20–24 May 2024
Park Hotel Cala di Lepre
Europe/Rome timezone

Monitoring resources of a computing infrastructure with Redfish and SNMP

24 May 2024, 11:35
25m
Sala Meeting "Le Saline" (Park Hotel Cala di Lepre)

Sala Meeting "Le Saline"

Park Hotel Cala di Lepre

Via Cala di Lepre 07020 Palau (SS) Italia
Presentazione orale Servizi ICT Sessione "Servizi ICT"

Speaker

Nicola Mosco (Istituto Nazionale di Fisica Nucleare)

Description

Efficient and secure operation of computing centre machines is very important in contemporary digital landscapes. The complexity and scale of the computing centres requires to develop a framework to monitor the resources and to build a responsive system that is able to prevent potential failures. The aim of this presentation is to illustrate the current status and future goals of the monitoring solution that is being developed at the computing centre of INFN Torino.
The new challenges posed by the PNRR projects, such as ETIC and TeRABIT, require a fresh approach regarding monitoring of hardware and software components.
The primary motivation behind this framework is to enhance operational efficiency by providing administrators with timely insights into resource utilisation, workload distribution, and system health. By leveraging advanced monitoring tools, administrators can proactively identify bottlenecks, optimise resource allocation, and mitigate performance degradation, thereby ensuring uninterrupted service delivery.
The traditional approach to monitoring relies on the SNMP protocol. While this has been the standard for decades, it presents several downsides which call for a new approach.
While SNMP remains a widely used protocol for network management and monitoring, especially for legacy systems, Redfish offers several advantages in terms of modern design, scalability, feature set, security, and vendor support; thus, it can be more efficient and easier to use as it is based on RESTful APIs and a JSON data model.
We show a possible implementation that is able to collect performance metrics from physical machines using Redfish and integrating this information with an SNMP exporter for Prometheus, combined with the convenience of Grafana dashboards. The current setup shows the correlations between the power consumption and the workload of the HTCondor jobs.

Primary author

Nicola Mosco (Istituto Nazionale di Fisica Nucleare)

Co-authors

Lia Lavezzi (Istituto Nazionale di Fisica Nucleare) Luca Tabasso (Istituto Nazionale di Fisica Nucleare) Marco Sadocco (Istituto Nazionale di Fisica Nucleare)

Presentation materials