Speaker
Description
Efficient and secure operation of computing centre machines is very important in contemporary digital landscapes. The complexity and scale of the computing centres requires to develop a framework to monitor the resources and to build a responsive system that is able to prevent potential failures. The aim of this presentation is to illustrate the current status and future goals of the monitoring solution that is being developed at the computing centre of INFN Torino.
The new challenges posed by the PNRR projects, such as ETIC and TeRABIT, require a fresh approach regarding monitoring of hardware and software components.
The primary motivation behind this framework is to enhance operational efficiency by providing administrators with timely insights into resource utilisation, workload distribution, and system health. By leveraging advanced monitoring tools, administrators can proactively identify bottlenecks, optimise resource allocation, and mitigate performance degradation, thereby ensuring uninterrupted service delivery.
The traditional approach to monitoring relies on the SNMP protocol. While this has been the standard for decades, it presents several downsides which call for a new approach.
While SNMP remains a widely used protocol for network management and monitoring, especially for legacy systems, Redfish offers several advantages in terms of modern design, scalability, feature set, security, and vendor support; thus, it can be more efficient and easier to use as it is based on RESTful APIs and a JSON data model.
We show a possible implementation that is able to collect performance metrics from physical machines using Redfish and integrating this information with an SNMP exporter for Prometheus, combined with the convenience of Grafana dashboards. The current setup shows the correlations between the power consumption and the workload of the HTCondor jobs.