Speaker
Description
The ALICE Experiment has been designed to study the physics of strongly interacting matter and the Quark–Gluon Plasma with heavy-ion collisions at the CERN LHC. A major upgrade of the ALICE detectors and computing farm is currently ongoing. The new farm, within the O2 (Offline-Online) computing project, will consist of almost 2000 nodes enabled to readout and process on-the-fly over 27 Tb/s of raw data.
To increase the efficiency of computing farm operations a general purpose real time streaming processing system has been developed. The system had been designed to lay on features like high-performance, high-availability, modularity and open source. Its core component is represented by Apache Kafka that ensures high throughput, data pipelines and fault tolerance services. A monitoring task, in addition to Kafka, uses Telegraf as a metric collector, Apache Spark for complex aggregation, InfluxDB as a time-series database and Grafana as visualization tool. A logging tasks is based on Elasticsearch stack for log collection, storage and display.
The O2 farm takes advantage of this system to handle metrics coming from operating system, network, custom hardware and in-house software. A prototype version of the system is currently running at CERN and has been successfully deployed also by the ReCaS Datacenter at INFN Bari for both monitoring and logging.