24–28 May 2021
Via telematica
Europe/Rome timezone

Operational Intelligence: distributed computing operations towards the HL-LHC era

26 May 2021, 12:25
30m
Via telematica

Via telematica

Infrastrutture ICT e calcolo distribuito Infrastrutture ICT e calcolo distribuito

Speaker

Federica Legger (TO)

Description

Operational Intelligence (OpInt) is a cross-experiment project aiming to reduce the cost of computing operations for HEP experiments. The heterogeneous nature of distributed computing systems calls for advanced intelligent monitoring and requires a non-negligible amount of human interventions and expertise in order to spot in advance symptoms of potential misbehaviours and possibly to troubleshoot operational issues. The OpInt initiative exploits state-of-the-art techniques to optimise operations and facilitate their execution by minimising human interventions. Most operational tasks involve repetitive actions -- such as gathering and sorting monitoring information from various subsystems, spotting problems, and escalating to the experts -- that would enormously benefit from automation thanks to adaptive systems. Anomaly detection in time series, log text analysis using Natural Language Processing (NLP), smart alert systems, and clustering techniques are examples of techniques exploited to help operators in their daily routines, and to improve the overall system efficiency and resource utilisation. We report on the latest developments and activities in the context of this project, and discuss the road map for increasing automation levels towards operation in the HL-LHC (High Luminosity LHC) era.

Primary author

Federica Legger (TO)

Co-authors

Alessandro Di Girolamo (INFN) Daniele Bonacorsi (BO) Leticia Decker De Sousa (BO) Lorenzo Rinaldi (BO) Luca Clissa (BO) Luca Giommi (Istituto Nazionale di Fisica Nucleare) Simone Rossi Tisbeni (CNAF) Tommaso Diotalevi (Istituto Nazionale di Fisica Nucleare - Sezione di Bologna)

Presentation materials