Speaker
Description
Operational Intelligence (OpInt) is a cross-experiment project aiming to reduce the cost of computing operations for HEP experiments. The heterogeneous nature of distributed computing systems calls for advanced intelligent monitoring and requires a non-negligible amount of human interventions and expertise in order to spot in advance symptoms of potential misbehaviours and possibly to troubleshoot operational issues. The OpInt initiative exploits state-of-the-art techniques to optimise operations and facilitate their execution by minimising human interventions. Most operational tasks involve repetitive actions -- such as gathering and sorting monitoring information from various subsystems, spotting problems, and escalating to the experts -- that would enormously benefit from automation thanks to adaptive systems. Anomaly detection in time series, log text analysis using Natural Language Processing (NLP), smart alert systems, and clustering techniques are examples of techniques exploited to help operators in their daily routines, and to improve the overall system efficiency and resource utilisation. We report on the latest developments and activities in the context of this project, and discuss the road map for increasing automation levels towards operation in the HL-LHC (High Luminosity LHC) era.