Riunione Settimanale ML_INFN
Monday, 10 July 2023 -
16:00
Monday, 10 July 2023
16:00
Machine Learning Approaches for Job Failure Prediction in HTC Systems
-
Alessio Arcara
Stefano Dal Pra
(
Istituto Nazionale di Fisica Nucleare
)
Machine Learning Approaches for Job Failure Prediction in HTC Systems
Alessio Arcara
Stefano Dal Pra
(
Istituto Nazionale di Fisica Nucleare
)
16:00 - 17:00
The CNAF operates a computing centre composed of approximately 59k cores spread across O(10^3) physical hosts. Jobs are queued and scheduled by the HTCondor batch system, using fairshare algorithms. During execution, the state of each job is represented by three variables (RAM, disk, swap), sampled every three minutes and stored in a database. These jobs can vary considerably in duration, from a few minutes to several days. This study explores the use of machine learning techniques to predict the success or failure of jobs based on their state evolution over time. Particular attention has been given to "zombie jobs", jobs that terminate without releasing the physical host, causing resource leakage until their timeout. An initial approach, which only considers the first hour of the job's evolution, has enabled the early identification and interruption of problematic jobs that could cause resource leaks. Using a decision tree-based ensemble algorithm built on gradient boosting techniques, an accuracy of 72% was achieved on the least represented class (the zombie tasks). In the second part of the study, further investigations were conducted to improve accuracy, exploring the first day instead of the first hour, and utilising supervised (CNN, CNN+LSTM, LSTM, Transformer) and unsupervised (autoencoders and variational autoencoders) deep learning techniques. However, these neural network approaches showed a sensitivity to overfitting on such highly imbalanced data. The XGBoost algorithm demonstrated superior performance, showing a significant difference compared to the other methods.