Speaker
Description
The precise identification of jets initiated by heavy-flavour quarks is a central challenge in the physics programme of the Large Hadron Collider (LHC). Modern flavour tagging algorithms rely on deep learning models trained on labelled Monte Carlo (MC) simulated data. A key limitation of these approaches is that the embedding learned from the input variables is typically tightly coupled to the specific MC generator used during training. This limitation can lead to significant discrepancies when these taggers are applied on MC simulations produced with other generators or on real data, enhancing systematics and reducing the analysis sensitivity. As a consequence, changes in the simulation setup often require a dedicated recalibration of the model output, and there is a risk that the network captures generator-specific artefacts rather than genuine physical features.
In this work, we propose a novel strategy to address this issue by constructing embeddings that are robust to variations in the underlying MC simulation. Our approach is based on adopting foundation models for tabular data to generate explicit embeddings. We introduce a pre-training phase to learn a jet-level representation that is invariant across different MC generators, such that jets with identical physical properties are mapped to the same embedding independently of the simulation used to produce them.