







# Machine Learning for Real-Time Processing of ATLAS Liquid Argon Calorimeter Signals with FPGAs

Nick Fritzsche
TU Dresden
on behalf of the ATLAS Liquid Argon Calorimeter Group

8 July, 2022





# The ATLAS Liquid Argon (LAr) Calorimeters



- ATLAS detector at LHC contains sampling calorimeters for measurement of energy deposited by electrons, photons and hadronic jets
- 2 ~ 182k cells
  - 1 active material: liquid argon
  - absorber: lead, copper, tungsten
- Triangular pulse by ionization is amplified, shaped and digitized at 40 MHz
- Energy reconstruction with Optimal Filter (OF)

$$E(t) = \sum_{i=t-N}^{t} a_i \cdot x_i$$

$$Energy = \sum_{i=t-N}^{t} a_i \cdot x_i$$

# Signal Processing at High Luminosity LHC (HL-LHC)



# Upgrade Challenges

- HL-LHC planned to start in 2029 with 7.5x nominal luminosity
  - $\longrightarrow$  140-200 proton-proton collisions per bunch crossing (currently  $\sim$  40)
- Close-by and overlapping pulses biased or missed
- Increasing receptive field does not improve OF performance much
- Advanced processing algorithms required

# Hardware Trigger

Trigger selects events after  $\sim 2\,\mu s$ , 150 ns foreseen for energy reconstruction

- → Implement algorithms on FPGA for real-time processing
- → Short latency and FPGA resources limit complexity of algorithms

# Convolutional Neural Network (CNN) Architectures



#### Architecture

2-tier convolutional network:

- pulse tagging
- energy reconstruction

# Layer operations

- Linear combination of previous layer
- Apply activation function

# **CNN** Training



#### Training targets

- Overall: Determine deposited energy out of ADC sequence
- Pulse tagging: Binary sequence, find energy deposits
- Secondary Sec



# Recurrent Neural Networks (RNNs)

#### **RNN** Architectures

Process new input combined with previous state



Two internal RNN architectures explored:

- Long Short-Term Memory (LSTM)
- Vanilla-RNN, fewer internal dimensions





#### Single Cell

- Long range correction, full signal processed in a stream
- High complexity needed, only LSTM



#### Sliding Window Method

- No long-range correlations, simpler training
- Short range correction only



#### Performance under HL-LHC conditions



40 50 60 70

- Single LSTM shows best energy resolution and close-by pulse identification
- CNNs and Vanilla-RNN good compromises between complexity and performance

80 90 100

Gap [BC]

#### FPGA Implementation: CNNs

#### ANNs on FPGAs

FPGA implementation required for running ANNs on hardware

- Operations mapped to FPGA configurable logic, DSPs, memory, ...
- Pixed-point arithmetic applied
- Support time-division multiplexing

#### **CNNs**



©IN I EL

- ONNs use custom converter from software model to VHDL
- ② DSP chain designed for low latency and efficient resource usage



# FPGA Implementation: RNNs

RNNs implemented in Intel High Level Synthesis (HLS) and VHDL

# Single Cell & Sliding Window Implementation

#### Single Cell:

prediction

RNN

Single RNN instance on hardware

# Sliding Window:



- 5 RNN instances
- Independent pipelined sequences

#### Placement Constraints

Logic locked region for each cell improves timing



#### Performance on FPGA



# Software model vs firmware implementation

- Floating-point numbers vs fixed-point calculations
- Good agreement of FPGA implementation with software
- Confirmed for CNNs with bit-exact software model of firmware

# FPGA Resource Usage

- ullet Process 384 calorimeter cells per FPGA ightarrow In total  $\sim$  550 FPGAs needed
- Time-division multiplexing: e.g. FPGA frequency 480 MHz = 12 · 40 MHz
  - → Process 12 cells in pipeline on 1 ANN instance

#### Single Channel

|                                       | 3-Conv<br>CNN | 4-Conv<br>CNN | Vanilla<br>RNN<br>(sliding) | LSTM<br>(single) | LSTM<br>(sliding) |
|---------------------------------------|---------------|---------------|-----------------------------|------------------|-------------------|
| Frequency<br>F <sub>max</sub> [MHz]   | 493           | 480           | 641                         | 560              | 517               |
| Latency<br>clk <sub>core</sub> cycles | 62            | 58            | 206                         | 220              | 363               |
| Resource<br>Usage                     |               |               |                             |                  |                   |
| #DSPs                                 | 46<br>0.8%    | 42<br>0.7%    | 34<br>0.6%                  | 176<br>3.1%      | 738<br>12.8%      |
| #ALMs                                 | 5684<br>0.6%  | 5702<br>0.6%  | 13115<br>1.4%               | 18079<br>1.9%    | 69892<br>7.5%     |

# Time-multiplexed

|                                     | 3-Conv<br>CNN | 4-Conv<br>CNN | Vanilla<br>RNN<br>(HLS) | Vanilla<br>RNN<br>(VHDL) | 28× Vanill<br>RNN<br>(VHDL) |
|-------------------------------------|---------------|---------------|-------------------------|--------------------------|-----------------------------|
| Multiplicity                        | 12            | 12            | 10                      | 14                       | 14                          |
| Frequency<br>F <sub>max</sub> [MHz] | 487           | 423           | 455                     | 587                      | 561                         |
| Latency [ns]                        | 125           | 150           | 302                     | 121                      | 121                         |
| Max. Channels                       | 516           | 660           | 370                     | 588                      | 392*                        |
| Resource Usage<br>#DSPs             | 46<br>0.8%    | 42<br>0.7%    | 152<br>2.6%             | 136<br>2.4%              | 3808<br>66.1%               |
| #ALMs                               | 21256<br>2.3% | 16698<br>1.8% | 24433<br>2.6%           | 5854<br>0.6%             | 164321<br>17.6%             |

High latency & resource usage for LSTMs

 $\rightarrow$  Focus on Vanilla RNN

firmware components

→ Optimizations needed

ALMs shared with other VHDL implementation outperforms HLS

#### Outlook

Further simulation studies and implementation improvements ongoing

# Architecture & Training

- Consider more realistic conditions
  - Varying pulse shapes
  - 2 Time shifts to optimal sampling point
  - UHC bunch train structure
  - Quantization aware training
  - O Different detector regions

- Add new features
  - Provide timing of detected pulse as output



# Firmware & Hardware

- Optimize firmware implementation
  - Reduce resource usage
  - 2 Increase operation frequency
- 2 Test ANNs on Stratix-10 hardware
  - Integrate ANNs into higher level LAr signal processor firmware



#### Summary

- Advanced signal processing algorithms required for ATLAS LAr energy reconstruction under HL-LHC conditions
  - Two machine learning based approaches: CNNs and RNNs
- Various ANN algorithms studied
  - CNNs and RNNs outperform legacy Optimal Filter algorithm
- FPGA implementation for real-time processing with high bandwidth developed
  - ONNs: VHDL implementation
  - 2 RNNs: High level synthesis and VHDL implementation
- Promising results of firmware evaluation
  - Good reproduction of Keras results with firmware simulation
  - Optimizations ongoing to improve resource usage and latency

 $\longrightarrow$  CNNs/RNNs show great potential to improve energy reconstruction of ATLAS LAr calorimeter system under HL-LHC conditions

Ref. "Artificial Neural Networks on FPGAs for Real-Time Energy Reconstruction of the ATLAS LAr Calorimeters" Aad, G. et al., Comput Softw Big Sci 5, 19 (2021)

Ref. "Energy reconstruction in a liquid argon calorimeter cell using convolutional neural networks" Polson, L. et al., JINST 17, P01002 (2022)