



# APEIRON: a Framework for High Level Programming of Dataflow Applications on Multi-FPGA Systems

Cristian Rossi

(INFN Roma, APE Lab)

for the APEIRON team

Workshop sul Calcolo nell'I.N.F.N. Loano (Savona) 22 - 26 maggio 2023

This work is supported by the TEXTAROSSA project (G.A. n. 956831) as part of the EuroHPC-JU initiative, and by INFN National Scientific Committee 5.





**APEIRON main goal** is to develop a framework offering hardware and software support for the execution of real-time dataflow applications on a system composed by interconnected FPGAs

- Enabling the mapping the dataflow graph of the application on the distributed FPGA system and offering runtime support for the execution.
- Allowing users, with no (or little) experience in hardware design tools, to develop their applications on such distributed FPGA-based platforms.
  - Tasks are implemented in C++ using High Level Synthesis tools (Xilinx<sup>®</sup> Vitis).
  - Lightweight C++ communication API (HAPECOM)
    - Non-blocking *send()*
    - Blocking *receive()*
- APEIRON is based on Xilinx<sup>®</sup> Vitis High Level Synthesis framework and on INFN Communication IP





### Abstract Processing Environment for Intelligent Read-Out systems based on Neural networks



- Input data from several different channels (data sources, detectors/sub-detectors).
- Data streams from different channels recombined through the processing layers using a low-latency, modular and scalable network infrastructure
- Distributed online processing on heterogeneous computing devices (FPGAs for the moment) in *n* subsequent layers.
- Typically features extraction will occur in the first NN layers on RO FPGAs.
- More resource-demanding NN layers can be implemented in subsequent processing layers.
- Classification produced by the NN in last processing layer (e.g. pid) will be input for the **trigger processor/storage online data reduction stage for triggerless systems.**

INFN

## **INFN Communication IP**







INFN is developing the **IPs** implementing a direct network that allows **low-latency data transfer** between processing tasks deployed on the same FPGA (**intra-node communication**)

and on different FPGAs (inter-node communication)

- Host Interface IP: Interface the FPGA logic with the host through the system bus.
  - Xilinx<sup>®</sup> XDMA PCle Gen3
- Routing IP: Routing of intra-node and inter-node messages between processing tasks on FPGA.
- Network IP: Network channels and Application-dependent I/O
  - APElink 20 Gbps → 40 Gbps
  - UDP/IP over 1/10 GbE  $\rightarrow$  25/40/100 GbE
- HLS Kernels: user defined processing tasks









- The APEIRON runtime software stack is built on top of the Xilinx<sup>®</sup> XRT one adding three layers to:
  - add the functionalities required to manage multiple FPGA execution platforms (e.g., program the devices, configure the IPs, start/stop execution, monitor the status of IPs, ...);
  - reduce the impact of changes in XRT API introduced with any new version of Vitis on the APEIRON host-side applications;
  - decouple the APEIRON software stack from the specific platform, easing the future porting of the framework to different platforms/vendors, ideally by extending the APEIRON library layer only.
- **Apeirond** is a persistent daemon used to manage multiple access request from user apps to the board. It uses the APEIRON lib exposed functions to operate on the devices.
- Using the network socket exposed by each apeirond module, the supervisor can write commands and read answer / status of the different instances of the APEIRON framework running in each node, allowing the end user to have a complete overview of the multiple FPGA execution platform.

# APEIRON: Workflow for FPGA bitstream generation

FF

- The HLS task must have a generic interface, implementation is free.
- A YAML configuration file is used to describe the kernels interconnection topology, specifying how many input/output channels they have



void example\_task(

[list of optional kernel specific parameters], message\_stream\_t message\_data\_in[N\_INPUT\_CHANNELS], message\_stream\_t message\_data\_out[N\_OUTPUT\_CHANNELS]) {...}



 Adaptation toward/from IntraNode ports of the Routing IP is done by the automatically generated Aggregator and Dispatcher kernel templates.



## **Communication Latency Test**





**Latency test** is performed using multi-task HLS kernel (krnl\_sr), configurable by the host in different modes:

- "send\_receive" mode: kernel reads a payload data item from the FPGA memory (either BRAM or DDR) and sends and receives it through/from the Communication IP to/from a second interconnected FPGA
- "**pipe**" mode has the task of receiving a single packet and bouncing it back to the initiator FPGA

Since the HLS kernel on the initiator FPGA is started via host code while the HLS kernel in "pipe" mode is free-running, the former is launched with a repetition parameter of 1 million send/receive operations before termination in order to **minimize the contribution of the host call overhead on the overall time elapsed** from the start of the first packet send to the completion of the last packet receive (measured on the host).







**Bandwidth test** is carried out by transferring multiple data packets with fixed payload size from:

- a "sender" HLS kernel which reads data from the source buffer in FPGA memory (either DDR or BRAM) and pushes them through the Communication IP to another FPGA
- a **"receiver"** HLS kernel: writes data into the destination buffer in memory.

After receiving the number of data packets whose integrated payload adds up to the size of the receive buffer, the second FPGA pings back a single "ACK" packet with minimal payload to confirm the reception.







-200

-100

100

mm

200

300

INFN

300

200

100

-100

-200

-300

-300

шш

## PID in NA62 RICH using NN on FPGA at LO Trigger

CCR 2023

- Goal: for any event detected by the RICH provide an estimate for the number charged particles and the number of electrons
- Streaming readout processing on FPGA using Neural Networks for classification (10 MHz).
- Produce a new primitives stream for Level 0 Trigger Processor
- The main challenge is the processing throughput











- Customizable I/O and deterministic latency make them well suited for TDAQ systems.
- Improvements to silicon manufacturing process made them very interesting for heavy computation as well.
- In our case, the challenge is the processing throughput → a pipelined design can potentially produce a new output at each clock cycle.
- Initiation interval (II): Number of clock cycles before the function can accept new input data. The lower the II, the higher the throughput
- The greater the number of pipeline stages, the greater the latency.
- High level synthesis tools allows to describe datapaths in FPGA using high level software languages (C/C++, OpenCL, SYCL,...).



INFN





• Generation strategy of training and validation data sets.

# How? Design and Implementation Workflow





**Design targets** (<u>efficiency</u>, <u>purity</u>, <u>throughput</u>, <u>latency</u>) and <u>hardware constraints</u> (<u>mainly</u> <u>FPGA resource usage</u>) must be taken into account and verified at any stage:

#### TensorFlow/Keras

→ NN architecture (number and kind of layers) and **representation of the input** 

→Training strategy (class balancing, batch sizes, optimizer choice, learning rate,...).



 Qkeras → Search iteratively the minimal representation size in bits of weights, biases and activations.



 hls4ml → Tuning of REUSE FACTOR config param (low values -> low latency, high throughput, high resource usage), clock frequency.



Vivado HLS → co-simulation for verification of performance (experimented very good agreement with QKeras Model)





eLu 0.8 Counts Normalized per true Label (horizontal) 0.7 0.86 0 -0.14 0.00 0.00 - 0.6 0.02 0.88 0.10 0.00 - 0.5 True label 0.4 0.00 0.72 0.09 0.19 0.3 3 0.00 0.00 0.28 0.72 0.2



- Input representation: normalized hitlist (max 64 hits per event)
- Output: 4 classes (0, 1, 2, 3+ rings)
- Quantization (fixed point)
  - Weights and biases: 8 bits <8, 1>
  - Activations: 16 bits <16, 6>
- FPGA resource usage (VCU118) LUT 14%, DSP 2%, BRAM 0%
- Latency: 22 cycles @ 150MHz
- Initiation Interval (II): 8 cycles
- Throughput: 18.75 MHz

Class 0 (0 rings) Efficiency 85.7 Purity 95.6 Class 1 (1 rings) Efficiency 87.7 Purity 82.9 Class 2 (2 rings) Efficiency 72.3 Purity 67.4 Class 3 (3+ rings) Efficiency 71.9 Purity 84.3

24/05/23

Predicted label

0



## NN Architectures: Convolutional Model



#### Input representation: 16x16 images





- Output: 4 classes (0, 1, 2, 3+ rings)
- Quantization (fixed point):
  - Weights and biases: 8 bits <8, 1>
  - Activations:16 bits <16, 6>
- FPGA resource usage (Alveo U200)
  - LUT 5.2%, FF 1.5%, DSP 4.8%,
  - BRAM 0.05%
- Latency: <u>388 cycles</u> @ 220MHz
- Initiation Interval (II): <u>369 cycles</u>
- Throughput: <u>0.6 MHz</u>



True label

# Convolutional model issue - Kernel replication

**Throughput** is not enough to sustain L0 rate, but we can <u>replicate the network</u> multiple times, also on multiple devices if necessary (APEIRON).



INFN

# Results for classification of number of electrons



- Preliminary results for online classification of the number of "electrons" show that even the very simple NN architectures that we tested are capable, below 35 GeV/c momentum, of reaching a non-negligible performance (see terminal picture below).
- It can be improved for the online unfiltered event stream using a dedicated NN receiving in input data from other detectors (e.g. LOCALO).

|   | Total | Eve | nts 16390    | 5      |        |      |                   |     |                    |              |
|---|-------|-----|--------------|--------|--------|------|-------------------|-----|--------------------|--------------|
|   | Total | eve | nts of clas  | s 0 is | 8462   | 28 ( | 51.63 %)          |     |                    |              |
|   | Total | eve | nts of clas  | s 1 is | 7682   | 22 ( | 46.87 %)          |     |                    |              |
|   | Total | eve | nts of clas  | s 2 is | 243    | 32 ( | 1.48 %)           |     |                    |              |
|   | Total | eve | nts of clas  | s 3 is | :      | 23 ( | 0.01 %)           |     |                    |              |
|   | Total | eve | nts classif: | ied as | 0 is   | 7553 | 3 (46.08 %)       |     |                    |              |
|   | Total | eve | nts classif: | ied as | 1 is   | 7520 | 9 (45.89 %)       |     |                    |              |
|   | Total | eve | nts classif: | ied as | 2 is   | 1192 | 0 (7.27 %)        |     |                    |              |
|   | Total | eve | nts classif: | ied as | 3 is   | 124  | 3 (0.76 %)        |     |                    | -20 - 2040 - |
|   | Class | 0   | Efficiency   | 82.6   | Purity | 92.5 | OverContamination | 7.5 | UnderContamination | 0.0          |
|   | Class | 1   | Efficiencv   | 80.6   | Puritv | 82.3 | OverContamination | 0.2 | UnderContamination | 17.5         |
|   | Class | 2   | Efficiency   | 74.6   | Purity | 15.2 | OverContamination | 0.0 | UnderContamination | 84.8         |
|   | Class | 3   | Efficiency   | 91.3   | Purity | 1.7  | OverContamination | 0.0 | UnderContamination | 98.3         |
| L |       |     |              |        |        |      | -                 |     |                    |              |



INFN





- The APEIRON framework enables the development and deployment of Vitis HLS dataflow applications distributed on multiple-FPGA systems.
- The co-design of its software stack and of the Communication IP allowed to reach very low and deterministic latency and a high fraction of the channel's raw bandwidth for communications between FPGAs, addressing fundamental bottlenecks for real-time distributed dataflow applications.
- We are working to improve the framework and the Communication IP
  - to increase the internal datapath of the IP to 256 bits and to use the transceiver with 4 lanes to support applications requiring an increased communication bandwidth
  - To implement a new channel interface based on the Xilinx<sup>®</sup> 10G/25G High Speed Ethernet Subsystem in order to enable interoperability with standard switched networks, either to support (e.g. UDP over IP) input and output streams or to implement a switched network topology.
- We control the workflow for the implementation of real-time/high throughput classifiers on FPGA using limited resources, this hints for applying the methodology also to:
  - less capable (i.e. front-end) FPGAs
  - complex design making use of a large fraction of FPGA resources (e.g. LOTP+)

## **The APEIRON Team**

#### @INFN Roma – APE Lab



A. Lonardo





F. Lo Cicero





M. Martinelli F. Simula

C. Rossi



P. S. Paolucci





A. Ciardiello



R. Ammendola A. Biagioni

P. Cretaro

O. Frezza







(now @CINECA)







# **BACKUP SLIDES**





- Dataset for training and validation obtained using the NA62 analysis framework
- Analyser called RingDumperAPE
- Single <u>run</u> or in batch (<u>run</u> list) from CTRL trigger sample
- Output: <u>Histograms</u> + Events <u>dumped</u> on <u>plain</u> text files \_\_\_\_\_

- RICH Hit list (TDCEvent)
- RICH trackless reconstruction (TRecoRICHEvent) Downstreamtrack reconstruction (Downstreamtrack) LOTP (TNA62L0Data) Event Labels

- Different labels are dumped to be used as ground truth
  - 1. Number of rings from RichReco
  - 2. Number of rings from Downstreamtrack
  - 3. Number of electrons from RichReco (based on ring radius only)
  - 4. Number of electrons from Downstreamtrack (based on MostLikelyHypothesis)
  - 5. Number of electrons as 4 + check on the radius + check on Energy over momentum ratio (EOP)
  - Event rejection criteria can be optionally activated
    - Formal check on the reconstructed tracks and rings (e.g. chi2)
    - Event characteristics e.g. NHit, Momentum, etc

Electron radius = [185,195] mm Eop = [0.90,1.10]

#### Most likely hypothesys = multiple are rejected





| 0        |                | 32       |                  | 64       |                 | 96         |             |          |
|----------|----------------|----------|------------------|----------|-----------------|------------|-------------|----------|
| 1        |                | 33       |                  | 65       |                 | 97         |             |          |
| 2        | VirtualChannel | 34       |                  | 66       |                 | 98         |             |          |
| 3        |                | 35       |                  | 67       |                 | 99         |             |          |
| 4        |                | 36       |                  | 68       |                 | 100        |             |          |
| 5        |                | 37       | Intratile Port   | 69       |                 | 101        |             |          |
| 6        |                | 38       |                  | 70       |                 | 102        |             |          |
| 7        | PID/Ch ID      | 39       |                  | 71       |                 | 103        |             |          |
| 8        |                | 40       |                  | 72       |                 | 104        |             |          |
| 9        |                | 41       |                  | 73       | Destination     | 105        |             |          |
| 10       |                | 42       | OutOfLattice pkt | 74       | Virtual Address | 106        |             |          |
| 11       |                | 43       |                  | 75       |                 | 107        |             |          |
| 12       |                | 44       |                  | 76       |                 | 108        |             |          |
| 13       |                | 45       | Packet Type      | 77       |                 | 109        |             |          |
| 14       |                | 46       |                  | 78       |                 | 110        |             |          |
| 15       |                | 47       |                  | 79       |                 | 111        |             |          |
| 16       |                | 48       |                  | 80       |                 | 112        |             |          |
| 17       |                | 49       |                  | 81       |                 | 113        | Num of Hops |          |
| 18       |                | 50       |                  | 82       |                 | 114        |             |          |
| 19       |                | 51       |                  | 83       |                 | 115        |             |          |
| 20       |                | 52       |                  | 84       |                 | 116        |             |          |
| 21       |                | 53       | Length           | 85       |                 | 117        |             | <u> </u> |
| 22       |                | 54       |                  | 86       |                 | 118        |             | <u> </u> |
| 23       |                | 55       |                  | 87       |                 | 119        |             |          |
| 24<br>25 |                | 56<br>57 |                  | 88<br>89 |                 | 120<br>121 | ECC_CR      |          |
| 26       | COORDINATE     | 58       |                  | 90       |                 | 121        | ECC_CK      |          |
| 20       | COORDINATE     | 59       |                  | 91       |                 | 122        |             |          |
| 28       |                | 60       |                  | 92       |                 | 124        |             |          |
| 29       |                | 61       |                  | 93       |                 | 125        |             |          |
| 30       |                | 62       |                  | 94       |                 | 126        |             |          |
| 31       |                | 63       |                  | 95       |                 | 127        |             |          |

24/05/23