# APEIRON: a framework for the development of smart TDAQ systems

### (INFN Sezione di Roma - APE Lab)



Speaker: Cristian Rossi (cristian.rossi@romal.infn.it)



## **APEIRON: overview**

**APEIRON is a framework** developed to offer hardware and software support for the execution of <u>real-time dataflow</u> <u>applications</u> on a system composed by interconnected FPGAs

- Enabling the mapping the dataflow graph of the application on the distributed FPGA system and offering runtime support for the execution.
- Allowing users, with no (or little) experience in hardware design tools, to develop their applications on such distributed FPGA-based platforms:
  - Tasks are implemented in C++ using High Level Synthesis tools (Xilinx® Vitis).
  - Lightweight C++ communication API (HAPECOM)
    - Non-blocking send()
    - Blocking receive()

APEIRON enables the scaling of Xilinx® Vitis High Level Synthesis applications on multiple FPGA interconnected by the INFN communication IP.



## Why using FPGA? ➤ Energy efficiency

• Energy Efficiency has become an important metric of performance in recent years. The application and computing scale is increasing exponentially every year leading to an enormous amount of data to be processed

⇒ <u>high power and energy consumption</u>.

 FPGA architectures, together with a high programmability, offer a good balance in terms of **performance** and **energy efficiency** without sacrificing the throughput of the application.



#### Why using FPGA? **Energy efficiency**

- **Energy Efficiency** has become an important metric of performance in recent years. The application and computing scale is increasing exponentially every year leading to an enormous amount of data to be processed
  - $\Rightarrow$  high power and energy consumption.
- FPGA architectures, together with a high programmability, offer a good balance in terms of **performance** and **energy** efficiency without sacrificing the throughput of the application.

#### EDP = Energy Delay Product



 $\sim$ 

S

 $\sim$ 

 $\infty$  $\sim$ 

 $\infty$ 

0  $\overline{}$ 

0  $\sim$ 

S CES

н ~

σ

0  $\overline{}$ 

 $\overline{}$ 

0

 $\overline{}$ 

• • -----

Ó Ō

0

 $\overline{}$ 0

 $\sim$ 

S

Ś GE

Ы

## Why using FPGA? ➤ Real-time inference

- Customizable I/O and deterministic latency make them well suited for TDAQ systems.
- In such systems, design challenge is the **processing throughput**:
  - Pipelined designs can potentially produce a new output at each clock cycle.
  - Initiation interval (II): Number of clock cycles before the function can accept new input data. The lower the II, the higher the throughput
- High level synthesis tools allows to describe datapaths in FPGA using high level software languages (C/C++, OpenCL, SYCL,...), leveraging *#pragma HLS* directives in order to increase overall performances



## **APEIRON for smart TDAQ Systems**

Abstract Processing Environment for Intelligent Read-Out systems based on Neural networks

• Input **data streams** from several different channels (data sources, detectors/sub-detectors) recombined through the processing layers using a **low-latency, modular and scalable network infrastructure** 



- More resource-demanding NN layers can be implemented in subsequent processing layers.
- Classification produced by the NN in last processing layer (e.g. pid) will be input for the trigger processor/storage online data reduction stage for triggerless systems.

## APEIRON building blocks: INFN Communication IP





INFN is developing the IPs implementing a <u>direct network</u> that allows **low-latency data transfer** between processing tasks deployed on the same FPGA (**intra-node communication**) and on different FPGAs (inter-node communication)

- Host Interface IP: Interface the FPGA logic with the host through the system bus.
- Routing IP: Routing of intra-node and inter-node messages between processing tasks on FPGA. •
- Network IP: Network channels and Application-dependent I/O
  - **APElink** 20 Gbps  $\rightarrow$  40 Gbps
  - UDP/IP over 1/10 GbE  $\rightarrow$  25/40/100 GbE
  - ETH port → Xilinx® 10G/25G High Speed Ethernet Subsystem

## **APEIRON building blocks:**

• Software Stack

USER SPACE APEIRONS (socket, parser) APEIRON user app APEIROND (daemon) VITIS HLS user app APEIRON lib XRT core lib XRT runtime lib KERNEL SPACE XOCL XCLMGMT



The APEIRON runtime software stack is built on top of the Xilinx® XRT one adding three layers to:

- add the functionalities required to manage multiple FPGA execution platforms (e.g., program the devices, configure the IPs, start/stop execution, monitor the status of IPs, ...);
- reduce the impact of changes in XRT API introduced with any new version of Vitis on the APEIRON host-side applications;
- decouple the APEIRON software stack from the specific platform, easing the future porting of the framework to different platforms/vendors.

**Apeirond** is a persistent daemon used to manage multiple access request from user apps to the board. Using the network socket exposed by apeirond modules, the **supervisor** can write commands and read status of the different instances of the APEIRON framework running in each node, allowing the user to have a <u>complete overview of</u> <u>the multiple FPGA execution platform</u>

## **APEIRON: FPGA bitstream generation**

- The **HLS task** must have a <u>generic interface</u>, implementation is free
- A **YAML configuration file** is used to describe the <u>kernels interconnection topology</u>, specifying how many input/output channels they have

Adaptation toward/from IntraNode ports of the Routing IP is done by the automatically generated **Aggregator** and **Dispatcher** kernel templates.



void example\_task(
 [list of optional kernel specific
 parameters], message\_stream\_t
 message\_data\_in[N\_INPUT\_CHANNELS],
 message\_stream\_t
 message\_data\_out[N\_OUTPUT\_CHANNELS])

#### kernels:

- name: krnl\_compute1 input\_channels: 4 output\_channels: 3 switch\_port: 1

- name: krnl\_compute2 input\_channels: 2 output\_channels: 1 switch\_port: 2

- name: krnl\_compute3
input\_channels: 1
output\_channels: 1
switch\_port: 3

## **APEIRON performance**

### (Communication IP: 256 bit datapath @200MHz)



Bandwidth Intra-node (loopback) Inter-node (oneway)

| DDR+sync(MB/s) | BRAM(MB/s) |
|----------------|------------|
| 3938           | 5967       |
| 3938           | 4658       |

## APEIRON applications: FIPLib-multiFPGA



**F**PGA Image **P**rocessing **Lib**rary ⇒ multi-FPGA implementation via APEIRON

- Developed by ENEA in C++, it employs the Vitis HLS flow to construct the library's kernels for the execution of image processing algorithms.
- FIPLib encompasses nearly 70 functionalities, conceived with a **streaming behavior**
- On a multi-FPGA setup, we were able to split the overall image processing by implementing a single RGB kernel on each node

## ⇒ increased internal datapath to 32B, avoiding FPGA resource limitation







## APEIRON applications: FIPLib-multiFPGA



FPGA Image Processing Library ⇒ multi-FPGA implementation via APEIRON

 Implementing FIPLib HLS kernels as APEIRON tasks means changing the interface of each of them to cope with the standard required by the framework to compile the entire project and to generate the bitstream
 ⇒ use of HAPECOM C++ communication API

| SINGLE FPGA FPLib IMPLEMENTATION                                                                                                                                                                                                                                                                                         | <u>MULTI-FPGA FPLib IMPLEMENTATION</u><br>(APEIRON)                                                                                                                                                                                                                                                                                                                                                                                           |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>while (NbWordToTransfer &gt; BUFFER_SIZE) if (phase){     buffer2Stream(outStream, Buff1, BUFFER_SIZE);     stream2Buffer(inStream, Buff2, BUFFER_SIZE);     stream2Buffer(inStream, Buff1, BUFFER_SIZE);     stream2Buffer(inStream, Buff1, BUFFER_SIZE); } phase = !phase; NbWordToTransfer -= BUFFER_SIZE;</pre> | <pre>#include "ape_hls/hapecom.hpp" while (NbWordToTransfer &gt; BUFFER_SIZE) {     if (phase) {         send(Buff1, BUFFER_SIZE*sizeof(word_t), coord,         task id, ch id, message data out);         stream2Buffer(inStream, Buff2, BUFFER_SIZE);     }     else {         send(Buff2, BUFFER_SIZE*sizeof(word_t), coord,         task_id, ch_id, message data out);         stream2Buffer(inStream, Buff1, BUFFER_SIZE);     } }</pre> |
| <pre>void buffer2Stream(hs::stream<io_stream_16b>&amp; outStream,<br/>tt16 Buff[BUFFER_SIZE], unsigned int size)<br/>{<br/>/pragma HLS inline off</io_stream_16b></pre>                                                                                                                                                  | phase = !phase;<br>NbWordToTransfer -= BUFFER_SIZE;<br>}                                                                                                                                                                                                                                                                                                                                                                                      |

## APEIRON applications:FIPLib-multiFPGA







# APEIRON applications:RAIDER



Real-time AI-based Data analytics on hEteRogeneous distributed systems

- High throughput online streaming processing on multi-FPGA ⇒ number of Cherenkov rings prediction on the stream of events generated by the RICH detector in the CERN NA62 experiment at a <u>rate of about 10 MHz</u>, using multiple CNN\_kernel replica.
- Lightweight CNN model deployed on Xilinx Alveo U280 FPGA (limited resource usage)
   ⇒ receives as input compressed representation of the original event in form of B&W 16x16 image (via imagifier kernel)





# APEIRON applications:RAIDER





textarossa

### **Conclusions**

- The APEIRON framework enables the **development and deployment** of Vitis HLS dataflow applications distributed on multiple-FPGA systems, leading to increased performance in terms of throughput and energy efficiency
- The co-design of its software stack and of the Communication IP allowed to reach **very low and deterministic latency and a high fraction of the channel's raw bandwidth** for communications between FPGAs, addressing fundamental bottlenecks for real-time distributed dataflow applications.
- We control the workflow for the implementation of real-time/high throughput classifiers on FPGA using limited resources, This hints for applying the methodology also to:
  - less capable (i.e. front-end) FPGAs
  - complex design making use of a large fraction of FPGA resource
     ⇒ multi-node setup/user-defined topology

### **Conclusions**

• Eager to find new applications for APEIRON framework, <u>feel free to contact us</u>!



### **Contacts:**

- <u>cristian.rossi@romal.infn.it</u>
- <u>alessandro.lonardo@roma1.infn.it</u>

## **BACKUP SLIDES**

## **FPGA overview**

The basic structure of an FPGA is composed of the following elements:

- Look-up table (LUT): This element performs logic operations
- Flip-Flop (FF): This register element <u>stores</u> the result of the LUT
- Wires: These elements connect elements to one another, both logic and clock
- Input/Output (I/O) pads: These physically available ports get signals in and out of the FPGA



