

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

#### The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

INFN - CNAF (Bologna)

 $13^{\rm th}$  Pisa Meeting on Advanced Detectors - 28th May, 2015 La Biodola, Isola d'Elba (Italy) 24 - 30 May, 2015

On behalf of the LHCb collaboration



Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

#### LHCb Experiment Upgrade

2 The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale



## The LHCb Experiment

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

- LHCb is a heavy flavour physics experiment to study CP violation and rare decays of b and c hadrons with a high precision
- Run II instantaneous  $L = 4 \times 10^{32} \mathrm{cm}^{-2} \mathrm{s}^{-1}$
- Integrated luminosity by the end of Run II  $8 \ {\rm fb}^{-1}$



- Run III scheduled 2020/2022
- LHC will run at the design  $14 \ {\rm TeV}$  energy
- Instanteneous luminosity will increase by a factor 5
- Upgrade DAQ to increase trigger efficiency



(日)



## Trigger evolution for Run III

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

Trigger configuration for Run I and II:

- L0 hardware trigger reduces the rate from 40 MHz to 1.1 MHz
- The software trigger further reduces the rate of data sent to storage

Trigger configuration for Run III:

- Software-only trigger with full event reconstruction
- Enhanced trigger efficiency





## DAQ implementation

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

• The idea for the DAQ Upgrade is to use a high-throughput network for the readout and the event building (EB)



- PCle40 readout units push event fragments into builder units (~ 100 Gbit/s)
- $\sim 500$  EB nodes communicate at  $\sim 100$  Gbit/s full-duplex (DAQ network)
- Output of EB to event filter farm for further processing

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @





Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

• The challenge is to handle 30 Tbit/s of aggregated traffic (30 MHz · 100 KByte)

| Event rate              | 30 MHz                       |
|-------------------------|------------------------------|
| Mean nominal event size | 100 KBytes                   |
| Readout board bandwidth | 100 Gbit/s (16 lanes PCle 3) |
| CPU cores               | Up to 4000                   |

• It can be done with commercial fabric technologies: InfiniBand, Ethernet

• Focus on InfiniBand in this talk



## InfiniBand standard

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale



- InfiniBand standard is widely used in HPC computing
- High speed and cost effective
- Constant speed evolution
- Thorougly tested on different testbeds with an EB performance evaluator developed on purpose → *lhcb-daqpipe*



# Event Building performance evaluator - Ihcb-daqpipe

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

- Performance evaluator for the EB software: *lhcb-daqpipe* developed and tested with the collaboration of the lhcb-online group
- EB building blocks: Generator, Readout Unit (RU), Builder Unit (BU), Event Manager (Listener and Consumer)



- The generator emulates the PCle40 output
- It writes metadata and data directly into RU memory
- The EM elects one node as the BU
- Each RU sends its fragment to the elected BU
- Performance measured on different test beds and with different InfiniBand cards



Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

- Ihcb-daqpipe allows to test both PULL and PUSH protocols
- It provides several transport layer implementations: IB verbs, TCP, UDP
- The processes on the nodes are spawned using MPI or by a synchronization mechanisms based on the ZeroMQ library

・ロット ( 雪 ) ・ ( 目 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 日 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) ・ ( 1 ) \cdot ( 1 ) \cdot

- We tested the EB software on test beds of increasing size:
  - At CNAF with 2 Intel Xeon server connected back-to-back
  - At Cern with 8 Intel Xeon cluster connected through an IB-switch
  - $\bullet\,$  At the 512-node Galileo cluster at Cineca  $\rightarrow\,$  next slides



## QDR and FDR characterization

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

• We measured the point-to-point bandwidth for different InfiniBand HCAs with RDMA *write* semantics (similar results for *send* semantics)





- QLogic: QLE7340, Single port 32 Gbit/s (QDR)
- Unidirectional throughput 27.2 Gbit/s
- Galileo and Cern clusters

• Mellanox: MCB194A-FCAT, Dual port 56Gbit/s (FDR)

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のへで

- Unidirectional throughput 54.3bit/s (per port)
- CNAF testbed



### Tuning issues

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

- IO performance can be severely degraded without a proper tuning of the nodes
  - PCIe Gen3 x16 Lanes: any previous version of the PCI bus represents a bottleneck for the network traffic
  - Disable node interleaving in NUMA architectures
  - Disable Power Management and CPU frequency selection (PM and frequency switching are latency sources)





#### Event Builder performance

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

• Measured bandwidth @CNAF as seen by the builder units on two nodes equipped with Mellanox FDR (max bandwidth 54.3 Gbit/s considering the encoding)

• Duration of the tests: 15 minutes



- PM and node inteleaving disabled
- $\bullet$  Bandwidth measured is on average 53.3 Gbit/s: 98% of maximum allowed

イロト 不得 トイヨト イヨト

э.



Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

• Extensive test have been made on the CINECA Galileo TIER-1 cluster

| Nodes      | 516                                                |
|------------|----------------------------------------------------|
| Processors | 2 8-core Intel Haswell $2.40 \text{ GHz}$ per node |
| RAM        | 128 GB/node, 8 GB/core                             |
| Network    | InfiniBand with 4x QDR switches                    |
| MPI        | OpenMPI v1.8.4                                     |

• The cluster size is similar to the LHCb Upgraded DAQ network



### Event Builder performance

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

- Measured bandwidth as seen by the BU on an increasing number of nodes
- Blue: Average bidirectional bandwidth achievable (24.7 Gbit/s)



- The EB works properly up to a scale of 128 nodes
- Few limitations to reach the maximum bandwidth:
  - · cluster is in production so other processes are polluting the network traffic

イロト 不得 トイヨト イヨト

э

no control on power management and frequency switching



#### Summary

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

- $\bullet\,$  Software level trigger for the LHCb Upgrade is a challenging task  $\to\,$  DAQ can be implemented with a InfiniBand-based network
- A performance evaluator has been developed in order to test the possible implementation choices
- I tested several InfiniBand HCAs on different test beds
- A control on the node interlieving and PM is needed to get the best performance
- Large scale tests have been performed showing that the EB prototype behave properly as the number of nodes increases
- Next developments:
  - Testing the EB performance evaluator with a higher number of nodes

Make it less sensible to PM issues



# Backup

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale





# Numa - Non Uniform Memory Access

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

- Memory architecture for multiprocessors
- In the schema above each CPU (NUMA node) has its own local bank of memory
- A CPU has faster access to its local memory
- Access to non-local memory is a potential bottleneck for IO



- The NIC is connected to one of the CPU



# Numa topology

The 40 MHz trigger-less DAQ system for the LHCb Upgrade

Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

*lstopo*, part of the *hwloc* package, produces CPU/Cache/Memory topology schemas

| Socket P#0                                                                                                   |                                                                    |                                                              |                                                                            |                                                            |                                                               |               | PCI PCI               |
|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------|---------------|-----------------------|
| L3 (15MB)                                                                                                    |                                                                    |                                                              |                                                                            |                                                            |                                                               |               | et                    |
| L2 (256KB)                                                                                                   | L2 (256KB)                                                         | L2 (256KB)                                                   | L2 (256KB)                                                                 | L2 (256KB)                                                 | L2 (256KB)                                                    |               | 2,0 PG                |
| L1d (32KB)                                                                                                   | L1d (32KB)                                                         | L14 (32KB)                                                   | Ltd (82KB)                                                                 | L1d (82KB)                                                 | L1d (32KB)                                                    |               |                       |
| L1((32K8)                                                                                                    | L1((32KB)                                                          | L11(02KB)                                                    | L11(02KB)                                                                  | L11 (32K8)                                                 | L11 (32K8)                                                    | 15.           | 15. PO                |
| Coxe P#0                                                                                                     | Core Pil1                                                          | Core P#2                                                     | Core P#3                                                                   | Core P84                                                   | Core P#5                                                      |               | 6                     |
| PU Pag                                                                                                       | PU P#1                                                             | PU P#2                                                       | PU P#3                                                                     | PU P#4                                                     | PU P#5                                                        |               |                       |
| PU P#12                                                                                                      | PU P#13                                                            | PU P#14                                                      | PU P#15                                                                    | PUP#16                                                     | PU P#17                                                       |               |                       |
|                                                                                                              |                                                                    |                                                              |                                                                            |                                                            |                                                               |               | 961 8088:143<br>953 5 |
| NUMANode P#1                                                                                                 | (16GB)                                                             |                                                              |                                                                            |                                                            |                                                               | <u> </u><br>ר | PCI 8088:1403         |
| NUMANode P#1<br>Booket P#1                                                                                   | (16QB)                                                             |                                                              |                                                                            |                                                            |                                                               | ] _           | PCI 8085:1432         |
| NUMANode Pirt<br>Booket Pirt<br>L3 (15MB)                                                                    | (16GB)                                                             |                                                              |                                                                            |                                                            |                                                               |               | PCI 8085:1402         |
| NUMANode P#1<br>Bocket P#1<br>L3 (15MB)<br>L2 (259KB)                                                        | (100B)<br>12 (256KB)                                               | L2 (256XB)                                                   | 12 (256KB)                                                                 | 12 (2559(8))                                               | 12 (25593)                                                    |               | PCI 8085:1:432        |
| NUMANode (P#1<br>Booket P#1<br>L3 (15M8)<br>L2 (256K8)<br>L1d (32K8)                                         | (16GB)<br>12 (256KB)<br>116 (32KB)                                 | L2 (256KB)<br>L14 (50KB)                                     | L2 (256KB)<br>L14 (12KB)                                                   | L2 (255KB)<br>L10 (32KB)                                   | 12 (25683)<br>116 (3288)                                      |               | PCI 8085:1432         |
| NUMANIOSE P#1<br>Societ P#1<br>L3 (15M8)<br>L1 (32K8)<br>L11 (32K8)                                          | (1898)<br>L2 (256x8)<br>L16 (30x8)<br>L16 (30x8)                   | L2 (256KB)<br>L14 (50KB)<br>L14 (52KB)                       | L2 (256KB)<br>L1d (32KB)<br>L11 (32KB)                                     | L2 (256KB)<br>L1d (32KB)<br>L11 (32KB)                     | 12 (256KB)<br>L1d (32KB)<br>L11 (32KB)                        |               | PCI 80861:40120       |
| NUMANIXAS PET<br>Societ PET<br>L3 (15MB)<br>L2 (259KB)<br>L10 (22KB)<br>L11 (32KB)<br>Core PED               | 1938)<br>12 ptext8)<br>11d (2018)<br>11d (2018)<br>Core Pet        | L2 (256KB)<br>L1d (22KB)<br>L11 (25KB)<br>Core P42           | L2 (256KB)<br>L14 (32KB)<br>L11 (32KB)<br>Core P#3                         | L2 (256KB)<br>L14 (32KB)<br>L11 (32KB)<br>Com P84          | L2 (256KB)<br>L1d (32KB)<br>L11 (32KB)<br>Core P45            |               | PCI 80861:4012        |
| NUMANOde Pirt<br>Booket Pirt<br>L3 (15MB)<br>L2 (259KB)<br>L10 (32KB)<br>L11 (32KB)<br>Over Pirt<br>Piu Piet | 1808)<br>12 (2648)<br>14 (2048)<br>11 (2048)<br>Core Pat<br>Pu Par | L2 (256KB)<br>L14 (35KB)<br>L11 (35KB)<br>Core PR2<br>PU PH8 | L2 (256KB)<br>L14 (32KB)<br>L14 (32KB)<br>L11 (32KB)<br>Core PR3<br>PU PR9 | L2 (25882)<br>L1 (3288)<br>L1 (3288)<br>Core PM<br>PU PH10 | L2 (25682)<br>L16 (3248)<br>L16 (3248)<br>Cove P45<br>PU P411 |               | PCI 80861:4012        |

Topology of our machines

- Topology of the test bed machines consists of 2 NUMA nodes
- The FDR InfiniBand network interfaces ib0, ib1 and the ethernet interfaces are connected to the first NUMA node
- High network latency is experienced if the EB data fragments are sent by a process running on cores 6 to 11

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のへで



Antonio Falabella

LHCb Experiment Upgrade

The DAQ for the Upgrade

Event Building performance evaluator

EB performance evaluator at the large scale

Summary

- Power saving states (C-states) of CPUs reduce the power consumption but can be critical to performance
- C-0 corresponds to every CPU component turned on. C-states with higher values correspond to lower power consumption

- Switching back and forth among the various states will result in performance degradation
- Switching between CPU frequencies gives similar effects