



### Trigger and Data Acquisition Systems In High Energy Physics: The Challenge of the LHCb Upgrade

Domenico Galli

Università di Bologna e INFN, Sezione di Bologna

INFN Bologna, "Aperitivi Scientifici", November 27, 2015



Top and Bottom Halves of Data Processing in HEP



#### • Top Half:

- Processing of particle interaction events is performed on data stored on disks/tapes of Tier-0/Tier-X computer centre:
  - Event tracks are reconstructed;
  - Event kinematics is reconstructed;
  - Useless event data are stripped;
  - Full event is reconstructed and tagged;
  - Mass analysis/Group analysis is performed;
  - Local (user, n-tuple) analysis is performed.
- · Bottom Half:
  - Who brought the data to the Tier-0?
  - How are they driven from the detector to the storage?
  - How data are selected?
  - How data coming from different sub-detector are **merged together**?





#### • Trigger:

- Multi-level trigger, Trigger trend;
- Readout and Signal Processing:
  - Pulse shaping, Range compression, Digitisation, Zero suppression;
- Data Acquisition and Event Building:
  - Basic DAQ, Collider DAQ, Event Building;
- Modules and Data Bus for DAQ systems:
  - NIM, CAMAC, FastBus, VME, PCI, PCIe, ATCA/ $\mu TCA;$
- Network based DAQ:
  - Ethernet, InfinBand;
- The Challenge of the DAQ for the LHCb Upgrade:
  - 30 MHz Rate;
  - 32 Tb/s Aggregate Throughput.









- A HEP experiment can collect hundreds of EiB (260 B) of data in a year:
  - Only a small subset (~1/10<sup>6</sup> events) of primary physics interest.
- Tape I/O lags behind many other computer components:
  - This problem can be overcome writing in parallel to many tapes;
- Storage media could run over tens of G€/year.
- How much **CPU power** for post-triggering reconstruction of all the events?



- Cannot save all raw data all the time.
- Eliminate useless background as early as possible:
  - In order to save resources to process intersting events.

**Trigger History** 

#### **Bubble chambers**:

- DAQ: stereo photograph.
- Low level trigger: piston expansion.
- High level trigger: humans (scanners).
- Early fixed target experiments:
  - Merely hardware implementation.
  - Very simple calculation.
  - Raw discrimination.
  - Large dead-time possible during readout.
  - DAQ came after the trigger.
- Nowadays HEP experiments:
  - Multi-level.
  - Pipelines to pull down dead-time (< 5%).
  - Hardware look-up tables for fast calculation.
  - Software implementation of higher level.





Cronin-Fitch experiment, 1964

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA





Istituto Nazionale

di Fisica Nucleare

Istituto Nazionale

di Fisica Nucleare

High No. Channels **High Bandwidth** 

**High Data Archives** 

(PetaBytes)

ALICE

(1000 Gbit/s)

107

ATLAS

CMS



Istituto Nazionale di Fisica Nucleare

- Software trigger advantages:
- Flexibility:
  - Selection rules for the events can be changed simply by modifying a software code.
- Scalability:
  - Processed event rate can be increased simply by increasing the farm size (number of PCs) and the port number of the network switch .
- Cost:
  - · Commodity components used in software triggers are very cheap.
  - Allows to profit from the rapid price drop in commodity components (PC, Ethernet cables and switches, etc.).
- Maintainability:
  - Widespread commodity interfaces will continue to be available on the market (with increased performance).
- Upgradeability:
  - Can profit from the rapid development undergone by commodity components.

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

- Drawbacks:
  - Variable latency. Difficult to be used for other but the last trigger level;
  - DAQ/EB throughput must be suitable for the trigger.

TDAQ Systems In High Energy Physics. Domenico Galli





#### Rightsizing the HLT Farm



- The software trigger in not designed in order to have a fixed latency.
- We can therefore talk over average values.
- The average time spent for the selection algorithm,  $\langle T_s \rangle$ , must be less than the average period which separates the input of two following events in the same trigger node,  $N_{\rm cou} / \nu_{\rm input}$ , i.e.:

$$\langle T_{_s}\rangle \leq N_{_{\rm cpu}} \ / \ \nu_{_{\rm inpu}}$$

- So must be:
- $N_{_{\rm cpu}} \geq \langle T_{_s} \rangle \cdot \nu_{_{\rm input}}$
- In the LHCb case,  $\langle T_s \rangle \approx 2 \text{ ms}$  and  $\nu_{\text{input}} = 1 \text{ MHz}$  so must be  $N_{\text{cpu}} \ge 2000$ .

#### TDAQ Systems In High Energy Physics. Domenico Galli

#### alma mater studiorum ~ università di bologna

18

#### LHCb Trigger Evolution (II)

- 2 stages:
  - Level-0: synvhronous, hardware + FPGA; 40 MHz → 1 MHz.
  - HLT: software, PC farm: 1 MHz → 2 kHz.
- Front-End Electronics:
  - Interfaced to Read-out Network.
- Read-Out Network:
  - Gigabit Ethernet LAN.
  - Read-out @ 1.1 MHz.
  - Aggregate thoughput:
     60 GiB/s.

TDAQ Systems In High Energy Physics. Domenico Galli









- Typically, the **pulse shaper** transforms a **narrow** detector current pulse to a **broader** pulse:
  - In order to increase rising time;
  - To reduce bandwidth;
  - To reduce electronic noise;
- With a **gradually rounded maximum** at the peaking time TP:
  - To facilitate measurement of the peak amplitude;

SENSOR PULSE

SHAPER OUTPUT



- Example: CR-RC Shaper:
  - In this case made out of CR (differentiatior) and RC (integrator) filters;
  - Key elements:
    - Lower frequency bound (related to pulse duration);
    - Upper frequency bound (related to rise time).





stituto Nazionale

li Fisica Nuclear

#### ...But Not Too Much



- Broad pulses reduce the temporal spacing between consecutive pulses;
- Need to limit the effect of "pile-up":
  - Pulses not too broad;
- As usual in life: a compromise.



# Baseline Restorer

- Any series capacitor in a system prevents transmission of a DC component.
- A sequence of unipolar pulses has a DC component that depends on the duty factor, i.e. the event rate.
- The baseline shifts to make the overall transmitted charge equal zero.
- Random rates lead to random fluctuations of the baseline shift:
  - Spectral broadening;
- Need baseline restorer.



TDAQ Systems In High Energy Physics. Domenico Gall

#### **Range Compression**

stituto Nazionale di Fisica Nucleare

1250 1000

Entity to be measured

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGN.

Input m\

stituto Nazionale

di Fisica Nuclear

- A non-linear transformation:
  - Compressing the signal according to an appropriate piecewise linear transfer function:
  - Producing an output in the range best suited for digitization circuit.

21600

1400

1200

1000 800

600

400

200

Σ

• Typically sum of the outputs of several linear amplifiers with different gain and upper cutoff.



**E.g.**: M = 3, N = 2.

DAQ Systems In High Energy Physics. Domenico Ga





#### **Digitisation: Flash ADC**

Ruler unit

#### **Digitization**:

110

101

100 011

010

001

000

TDAQ Systems In High Energy Physics. Domenico Gall

2/8 1/8

3/8

5/8 6/8 7/8 FS

4/8

ANALOG INPUT

DUTPUT CODE

- Encoding an analog value into a binary representation;

Entity to be measured

stituto Nazionale

di Ficica Nucleare

- By comparing entity with a ruler.
- Flash ADC is the simplest and fastest implementation:
  - M comparisons in parallel;



ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

R/2

FS-1LSB

N=Log<sub>2</sub> (M+1)

**Differential comparator** 



stituto Nazionale di Ficica Nuclear

- QDC (Charge to Digital Converter).
  - Often we have a current and we are interested in the total charge;
  - Essentially an integration step followed by an ADC;
  - Integration require limits: gate.

delay





#### Digitisation: QDC (II)



• Relative timing between signal and gate is important:

Example: measure the **position** of a particle in a wire

particle will take a time  $\Delta t$  to reach the anode wire:

- Transit time is normally **negligible** with respect to  $\Delta t$ ;

The ionization electrons created by the passage of the

- If we consider a constant drift speed  $v_D$  (e.g.: 50  $\mu$  m/ns), then

- Delay tuning.
- · Gate should be large enough to contain the full pulse and to accommodate for the jitter:
  - Fluctuations are always present.
- Gate should not be too large:
  - Increases the noise level:

**Digitisation: TDC** 

- Increase dead time

DAQ Systems In High Energy Physics. Domenico Gal

(drift) chamber.

position is:

 $x = v_{\rm p} \cdot \Delta t$ 

•



stems In High Energy Physics. Domenic

Digitisation: QDC (III)

### stituto Nazionale li Fisica Nuclear

#### Pedestal:

- Due to PMT dark current (thermionic emission), thermal noise, etc.:

**QDC** 

- The same noise enters the physics measurements and contributes with an offset to the distribution;
- Can be measured with an out-of-phase trigger.
- The result of a pedestal measurement has to be subtracted from the charge measurements.



0xa3, 0x15, 0x8d, ...

drift

ALMA MATER STUDIORUM 7 UNIVERSITÀ DI BOLOGNA

🐽 anode

li Ficica Nucleare

#### Digitisation: TDC (II)

INFN Istituto Nazionale di Fisica Nucleare

TDC

- Wire chamber alone is not sufficient:
  - We need a triggering system (e.g. a scintillator slab).
- We can measure the time offset between the two signals using a N-bit digital counter driven by a clock of frequency f:

à anode

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

- The wire signal acts as a start signal;
- The scintillator provides the stop signal.

drift

• This device is a TDC.

Zero Suppression (II)

DAQ Systems In High Energy Physics. Domenico Ga

• In the LHCb upgrade all Front End electronics transmit data continuously at 40 MHz to the Readout Boards.

- Very large number of optical links needed between the FE and the new Readout Boards.
- Almost a factor of ten could be gained by sending zerosuppressed data already at the FE:
  - Reducing the number of optical links from ~80000 to ~10000.
  - The zero-suppression will be **performed** in **radiation-tolerant FE chips**.
  - A possible consequence of zero-suppression is a varying latency of data transmission.



#### Zero Suppression



### **Data Acquisition**

43

#### Un-Triggered DAQ



- FEE: ADC performs analog to digital conversion;
- DAQ: CPU does ADC readout and disk write.
- System limited by the time  $\tau$  needed to process an event:
  - ADC conversion + CPU processing + storage.
- The maximum sustainable DAQ rate is the inverse of  $\tau$ , e.g.:

$$\tau = 1 \text{ ms} \quad \Rightarrow \quad \nu = \frac{1}{\tau} = 1 \text{ kHz}$$

TDAQ Systems In High Energy Physics. Domenico Gal

Basic Triggered DAQ (II)

- The process is **poissonian**:
  - Fluctuations in time between events;
- Let's assume for example:
  - A process rate  $\nu_{\rm ph} = 1$  kHz, i.e.  $\lambda = 1$  ms;
  - A processing time  $au_{
    m daq}=1~{
    m ms.}$





stituto Nazionale

di Ficica Nuclear

ADC

Processing

disk

au

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



Time between events (ms)

TDAQ Systems In High Energy Physics. Domenico Galli

disk

#### Basic Triggered DAQ (IV)

stituto Nazionale di Fisica Nuclear

- Definitions:
  - Average rate of physical events (input):  $\nu_{\rm ph}$ ;
  - Average rate of DAQ (output):  $\nu_{dag}$ ;
  - **Dead time**, the time the system requires to process an event, without being able to handle other triggers:  $\tau$ ;
  - Probability that DAQ is busy:  $P(busy) = \nu_{\rm dag} \cdot \tau_{\rm dag}$  ;
  - Probability that DAQ is free:  $P({\rm free}) = 1 \dot{\nu}_{\rm daq} \cdot \dot{\tau}_{\rm daq}$  ;
- Therefore:



DAQ Systems In High Energy Physics. Domenico Gall



- Basic Triggered DAQ (V) Istituto Nazionale di Fisica Nucleare Due to stochastic fluctuations: - DAQ rate always less than physics rate;  $\nu_{\rm daq} = \frac{\nu_{\rm ph}}{1 + \nu_{\rm ph} \cdot \tau_{\rm ph}} < \nu_{\rm ph}$ - Efficiency always less than 1;  $\varepsilon = \frac{1}{1 + \nu_{\rm ph} \cdot \tau_{\rm daq}} < 100\%$  • Example:  $\lambda = 1 ms$  $f(t) = \lambda e^{-\lambda t}$  $\begin{cases} \nu_{\rm ph} = 1 \text{ kHz} \\ \tau_{\rm daq} = 1 \text{ ms} \end{cases} \Rightarrow \begin{cases} \nu_{\rm daq} = 500 \text{ Hz} \\ \varepsilon = 50\% \end{cases}^{\circ 4} \end{cases}$ Time between events (ms DAQ Systems In High Energy Physics. Domenico ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA **De-Randomisation** Istituto Nazionale di Fisica Nucleare Input fluctuations can be absorbed and smoothed by a queue: - A First In First Out can provide a ~steady delay and de-randomised output rate. trigger Inter-arrival time distribution mś λ(ms), f (Hz) Busy logic
  - FIF0  $\tau$  (ms),  $\nu$  (Hz) Inter-arrival time distribution ms



TDAQ Systems In High Energy Physics. Domenico Galli





TDAQ Systems In High Energy Physics. Domenico Galli

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



#### Event Building and Trigger

NFN Istituto Nazionale di Fisica Nucleare

- Usually low-level trigger is based on local (sub-detector) data:
  - Event fragments are sent to trigger electronics through dedicated lines, before event building;
- High-level trigger requires all detector data:
  - HLT is performed on built events.



Clock/Trigger Distribution and Synchronisation



- An event is a snapshot of the values of all detector front-end electronics elements, which have their value caused by the same collision;
- A common clock signal must be provided to all detector elements:
  - Since the c is constant, the detectors are large and the electronics is fast, the detector elements must be carefully time-aligned.
- Common system for all LHC experiments TTC (Timing, Trigger and Control)
   based on radiation-hard opto-electronics.

TDAQ Systems In High Energy Physics. Domenico Gal

Timing & Sync Control



Ontica

Electrica

TTC driver

- Sampling clock with low jitter;
- Synch reset;
- Synchronization with machine bunch structure;
- Calibration;
- Trigger (with event type);





TTC switch/

Fan-out

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA



# ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

12 ns



### **Modules and Data Bus**

#### Modular Electronics

INFN Istituto Nazionale di Fisica Nucleare

- Modularizing DAQ electronics helps in these respects:
  - Allows for the **re-use** of **generic modules** in different applications;
  - Limiting the complexity of individual modules increases their reliability and maintainability;
  - You can profit from 3rd party support for common modules
  - Makes it easier to achieve scalable designs;
  - Upgrades (for performance or functionality) are less difficult.





#### NIM (Nuclear Instrument Modules)

- Not actually a data bus:
  - No common backplane bus;
- Backplane provides only powers to functional modules;
- 250 x 193 mm board size:
  - 12 boards per crate maximum;
- Plug-and-play approach:
  - Does not need any software:
- Front panel settings and cable connections:
- Amplifiers, shapers, discriminators, delay units, etc.

DAQ Systems In High Energy Physics. Domenico Gall





BNC connectors

LEMO connectors

stituto Nazionale di Ficica Nuclear



NIM backplane connector

- 0 = 0A (0V);
- 1 = -12 to -32 (typical -16) mA at 50  $\Omega$  (-0.8V);
- NIM connector:
  - 42 pins in total;
  - 11 pins used for power (+/- 6, 12, 24V);
  - 2 logic pins (reset & gate).



CAMAC (Computer Automated Measurement and Control)



BIN CONNECTOR

00

- CAMAC was the first successful databus interface between commercial computers and custom detector electronics:
- Most of physics experiments in late 60's -late 80's were based on parallel CAMAC electronics e.g.:
  - UA1 (interfaced to Apple MacIntosh Plus);
  - UA2 (interfaced to VAX 11/780).
- Several large distributed accelerator control systems (CERN, Fermilab) were based on serial CAMAC.







VME (Versa Module Eurocard)

- VME standard was proposed in 1981 by Motorola, Mostek and Signetics;
- Processor independent, but signal set • has its roots in MC 68000 CPU:
- Open architecture;

TDAQ Systems In High Energy Physics. Domenico Gall

- VME International Trade Association (VITA) remains the driving force;
- Large number of commercial products (used heavily in the military);
- 32/64 bit bus (320/640 Mb/s);
- Currently there are more than 1000 • VMEbus systems at CERN (accelerator and experiments).



ALMA MATER STUDIORUM - UNIVERSIT









- Classes of modules (logical)
  - Master:
    - A module that can initiate data transfers:
  - Slave:
    - A module that responds to a master:
  - Interrupter:
    - A module that can send an interrupt (usually a slave);
  - Interrupt handler:
    - A module that can receive (and handle) interrupts (usually a Single Board Computer);
  - Arbiter:
    - A piece of electronics (usually included in the SBC) that arbitrates bus access and monitors the status of the bus.

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

• It should always be installed in slot 1 of the VMEbus crate if interrupts are used.

#### DAQ Systems In High Energy Physics. Domenico Galli

#### PCI (Peripheral Component Interconnect) / PCI-X (Extended) Bus



- Local computer bus for attaching hardware devices in a computer.
- First standardized in 1991:
  - Replaced the older ISA/EISA/MCA cards;
  - Initially intended for PC cards;
  - Later spin-offs: CompactPCI, PXI, PMC.







Communication in a Crate: Buses



- A bus connects two or more devices and allows them to communicate:
- The bus is shared between all devices on the bus: - Arbitration is required:
- Devices can be masters or slaves (some can be both);
- Devices can be uniquely identified ("addressed") on the bus.



# PCI/PCI-X Bus (II)

#### Main features:

- Synchronous timing (but wait cycles possible);
- **Clock** rates:
  - Initially 33 MHz. Later: 66 MHz, (PCI-X: 100 and 133 MHz);
- Bus width:
- · Initially 32 bit. Later: 64 bit; Signaling voltage:
  - Initially 5 V. Later 3.3 V;
- Bus topology:
  - 1 to 8 slots per bus;
  - Busses can be connected to form a tree;
  - Address and data as well as most protocol lines are shared by all devices:
  - The lines used for arbitration are connected point-to-point;
  - · The routing of the interrupt request lines is more complicated...
  - A system can consist of several Initiators (master) and Targets (slave) but only one Initiator (master) can receive interrupts.



Istituto Nazional di Fisica Nucleare

#### Limits of Parallel Data Bus

INFN Istituto Nazionale di Fisica Nucleare

stituto Nazional

PCIe Motherboard

- What is wrong about "parallel"?
  - You need lots of pins on the chips and wires on the PCBs (printed circuit boards):
    - Control, data and address lines;
  - The **skew** (difference in arrival time of simultaneously transmitted bits) between lines **limits the maximum speed**;
- What is wrong about "bus"?
  - A bus is shared between all devices (each new active device slows every other device down);
  - Speed is a function of the length (impedance) of the lines;
    - Bus-frequency (number of elementary operations per second) can be increased, but decreases the maximum physical bus-length;
  - Number of devices and physical bus-length is limited;
  - Communication is limited to one master/slave pair at a time.
- Buses are typically useful for systems <1 GB/s:</li>
  - Not useful for DAQ at LHC.

> PCIe (PCI Express)

• Not a bus any more:

TDAQ Systems In High Energy Physics. Domenico Galli

- But a point-to-point link;
- Data not transferred on parallel lines but on one or several serial lanes:
  - Lane: One pair of LVDS lines per direction;
  - Devices can support up to 32 lanes;
- Protocol at the link layer has nothing to do with protocol of parallel PCI;
- Fully transparent at the S/W layer.



### From Parallel Data Bus to Switched Serial Link

#### Parallel Buses Are Dead

RTC magazine, September 2006, Ben Sharfi CEO, General Micro Systems;

#### • Switched Serial Link:

- Packet switching;
- Star or mesh topology;
- Examples:
  - PCIe (PCI Express);
  - InfiniBand;
  - Ethernet;
  - Serial ATA;
  - Fiber Channel;
  - Etc.





PCI Card

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

88

witch

Dev

Dev.

Dev.

Dev.

Dev.

Dev

Dev.

Star

Dev.

Mesh



#### ATCA (Advanced Tele Communication Architecture)

- The Basic Idea:
  - Telecom companies are using proprietary electronics:
  - Let's design a standard for them from scratch:
  - It has to have all the features telecom companies need:
    - High availability (99.999%);
    - · Redundancy at all levels;
    - Very high data throughput;
    - Sophisticated remote monitoring and control;







#### AMC (Advanced Mezzanine Card)

stituto Nazionale di Ficica Nuclear

- ATCA blades are big.
- Small mezzanine modules could be helpful to modularize their functionality:
  - PMC/XMC mezzanines are not hot-swappable;
  - Let's design a new type of mezzanine for ATCA



ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



#### µTCA / MTCA (Micro TCA)



- AMC mezzanines are great but ATCA is a heavy standard and the H/W is expensive:
  - Let's define a standard that allows for using AMCs directly in a shelf:
  - i.e. Promote the AMC from "mezzanine" to "module".



- UNIVERSITÀ DI BOLOGN



Istituto Nazionale

di Fisica Nucleare

Which Module Standard in LHC **Upgrade**?

- LHC and experiments at CERN:
  - Still many VMEbus and PCI based;
  - CMS: Several µTCA systems in operation;
  - ATLAS: ATCA proposed as VMEbus replacement, many R&D projects;
  - LHCb: first favored ATCA then decided to go for PCs;
  - ALICE: Still planning to use ATCA;
- Control systems of new accelerators:
  - **µTCA** everywhere;
  - XFEL@DESY, SCLS@SLAC, FAIR@GSI.



stituto Nazionale

li Ficica Nuclear

### Network Based DAQ

TDAQ Systems In High Energy Physic Domenico Galli

Network Based DAQ

 In large (HEP) experiments we typically have thousands of devices to read, which are sometimes very far from each other:

- Buses can not do that;
- Network technology solves the scalability issues of buses:
  - In a network devices are equal ("peers");
  - In a network devices communicate directly with each other:
    - No arbitration necessary;
    - Bandwidth guaranteed;
  - Data and control use the same path:
    - $\boldsymbol{\cdot}$  Much fewer lines (e.g. in traditional Ethernet only two)
  - At the signaling level buses tend to use parallel copper lines. Network technologies can be also optical, wire-less and are typically (differential) serial.

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA





- What defines large?
  - The number of channels: for LHC experiments O(10<sup>7</sup>) channels:
    - A (digitized) channel can be between 1 and 14 bits;
  - The **rate**: for LHC experiments everything happens at **40.08 MHz**, the LHC bunch crossing frequency:
    - This corresponds to 24.9500998 ns or 25 ns among events;
- HEP experiments usually consist of many different sub-detectors:
  - Tracking, calorimetry, particle-ID, muon-detectors.



- HERA-B:
- Shark link (proprietary, by Analog Devices) until level 2, than Fast Ethernet.
- DØ:
  - Fast Ethernet / Gigabit Ethernet.
- CDF:
  - ATM / SCRAMnet (proprietary, by Systran, low latency replicated noncoherent shared memory network).
- · CMS:
  - Myrinet (proprietary, Myricom) / Gigabit Ethernet.
- Atlas / LHCb / Alice:
  - Gigabit Ethernet.
- LHCb Upgrade:
  - InfiniBand, 100-Gigabit Ethernet.







Istituto Nazionale

di Fisica Nucleare

512

×72

Istituto Nazionale

di Fisica Nucleare

→ 🖂 Readout Builder 8

مله مله مله م

EVM

256x256 FED Route



#### Why a New Transport Protocol?

stitute Nazionale di Ficica Nucleare

- The optimal Ethernet payload/overhead ratio, is achieved when the IP datagram fills completely the 1500 B Ethernet payload.
- Moreover the Gigabit Ethernet throughput drops for small frame size
- However, each Tell-1 board can send only data-fragments pertaining to the associated sub-detector element, which usually is much smaller.
- In order to optimize the payload/overhead ratio, fragments from multiple (~20) events have to be aggregated (MEP, Multi Event Packet) into a single IP datagram.
- MEP is a LHCb custom OSI-level 4 (transport) protocol.
  - OSI-level 3 (network) is IP;
  - OSI-level 2 (datalink) is Ethernet.



## The LHCb Upgrade



#### Luminosity and Event Multiplicity

nt Multiplicity

- **Instantaneous luminosity** leveling at 4×10<sup>32</sup> cm<sup>-2</sup> s<sup>-1</sup>, ±3% around the target value.
- LHCb was designed to operate with a single collision per bunch crossing, running at a instantaneous luminosity of  $2 \times 10^{32}$  cm<sup>-2</sup> s<sup>-1</sup> (assuming about 2700 circulating bunches):
  - At the time of design there were worries about possible ambiguities in assigning the B decay vertex to the proper primary vertex among many.
- Soon LHCb realized that running at higher multiplicities would have been possible. In 2012 we run at  $4 \times 10^{32}$  cm<sup>-2</sup> s<sup>-1</sup> with only 1262 colliding bunches:
  - 50 ns separation between bunches while the nominal 25 ns (will available by 2015).
  - 4 times more collisions per crossing than planned in the design.
  - The average number of visible collisions per bunch crossing in 2012 raised up to µ > 2.5.
  - µ ~ 5 feasible but...

TDAQ Systems In High Energy Physics. Domenico Gall





yield on the hadronic channels around 4×10<sup>32</sup> cm<sup>-2</sup> s<sup>-1</sup>.

• **Increasing the first level trigger rate** considerably increases the efficiency on the hadronic channels.



stituto Nazionale

di Fisica Nucleare

#### Luminosity and Event Multiplicity (II)



- At present conditions, if we increase the luminosity:
- Trigger yield of hadronic events saturates;
- The  $p_{\rm T}\,\text{cut}$  should be raised to remain within the 1 MHz LO output rate;
- There would be not a real gain.



### The LHCb Upgrade

- Readout the whole detector at 40 MHz.
  - Trigger-less data acquisition system, running at 40 MHz (~30 MHz are non empty crossings):
    - Use a (Software) Low Level Trigger as a throttle mechanism, while progressively increasing the power of the event filter farm to run the HLT up to 40 MHz.
- We have foreseen to reach 20×10<sup>32</sup> cm<sup>-2</sup>s<sup>-1</sup> and therefore to prepare the sub-detectors on this purpose:
  - pp interaction rate 27 MHz.
  - At 20×10<sup>32</sup> cm<sup>-2</sup> s<sup>-1</sup> pile up μ = 5.2
  - Increase the yield in the decays with muons by a factor 5 and the yield of the hadronic channels by a factor 10.
- Collect 50 fb<sup>-1</sup> of data over ten years.
  - 8 fb<sup>-1</sup> is the integrated luminosity target, to reach by 2018 with the present detector;
  - **3.2 fb**<sup>-1</sup> collected so far.



#### LHCb Upgrade: Consequences

INFN Istituto Nazionale di Fisica Nucleare

- The **detector front-end electronics** has to be entirely **rebuilt**, because of the current readout speed is limited to 1 MHz.
- Synchronous readout, no trigger.
- No more buffering in the front-end electronics boards.
- Zero suppression and data formatting before transmission to optimize the number of required links.
  - Average event size 100 kB
- Three times the optical links as currently to get the required bandwidth, needed to transfer data from the front-end to the read-out boards at 40 MHz.
  - GBT links simplex (DAQ) 9000, GBT duplex (ECS/TFC) 2400
- New HLT farm and network to be built by exploiting new LAN technologies and powerful many-core processors.
- **Rebuild** the current sub-detectors equipped with embedded front-end chips:
  - Silicon strip detectors: VELO, TT, IT
  - RICH photo-detectors: front-end chip inside the HPD.
- Consolidate sub-detectors to let them stand the foreseen luminosity of 20.×10<sup>32</sup> cm<sup>-2</sup> s<sup>-1</sup>.

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

#### TDAQ Systems In High Energy Physics. Domenico Gall

DAQ Present View

INFN Istituto Nazionale di Fisica Nucleare

- Use PCIe Generation 3 as communication protocol to inject data from the FEE directly into the event-builder PC.
- A much cheaper event-builder network:
  - Data-centre interconnects can be used on the PC:
  - Not realistically implementable on an FPGA (large software stack, lack of soft IP cores,...)
- Moreover PC provides: huge memory for buffering, OS and libraries.
- Up to date NIC and drivers available as pluggable modules.









- Intermediate layer of electronics boards arranged in crates to decouple FEE and PC farm: for buffering and data format conversion.
- The optimal solution with this approach: ATCA, µTCA crates, ATCA carrier board hosting AMC standard mezzanine boards.
- AMC boards equipped with FPGAs to de-serialize the input streams and transmit event-fragments to the farm, using a standard network protocol, using 10 Gb Ethernet.







Istituto Nazionale

di Fisica Nucleare



#### PCIe-Gen3 Based Readout

INFN Istituto Nazionale di Fisica Nucleare

 A main FPGA manages the input streams and transmits data to the event-builder PC by using DMA over PCIe Gen3.

#### Nominal configuration:

- 1 bidir link for TFC
- 24 GBT inputs  $\rightarrow$  limited by PCIe output bandwidth
- PCIe GEN3 x16 = 110 Gbits/s
- 24 GBT wide bus = 107 Gbits/s

### Up to 48 bidir links available on board for low luminosity sub detectors $\rightarrow$ decrease the costs

**The LHCb Upgrade** 

**InfiniBand Tests** 

**Event Builder Network** 





### InfiniBand vs Ethernet

DAQ Systems In High Energy Physics. Don

PCIe layout



- Guaranteed delivery. Credit based flow control:
  - Ethernet: Best effort delivery. Any device may drop packets;
- Hardware based re-transmission:
  - Relies on TCP/IP to correct any errors;
- Dropped packets prevented by congestion management:
  - Subject to micro-bursts;
- Cut through design with late packet invalidation:
  - Store and forward. Cut-through usually limited to local cluster;
- RDMA baked into standard and proven by interoperability testing:
  - Standardization around compatible RDMA NICs only now starting;
  - Need same NICs are both ends;
- Trunking is built into the architecture:
  - Trunking is an add-on, multiple standards an extensions;

ALMA MATER STUDIORUM 7 UNIVERSITÀ DI BOLOGNA



ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

A

stituto Nazionale

di Fisica Nucleare

#### InfiniBand vs Ethernet (II)

stituto Nazionale di Ficica Nuclear

#### All links are used:

- Spanning Tree creates idle links;
- Must use QoS when sharing with different applications:
  - Now adding congestion management for FCoE but standards still developing;
- Supports storage today;
- Green field design which applied lessons learnt from previous generation interconnects:
  - Carries legacy from it's origins as a CSMA/CD media;
- Legacy protocol support with IPoIB, SRP, vNICs and VHBAS:
- Provisioned port cost for 10 Gb Ethernet approx. 40% higher than cost of 40 Gb/s InfiniBand.

TDAQ Systems In High Energy Physics. Domenico Gall

**IB** Performance Test

stituto Nazional di Fisica Nuclear

alabell

<u>\_\_\_</u> Ð

No Tunino

NUMA tunina

NUMA and PM tunin

- Performances tests performed at CNAF.
- PCIe Gen 3, 16 lanes needed:
  - Any previous version of the PCI bus represents a bottleneck for the network traffic:
- Exploiting the best performances required some tuning:
  - Disable node interleaving and bind processes according to NUMA topology;
  - Disable power saving modes and CPU frequency selection:
    - PM and frequency switching are latency sources.

time [s]

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA





- Ib QDR (Quad Data Rate):
  - Point-to-point bandwidth with RDMA write semantic (similar results for send semantic):
  - QLogic : QLE7340, Single port 32 Gbit/s (QDR);
  - Unidirectional throughput: 27.2 Gbit/s;
  - Encoding 8b/10b.



Istituto Nazionale

di Fisica Nucleare



#### IB Performance Test (III)

Istituto Nazionale di Fisica Nucleare

- Ib FDR (Fourteen Data Rate):
  - Point-to-point bandwidth with RDMA write semantic (similar results for send semantic):

55.0

- Mellanox : MCB194A-FCAT, Dual port, 56 Gbit/s (FDR);
- Unidirectional throughput: 54.3 Gbit/s (per port);
- Encoding 64b/66b.







# The LHCb Upgrade

### **Event Builder Tests**

#### **CPU NUMA Architectures, Event Builder Network**



#### Event Builder Performance



- LHCb-daqpipe software:
  - Allows to test both PULL and PUSH protocols;
  - It implements several transport layer implementation: IB verbs, TCP, UDP;
- EB software tested on **test beds** of increasing size:
  - At CNAF with 2 Intel Xeon server connected back-toback;
  - At Cern with 8 Intel Xeon cluster connected through an IB-switch;
  - On 128 nodes at the 512 nodes Galileo cluster at the Cineca.



- Extensive test on the CINECA Galileo TIER-1 cluster.
  - Nodes: 516;

stems In High Energy Physics, Domenia

- Processors: 2 8-core Intel Haswell 2.40 GHz per node;
- RAM: 128 GB/node, 8 GB/core;
- Network: Infiniband with 4× QDR switches.
- Limitations:
  - Cluster is in production:
    - Other processes are polluting the network traffic;
  - No control on power management and frequency switching;
- The fragment composition is performed **correctly** up to a scale of 128 nodes:
  - Maximum allowed for the cluster batch system.



#### EB Test on 2 Nodes



- Measured bandwidth as seen by the builder units on two nodes equipped with Mellanox FDR (max bandwidth 54.3 Gbit/s considering the encoding);
- Duration of the tests: 15 minutes (average value reported).
- Bandwidth measured is on average 53.3 Gbit/s:
  - 98% of maximum allowed;
- PM disabled.

DAQ Systems In High Energy Physics. Do







#### LHCb Upgrade: Temporary Software LLT

- Throttle mechanism, while progressively increasing the power of the EFF to run the HLT up to 40 MHz.
- The LLT algorithms can be executed in the event builder PC after the event building.
- Preliminary studies show that the LLT runs in less than 1 ms. if the CALO clusters are built in the FEE.
- Assuming 400 servers, 20 LLT processes running per PC, and a factor 8 for the CPU power from the Moore Law, the time budget available turns out to be safely greater then 1 ms:



- of intermediate crates, ATCA and AMC board and cables, 10 and 40 GbEthernet, Cost to operate at 40 MHz: 8,9 MCHF. The cost due to the ATCA crate has not been included.
- Bidirectional: PCIe and InfiniBand proposed approach. Cost to operate at

TDAQ Systems In High Energy Physics. Domenico Gall



ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



titute Nazional

Unidirectiona

Bidirectiona

#### LHCb Upgrade: HLT Farm



- Trigger-less system at 40 MHz:
  - A selective, efficient and adaptable software trigger;
- Average event size: 100 kB;
- Expected data flux: 32 Tb/s;
- Total HLT trigger process latency: ~15 ms:
  - Tracking time budget (VELO + Tracking + PV searches): 50%
  - Tracking finds 99% of offline tracks with p<sub>T</sub>>500 MeV/c
- Number of running trigger process required: 4×10<sup>5</sup>;
- Number of core/CPU available in 2018; ~200;
  - Intel tick-tock plan: 7 nm technology available by 2018-19, the number of core accordingly scales as  $12 \times (32 \text{ nm}/7 \text{ nm})^2 = 250$ , equivalent 2010 cores.
- Number of computing nodes required: ~1000.



- INFN-Bologna: Umberto Marconi, Domenico Galli, Vincenzo Vagnoni, Stefano Perazzini et al.;
- Laboratorio di Elettronica INFN-Bologna: Ignazio Lax, Gabriele Balbi et al.:
- INFN-CNAF: Antonio Falabella, Francesco Giacomini, Matteo Manzali et al.:
- INFN-Padova: Marco Bellato, Gianmaria Collazuol et al.;
- CERN: Niko Neufeld, Daniel Hugo Cámpora Pérez, Guoming Liu, Adam Otto, Flavio Pisani, et al.;
- Other





stituto Nazional

di Fisica Nucleare



#### **Prof. Domenico Galli** Dipartimento di Fisica and INFN

domenico.galli@unibo.it

domenico.galli@bo.infn.it

http://www.unibo.it/docenti/domenico.galli http://lhcbweb2.bo.infn.it/bin/view/GalliDidattica





 Visit
 DCS Devices (HV, LV, GAS, Cooling, etc.)

 U
 Detector Channels

 Visit
 Front End Electronics

 HLT Farm
 HLT Farm

 Storage
 DAQ

 External Systems (LHC, Technical Services, Safety, etc)





- Besides Trigger and DAQ, the Online System include the ECS, in charge of the **Control** and **Monitoring** of:
  - Detector Operations (ex Slow Controls);
    - GAS, HV, LV, temperatures...;
  - Data Acquisition and Trigger;
    - FE Electronics, Event building, EFF, etc.;
  - Experimental Infrastructure;
    - Cooling, ventilation, electricity distribution, ... ;
  - Interaction with the outside world;
    - Magnet, accelerator system, safety system, etc.;



#### Trigger Rate / Event Size Comparison

|              | )                                        |
|--------------|------------------------------------------|
| INF          | N                                        |
| $\mathbf{C}$ | lstituto Nazionale<br>di Fisica Nucleare |

|       | Event<br>Size | L1 Input<br>Rate | L1 output<br>Rate | L2 output<br>Rate | L3 output<br>Rate |
|-------|---------------|------------------|-------------------|-------------------|-------------------|
| KTev  | 8 KiB         |                  | 100 KHz           | 20 KHz            | 2 KHz             |
|       |               |                  | 800 MiB/s         | 160 MiB/s         | 7 MiB/s           |
| CDF   | 270 KiB       |                  | 50 KHz            | 300Hz             | 80 Hz             |
|       |               |                  | 13 GiB/s          | 80 MiB/s          | 23 MiB/s          |
| DØ    | 250 KiB       |                  | 10 KHz            | 1 KHz             | 70 Hz             |
|       |               |                  | 2.5 GiB/s         | 250 MiB/s         | 13 MiB/s          |
| BaBar | 33 KiB        |                  | 2 KHz             | None              | 100 Hz            |
|       | (1200 L1)     |                  | 2.4 GiB/s         | (65 MiB/s)        | 4 MiB/s           |
| BTev  | 50-80 KiB     | 800 GiB/s        | 80 KHz            |                   | 4 KHz             |
|       |               |                  | 8 GiB/s           |                   | 200 MiB/s         |

TDAQ Systems In High Energy Physics. Domenico Galli

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA







|       | Event   | L1 Input | L1 output | L2 output | L3 output |
|-------|---------|----------|-----------|-----------|-----------|
|       | Size    | Rate     | Rate      | Rate      | Rate      |
| Atlas | 1-2 MiB |          | 75 KHz    | 3 KHz     | 200 Hz    |
|       |         |          | 100 GiB/s | 5 GiB/s   | 300 MiB/s |
| CMS   | 1 MiB   |          | 100 KHz   |           | 100 Hz    |
|       |         |          | 100 GiB/s |           | 100 MiB/s |
| LHCb  | 35 KiB  | 1 MHz    | 1.1 MHz   |           | 2 KHz     |
|       |         |          | 60 MiB/s  |           | 68 MiB/s  |



