# RETINA REal time Tracking INnovative Approach

A specialized processor for track reconstruction at the LHC crossing rate

Preventivi CSNV 2016 Michael J. Morello Pisa, 9/luglio/2015



#### What is RETINA?

- RETINA was born last year within a collaboration between Pisa and Milano groups.
- Motivation: real-time tracking at 40MHz (and at high intensity).
- How: using a new innovative "fully parallel" tracking algorithm, allowing tracking with low latency and high throughput.
- Preliminary software simulation results very promising
  need to design and build a "realistic hardware prototype".

## A "cellular" tracking algorithm



- Basic algorithm proposed by Luciano Ristori back in 2000:
  - "An artificial retina for real-time track finding" [NIM A453 (2000) 425-429]
- Inspired by mechanism of visual receptive fields of mammals [D.H. Hubel, T.N. Wiesel, J. Physiol. 148 (1959) 574] (from here the name Artificial Retina).
- In between the "Hough transform" [P.V.C. Hough, Conf. Proc. C590914 (1959) 554] and Associative Memories [S. Belforte et al., IEEE Trans. on Nucl. Sci., 42 (1995) p. 860.].

### Track Processing Unit

- Suitable for FPGA implementation because their large I/O capabilities O(Tb/s) with optical links, and large internal bandwidth.
- The switching network minimize the hits sent to the engines using their approximate location and a LUT.
  Truncate, lookup, dispatch. Hits are delivered only where needed via 2-way sorters.
- Note: total bandwidth increases after the switching network, but then shrinks back (data are not compressed, parameters of real tracks are available).
- Parameters space divided into blocks, each implemented on one FPGA.



The whole processing happens in a time short enough that it effectively appears to the rest of the DAQ as if tracks are coming out of the detector at the same time as the hits and all other raw data.

#### A little bit of history

- First studies with software and hardware simulation showed the feasibility of a TPU (*LHCb-PUB-2014-026*)
  - Processed simulated events form official LHCb-MC at  $3x10^{33}$  cm<sup>-2</sup> s<sup>-1</sup>, pileup of 7.6-11.4 , 40 MHz bunch crossing.
  - Efficiency, ghost rate, and resolution comparable wrt offline reconstruction can be achieved with ~50k cells.
- Natural step forward  $\rightarrow$  need a "realistic hardware prototype".
  - Started a 3-year project funded by CSN5 called "RETINA"
  - An aggressive R&D on real-time fast track finding for any experiment needing intensive tracking (1-100 GTracks/s), as the HL-LHC experiments.

#### RETINA milestones from "Preventivi 2015"

- 2015 Preparazione setup di test e prototipo con schede Tel62 a PI, e realizzazione di dimostratore basato su rivelatori a silicio e TEL62 e sistema di test a MI.
- 2016 Test di prototipo a PI con dati simulati a 1MHz. Test di dimostratore con dati simulati e con raggi cosmici a MI.
- 2017 Costruzione di piccolo prototipo a full-speed (40MHz) a PI. In caso di risultati positivi, assemblaggio e test di prototipo di maggiori dimensioni per run a test-beam, o parassitico in presa dati di un esperimento LHC.

in rosso milestone di Pisa

# Current Status of the prototype with Tel62 boards

- Setup assembled in Pisa to run at full speed (8 Tel62+1crate)
  - valuable help from expertise on Tel62 at INFN-Pisa
  - FPGA's on programmed and debugged using USB-JTAG link.
  - A Linux server needed to boot the mini-PC on the board, used to configure the FPGA's at running time and control the I/O.
- Firmware loaded on Altera Stratix III
  - ~200 engines/chip (engine = retina receptor)
  - Engines accept 1 hit per clock cycle.
  - Internal frequency: 160 MHz.
- Cellular Engines and Fitter block fully finalized and operating on real board.
  - Work ongoing on hardware implementation of Switching Network block.



### Prototype with Tel62 boards





- About 3k cells, fit into 32 Stratix III chip (one IT boxes = 1/4 of Inner Tracker)
- Step 1 (Hits delivery through the Switching Network )
  - Hits go to the right engines (high level simulation).
- Step 2 (Accumulating weights) and Step 3 (Find the local maxima and compute centroid):
  - firmware designed, simulated, and fully working on real board at 160MHz, including Ethernet output to PC.
- Interface cards to inter-connect different boards designed and assembled in Pisa.

#### The prototype is really running

Altera allows the addition of a logic analyzer inside the firmware for debugging: internal logic signals are exactly as predicted by ModelSim.

| log: 2015/01/12 14:48:31 #0 2 events, c = 0 +1 +7 +14 +18 +19 +22 |                  |               |                    |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
|-------------------------------------------------------------------|------------------|---------------|--------------------|-----|--|-----------------|----------|--------|--------------|-----------|-----------|------|----|----|--------------|----------------------------------------------|
| Туре                                                              | Alias            | Name          | <sup>0</sup> Value | 1-2 |  | 2 4             | 6        | 8      | 10 1         | 2 1       | 4         | 16   | 18 | 20 | , 2          | 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 |
| *                                                                 | main_reg_out     | REG_OUT[0]    | 1                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| 6                                                                 | ECSAD            | ±S ECSAD      | 00000001h          |     |  |                 | 000      | 00001h |              |           |           | X_   | X  | 00 | 00001        | <u>00 0h X X 04300004h</u>                   |
| *                                                                 | ib0_empty        | IB0 rdempty   | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              | _                                            |
| *                                                                 | ib0_rreq_int     | pib0_rreq_int | 1                  | ]   |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| *                                                                 | ib0_rreq         | S ppib0_rreq  | 1                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| 6                                                                 | ib0_rdata        | ±b0_rdata     | 0                  | ) 0 |  | 91\449\725\982\ | $\sim$   | ((191  | X450X725X982 | $\square$ |           | 0    |    |    |              | 0                                            |
| 6                                                                 | tpu_hitdata      | HITDATA       | 0                  | ) 0 |  | 91\449\725\982\ | $\Sigma$ | ((191  | X450X725X982 | $\sim$    | $\square$ | 0    |    |    |              | 0                                            |
| *                                                                 | data_valid       | ION_inst DV   | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| *                                                                 | eebit            | module EEbit  | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| 6                                                                 | intersect_data   | ±AM data      | 00h                |     |  |                 |          |        |              |           |           |      |    |    |              | 00h                                          |
|                                                                   | d_x              | inst d_x      | 67                 |     |  | 67              | X        | 2./_   | 3            | (67) 2    |           | 3    |    |    |              | 67                                           |
| b                                                                 | d_x_approx       | +approx       | 63                 |     |  | 63              |          | 2      | Х 3          | (63       | 2)        | 3    |    |    |              | 63                                           |
| *                                                                 | acc_en           | inst ACC_EN   | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| B                                                                 | acc_enable       | ⊡c_enable     | Oh                 |     |  | Oh              |          |        | ( 1h         | (2h)      | 4h        | (0h) | 1h | 2h | χ_4          | 4n X 1h X0hX 1h X 0h                         |
| *                                                                 |                  | le[0]         | 0                  |     |  |                 |          |        |              | <u> </u>  |           |      |    |    |              |                                              |
| *                                                                 |                  | le[1]         | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| *                                                                 |                  | le[2]         | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| *                                                                 | sum_enable       | inst SUM_EN   | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| 6                                                                 | acc_total        | ±sum_var      | 0000h              |     |  |                 |          | 0000h  |              |           |           |      | X  |    |              | 0574h X 0566h                                |
| *                                                                 |                  | BUFFER_EN     | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
|                                                                   | acc_total_buffer | al_buffer     | 0                  |     |  |                 |          | 0      |              |           |           |      |    |    |              | 1396 1382                                    |
| *                                                                 | busy             | e BUSY_TOP    | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| B                                                                 | comp_out_eng     | out_eng       | 00h                |     |  |                 |          |        | 00h          |           |           |      |    |    | $\mathbf{X}$ | FFh X 00h X FFh X                            |
| *                                                                 | maxx             | odule MAXX    | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |
| 6                                                                 | out_weight       | WEIGHT        | 0000h              |     |  |                 |          |        | 0000h        |           |           |      |    |    |              |                                              |
| 5                                                                 | maxdata_out      | TA_OUT        | 00430000h          |     |  |                 |          |        | 00430000h    |           |           |      |    |    | $\square$    |                                              |
| *                                                                 | maxenable_out    | NABLE_OUT     | 0                  |     |  |                 |          |        |              |           |           |      |    |    |              |                                              |

# A realistic case: hits from an existing detector

- Use an existing (generic) silicon sub-detector (LHCb Inner Tracker):
  - strip detector with high occupancy.
  - single-hit resolution is of  $\sim 50 \mu m$ .
  - 3 stations, 12 boxes, xuvx layers in each box.
  - Data output on optical fiber.
  - Few % momentum resolution achievable assuming tracks coming from nominal interaction point.
  - Event readout rate is 1 MHz.
- Using ~12k cells, preliminary studies shows good relative efficiency wrt offline reconstructed tracks. (Piucci's Master Thesis)
- Available testing events: single track, simulated data, and real data.





#### First results

- Negligible differences in accumulators between C++ High Level Simulation and logic simulation (ModelSim). Due to integer calculation inside the FPGA.
- Processing time depends only on number of hits in the event
  - Can process ~100MTracks/s.
  - Total latency ~100 clock cicles.
    - Ο Corresponding to ~0.6 µsec at 160MHz.
  - Max rate for reconstructing tracks is 1.8 MHz
    - Simulating the switching network and using as input real events from Minimum Bias, 90% of the events have less than 88 hits delivered to a single FPGA (160 MHz /88).
  - To be compared with readout speed of 1MHz. TPU is able to reconstruct tracks at the same time while reading the hits, without extra delay.

#### D. Ninci's Master Thesis



(a) Simulazione ModelSim con 1 traccia



#### Thesis and conferences

#### • Tesi di Laurea Magistrale

- A. Piucci, Reconstruction of tracks in real time in the high luminosity environment at LHC.
  - O <u>http://www.infn.it/thesis/thesis\_dettaglio.php?tid=8987</u>
- D. Ninci, Ricostruzione di traccia in tempo reale su FPGA ad LHC.
  - o <u>https://etd.adm.unipi.it/theses/available/etd-11302014-212637/</u>
- Recent Pisa talks at conferences/workshops:
  - G. Punzi, talk alla Scuola INFN di Alghero, May 2015 Alghero
  - R. Cenci, poster at the "13th Pisa Meeting", May 2015, La Biodola, Italy.
  - S. Stracka, talk at the "Connecting The Dots Workshop" Feb-2015, Berkeley, USA.
  - R. Cenci, talk at the "Connecting The Dots Workshop", Feb-2015, Berkeley, USA
  - S. Stracka, poster at NSS-MIC14, Nov 2014, Settle (USA)
  - A.Piucci, talk at SIF 2014.
  - P. Marino, poster at ICHEP14, Valencia (Spain) and talk at WIT 2014, Philadelphia (USA).
  - and others...
- And many others from INFN-Milano colleagues: A. Abba, F. Caponio RT14, N. Neri TIPP14, A. Abba, TWEPP14, N.Neri, F. Caponio NSS-MIC14, N.Neri TREDI15, N. Neri ANIMMA15, M. Petruzzo PISA MEETING15.

#### Plans for 2016

- In advance wrt our original schedule, finalization of the prototype on Tel62 boards (basic test of logic at low event rate of 1MHz):
  - Engines and Fitter blocks already finalized and working.
  - Hardware implementation of Switching Network on FPGA is ongoing.
  - Full-test of the prototype with simulated and real data at 1MHz with a system of 4(Switch)+4(Engines) Tel62 boards in a single crate, corresponding to ¼ of the Inner Tracker.
- In 2016 move to the full-speed prototype. Jump to 40MHz requires:
  - More I/O bandwidth
  - More LE/chip
  - More internal speed and internal bandwidth
  - Optical links.

Several options (from HEP or COTS) are under evaluation. Here we propose a 40MHz prototype based con AMC ( $\mu$ TCA) with Altera StratixV chip to give a reasonable cost estimate.

#### Prototype at 40MHz

- Aim of 40 MHz prototype is the demonstration it is possible to sustain a very high throughput (reconstruct tracks every 25 nsec with ~100 clock cycles).
- From previous studies (for instance on Altera Stratix V):
  - Stratix V can contain ~500 engines (~1.4k LE/engine)
  - Bandwidth to feed all engines is 4Tbit/sec
    - Assuming 16 bit for each detector hits and 0.5GHz of clock cycle.
  - Assuming at least a fan-out of 4 inside the chip, as last layers of the Switching Network, we get an input bandwidth of 1Tbit/sec. All engines can be fed (for multiple events processing essential for 40MHz).
- A system of 50k cells would requires 100 Stratix V (plus <100 Stratix V for the Switching Network).</li>

#### A possible setup at 40MHz



In order to fully test functionality at high speed a "minimal" prototype requires at least two elements of the Switching Network block plus one element of the Cellular Engine/Fitter block, corresponding to just one branch of the scheme above (1% of the entire envisioned system).

## Anagrafica

|                   | 2015 [%] | 2016 [%] | Posizione         |
|-------------------|----------|----------|-------------------|
| F. Bedeschi       | 20       | 10       | Dir. di Ricerca   |
| R. Cenci          | 45       | 45       | Post-Doc          |
| M.J. Morello (RL) | 30       | 40       | Ricercatore       |
| G. Punzi (RN)     | 20       | 30       | Prof. Associato   |
| L. Ristori        | 80       | •        | In pensione       |
| F. Spinella       | 20       | 10       | Tecnologo         |
| S. Stracka        | 30       | 30       | Post-doc          |
| J. Walsh          | 20       | 0        | Primo Ricercatore |
| Tot (FTE)         | 2.65     | 1.65     |                   |

#### Richieste 2016

- Missioni : 2kE
- Consumi: 3kE
- Inventariabile: 0E
- Apparati:
  - 1 Crate  $\mu$ TCA (10kE)
  - 3 AMC boards equipped with Stratix V (3x10kE)
- Total: 45kE

Per quanto riguarda le risorse di Sezione chiediamo di continuare ad usare lo spazio di laboratorio che stiamo già usando.