

# Fast track candidates 3D clustering





Algorithm, Hardware implementation And results

> Geoffrey GALBIT IPN Lyon

> > 23<sup>rd</sup> September 2015

# Introduction



#### $\rightarrow$ <u>Context :</u>

- $\rightarrow$  This track filtering system takes place between the Associative Memories and the fit
- → The goal of this module is to provide clean Track Candidates (5 or 6 stubs) with a low fake rate



 $\rightarrow$  The TC Builder proceeds to a **3D clustering** of the stubs **pattern per pattern** 

# Algorithm overview



#### $\rightarrow$ Step by step algorithm :

 $\rightarrow$  Each pattern is processed individually, treatment is operated simultaneously on (R,  $\phi$ ) and (R, Z) plans



### Algorithm overview



#### $\rightarrow$ Step by step algorithm :

 $\rightarrow$  Each pattern is processed individually, treatment is operated simultaneously on (R,  $\phi$ ) and (R, Z) plans



 $\rightarrow$  For each 3D seed, all the other layers stubs are tested.

 $\rightarrow$  The test consist in calculating the distance between the 3D seed projection and the stub and comparing the result with a threshold (green triangle).

 $\rightarrow$  If a stub pass the test, it is added to the current TC.

 $\rightarrow$  When all stubs are tested, if the current TC contains 5 stubs or more, it is (stored in an intermediate memory).

 $\rightarrow$  In the example the TC is not recorded for this seed (there are only 3 stubs selected).

## **Algorithm overview**



#### $\rightarrow$ Step by step algorithm :

 $\rightarrow$  Each pattern is processed individually, treatment is operated simultaneously on (R,  $\phi$ ) and (R, Z) plans



 $\rightarrow$  When a stub pass the test and is recorded in the current TC, the distance is memorized.

 $\rightarrow$  If an other stub pass the test for the same layer, the results of the stubs are compared in order to keep the best alignment with the seed (only on R,  $\varphi$  plan).

 $\rightarrow$  In our example the current TC is recorded for this seed (5 stubs selected).

 $\rightarrow$  When iterations over the seeds are finished, the intermediate memory content is read if there is a valid TC (never more than 1 TC per pattern).

 $\rightarrow$  nlter = (N5 \* N6) \* (Nstubs – N5 – N6) + (N6 \* N7) \* (Nstubs – N6 – N7) + (N5 \* N7) \* (Nstubs – N5 – N7) - 2 \* (N5 \* N6 \* N7)

$$\rightarrow$$
 For this example : nIter = 22

Fountain pattern (AM)

Current seed stub



 $\rightarrow$  Block diagram of one filtering module :



Geoffrey GALBIT



#### $\rightarrow$ Implementation of one filtering module :

- $\rightarrow$  Module is fed with patterns content (1 stub per clock cycle).
- $\rightarrow$  In the first pipeline, polar translation is processed by CORDIC while an other module compute the address where the stubs will be stored in the RAM and begins to send informations to the read logic.
- $\rightarrow$  When there is at least one complete pattern in the memory, the read logic begins to generate all the read addresses corresponding to an iteration (1 seed + 1 stub).
- $\rightarrow$  Last part of the module is a pipeline which tests alignment and memorizes stubs to construct TC.



### **Hardware Implementation**



- $\rightarrow$  Focus on the 3D alignment evaluation module :
  - $\rightarrow$  Alignment test on R,  $\phi$  :

| $ (\phi_1 - \phi_2) * (R_3 - R_1) + (\phi_1 - \phi_3) * (R_1 - \phi_3) $ | $  \mathbf{R}_2   \leq  WindowWidth^* (\mathbf{R}_1 - \mathbf{R}_2) $ |
|--------------------------------------------------------------------------|-----------------------------------------------------------------------|
| Stub (R, φ) Score                                                        | Stub (R, φ) Threshold                                                 |

Seed1 $(R_1, \varphi_1, Z_1)$ Seed2 $(R_2, \varphi_2, Z_2)$ TestedStub  $(R_3, \varphi_3, Z_3)$ 

- $\rightarrow$  Same processing is operated in parallel on the R, Z plan (with different thresholds)
- $\rightarrow$  This processing step is fully pipelined
- → Multiplications are operated by DSP Slices (up to 18 bits per operand for 1 Xilinx DSP Slice)



# Full detector adaptation



 $\rightarrow$  <u>Algorithm adaptation for the other tower types (hybrid and end-cap)</u>:

 $\rightarrow$  Straight forward to adapt to other towers (exactly the same algorithm)

## BE\_5D/Eta6\_Phi8 configuration



| Eta range      | 1                                           | 2                   | 3       | 4       | 5                         | 6                                            |  |
|----------------|---------------------------------------------|---------------------|---------|---------|---------------------------|----------------------------------------------|--|
| Sector numbers | 0→7                                         | <mark>8</mark> →15  | 16→23   | 24→31   | 32→39                     | 40→47                                        |  |
| Active layers  | <mark>56</mark> 18 19 20 21 <mark>22</mark> | 5678 <u>9101819</u> | 5678910 | 5678910 | 5678 <mark>9101112</mark> | <mark>56</mark> 11 12 13 14 <mark>1</mark> 5 |  |

 $<sup>\</sup>rightarrow$  PS and 2S modules of the disk have to be split (they receives 2 distinct sets of acceptance window width during the calibration)

## **Hardware Implementation**



#### $\rightarrow$ Full TC Builder :

- $\rightarrow$  The patterns of an event have to be split into different groups
- $\rightarrow$  A Clustering Module is light enough to be instanciate many times
- $\rightarrow$  The required N\_Modules can be defined thanks to a latency study



## **Ressources utilization**



 $\rightarrow$  **Ressource utilization :** 

 $\rightarrow$  Vivado ressource utilization estimation for one filtering module on a Kintex 7 FPGA :

| Resource   | Estimation | Available | Utilization % |
|------------|------------|-----------|---------------|
| FF         | 6100       | 407600    | 1.50          |
| LUT        | 3731       | 203800    | 1.83          |
| Memory LUT | 3          | 64000     | 0.01          |
| I/O        | 311        | 500       | 62.20         |
| BRAM       | 1          | 445       | 0.22          |
| DSP48      | 6          | 840       | 0.71          |
| BUFG       | 1          | 32        | 3.12          |

 $\rightarrow$  Note : It is not relevant to consider the I/O line (inputs and outputs of the FPGA) while the goal of this module is to be inserted between the Data Organizer and a fitter present on the FPGA.

# Latency Study



#### $\rightarrow$ Latency of the filtering module:

 $\rightarrow$  During one event, a module can manage multiple patterns



Total latency = Constant latency + size of the 1st pattern + sum of nIterations for all the patterns of the module





 $\rightarrow$  Number of clock cycles required (per filtering module per event) :

- $\rightarrow$  Round robin algorithm (each module take alternatively a pattern until there is no more)
- $\rightarrow$  This latency can be reduced thanks to a better algorithm (better load repartition between modules)



Distribution of total latency per module per event, N MODULES = 20





#### $\rightarrow$ **Current status:**

|                        | Muons                                                                             |                 | Electrons      |                | PU140           |                 | PU200           |                 |
|------------------------|-----------------------------------------------------------------------------------|-----------------|----------------|----------------|-----------------|-----------------|-----------------|-----------------|
| A.                     | $p_T \ge 3$                                                                       | $p_T \ge 10$    | $p_T \ge 3$    | $p_T \ge 10$   | $p_T \ge 3$     | $p_T \ge 10$    | $p_T \ge 3$     | $p_T \ge 10$    |
| $\epsilon^{SR}$        | $100.0_{-0.05}$                                                                   | $100.0_{-0.05}$ | $97.0 \pm 0.1$ | $98.4 \pm 0.1$ | $98.0 \pm 0.1$  | $98.4 \pm 0.2$  | $98.1 \pm 0.1$  | $98.5 \pm 0.2$  |
| $\epsilon^{TW}$        | $99.9\pm0.05$                                                                     | $99.9 \pm 0.05$ | $99.8\pm0.05$  | $99.8\pm0.05$  | $100.0_{-0.05}$ | $100.0_{-0.05}$ | $100.0_{-0.01}$ | $100.0_{-0.05}$ |
| $\epsilon^{AM}$        | $99.0\pm0.05$                                                                     | $99.1 \pm 0.05$ | $94.9 \pm 0.1$ | $95.8 \pm 0.1$ | $97.8 \pm 0.1$  | $99.5 \pm 0.1$  | $97.8 \pm 0.1$  | $99.3 \pm 0.2$  |
| $\epsilon^{CB}$        | $99.9\pm0.05$                                                                     | $99.9\pm0.05$   | $98.4 \pm 0.1$ | $98.5\pm0.1$   | $99.3\pm0.05$   | $99.4\pm0.1$    | $99.1\pm0.1$    | $99.3\pm0.2$    |
| $\epsilon^{SR \to CB}$ | $98.8\pm0.05$                                                                     | $98.9\pm0.05$   | $91.4\pm0.2$   | $92.8 \pm 0.2$ | $95.2\pm0.1$    | $97.3\pm0.2$    | $95.0\pm0.1$    | $97.2 \pm 0.3$  |
|                        | Table 6: Efficiencies up to the combination builder, for $p_T \ge 3 \text{GeV/c}$ |                 |                |                |                 |                 |                 |                 |

 $\rightarrow$  > 98% in all the case for the TC Builder stage (not affecting the overall efficiency)

- $\rightarrow$  Same effect in all the towers
- $\rightarrow$  fixed point 18 bits give the same results as the floating point software simulation

# Conclusion



#### $\rightarrow$ **Conclusion :**

- $\rightarrow$  3D thresholding allows a good fake rejection.
- $\rightarrow$  TC are perfectly formated for the fit step (5 or 6 stubs per TC).
- $\rightarrow$  An unique solution for barrel, hybrid and endcap towers.

#### $\rightarrow$ **Future work :**

 $\rightarrow$  Add some optional features to the C++ (and future CMSSW) code, like the ability to emulate the

HW binning and to estimate the latency needed by the TC Builder to process an event.

- $\rightarrow$  Increase the filtering frequency (goal = 400 MHz).
- $\rightarrow$  Continue the global optimization of the filtering modules.
- $\rightarrow$  Implement a coarse track parameters estimator.
- $\rightarrow$  Insert the TC builder implementation between the Data Organizer and a Fitter