# HPC for theoretical physics

 $R.~(lele)~Tripiccione \\tripiccione@fe.infn.it$ 

Workshop CCR La Biodola, May 19th, 2016

### On the menu today

Computational theoretical physics and HPC.

HPC vs. HTC

HPC workhorses today
Trade secrets
The INFN way ...

WhatNext & WhatNextNext (Marconi / IBM Coral)

Programming: Present & Future

Conclusions & take-away lessons

# The starting point:

Computational theoretical physics, when **heavy**, is intrinsically **parallel**. Indeed ....

The world is "run" by partial differential equations -->
Implying short-distance communication in space and time

So you can divide the physical system in chunks, and compute on each of them independently

\*\*\* **IF** \*\*\*

You can exchange the needed informations among those chunk

An example from Lattice  $QCD \rightarrow$ 



#### An example from Lattice QCD

When computing the Dirac operator (the key computing kernel of LatticeQCD)

For each 4-D slice of your physical system of size  $L^4$ 

$$C \sim 584 L^4 \text{ ops}$$
  $I \sim 8 x (3x8) L^3 \text{ bytes}$  
$$\frac{C}{I} = \frac{P}{B} \equiv \rho \approx \frac{584 L^4}{8 \times 3 \times 8 L^3} = 3 L$$



#### An example from Lattice QCD

If you put a tile of size  $L^4$  on each processing element, you need

$$\frac{C}{I} = \frac{P}{B} \equiv \rho \approx \frac{584 L^4}{8 \times 3 \times 8 L^3} = 3 L$$

If your computer fulfills this requirement, can do it parallel

Just an example: if  $P/B \sim 10$  ops/byte, can study a (not too large) physical system of  $32^4$  points on **10000 processing** elements

The typical HPC workhorse of 2013 – 2016 -->

| Fermi @ Cineca | Blue Gene/Q     |
|----------------|-----------------|
| Cores/node     | 16              |
| Perf. / core   | 12.5 Gflops     |
| Mem/cores      | 1 Gbyte         |
| # of cores     | 160000          |
| Peak Flops     | 2.1 Pflops      |
| Energy eff.    | 3.3 Gflops/W    |
| Node bandwidth | 20 GB/sec       |
| Perf / Bwidth  | 10 Flops / byte |

The typical HTC workhorse of  $2013 - 2016 \rightarrow$  (thanks to G. Lo Presti)

| MEYRIN DATA CENTRE                                 |            | WIGNER DATA CENTRE                   |            |
|----------------------------------------------------|------------|--------------------------------------|------------|
|                                                    | rast_value |                                      | last_value |
| Number of Cores in Meyrin                          | 133,507    | lumber of Cores in Wigner            | 43,328     |
| Number of Drives in Meyrin                         | 00,920     | Number of Drives in Wigner           | 20,100     |
| Number of 10G NIC in Meyrin                        | 7,107      | Number of 10G NIC in Wigner          | 1,399      |
| Number of 1G NIC in Meyrin                         | 22,369     | Numer of 1G NIC in Wigner            | 5,067      |
| <ul> <li>Number of Processors in Meyrin</li> </ul> | 23,007     | Number of Processors in Wigner       | 5,418      |
| Number of Servers in Meyrin                        | 12,273     | Number of Servers in Wigner          | 2,712      |
| Total Disk Space in Meyrin (TB)                    | 170,963    | Total Disk Space in Wigner (TB)      | 71,738     |
| Total Memory Capacity in Meyrin (TB)               | 544        | Total Memory Capacity in Wigner (TB) | 172        |

#### Any differences????

|                | CINECA          |                    |
|----------------|-----------------|--------------------|
| Cores/node     | 16              |                    |
| Perf. / core   | 12.5 Fflops     | 12.5 Gflops ??     |
| Mem/cores      | 1 Gbyte         | 4 Gbyte            |
| # of cores     | 160000          | ~ 177000           |
| Peak Flops     | 2.1 Pflops      | 2.3 Pflops ??      |
| Energy eff.    | 3.3 Gflops/W    | ???                |
| Node bandwidth | 20 GB/sec       |                    |
| Core bandwidth | 1.25 GB/sec     | ~ 0.06 GB/sec      |
| Perf / Bwidth  | 10 Flops / byte | ~ 208 Flops / byte |

#### Any differences????

|                | CINECA          | CERN               |
|----------------|-----------------|--------------------|
| Cores/node     | 16              |                    |
| Perf. / core   | 12.5 Fflops     | 12.5 Gflops ??     |
| Mem/cores      | 1 Gbyte         | 4 Gbyte            |
| # of cores     | 160000          | ~ 177000           |
| Peak Flops     | 2.1 Pflops      | 2.3 Pflops ??      |
| Energy eff.    | 3.3 Gflops/W    | ???                |
| Node bandwidth | 20 GB/sec       |                    |
| Core bandwidth | 1.25 GB/sec     | ~ 0.06 GB/sec      |
| Perf / Bwidth  | 10 Flops / byte | ~ 208 Flops / byte |

Any reason for these differences??????

Any reason for these differences???



At CERN ~ 1.5 cores per job

At FERMI, on average 20 jobs on 160K cores  $\rightarrow$  8000 cores / job

### HPC workhorses today

Looking at the recent past ...



#### Current HPC machines

Two examples .... two-three years between the two

|                | Fermi @ Cineca  | SuperMUC          |
|----------------|-----------------|-------------------|
| Cores/node     | 16              | 28                |
| Perf. / core   | 12.5 Fflops     | 41.6 Gflops       |
| Mem/cores      | 1 Gbyte         |                   |
| # of cores     | 160000          | 86000             |
| Peak Flops     | 2.1 Pflops      | 3.58 Pflops       |
| Energy eff.    | 3.3 Gflops/W    | 3.5 Gflops/W      |
| Node bandwidth | 20 GB/sec       | 18 Gbyte/sec      |
| Core bandwidth | 1.25 GB/sec     | ~ 0.65 GB/sec     |
| Perf / Bwidth  | 10 Flops / byte | ~ 63 Flops / byte |

#### Current HPC machines

#### Two examples ....

|                | Fermi @ Cineca  | SuperMUC          |
|----------------|-----------------|-------------------|
| Cores/node     | 16              | 28                |
| Perf. / core   | 12.5 Fflops     | 41.6 Gflops       |
| Mem/cores      | 1 Gbyte         |                   |
| # of cores     | 160000          | 86000             |
| Peak Flops     | 2.1 Pflops      | 3.58 Pflops       |
| Energy eff.    | 3.3 Gflops/W    | 3.5 Gflops/W      |
| Node bandwidth | 20 GB/sec       | 18 Gbyte/sec      |
| Core bandwidth | 1.25 GB/sec     | ~ 0.65 GB/sec     |
| Perf / Bwidth  | 10 Flops / byte | ~ 63 Flops / byte |

.... showing a worrying trend!

The Top500 ranking list in 1978!!!!

```
UNIT - 10**6 TIME/( 1/3 100**3 + 100**2 )
                   N=100 micro-
                                  Computer
                                                        Compiler
ECAR.
                    049
                           0.14
                                  CRAY-1
                                                        CFT, Assembly BLAS
                                  CDC 7600
                   .148
                           0.43
                                                        FIN. Assembly BLAS
NCAR
               3.5%, 192
                           0.56
                                  CRAY-1
LASL
                                                        PEN
              5,27 .210
                           0.61
                                  CDC 7600
                           0.86
Argonne
                                  IBM 370/195
                   .359
                           1.05
                                  CDC 7600
IXCAR.
                                                        Local
Argonne
             . 1677
                   .388
                           1.33
                                  IBM 3033
NASA Langley 1.40,489
                           1.42
                                  CDC Cyber 175
                                                        FIN
U. III. Urbana 144 .506
                           1.47
                                  CDC Cyber 175
                                                        Ext. 4.6
LLL
               144.554
                           1.61
                                  CDC 7600
                                                        CHAT, No optimize
SLAC
               1.19 . 579
                           1.69
                                  IBM 370/168
                                                        H Ext., Fast mult.
Michigan
               107.631
                           1.84
                                                        H
Toronto
               . 772 590
                           2.59
                                  IBM 370/165
                                                        H Ext., Fast mult.
               A791.64
                           4.20
                                  CDC 6600
Northwestern
                                                        FIR
                           5.63
Texas
                                  CDC 6600
                                                        RUM
China Lake
                                  Univac 1110
Yale
                                  DEC KL-20
                                                        F20
Bell Labs
                           10.1
                                  Honeywell 6080
                           10.1
Wisconsin
                                  Univac 1110
                                                        H
Iowa State
               . my 3.54
                           10.2
                                  Itel AS/5 mod3
                           11.9
                                  IBM 370/158
                                                        Gl
U. III. Chicago #4.10
                JUL 5 . 69
                           16.6
                                  CDC 6500
                                                        FUR
Furdue
U. C. San Diego-MG-13.1
                           38.2
                                  Burroughs 6700
                                                        H
                           49.9
Yale-
                                  DEC KA-10
  * TINE(100) = (100/75) **3 SGEFA(75) + (100/75) **2 SGESL(75)
```

Thanks to J. Dongarra for this (and a few more) hard to find pictures

The Top500 ranking list in the last 30 years !!!!



The Top500 ranking list in the last 30 years !!!!



The Top500 ranking list in the last 30 years !!!!



### Trade secrets 2: Less official views

Why 500 ????? Or ...

... Is being a club member a true sign of distinction??



# Trade secrets 2: Less official views

Entry # 10 is the true boundary between the" haves" and the "haves not"....

#18 ENI
(3.188 P / 4.605P)
#36 CINECA (Fermi)
(1.789P / 2.097P)
#129 CINECA Galileo
(0.684P / 1.103P)
#205 ENI
(0.454P / 0.496P)



By the way, is the Linpack TOP500 test a sensible benchmark?? Make your choice!!!!!



By the way, is the Linpack TOP500 test a sensible benchmark?? Make your choice!!!!!



Applying the "new" proposed Linpack has a sobering effect, but the scaling law does not change at all (except for just a few entries)

More and more high-end machines look like this .....



.... heterogeneous systems exploiting (in one way or another) massively parallel accelerators (GPUs or similar beasts)

This is so because boosting the clock no longer works.....



By the way, physics explains why it is so....

parallel computing is the <u>physics sponsored way to compute</u>:

The basic object in computers today is the transistor

Industry learns to build smaller and smaller transistors. As  $\lambda \to 0$  obviously  $N \propto 1/\lambda^2$  but speed scales less favourably  $\int_{-\infty}^{\infty} 1/\lambda$ 

Trade rules: perform <u>more and more</u> things <u>in parallel</u> rather than a <u>fixed number of things faster and faster</u>

1) Agreements with a "large computing centre" (CINECA)

Access to 100 (out of 1600) Mcore-hours / year on the FERMI system (BG/Q); started in late 2012 ending in June 2016

Access to an additional 15 (x2) Mcore-hours / year on GALILEO 2015 through 2017

Access to ~ 6% of the new MARCONI system (see later...)
Starting June 2016, ending November 2018

After Nov. 2018, ??????

2) A smaller INFN-maintained cluster (Zefiro) for algorithm/code development, tests, fine-tuning of programs ...

Some 1600 computing cores (~1.4 Mcore-hours) Infiniband QDR + Experimental nodes (GPUs)

up and running in Pisa since Sept. 2013



3) Probably early 2017: The first significant investment in HPC since the previous millennium ...

In the framework of the CIPE project on "Integration of HPC and HTC computing", approved in late December 2015; "Decreto" published a few days ago →

an HPC island of roughly 1.5 Pflops peak power --->
1000 Mcore-hours / year (BG/Q equivalent)

plus (probably) an upgraded "R&D small" machine

4) Are smaller installations sensible choices??

Just an example of a smaller HPC machine at Università di Ferrara

- 5 Nodes with 2 Intel CPUs and 16 GPUs on each node (assembled by E4)
  - $\rightarrow$  110+ Tflops peak performance (5.5 Gflops/W)
- $\rightarrow$  96 (BG/Q) Mcore hours .... GOOD!!!! (4 years later...)
- $\rightarrow$  But only if you can use GPUs efficiently ... LESS GOOD!!



A new Moore's law ????



A new Moore's law (**Zoccoli's** law) ???? Still some worries ...



# A fashionable question: WhatNext ??



#### WhatNext: Fermi → Marconi

A large upgrade scheduled in 2016 – 2017 for the Tier-0 machine at CINECA

A preview of a trend likely to be seen elsewhere

A long installation process

Three successive phases, going from 2 Pflops  $\rightarrow$  ~ 18 Pflops

Approximately 26 ME for the full upgrade (~1.4 MEuro/Pflops)

#### WhatNext: Fermi → Marconi

A summary of the main figures ....

|                | Fermi           | Marconi 1       | Marconi 2   | Marconi 3   |
|----------------|-----------------|-----------------|-------------|-------------|
| When           | Oct. 2013       | Jun. 2016       | Dec. 2016   | Jun. 2017   |
| Processor      | B/G             | Broadwell       | KNL B1      | Skylake     |
| Cores/node     | 16              | 36              | 68          | 40          |
| Perf. / core   | 12.5 Gflops     | 36.7 Gflops     | 44.9 Gflops | 74.4 Gflops |
| Mem/cores      | 1 Gbyte         | 3.55 Gbyte      | 1.4 Gbyte   | 4.8 Gbyte   |
| # of cores     | 160000          | 54432           | 244800      | 60480       |
| Peak Flops     | 2.1 Pflops      | 2.0 Pflops      | 11.0 Pflops | 4.5 Pflops  |
| Energy eff.    | 3.3 Gflops/W    | Valore medio    | >           | 12 Gflops/W |
| Node bandwidth | 20 GB/sec       | 25 GB/sec       | 25 GB/sec   | 25 GB/sec   |
| Perf / Bwidth  | 10 Flops / byte | 51 Flops / byte | 122 Fl/byte | 119 Fl/byte |

#### WhatNextNext:

The next big step forward might be the Coral project:

2~GPU~based~machines~(IBM) Argonne+LLL

1 MIC based machine (Cray) OakRidge

Promised for (late) 2017
Each machine in the 100 – 200 Pflops range

The two IBM machines cost 325 M\$ (1.3 M\$ / Pflops)

A closer look at the Argonne machine (codename: Summit)

#### WhatNextNext:

The Summit/Sierra systems are the first attempt at designing from scratch an HPC GPU-based system ...

- ... as opposed to just assembling whatever processor with whatsoever GPU ...
- ... trying to eliminate (or at least reduce) the two critical bottlenecks of these "accrocchi":

Data bottleneck among memory and compute engine Data bottleneck among GPU and CPU

#### WhatNextNext:

|                | Fermi @ Cineca  | Coral / Sierra    |
|----------------|-----------------|-------------------|
| Cores/node     | 16              | ???               |
| Perf. / node   | 0.2 Tflops      | 5 x 8 Tflops      |
| Mem/node       | 16 Gbyte        | 256 Gbyte ????    |
| # of nodes     | 10000           | 3400              |
| Peak Flops     | 2.1 Pflops      | ~ 140 Pflops      |
| Energy eff.    | 3.3 Gflops/W    | ~ 15 Gflops/W     |
| Node bandwidth | 20 Gbyte/sec    | 23 – 46 Gbyte/sec |
| Core bandwidth | 1.25 GB/sec     |                   |
| Perf / Bwidth  | 10 Flops / byte | 860 Flops/byte    |

A couple of critical architectural improvements here:



 $A\ couple\ of\ critical\ architectural\ improvements\ here\ \dots$ 

But a serious network bottleneck coming back ???

 $A\ couple\ of\ critical\ architectural\ improvements\ here\ \dots$ 

But a serious network bottleneck coming back ???

YES and NO!!!

#### Assume that:



- your pattern of node-to-node-communication is local, so ...
- the information exchange is proportional to the "surface" of your computing volume  $I \propto V^{\varepsilon}$   $\varepsilon < 1$
- you use the increasing power of each processing node to stuff larger and larger parts of your problem into each processing nodes
- more formally, you are happy with "weak scaling"

*Define*  $\rho = P/B$ 

Then you can derive a scaling law telling that you're happy as long as:  $\rho \leq P^{1-\epsilon} \quad \epsilon \propto 0.5$ 

#### Example:

When computing the Dirac operator (the key computing kernel of LatticeQCD)

For each 4-D slice of your physical system of size  $L^4$ 

$$C \sim 584 L^4 ops$$
  $I \sim 8x3x8 L^3 bytes$  
$$\rho \approx \frac{584 L^4}{8 \times 3 \times 8 L^3} = 3 L$$

(not too large) lattices of  $24^4$  ( $32^4$ ) sites scale in performance on BG/Q up to 4096 (10000) cores

It is not as bad as it looks like at the beginning ....



... but not very good either !!!!

\_

#### WhatNextNextNext:

"La Cina e' vicina...."

#### **China Accelerator**



#### Matrix2000 GPDSP

- ☐ High Performance
  - > 64bit Supported
  - > ~2.4/4.8TFlops(DP/SP)
  - > 1GHz, ~200W

- ☐ High Throughput
  - ➤ High-bandwidth Memory
  - > 32~64GB
  - ➤ PCIE 3.0, 16x







#### THE KEY PROBLEM:

No way to increase performance if you do not go parallel

And you have to do so AT THE SAME time at many level

- 1. Coarse (node) parallelism
- 2. Fine (core) parallelism
- 3. Finer(intra-core) parallelism (aka, vectorization, SIMD ....)

The "old" way...

- 1. MPI
- 2. openMP
- 3 55555



```
The "old" way...
1. MPI
2. openMP
3 55555
Is there a "new" way to:
   replace ?????
   possibly combine (at least) 2) and 3)
A possible approach to answer this question ....
```

Are current massively-parallel processors efficient compute engines for our typical (massively-parallel) algorithms ... ... if you are ready to make an "unreasonable" effort to adapt your algorithm/code to the machine?

Is there a way to squeeze a large fraction of the potentially available computing power with a "reasonable" effort?

Is this "reasonable effort" <u>re-usable</u>, as new processors/computers become available in the near future.

Can we have "(mildly-)quantitative" answers to these questions?

Direct experience in 2 real-life cases:

Fluid - dynamics using the Lattice Boltzmann (LB) method

LQCD with staggered fermions.

#### The LB method

LB is discrete in position and momentum space.

LB solves numerically the Boltzmann equation on a lattice with sufficient accuracy (i.e. up to a given power of momenta) to reproduce the features of the fluid-flow described by the Navier-Stokes equations ( $\delta x \gg 1$ )













#### The LB method

An extended set of tests on the D2Q37 LB model

Surprisingly similar to LGT simulations in many of its computational details.

- Massively-parallel algorithm, easily partitioned on many nodes
- Just two computationally critical kernels (collide propagate)
- No significant role of global sums (at variance with LQCD)

In practical terms: can we make a transition from here ....



In practical terms: can we make a transition from here .... to here



.... and be happy with performance and portability

## Our Graal: openACC

Is there a programming language that makes this perspective possible and efficient?

It looks like it does: it is called openACC (or maybe openMP4)

#### Let's have a look ---->

## Our Graal: openACC

|                                                      |                     | Tesla K40           |                     |                     | Tesla K80          |                     |  |  |
|------------------------------------------------------|---------------------|---------------------|---------------------|---------------------|--------------------|---------------------|--|--|
| Code Version                                         | CUDA                | OCL                 | OACC                | CUDA                | OCL                | OACC                |  |  |
| $T_{	ext{Prop}} [	ext{msec}]$ $GB/s$ $\mathcal{E}_p$ | 13.78<br>169<br>59% | 15.80<br>147<br>51% | 13.91<br>167<br>58% | 7.60<br>306<br>64%  | 7.79<br>299<br>62% | 7.51<br>310<br>65%  |  |  |
| $T_{\rm Bc}$ [msec]                                  | 4.42                | 6.41                | 2.76                | 1.11                | 1.58               | 0.71                |  |  |
| $T_{	ext{Collide}}$ [msec] MLUPS $\mathcal{E}_c$     | 39.86<br>99<br>45%  | 136.93<br>29<br>13% | 78.65<br>50<br>23%  | 16.80<br>234<br>52% | 61.73<br>64<br>14% | 36.39<br>108<br>24% |  |  |
| $T_{ m WC}$ /iter [msec] MLUPS                       | 58.07<br>68         | 159.14<br>25        | 96.57<br>41         | 26.84<br>147        | 71.12<br>55        | 44.61<br>88         |  |  |
|                                                      | <b>A</b>            |                     | <b>A</b>            | <b>A</b>            |                    | <b>A</b>            |  |  |

## Our Graal: openACC

Once we have our nice openACC code, can we use it (without changes) on a different (yet a new) computer????

|                                                              | I          | E5-2630 v3 | 3        |          | GK210    |          | Hav  | vaii XT  |
|--------------------------------------------------------------|------------|------------|----------|----------|----------|----------|------|----------|
| compiler                                                     | ICC 15     | ICC 15     | PGI 15.9 | NVCC 7.5 | NVCC 7.5 | PGI 15.9 | GCC  | PGI 15.9 |
| model                                                        | Intrinsics | OMP        | OACC     | CUDA     | OCL      | OACC     | OCL  | OACC     |
| <i>propagate</i> perf. [GB/s] $\mathcal{E}_p$                | 38         | 32         | 32       | 154      | 150      | 155      | 232  | 216      |
|                                                              | 65%        | 54%        | 54%      | 64%      | 62%      | 65%      | 73%  | 70%      |
| collide perf. [MLUPS] collide perf. [GFLOPs] $\mathcal{E}_c$ | 14         | 11         | 12       | 117      | 32       | 55       | 76   | 54       |
|                                                              | 92         | 71         | 78       | 760      | 208      | 356      | 494  | 351      |
|                                                              | 30%        | 23%        | 25%      | 52%      | 14%      | 24%      | 19%  | 14%      |
| Tot perf. [MLUPS]                                            | 11.5       | 9.2        | 9.8      | 80.7     | 28.1     | 45.6     | 63.7 | 47.0     |

## From LB to LQCD

A LQCD code with staggered fermions + improved action + (two level) stout smearing

earlier moved to GPUs by M. Delia and collaborators  $\dots$  and now going back to good old C + openACC

The physics problem: what happens to the quark potential in an

external magnetic field

(C. Bonati et al., arXiv:1403:6094)

 $(40^4) \to (48^3 \times 96)$ 



## From LB to LQCD

Code fully (re-)written and tested (C + openACC)

/Running physics on the GPU cluster in Ferrara)

Preliminary performance tests on K20 - K40 - K80 GPUs

Comparison against the golden benchmark today (BG/Q)

|                                                           | Measured                                                                                                                    | Expected                     |
|-----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|------------------------------|
| 1 K20 GPU<br>1 K40 GPU<br>1 K80 GPU<br>1 "heavy" GPU node | $\rightarrow$ 80 BG/Q cores<br>$\rightarrow$ 96 BG/Q cores<br>$\rightarrow$ 160 BG/Q cores<br>$\rightarrow$ 1300 BG/Q cores | ( 93 )<br>( 110 )<br>( 190 ) |

# Is it all gold that glitters??

Obviously not ....

Some new processors not yet supported by openACC / openMP4

Non negligible (but acceptable) performance gap still present

Not all programs automatically OK...

 $And \dots$ 

The problem in the next few years will again be in the match between processing power and communication capacity among compute nodes (Nothing as good as BG/X will be available in the near future....)

## Is it all gold that glitters (2)??

Still, after some (non trivial) tuning is done (once for all) ...



Reasonable windows for efficiency can be found



## Concluding remarks (1)

HPC differs from HTC because the inter-node(core) communication requirements are hugely different

HPC machines happily increase their peak performance levels ...

... but interconnection technology is barely adequate to keep them working on useful problems

In the future HPC and HTC machine may look increasingly similar

Competitive systems are not within reach of INFN alone: agreements with other partners necessary

## Concluding remarks (2)

Processing elements are increasingly powerful, but also complex and heterogeneous, and (luckily??) no clear winner has emerged.

Programming (for performance) in a machine-independent way is difficult.

But recent progress in this direction is promising.

And is probably the key place in which collaboration among the theoretical and experimental communities within INFN can be very fruitful.