

## **Technological Update** November 2024





All information in the following presentation is NVIDIA confidential, including codenames, future products, and performance projections

No information in this presentation is allowed to be revealed or published without NVIDIA consent

Sharing or distributing copies of this presentation to anyone is strictly prohibited

Use of cameras to capture information is strictly prohibited 

# **NVIDIA CONFIDENTIAL** PRESENTATION

# **Customer Guidelines**



- NVIDIA hardware architecture overview;
- Main differences between the different NVIDIA GPU architectures;
- Programming languages available for GPUs;
- Main techniques for optimizing code performance;
- References to in-depth courses on the different topics.

## Agenda

• Overview of the main drivers / software libraries needed for the different hardware components;

Tools for debugging and performance analysis (e.g. Visual/Compute Profiler);

• Main techniques for optimizing intra-node and inter-node multi-GPU communications with infiniband (e.g. GPUDirect P2P, RDMA); Roadmap of NVIDIA technological developments (for what concerns CPU/GPU);







## **Pre-Exascale Supercomputing**

### EDGE

## SIMULATION





### Viz





### EDGE







## **Exascale Supercomputing**

SIM + AI









### **DIGITAL TWIN**



## **QUANTUM COMPUTING**





📀 NVIDIA.







## **AI Factory for Research**

SIMULATION



### **DIGITAL TWIN**



### **QUANTUM COMPUTING**



CLOUD



📀 NVIDIA.

### **DRUG DISCOVERY** EvolutionaryScale





### **CLIMATE MODELING** KAUST





## **AI: The new tool for Science**

### **CANCER RESEARCH**

Wellcome Sanger Institute

### LABORATORY ROBOTICS

Argonne National Lab



### **ASTRO FOUNDATION MODELS** The Flatiron Institute

### AGRICULTURAL HEALTH Fermata



### **MATERIALS DISCOVERY**

Microsoft Research



SEISMIC GEOSCIENCE California Institute of Technology





## Al is Transforming Material Science and Chemistry



### Materials Project Multi-National Effort

### MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields

ML] 26 Jan 2023

Ilyes Batatia Engineering Laboratory, University of Cambridge Cambridge, CB2 1PZ UK Department of Chemistry, ENS Paris-Saclay, Université Paris-Saclay 91190 Gif-sur-Yvette, France ilyes.batatia@ens-paris-saclay.fr

**Gregor N. C. Simm** Engineering Laboratory, University of Cambridge Cambridge, CB2 1PZ UK **Dávid Péter Kovács** Engineering Laboratory, University of Cambridge Cambridge, CB2 1PZ UK

Christoph Ortner Department of Mathematics University of British Columbia Vancouver, BC, Canada V6T 1Z2

MACE OpenSource

## Broad industry innovation and achievement

### Article

## Scaling deep learning for materials discovery

https://doi.org/10.1038/s41586-023-06735-9 Received: 8 May 2023

Accepted: 10 October 2023

Published online: 29 November 2023

Open access

2024

May

10

Sci]

ntrl-

Check for updates

Amil Merchant<sup>1,3</sup>, Simon Batzner<sup>1,3</sup>, Samuel S. Schoenholz<sup>1,3</sup>, Muratahan Aykol<sup>1</sup>, Gowoon Cheon<sup>2</sup> & Ekin Dogus Cubuk<sup>1,3</sup>

Novel functional materials enable fundamental breakthroughs across technological applications from clean energy to information processing<sup>1–11</sup>. From microchips to batteries and photovoltaics, discovery of inorganic crystals has been bottlenecked by expensive trial-and-error approaches. Concurrently, deep-learning models for language, vision and biology have showcased emergent predictive capabilities with increasing data and computation<sup>12–14</sup>. Here we show that graph networks trained at scale can reach unprecedented levels of generalization, improving the efficiency of materials discovery by an order of magnitude. Building on 48,000 stable crystals identified in continuing studies<sup>15–17</sup>, improved efficiency enables the discovery of 2.2 million structures below the current convex hull, many of which escaped previous

### **GNoME** Google DeepMind

### MatterSim: A Deep Learning Atomistic Model Across

Elements, Temperatures and Pressures

Han Yang<sup>1\*†</sup>, Chenxi Hu<sup>1†</sup>, Yichi Zhou<sup>1†</sup>, Xixian Liu<sup>1†</sup>, Yu Shi<sup>1†</sup>, Jielan Li<sup>1\*†</sup>, Guanzhi Li<sup>1†</sup>, Zekun Chen<sup>1†</sup>, Shuizhou Chen<sup>1†</sup>, Claudio Zeni<sup>1</sup>, Matthew Horton<sup>1</sup>, Robert Pinsler<sup>1</sup>, Andrew Fowler<sup>1</sup>, Daniel Zügner<sup>1</sup>, Tian Xie<sup>1</sup>, Jake Smith<sup>1</sup>, Lixin Sun<sup>1</sup>, Qian Wang<sup>1</sup>, Lingyu Kong<sup>1</sup>, Chang Liu<sup>1</sup>, Hongxia Hao<sup>1\*</sup>, Ziheng Lu<sup>1\*</sup>

<sup>1\*</sup>Microsoft Research AI for Science.

\*Corresponding author(s). E-mail(s): hanyang@microsoft.com; jielanli@microsoft.com; hongxiahao@microsoft.com; zihenglu@microsoft.com; †These authors contributed equally to this work.

Abstract

MatterSIM Microsoft Research



202

Jan

29

rl-sci]

### MatterGen: a generative model for inorganic materials design

Claudio Zeni<sup>1†</sup>, Robert Pinsler<sup>1†</sup>, Daniel Zügner<sup>1†</sup>, Andrew Fowler<sup>1†</sup>, Matthew Horton<sup>1†</sup>, Xiang Fu<sup>1</sup>, Aliaksandra Shysheya<sup>1</sup>, Jonathan Crabbé<sup>1</sup>, Lixin Sun<sup>1</sup>, Jake Smith<sup>1</sup>, Bichlien Nguyen<sup>1</sup>, Hannes Schulz<sup>1</sup>, Sarah Lewis<sup>1</sup>, Chin-Wei Huang<sup>1</sup>, Ziheng Lu<sup>1</sup>, Yichi Zhou<sup>1</sup>, Han Yang<sup>1</sup>, Hongxia Hao<sup>1</sup>, Jielan Li<sup>1</sup>, Ryota Tomioka<sup>1\*†</sup>, Tian Xie<sup>1\*†</sup>

<sup>1</sup>Microsoft Research AI4Science.

\*Corresponding author(s). E-mail(s): ryoto@microsoft.com; tianxie@microsoft.com; †Equal contribution; non-corresponding authors are listed in random order.

> MatterGen Microsoft Research

### Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

 $\label{eq:LuisBarroso-Luque} \mbox{Muhammed Shuaibi}, \mbox{Xiang Fu}, \mbox{Brandon M. Wood}, \mbox{Misko Dzamba}, \mbox{Meng Gao}, \mbox{Ammar Rizvi}, \mbox{C. Lawrence Zitnick}, \mbox{Zachary W. Ulissi}$ 

Fundamental AI Research (FAIR) at Meta

The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive

> Open Materials 2024 Meta FAIR

trl-sci] 16 Oct 2024



# **Announcing Earth-2 NIMs for CorrDiff & FourCastNet**

### **GEFS Lower-Resolution** Forecast ~25km







CorrDiff NIM for Generative AI Powered Downscaling | FourCastNet NIM for Global Weather Forecasting



Forecast Rollout T=0 to T=120 hrs

Capture Extremely Rare Events 3 Sigma or 4 Sigma





NIM CUDA-Accelerated Agentic AI Libraries





Grace Blackwell MGX Node



NVLink Switch





## NVIDIA BUILDS AI SUPERCOMPUTING INFRASTRUCTURE One Year Rhythm | Supercluster Scale | Full-Stack | CUDA Everywhere

Omniverse CUDA-Accelerated Physical AI Libraries





| CUDA • D    |
|-------------|
| Cluster-Sca |
| System      |
| Chip S      |

GB200 NVL72 SuperPOD



Chips Purpose-Built for AI Supercomputing GPU | CPU | DPU | NIC | NVLink Switch | IB Switch | ENET Switch CUDA-X Libraries

DOCA • NCCL cale Software n Software Software

Accelerated Software Stack



Quantum Switch



Spectrum-X Switch













## Inventec

Lenovo



**Grace Hopper** Superchip

**Ecosystem Powering the Next Wave of AI Supercomputing Systems** 

## Partners Supercharge HPC and AI







## PEGATRON





NVIDIA GB200 Grace Blackwell NVL4 Superchip



NVIDIA GB200 Grace Blackwell Superchip

**GIGABYTE**<sup>™</sup>

**Hewlett Packard** Enterprise













H200 NVL







AI SUPERCHIP 208B Transistors



2nd GEN TRANSFORMER ENGINE FP4/FP6 Tensor Core

## **NVIDIA Blackwell** The Engine of the New Industrial Revolution

Built to Democratize Trillion-Parameter Al



5<sup>th</sup> GENERATION NVLINK Scales to 576 GPUs



RAS ENGINE 100% In-System Self-Test



- 20 PetaFLOPS of AI performance on a single GPU
- 4X Training | 30X Inference | 25X Energy Efficiency & TCO
- Expanding AI Datacenter Scale to beyond100K GPUs



SECURE AI Full Performance Encryption & TEE



DECOMPRESSION ENGINE 800 GB/s





## **Announcing Fifth Generation NVLink and NVLink Switch Chip** Efficient Scaling for Trillion Parameter Models

7.2 TB/s Full all-to-all Bidirectional Bandwidth

Sharp v4 plus FP8

3.6 TF In-Network Compute

Expanding NVLink up to 576 GPU NVLink Domain

18X Faster than Today's Multi-Node Interconnect



|     | •    |
|-----|------|
|     | III. |
| I I |      |
|     |      |
|     |      |
| (   |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
| E D |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
|     |      |
| -   |      |

## **GB200 NVL72 Delivers New Unit of Compute**



Training Inference **NVL Model Size** Multi-Node All-to-All Multi-Node All-Reduce

### **GB200 NVL72**

36 GRACE CPUs 72 BLACKWELL GPUs Fully Connected NVLink Switch Rack

> 720 PFLOPs 1,440 PFLOPs 27T params 130 TB/s 260 TB/s

> > **OEM and DGX** options



# Choose The Right <u>Solution</u> For The Job

### Scale-up, CPU+GPU & HGX Products

**Real-Time Trillion-**Parameter Models LLM & MoE



Graph Neural Networks



Massive Scale Model Training & Inference

405B - 1T+

Rack Power: ~120 / 70kW

NVLink Domain 72

Max GPUs per NVLink Domain Max Performance & Capability

GB200 NVL72 1000W, 480GB, 144GB



Quantum-2/CX7

N/A





4-8 GPUs per Baseboard Highest Compute Performa AI, HPC & Data Analytics

| HGX Products                                                                      | Scale-out, CPU+GPU & PCIe Products                                  |                                                                                    |                                                        |                                                               |  |
|-----------------------------------------------------------------------------------|---------------------------------------------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------------------------|--|
| Highest Compute<br>Performance HPC & Al                                           | Most Versatile Platform<br>Diverse Workloads                        | Most Efficient Compute<br>LLM Inference, AI + HPC                                  | Fast Universal<br>AI + Graphics<br>Text to Image/Video | Al Video<br>Entry-tier Universal<br>& Edge Al                 |  |
|                                                                                   | ı∥ııı ←<br>L 🀲                                                      |                                                                                    | J.C.                                                   |                                                               |  |
| Al Training, Inference &<br>Scientific Research                                   | LLM Inference & Retrieval-<br>Augmented Generation                  | LLM Inference & Retrieval-<br>Augmented Generation                                 | Text to Image/Video Al<br>Multi-modal Generative Al    | Edge AI, Inference +<br>Video and AI                          |  |
|                                                                                   |                                                                     |                                                                                    |                                                        | البالي                                                        |  |
| Data Analytics                                                                    | Data Analytics, Vector<br>Database & HPC                            | AI, HPC & Data<br>Analytics                                                        | Fine Tune<br>Training/Inference,<br>GenAI + Omniverse  | Mobile Cloud Gaming<br>+ vWS                                  |  |
| 70B-405B                                                                          | 70B-175B                                                            | 70B-175B                                                                           | 13B-70B                                                | Up to 7B                                                      |  |
| Node Power: ~11 / 5.5kW                                                           | Node Power: ~3kW                                                    | Node Power: ~8kW                                                                   | Node Power: ~3kW                                       | Node Power: ≤2kW                                              |  |
| NVLink Domain 8 & 4                                                               | NVLink Domain 2                                                     | NVLink Domain 4 & 2                                                                | NA                                                     | NA                                                            |  |
| 4-8 GPUs per Baseboard<br>lighest Compute Performance<br>AI, HPC & Data Analytics | 1 – 2 GPUs per node<br>Best Inference TCO<br>Adv. Arch for AI & HPC | 4-8 GPUs per node<br>2 <sup>nd</sup> Generation MIG<br>5-year NVIDIA AI Enterprise | Fastest RT Graphics<br>Largest Render Models           | 1-16 GPUs per node<br>Video & Graphics<br>Compact & Versatile |  |
| HGX H200   H100<br>700W, 141GB   80GB                                             | GH200 NVL2<br>1000W, 480GB, 144GB                                   | H200 NVL   H100 NVL<br>600W, 141GB   400W, 94GB<br>NVL2 & 4   NVL2                 | L40S<br>350W   48GB<br>2-Slot FHFL                     | <b>L4</b><br>72W   24GB<br>1-Slot HHHL                        |  |
| Quantum-2/ConnectX-7                                                              |                                                                     |                                                                                    |                                                        |                                                               |  |
| Spectrum-X / B3140H SuperNIC                                                      |                                                                     |                                                                                    |                                                        | Ethernet/Wi-Fi                                                |  |
| BlueField-3 DPU                                                                   |                                                                     |                                                                                    |                                                        |                                                               |  |
|                                                                                   |                                                                     |                                                                                    |                                                        |                                                               |  |





## X86 + HOPPER Architectures & Connectivity



HGX H100 4-GPU HGX H200 4-GPU

NEW

80GB HBM3 3.4TB/s 141GB HBM3e 4.8TB/s





### HGX H100 8-GPU HGX H200 8-GPU

80GB HBM3 3.4TB/s 141GB HBM3e 4.8TB/s

NEW



## **NVIDIA GH200 Grace Hopper** Superchip

Built for the New Era of Accelerated Computing and **Generative Al** 

## Most versatile compute

Best performance across CPU, GPU or memory intensive applications

Easy to deploy and scale out 1 CPU:1 GPU node simple to manage and schedule for for HPC, enterprise, and cloud

**Best Perf/TCO for diverse workloads** Maximize data center utilization and power efficiency

## **Continued Innovation**

Grace and Blackwell in 2025





900GB/s NVLink-C2C | 624GB High-Speed Memory 4 PF AI Perf | 72 Arm Cores

# **Grace Hopper Powers Al Supercomputing Datacenters**

400 350 Cumulative AI Performance (ExaFLOPS of AI) 300 250 200 150 100 50 0

2015

2016

## Grace Hopper Will Deliver 200 Exaflops of AI performance for Groundbreaking Research

Cumulative AI FLOPS

Isambard A an Ois VENADO







**200 ExaFLOPS** Al Grace Hopper in Supercomputing Centers

65% of Hopper are Grace Hopper

> **2X** More energy efficient





### GH200 96GB

480 GB LP5x 96GB HBM3

## Scale Out Al Inference

## **GRACE GPU-GPU NVLINK**

### Architectures & Cost of Connectivity







## **Grace Hopper Superchip 4-Way Design**





**Hewlett Packard** Enterprise

The choice for the world's fastest supercomputers











## **GH200 Grace Hopper HPC Platform** Unified Memory and Cache Coherence for Next Gen HPC Performance











## **Grace-Hopper Superchip Workload Performance**

# Hopper architecture





132 SMs 2x Performance per Clock 4<sup>th</sup> Gen Tensor Core **Thread Block Clusters** 

## HOPPER H100 TENSOR CORE GPU 80B Transistors, TSMC 4N

4<sup>th</sup> Gen NVLink 900GB/s total BW New SHARP support **NVLink Network** 

## NEW HOPPER SM ARCHITECTURE

- 2x faster FP32 & FP64 FMA
- 256 KB L1\$ / Shared Memory
- New 4<sup>th</sup> Gen Tensor Core
- New DPX instruction set
- New Tensor Memory Accelerator
  - Fully asynchronous data movement
- New Thread Block Clusters
  - Turn locality into efficiency

### SM

|           |       | _             |  |
|-----------|-------|---------------|--|
|           | -     | Warp          |  |
| -         |       | Disp          |  |
| -         |       |               |  |
|           | Regis |               |  |
| Thirteel  |       |               |  |
| INT32     | FP32  | FP32          |  |
| LD/<br>ST | 1000  | .D/ L<br>ST : |  |
|           |       |               |  |

|       |       | 10.   |
|-------|-------|-------|
|       | )     | Warp  |
|       |       | Disp  |
|       |       |       |
|       |       | Regis |
|       |       |       |
| INT32 | FP32  | FP32  |
| LD/   | LD/ L | _D/   |
| ST    | ST    | ST    |
|       |       |       |
|       |       |       |
| _     | _     |       |
|       |       |       |

Tex



- Specialized high-performance compute cores for matrix multiply and accumulate (MMA) math operations for AI and HPC applications.
- Operating in parallel across SMs in one NVIDIA GPU deliver massive increases in throughput and efficiency compared to standard floating-point (FP), integer (INT), and fused multiply-accumulate (FMA) operations.
- Support for a wide range of data types (fp64, fp32, tf32, fp16, bfloat16, fp8, int8) and mixed precision
- New Transformer Engine designed specifically to accelerate Transformer model training and inference (chooses dynamically between FP8 and 16-bit calculations)
- Tensor Memory Accelerator feeds the H100 Tensor Cores with transfers large blocks of data and multi-dimensional tensors from global memory to shared memory and vice-versa.

# **Tensor Cores**



## Allocate 1 bit to either range or precision



2x throughput & half footprint of FP16/BF16

## **FP8 TENSOR CORE**



## Support for multiple accumulator and output types







For details, see "NVIDIA H100 Tensor Core GPU Architecture" white paper available for download

## **Thread Block Clusters**

New feature introduces programming locality within clusters of SMs Shared memory blocks of SMs within a GPU Processing Cluster (GPC) can communicate directly (w/o going to HBM) Leveraged with CUDA cooperative groups API





## HW-accelerated mem\_copies

## Global <=> Shared Mem

Shared Mem <=> Shared Mem for Clusters

Address generation for 1D to 5D Tensors

Simplified programming model

## **Fully asynchronous with threads**

No addr gen or data movement overhead Synchronize with transaction barrier

## **TENSOR MEMORY ACCELERATOR UNIT**

## ASYNC MEM COPY USING TMA









# Software





# CUDA



# **CUDA: NVIDIA's Computing Platform** Used Everywhere





## Media & Entertainment

## http://developer.nvidia.com/cuda-downloads





# CUDA TOOLKIT Libraries, Languages and Development Tools for GPU Computing

# Programming Approaches

# Development Environment

# Language Support

## Libraries

## "Drop-in" Acceleration





**CUDA** Profiling Tools Interface



## Programming Languages

## Maximum Flexibility





**CUDA-GDB** Debugger





Compile new languages to CUDA







### **DEEP LEARNING**



### LINEAR ALGEBRA





## **GPU Accelerated Libraries** "Drop-In" Acceleration For Your Applications







## SIGNAL, IMAGE & VIDEO



### PARALLEL ALGORITHMS









# **NVIDIA HPC SDK**

Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud

Develop for the NVIDIA Platform: GPU, CPU and Interconnect Libraries | Accelerated C++ and Fortran | Directives | CUDA x86\_64 | Arm | OpenPOWER 7-8 Releases Per Year | Freely Available



# OK, but, What Does It Mean?









# How GPU Acceleration Works

## **Application Code**



Rest of Sequential CPU Code

CPU 95% of Code 5% of Execution





# Piece-by-Piece, not All-or-Nothing

Incrementally accelerate key components of an application

Real applications are complex

No need to port the whole thing in one go





# **Piece-by-Piece, not All-or-Nothing**

Incrementally accelerate key components of an application





CUDA includes heterogeneous profiling tools to help evaluate which components to port next





## **Piece-by-Piece, not All-or-Nothing** Incrementally accelerate key components of an application









## **Piece-by-Piece, not All-or-Nothing** Incrementally accelerate key components of an application









# **Piece-by-Piece, not All-or-Nothing**

Incrementally accelerate key components of an application









## What is CUDA A Simplified View

- Domain-Specific Libraries
- - Compiler
- CUDA Runtime Libraries

- Driver:
  - interaction

CUDA Programming Environment

CUDA programming model (will talk more about this)

• Kernel Mode Driver – Lives in the OS, handles low-level hardware

• User Mode Driver – Integrates with your application, maps lowlevel CUDA API calls to your specific HW





# But, Of Course, a Real Application is Complex

Many Components, Many Dependencies





## "Drop-in" Acceleration

# 4 Ways To Accelerate Applications

Applications

Directives (OpenACC, OpenMP)

Standard Languages

Easily Accelerate Applications

Portable Performance



Flexibility





"Drop-in" Acceleration

# 4 Ways To Accelerate Applications

Applications

Directives (OpenACC, OpenMP)

Standard Languages

Easily Accelerate Applications

Portable Performance



Flexibility



## EASE OF USE

"DROP-IN"

QUALITY

PERFORMANCE

# Libraries: Easy, High-Quality Acceleration

Using libraries enables GPU acceleration without in-depth knowledge of GPU programming

Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes

Libraries offer high-quality implementations of functions encountered in a broad range of applications

NVIDIA libraries are tuned by experts



## **DEEP LEARNING**



### LINEAR ALGEBRA



## **GPU Accelerated Libraries** "Drop-In" Acceleration For Your Applications









## SIGNAL, IMAGE & VIDEO



### PARALLEL ALGORITHMS







| Substitu      | STEP 1 |
|---------------|--------|
| Manage<br>wit | STEP 2 |
| Rebuild       | STEP 3 |

# **3** Steps To A CUDA-Accelerated Application

ite library calls with equivalent CUDA library calls 

data locality cudaMallocManaged,() cudaMalloc(), cudaMemcpy() th CUDA: ith CUBLAS: cublasAlloc(), cublasSetVector()

and link the application with the CUDA-accelerated library \$ gcc myobj.o -1 cublas





# Single Precision Alpha X Plus Y (SAXPY)

Part of Basic Linear Algebra Subroutines (BLAS) library

# $z = \alpha x + y$ *x*, *y*, *z* : vector $\alpha$ : scalar





# **Drop-In Acceleration With CUDA Maths Libraries**

int N = 1 << 20;x = (float \*)malloc(N \* sizeo y = (float \*)malloc(N \* sizeo initData(x, y); // Perform SAXPY on 1M elemen saxpy(N, 2.0, x, 1, y, 1); useResult(y);

**Original Code** 

In two easy steps

| / 1M elements              |     |
|----------------------------|-----|
| of(float));<br>of(float)); |     |
| nts: y[]=a*x[]+            | y[] |

int N = 1 << 20; // 1M elements x = (float \*)malloc(N \* sizeof(float)); y = (float \*)malloc(N \* sizeof(float)); initData(x, y); // Perform SAXPY on 1M elements: y[]=a\*x[]+y[]

useResult(y);

saxpy(N, 2.0, x, 1, y, 1);

**GPU-Accelerated Code** 



## **Drop-In Acceleration With CUDA Maths Libraries** Step 1: Update memory allocation to be CUDA-aware

int N = 1 << 20; x = (float \*)malloc(N \* sizeof(float)); y = (float \*)malloc(N \* sizeof(float)); initData(x, y); // Perform SAXPY on 1M elements: y[]=a\*x[]+y[] saxpy(N, 2.0, x, 1, y, 1); useResult(y);

**Original Code** 

Here, we use Unified Memory which automatically migrates between host (CPU) and device (GPU) as needed by the program



```
int N = 1 << 20;
                              // 1M elements
cuda Malloc Managed (&x, N * sizeof(float));
cuda Malloc Managed (&y, N * sizeof(float));
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]
```

**GPU-Accelerated Code** 



## **Drop-In Acceleration With CUDA Maths Libraries Step 2:** Call CUDA library version of API

int N = 1 << 20; x = (float \*)malloc(N \* sizeof(float)); y = (float \*)malloc(N \* sizeof(float)); initData(x, y); // Perform SAXPY on 1M elements: y[]=a\*x[]+y[] saxpy(N, 2.0, x, 1, y, 1); useResult(y);

**Original Code** 

Here, we use Unified Memory which automatically migrates between host (CPU) and device (GPU) as needed by the program



int N = 1 << 20; // 1M elements cudaMallocManaged(&x, N \* sizeof(float)); cudaMallocManaged(&y, N \* sizeof(float));

// Perform SAXPY on 1M elements: y[]=a\*x[]+y[]

**GPU-Accelerated Code** 





# SIX WAYS TO SAXPY

Programming Languages for GPU Computing





# Single Precision Alpha X Plus Y (SAXPY)

Part of Basic Linear Algebra Subroutines (BLAS) library

# $z = \alpha x + y$ *x*, *y*, *z* : vector $\alpha$ : scalar

GPU SAXPY in multiple languages and libraries

A selection of possibilities, not a tutorial







## Serial C code



# **OpenACC Compiler Directives**

## Parallel C code with OpenACC

```
float a,
           float *x,
           float *y)
  for (int i = 0; i < n; ++i)
   y[i] = a*x[i] + y[i];
// Perform SAXPY on 1M elements
```





## Serial BLAS code

# int N = 1 < < 20; • • • // Use your choice of blas library // Perform SAXPY on 1M elements blas\_saxpy(N, 2.0, x, 1, y, 1);

# cuBLAS Library

### int N = 1 < < 20;

cublasInit();

cublasShutdown();

You can also call cuBLAS from Fortran, C++, Python and other languages

## Parallel cuBLAS code

```
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
```



# 3

## Serial C code

```
void saxpy(int n, float a,
           float *x, float *y)
  for (int i = 0; i < n; ++i)
    y[i] = a*x[i] + y[i];
// Perform SAXPY on 1M elements
int N = 1 < < 20;
saxpy(N, 2.0, x, y);
```

# CUDA C++

```
__global__
int N = 1 < < 20;
```

### CUDA C++ code

- void saxpy(int n, float a, float \*x, float \*y)
  - int i = blockIdx.x\*blockDim.x + threadIdx.x; if (i < n) y[i] = a\*x[i] + y[i];

// Perform SAXPY on 1M elements saxpy<<< 4096, 256 >>>(N, 2.0, d\_x, d\_y);





## Serial Standard C++ code

# int N = 1 < < 20;

std::vector<float> x(N), y(N); • • • // Perform SAXPY on 1M elements std::transform(x.begin(), x.end(), y.begin(), y.end(),  $2.0f * _1 + _2);$ 

# CUDA C++ Core Libraries (CCCL)

int N = 1 < < 20;

• • •

### CUDA C++ code

- thrust::host\_vector<float> x(N), y(N);
- thrust::device\_vector<float> d\_x = x;
- thrust::device\_vector<float> d\_y = y;

## // Perform SAXPY on 1M elements

thrust::transform(d\_x.begin(), d\_x.end(), d\_y.begin(),d\_y.begin(),  $2.0f * _1 + _2)$ 





Serial Standard C++ code

# int N = 1 < < 20;

std::vector<float> x(N), y(N); • • • // Perform SAXPY on 1M elements std::transform(x.begin(), x.end(), y.begin(), y.end(), 2.0f \* \_1 + \_2);

# Standard C++ Parallel Algorithms (stdpar)

int N = 1 < < 20; out.reserve(N); • • •

std::transform(

});

## CUDA C++ code

```
std::vector<float> x(N), y(N), out;
// Perform SAXPY on 1M elements
        std::execution::par_unseq,
         x.begin(), x.end(), y.begin(), y.end(),
        std::back_inserter(out),
         [](int a, int b) {
                  return 2.0f * a + b;
```





## Standard Python

import numpy as np def saxpy(a, x, y): return [a \* xi + yi for xi, yi in zip(x, y)] x = np.arange(2\*\*20, dtype=np.float32)y = np.arange(2\*\*20, dtype=np.float32) $cpu_result = saxpy(2.0, x, y)$ 

# Python

# import numpy as np def saxpy(a, x, y): return a \* x + y N = 1048576# Initialize arrays # Add arrays on GPU C = saxpy(2.0, X, Y)

## Numba Parallel Python

```
from numba import vectorize
@vectorize(['float32(float32, float32,
             float32)'], target='cuda')
A = np.ones(N, dtype=np.float32)
B = np.ones(A.shape, dtype=A.dtype)
C = np.empty_like(A, dtype=A.dtype)
```





# Anatomy of a CUDA binary

Hello world example







# Anatomy of a CUDA binary

Hello world example





- PTX Compatibility Layer



# How Do We Keep Things Working Together?



# **CUDA Compatibility** "Backward Compatibility" Software Considerations

- The simplest use case:
  - "Your compiled application will work forever on NVIDIA GPUs, regardless of installed driver"
- All newer GPU drivers will be binary-compatible with older binaries
  - Requires statically linking libraries like the CUDA runtime
- Recompiling from *source* may require API changes Only binary compatibility is guaranteed







# **CUDA Compatibility** "SM/Compute Compatibility" Hardware Considerations (Binaries)

- Binaries are built for a specific GPU family, PTX is used to target additional families
  - Each architecture supports a given ISA, or compute capability
  - PTX enables compatibility between architectures
- Compiled applications target a specific CC, with some compatibility within a family (newer but not older)
- Supported:
  - CC 8.0 cubin runs on CC 8.6 (A100  $\rightarrow$  A40)
- Unsupported:
  - CC 8.6 cubin cannot run on CC 8.0 (A40  $\rightarrow$  A100)
  - **CC 8.0** cubin cannot run on **CC 7.0** (A100  $\rightarrow$  V100)
  - CC 7.0 cubin cannot run on CC 8.0 (V100  $\rightarrow$  A100)





- **PTX Code** is compatible with future versions, both Major and Minor
- **Supported PTX Migration:** 
  - CC 8.0 PTX runs on CC 8.6 (A100 PTX  $\rightarrow$  A40)
  - CC 7.0 PTX runs on CC 8. (V100 PTX  $\rightarrow$  A100)
- **Unsupported PTX Migration:** 
  - CC 8.6 PTX cannot run on CC 8.0 (A40 PTX  $\rightarrow$  A100)
  - CC 8.0 PTX cannot run on CC 7.0 (A100 PTX  $\rightarrow$  V100)

# CUDA compatibility "PTX Compatibility" Hardware Considerations









### "CUDA Everywhere"

Code for one GPU runs on all GPUs with newer SM version

CUDA applications are portable between all chip classes (100, 10x, 20x, 21x, 10b, etc.)

- All current features supported on all future architectures
- Performance & capacities vary (e.g. SM count)
- A few features much slower but still functional (e.g. FP64)



Volta applications "just work" on Turing/Ampere/Hopper Datacenter libraries "just work" for Quadro, GeForce, etc.

# CUDA COMPATIBILITY "PTX Compatibility" Hardware Considerations



SASS is pre-compiled binary code native to a specific GPU architecture - multiple versions may be packaged together

PTX is assembly code JIT compiled by CUDA when an application is run on a new GPU for which there is no SASS



### Portability depends on PTX Just-In-Time Compilation

Forward compatibility guarantee: PTX 8.0 runs on CC 9,10,11, ...

Exact match of SASS runs natively (many may exist)

PTX 8.0 won't run on an older CC. Applications occasionally include older PTX to avoid shipping lots of SASS.



🕑 NVIDIA.

# CUDA compatibility "Minor-Version Compatibility" (Previously "Enhanced Compatibility")

- Applications created within a major-release of CUDA may run on a system with the minimum driver version
  - E.g., 11.x CTK requires 450.80.02
- Works with:
  - Newer driver than CTK
  - Newer CTK than driver
- New CTK features that require a new driver will return errors
  - Programmers must write code to check if features exist and if libraries are supported (e.g., cublas must match cudnn)\*
- PTX JIT unsupported (matching driver required)





🗼 NVIDIA.

## Using a CUDA toolkit with higher-versioned UMD with a lower-versioned KMD

- Deployment & upgrade of Drivers may be very disruptive, especially in CSP and enterprise datacenters
- Can be used across major and minor versions of CTK
- Compatibility Package to be installed, includes user-mode driver (among other files)
  - Via symbolic links, multiple compatibility versions can be installed together in a single system
- Programmers must check for supported features & supported hardware
- Supports PTX JIT compilation

# CUDA Compatibility "Forward Compatibility"



- CUDA applications are compatible forever
- CUDA programs within a major version generally are compatible
- CUDA applications run against older drivers with compatibility shims
  - Matters in e.g. containers, data center environments

# Key Takeaways



# Multi GPU Multi Node programming



### Solves the 2D-Laplace Equation on a rectangle

Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary

# Example: Jacobi solver

### $\Delta u(x,y) = \mathbf{0} \forall (x,y) \in \Omega \setminus \delta \Omega$



While not converged Do Jacobi step: for( int iy = 1 ; iy < ny-1 ; iy++ )</pre> for( int ix = 1 ; ix < nx-1 ; ix++ )</pre>  $a_new[iy*nx+ix] = -0.25 *$ 

Apply periodic boundary conditions

Swap a\_new and a Next iteration

## Example: Jacobi Solver Single GPU

- -(a[iy \*nx+(ix+1)] + a[iy \*nx+ix-1]
- + a[(iy-1)\*nx+ix ] + a[(iy+1)\*nx+ix ]);



Apply periodic boundary conditions

### Halo exchange

Swap a\_new and a

Next iteration

# **Example: Jacobi Solver** Multi GPU





### Standard to exchange data between processes via messages

- Defines API to exchanges messages
  - Point to Point: e.g. MPI\_Send, MPI\_Recv
  - Collectives: e.g. MPI\_Reduce
- Multiple implementations (open source and commercial)
  - Bindings for C/C++, Fortran, Python, ...
  - E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, ...

# Message Passing Interface - MPI



#include <mpi.h>

• • •

int main(int argc, char \*argv[]) { int rank,size;

/\* Initialize the MPI library \*/

MPI\_Init(&argc,&argv);

MPI\_Comm\_rank(MPI\_COMM\_WORLD,&rank);

MPI\_Comm\_size(MPI\_COMM\_WORLD,&size);

/\* Shutdown MPI library \*/ MPI\_Finalize(); return 0;

# **MPI - Skeleton**

- /\* Determine the calling process rank and total number of ranks \*/ /\* Call MPI routines like MPI\_Send, MPI\_Recv, ... \*/











Handle GPU affinity on multi-GPU nodes:

int local\_rank = -1; MPI\_Comm\_rank(local\_comm, &local\_rank); int num\_devices = 0; cudaGetDeviceCount(&num\_devices); cudaSetDevice(local\_rank % num\_devices);

(Use M PI\_Comm\_split\_type(MPI\_COMM\_WORLD, M PI\_COM M \_TYPE\_SHARED, rank, info, &local\_comm); to get local\_comm.)

# Multi Process Multi GPU Programming

Using CUDA-aware MPI



while (l2\_norm > tol && iter < iter\_max) {</pre> cudaEventRecord(compute\_done, compute\_stream);

cudaEventSynchronize(compute\_done); const int top = rank > 0 ? rank - 1 : (size - 1); const int bottom = (rank + 1) % size; // Top/Bottom Halo exchange -> next slide

cudaStreamSynchronize(compute\_stream); MPI\_CALL(MPI\_Allreduce(12\_norm\_h, &12\_norm, 1, MPI\_REAL\_TYPE, MPI\_SUM, MPI\_COMM\_WORLD));  $12_norm = std::sqrt(12_norm);$ 

std::swap(a\_new, a); iter++;

# Multi Process Multi GPU Programming

```
Using CUDA-aware MPI
```

- cudaMemsetAsync(l2\_norm\_d, 0, sizeof(real), compute\_stream);
- launch\_jacobi\_kernel(a\_new, a, l2\_norm\_d, iy\_start, iy\_end, nx, compute\_stream);
- cudaMemcpyAsync(l2\_norm\_h, l2\_norm\_d, sizeof(real), cudaMemcpyDeviceToHost, compute\_stream);



# MPI\_Sendrecv(a\_new+iy\_start\*nx, nx, MPI\_FLOAT, top , 0,

### **Example Jacobi Top/Bottom Halo**

a\_new+(iy\_end\*nx), nx, MPI\_FLOAT, bottom, 0, MPI\_COMM\_WORLD, MPI\_STATUS\_IGNORE);



### MPI\_Sendrecv(a\_new+iy\_start\*nx, nx, MPI\_FLOAT, top , 0,

a\_new+(iy\_end\*nx), nx, MPI\_FLOAT, bottom, 0,

### Example Jacobi **Top/Bottom Halo**

MPI\_COMM\_WORLD, MPI\_STATUS\_IGNORE);





### Example Jacobi **Top/Bottom Halo**



- CUDA Driver 535.129.03
- GPUs@1980 Mhz
- Reported Runtime is the minimum of 5 repetitions

### **Benchmark Setup** DGX H100

• NVIDIA HPC SDK container: Tag nvcr.io/nvidia/nvhpc:24.1-devel-cuda12.3-ubuntu22.04

• For all runs CPU and GPU affinities have been tuned: See bench.sh in <u>https://github.com/NVIDIA/multi-gpu-programming-models</u>





# **Example: Jacobi Solver**

Single GPU performance vs. problem size – NVIDIA H100 80GB HBM3

Problem size nx=ny

ciency



# Multi GPU Jacobi Runtime And Parallel Efficiency



Benchmark setup: DGX H100, CUDA Driver 535.129.03, NVIDIA HPC SDK container: nvcr.io/nvidia/nvhpc:24.1-devel-cuda12.3-ubuntu22.04, GPUs@1980Mhz AC, Reported Runtime is the minimum of 5 repetitions

MPI on DGX H100 – 20480 x 20480, 1000 iterations



# Multi GPU Jacobi Nsight Systems Timeline

| e <u>V</u> iew <u>T</u> ools <u>H</u> elp |      |             |
|-------------------------------------------|------|-------------|
| titled 1 * $\times$                       |      |             |
| ≡ Timeline View                           |      |             |
|                                           |      | 0 1577.06m  |
|                                           | 7s • | s +577.96m  |
| [All Streams]                             | Ŧ    | void jacob  |
| MPI                                       | ¥    | MP          |
| [All Streams]                             | ¥    | void jacob  |
| MPI                                       | ¥    | MP          |
| [All Streams]                             | Ŧ    | void jaco   |
| MPI                                       | ¥    | MPI_        |
| [All Streams]                             | ¥    | void jaco   |
| MPI                                       | ¥    | MPI_        |
| [All Streams]                             | ¥    | void jacobi |
| MPI                                       | ¥    | MF          |
| [All Streams]                             | ¥    | void jacob  |
| MPI                                       | ¥    | MP          |
| [All Streams]                             | ¥    | void jaco   |
| MPI                                       | ¥    | MPI         |
| [All Streams]                             | ¥    | void jaco   |
| MPI                                       | ¥    | MPL         |
| NVTX                                      | ¥    |             |
| CUDA API                                  |      | cudaEventS  |
| Profiler overhead                         | e    |             |
|                                           |      | 100%        |

MPI 8 NVIDIA H100 80GB HBM3 on DGX H100



|                                                                       |                |           | ×          |
|-----------------------------------------------------------------------|----------------|-----------|------------|
|                                                                       |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                |           |            |
| 🗉 Q 🗆 🔤 🛛                                                             | @ <sup>.</sup> | 112 mess  | ages       |
|                                                                       |                |           |            |
| 578.12ms +578.14ms +578.16ms +578.18ms +578.2ms +578.22ms             | +5/8.24        | ms        | -          |
| )32, (int)32>(float *, const float *, float *, int, int, int, bool)   |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI_Sen   |            |
| 32, (int)32>(float *, const float *, float *, int, int, int, bool)    |                |           |            |
|                                                                       |                | MPI_Send  | r          |
| )32, (int)32>(float *, const float *, float *, int, int, int, bool)   |                | 0         |            |
| /, {, ,, / / /, /,,                                                   |                | •         |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI_Sen   |            |
| t)32, (int)32>(float *, const float *, float *, int, int, int, bool)  |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI_Sen   |            |
| 32, (int)32>(float *, const float *, float *, int, int, int, bool)    |                |           |            |
|                                                                       |                |           |            |
|                                                                       | ſ              | MPI_Send  | r          |
| )32, (int)32>(float *, const float *, float *, int, int, int, bool)   |                |           |            |
| ,, (, , , ,, ,, ,,,,,,                                                |                | •         |            |
|                                                                       | (              | MDL Cond  |            |
|                                                                       |                | MPI_Send  |            |
| 32, (int)32>(float *, const float *, float *, int, int, int, bool)    |                | - (       |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI_Sendr | ī          |
| nt)32, (int)32>(float *, const float *, float *, int, int, int, bool) |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI_Se    |            |
|                                                                       |                |           |            |
|                                                                       |                | MPI [41   |            |
| cudaEventSynchronize                                                  |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                |           |            |
|                                                                       |                |           | ▼          |
|                                                                       |                |           |            |
|                                                                       |                |           | <b>P</b> : |
|                                                                       |                |           |            |



# **Overlapping Communication and Computation**

No Overlap

Overlap

Process boundary domain

Process Whole Domain

Process inner domain

Dependency

COMM







const int top = rank > 0 ? rank - 1 : (size-1); const int bottom = (rank+1)%size; cudaStreamSynchronize( push\_top\_stream ); cudaStreamSynchronize( push\_bottom\_stream ); MPI\_STATUS\_IGNORE );

## MPI

### **Overlapping Communication and Computation**

- launch\_jacobi\_kernel( a\_new, a, l2\_norm\_d, iy\_start, (iy\_start+1), nx, push\_top\_stream );
- launch\_jacobi\_kernel( a\_new, a, l2\_norm\_d, (iy\_end-1), iy\_end, nx, push\_bottom\_stream );
- launch\_jacobi\_kernel( a\_new, a, l2\_norm\_d, (iy\_start+1), (iy\_end-1), nx, compute\_stream );
- MPI\_Sendrecv( a\_new+iy\_start\*nx, nx, MPI\_REAL\_TYPE, top , 0,
  - a\_new+(iy\_end\*nx), nx, MPI\_REAL\_TYPE, bottom, 0,
  - MPI\_COMM\_WORLD, MPI\_STATUS\_IGNORE );
- MPI\_Sendrecv( a\_new+(iy\_end-1)\*nx, nx, MPI\_REAL\_TYPE, bottom, 0, a\_new, nx, MPI\_REAL\_TYPE, top, 0, MPI\_COMM\_WORLD,



# Multi GPU Jacobi Nsight Systems Timeline

| e <u>V</u> iew <u>T</u> ools <u>H</u> elp |      |                 |            |           |
|-------------------------------------------|------|-----------------|------------|-----------|
| titled 2 * $\times$                       |      |                 |            |           |
| ■ Timeline View                           |      | ons             |            |           |
|                                           | 7s - | ms +672.12      | ms +672.14 | ms        |
| [All Streams]                             | ¥    | void jacobi_k   | -          | vo        |
| MPI                                       | Ŧ    |                 | M          |           |
| [All Streams]                             | Ŧ    | void jacobi_ke  | -          | voi       |
| MPI                                       | Ŧ    |                 | MPI_A      |           |
| [All Streams]                             | Ŧ    | void jacobi     | -          | vo        |
| MPI                                       | Ŧ    |                 | MPI        |           |
| [All Streams]                             | Ŧ    | void jacobi     |            | vo<br>vo. |
| MPI                                       | Ŧ    |                 | MPI_Allred |           |
| [All Streams]                             | Ŧ    | void jacobi     |            | v         |
| MPI                                       | Ŧ    |                 | MPI_Al     |           |
| [All Streams]                             | Ŧ    | void jacobi_k   | -          | V<br>V.   |
| MPI                                       | Ŧ    |                 | M)         |           |
| [All Streams]                             | Ŧ    | void jacobi     |            | V         |
| MPI                                       | Ŧ    |                 | MPI_Allre  |           |
| [All Streams]                             | Ţ    | void jacobi_k   |            | v<br>vo.  |
| MPI                                       | ¥    |                 | MP         |           |
| NVTX                                      |      |                 |            |           |
| CUDA API                                  |      | cudaStreamSynch | ronize     | j         |
| Profiler overhea                          | ad 🔻 | 4               |            |           |

MPI Overlap 8 NVIDIA H100 80GB HBM3 on DGX H100







# Multi GPU Jacobi Parallel Efficiency DGX H100 – 20480 x 20480, 1000 iterations



# **NCCL : NVIDIA Collective Communication Library** Communication library running on GPUs, for GPU buffers.

- Library for efficient communication with GPUs
- First: Collective Operations (e.g. Allreduce), as they are required for Deep Learning
- Since 2.8: Support for Send/Recv between GPUs
- Library running on GPU: Communication calls are translated to a GPU kernel (running on a stream)

# NCCL **Optimized inter-GPU communication**



Binaries : https://developer.nvidia.com/nccl and in NGC containers Source code : https://github.com/nvidia/nccl Perf tests : <a href="https://github.com/nvidia/nccl-tests">https://github.com/nvidia/nccl-tests</a>

Sockets InfiniBand Other networks



GPU



# Multi GPU Jacobi Parallel Efficiency DGX H100 – 20480 x 20480, 1000 iterations



```
int leastPriority = 0;
int greatestPriority = leastPriority;
cudaStream_t compute_stream, push_stream;
• • •
launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start,
launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_end - 1),
ncclGroupStart();
ncclRecv(a_new,
ncclSend(a_new + iy_start * nx,
```

ncclGroupEnd();

# NCCL

**Overlapping Communication and Computation** 

cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority)

cudaStreamCreateWithPriority(&compute\_stream, cudaStreamDefault, lea cudaStreamCreateWithPriority(&push\_stream, cudaStreamDefault, great

- (iy\_start
- iy\_end,
- launch\_jacobi\_kernel(a\_new, a, l2\_norm\_d, (iy\_start + 1), (iy\_end -

nx, NCCL\_REAL\_TYPE, top, nccl\_c

- ncclSend(a\_new + (iy\_end 1) \* nx, nx, NCCL\_REAL\_TYPE, btm, nccl\_comm, push\_stream)
- ncclRecv(a\_new + (iy\_end \* nx), nx, NCCL\_REAL\_TYPE, btm, nccl\_comm, push\_stream);
  - nx, NCCL\_REAL\_TYPE, top, nccl\_comm, push\_stream);

|                                | Need to use CUDA high priorit<br>streams to avoid NCCL comma |  |  |
|--------------------------------|--------------------------------------------------------------|--|--|
| );                             | getting stuck behind compute                                 |  |  |
| eastPriority)<br>testPriority) | •                                                            |  |  |
| t + 1), nx, <mark>p</mark> u   | ush_stream);                                                 |  |  |
| nx, pı                         | ush_stream);                                                 |  |  |
| -1), nx, <mark>co</mark>       | <pre>mpute_stream);</pre>                                    |  |  |
| comm, <b>push_stre</b>         | eam)                                                         |  |  |
| comm, <pre>push_stream);</pre> |                                                              |  |  |





# Multi GPU Jacobi Parallel Efficiency DGX H100 – 20480 x 20480, 1000 iterations



### CUDA graphs reduce kernel launch latencies:

**GPU:** 

**CPU:** 

**GPU:** 

**CPU:** 

From <u>Advanced Performance Optimization in CUDA [S62192]</u> by Igor Terentyev more details on Graphs there.

# **CUDA Graphs** Reducing launch overhead



Short kernel

Short kernel

Kernel launch

Short kernel

Short kernel





### Single Graph "Template"

Created in host code or built up from libraries

# **Three-Stage Execution Model** Minimizes Execution Overheads – Pre-Initialize As Much As Possible

Instantiate



### Multiple "Executable Graphs"

Snapshot of templates

Sets up & initializes GPU execution structures (create once, run many times)



Execute

### **Executable Graphs Running in CUDA Streams** Concurrency in graph is not limited by stream





# Multi GPU Jacobi Parallel Efficiency DGX H100 – 20480 x 20480, 1000 iterations





Symmetric objects are allocated collectively with the same size on every PE

Symmetric memory: nvshmem\_malloc(...); Private memory: cudaMalloc(...);

CPU (blocking and stream-ordered) and CUDA Kernel interfaces Read: nvshmem\_get(...); Write: nvshmem\_put(...); Atomic: nvshmem\_atomic\_add(...); Flush writes: nvshmem\_quiet(); Order writes: nvshmem\_fence();

Synchronize: nvshmem\_barrier(); Poll: nvshmem\_wait\_until(...);

Interoperable with MPI

# NVSHMEM

Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library





# Multi GPU Jacobi Parallel Efficiency DGX H100 – 20480 x 20480, 1000 iterations



MPI

NCCL

NVSHMEM

Source is on GitHub: <a href="https://github.com/NVIDIA/multi-gpu-programming-models">https://github.com/NVIDIA/multi-gpu-programming-models</a>

# **Conclusion** Thank you for your attention

| GPUDirect P2P/RDMA | CUDA stream/graph- |
|--------------------|--------------------|
| Improves Perf.     | No                 |
| Improves Perf.     | Yes                |
| Required           | Yes                |

| aware | Kernel Initiated<br>Communication |
|-------|-----------------------------------|
|       | No                                |
|       | No                                |
|       | Yes                               |





