Speaker
Francesco Rossi
(Università di Bologna)
Description
We present Jasmine, an implementation of a fully relativistic, 3D, electromagnetic Particle In Cell (PIC) code, capable of running on hybrid HPC (High Performance Computing) clusters, exploiting the computational power of both CUDA GPUs and CPUs.
The code modularity and the advanced C++ implementation allows for simple extension of the core algorithms to various simulation schemes. When porting a PIC scheme to a GPU based machine, the particle to grid operations (e.g. the evaluation of the current density) need special care to avoid memory inconsistencies. Here we present how we implemented this operation exploiting a parallel evaluation for each grid cell relying on a robust and efficient sorting and stream compaction algorithms.
Running demanding simulations on GPUs comes with the great advantage of the high processing power available on the graphic boards at the expense of the rather limited memory available per board. We have tackled the GPU memory limitation problem streaming particle chunks asynchronously from the main node memory to the GPUs. This chunking technique can also be used to hide the network transfer overhead occurring in the multi-GPU parallelization.
We show the comparison of the performance of the code Jasmine when run on different architectures: pure CPU (Intel Xeon), GPUs (NVIDIA Fermi board) or on a hybrid HPC cluster (Intel Xeon + NVIDIA Fermi). The single particle process time is 13 ns for the 2D case and 80 ns for the 3D one, for a relativistic plasma simulation with grid staggering, double precision and quadratic shape functions, running on a NVIDIA Fermi board.