GPU Computing in High Energy Physics

Europe/Rome
University of Pisa

University of Pisa

<a target="_blank" href=https://www.google.com/maps/place/Dipartimento+di+Fisica/@43.720239,10.407985,17z/data=!3m1!4b1!4m2!3m1!1s0x12d591bb7d8c8ec9:0xbf91ddd442e32978>Polo Fibonacci</a> Largo Bruno Pontecorvo, 3 I-56127 Pisa <em>phone +39 050 2214 327</em>
Gianluca Lamanna (PI), Marco Sozzi (PI)
Description


The Registration deadline is August 31nd, 2014
(send e-mail to gpu2014@pi.infn.it in case of late registration)

 


The conference focuses on the application of GPUs in High Energy Physics (HEP), expanding on the trend of previous workshops on the topic and pointing to establishing a recurrent series. The emerging paradigm of the use of graphic processors as powerful accelerators in data- and computation-intensive applications found fertile ground in the computing challenges of the HEP community and is currently object of active investigations.

This follows a long established trend which sees the increased use of cheap off-the-shelf commercial units to achieve unprecedented performances in parallel data processing, thus leveraging on a very strong commitment of hardware producers to the huge market of computer graphics and games. These hardware advances comes together with the continuous development of proprietary and free software to expose the raw computing power of GPUs for general-purpose applications and scientific computing in particular.

All different applications of massively parallel computing in HEP will be addressed, from computational speed-ups in online and offline data selection and analysis to hard real-time applications in low-level triggering, to MonteCarlo simulations for lattice QCD. Both current activities and plans for foreseen experiments and projects will be discussed, together with perspectives on the evolution of the hardware and software.


Participants
  • Akitaka Ariga
  • Alessandro Lonardo
  • Alessio Gianelle
  • Andrea Messina
  • Andrea Ruggeri
  • Andreas Herten
  • Antonio Augusto Alves Junior
  • Arkadiusz Ćwiek
  • Attilio Cucchieri
  • Bachir Bouhadef
  • Bernd Stelzer
  • Christopher Pinke
  • Claudio Bonati
  • Cristiano Bozza
  • Daniel Hugo Campora Perez
  • Daniele Cesini
  • Dario Soldi
  • David Chamont
  • Dietmar Kaletta
  • Dmitry Chirkin
  • Dorothea vom Bruch
  • Dániel Berényi
  • Egor Ovcharenko
  • Enrico Calore
  • Felice Pantaleo
  • Filippo Mantovani
  • Gergely Debreczeni
  • Gianluca Lamanna
  • Gianmaria Collazuol
  • Gilles GRASSEAU
  • Giorgia Mila
  • Giovanni Calderini
  • Giovanni Di Domenico
  • Hannes Vogt
  • Ilaria Neri
  • Ivan Kisel
  • Ivan Reid
  • Jacob Howard
  • Jacopo Pinzino
  • Johannes de Fine Licht
  • Juan José Rodríguez
  • Junichi Kanzaki
  • Lorenzo Rinaldi
  • Luca Pontisso
  • Luca Rei
  • Lucia Morganti
  • Ludovico Bianchi
  • Maik Dankel
  • Marco Corvo
  • Marco Palombo
  • Marco Rescigno
  • Marco Sozzi
  • Mario Schroeck
  • Mario Spera
  • Massimiliano Fiorini
  • Massimo D'Elia
  • Massimo Minuti
  • Matteo Bauce
  • Mauro Belgiovine
  • Michael Sokoloff
  • Michele Mesiti
  • Máté Ferenc Nagy-Egri
  • Natalia Kolomoyets
  • Patrick Steinbrecher
  • Peter Messmer
  • Piero Vicini
  • Richard Calland
  • Roberto Piandani
  • Salvatore Alessandro Tupputi
  • Silvia Arezzini
  • SIMONE COSCETTI
  • Simone Michele Mazza
  • Soon Yung Jun
  • Stefanie Reichert
  • Stefano Gallorini
  • Tobias Winchen
  • Vadim Demchik
  • Vitaliy Schetinin
<b>E-mail</b>
    • 8:30 AM 9:30 AM
      Registration 1h
    • 9:30 AM 9:45 AM
      Welcome
      • 9:30 AM
        Welcome 15m
        Speaker: Marco Sozzi (PI)
        Slides
    • 9:45 AM 11:15 AM
      GPU in High Level Trigger (1/3)
      Convener: Prof. Ivan Kisel (Goethe University Frankfurt am Main, FIAS Frankfurt Institute for Advanced Studies)
      • 9:45 AM
        The GAP Project: GPU applications for high level trigger and medical imaging 30m
        The aim of the GAP project is the deployment of Graphic Processing Units (GPU) in real-time applications, ranging from high-energy physics online event selection (trigger) to medical imaging reconstruction. The final goal of the project is to demonstrate that GPUs can have a positive impact in sectors different for rate, bandwidth, and computational intensity. Most crucial aspects currently under study are the analysis of the total latency of the system, computational algorithms optimisation, and integration with the data acquisition systems. In this contribution we are focusing on the application of GPUs in asynchronous trigger systems, employed for the high-level trigger of LHC experiments. In particular we discuss how specific trigger algorithms can be naturally parallelized and thus benefit from the implementation on the GPU architecture, in terms of the increased execution speed and more favourable dependency on the complexity of the analyzed events. Such improvements are particularly relevant for the foreseen LHC luminosity upgrade where highly selective algorithms will be crucial to maintain a sustainable trigger rates with very high pileup. We will give details on how these devices can be integrated in a typical LHC trigger system and benchmarking their performances. As a study case, we will consider the Atlas experimental environment and propose a GPU implementation for a typical muon selection in a high-level trigger system.
        Speaker: Matteo Bauce (ROMA1)
        Slides
      • 10:15 AM
        GPU-based track quasi-real-time track recognition in imaging devices: from raw data to particle tracks 30m
        Nuclear emulsions as tracking devices suitable for high-energy physics experiments have been recently used by CHORUS, DONUT, PEANUT and OPERA. High statistics is accessible by fast automatic microscopes for emulsion readout. Former systems had to do most processing by the CPU; the next generation, entering duty at the present time, is based on GPU’s. Automatic microscopes need real-time 2D imaging to drive the motion of the microscope axes using feedback from images; 3D track recognition occurs quasi-online (continuous buffered stream), in local clusters of computing servers based on GPU’s. The proposed contribution shows the status of the Quick Scanning System (QSS), an evolution of the ESS currently used in OPERA and for muon radiography on volcanic edifices. Compared to vision processors, GPU’s provide at least a factor 4 speed-up, and a factor 16 or better if the comparison includes cost. A short account is given of the hardware setup of the system, while the presentation mostly focuses on: 1) 2D image processing on-the-fly at the data rate of about 2 GB/s on a single workstation; 2) 3D tracking algorithm and the related computing power distribution and balancing technique; 3) Local clusters of GPU-based tracking servers. Many aspects of the algorithms shown are not specific to the application, but tackle common problems of particle tracking in high-energy physics experiments. An outlook of the future evolution and applications of the QSS is given.
        Speaker: Cristiano Bozza (SA)
        Slides
      • 10:45 AM
        GPGPU for track finding and triggering in High Energy Physics 30m
        The LHC experiments are designed to detect large amount of physics events produced with a very high rate. Considering the future upgrades, the data acquisition rate will become even higher and new computing paradigms must be adopted for fast data-processing: General Purpose Graphical Processing Units (GPGPU) can be used in a novel approach based on massive parallel computing. The intense computation power provided by GPGPU is expected to reduce the computation time and speed-up fast decision taking and low-latency applications. In particular, this approach could be hence used for high-level triggering in very complex environments, like the typical inner track detectors of the LHC experiments, where a large amount of pile-up events overlaying the intersting physics processes are expected with the luminosity upgrade. In this contribution we discuss two typical use-cases where a parallel approach is expected to reduce dramatically the execution time: a track pattern recognition algorithm based on the Hough transform and a trigger model based on track fitting.
        Speaker: Lorenzo Rinaldi (BO)
    • 11:15 AM 11:35 AM
      Coffee Break 20m
    • 11:35 AM 1:05 PM
      GPU in Offline, Montecarlo and Analysis (1/3)
      Convener: Soon Yung Jun (Fermilab)
      • 11:35 AM
        Fast event generation on GPU 30m
        We use a graphics processing unit (GPU) for fast event generation of general Standard Model (SM) processes. The event generation system on GPU is developed based on the Monte Carlo integration and generation program BASES/SPRING in FORTRAN. For computations of cross sections of physics processes all SM interactions are included in the helicity amplitude computation package on GPU (HEGET) and phase space generation codes are developed for efficient generation of jet associated processes. For the GPU version of BASES/SPRING new algorithm is introduced in order to fully exploit the ability of GPU performance and the improvement of performance in nearly two orders of magnitude is achieved. The generation system is tested for general SM processes by comparing with the MadGraph, which is widely used in HEP experiments, and integrated cross sections and distributions of generated events are found to be consistent. In order to realize smooth migration to the GPU environment, a program that automatically converts FORTRAN amplitude programs generated by the MadGraph into CUDA programs is developed.
        Speaker: Dr Junichi Kanzaki (KEK)
        Slides
      • 12:05 PM
        First experience with portable high-performance geometry code on GPU 30m
        The Geant-Vector prototype is an effort to address the demand for increased performance in HEP experiment simulation software. By reorganizing particle transport towards a vector-centric data layout, the projects aims for efficient use of SIMD instructions on modern CPUs, as well as co-processors such as GPUs and Xeon Phi. The geometry is an important part of particle transport, consuming a significant fraction of processing time during simulation. A next generation geometry library must be adaptable to scalar, vector and accelerator/GPU platforms. To avoid the large potential effort going into developing and maintaining distinct solutions, we want a single code base that can be utilized on multiple platforms and for various use cases. We report on our solution: by employing C++ templating techniques we can adapt geometry operations to SIMD processing, thus enabling the development of abstract algorithms that compile to efficient, architecture-specific code. We introduce the concept of modular "backends", separating architecture specific code and geometrical algorithms, thereby allowing distinct implementation and validation. We present the templating, compilation and structural architecture employed to achieve this code generality. Benchmarks of methods for navigation in detector geometry will be shown, demonstrating abstracted algorithms compiled to scalar, SSE, AVX and CUDA instructions. By allowing users to dynamically dispatch work to the most suitable processing units at runtime, we can greatly increase hardware occupancy without inflating the code base. Finally the advantages and disadvantages of a generalized approach will be discussed.
        Speaker: Mr Johannes de Fine Licht (CERN)
        Slides
      • 12:35 PM
        Hybrid implementation of the Vegas Monte-Carlo algorithm 30m
        Multidimensional integration based on Monte-Carlo (MC) techniques are widely 
used in High Energy Physics (HEP) and numerous other computing domains. In HEP, they naturally arise from the multidimensional probability densities or from the likelihoods often present in the analysis. Today HPC programming requires dealing with computing accelerators 
like GPGPU or 'many-core' processors, but also taking into account the development 
portability and the hardware heterogeneities with the use of open programming standard 
like OpenCL. Among several MC possible algorithms (MISER, Markov Chain, etc.) the choice has 
been driven by the popularity and the efficiency of the method. The 'Vegas' algorithm is 
frequently used in the LHC analysis as it is accessible from ROOT environment and provides 
reasonably good performance. The parallel implementation of Vegas for computing accelerators presents no major 
obstacle, however some technical difficulties occur when dealing with portability 
and heterogeneity mainly due to the lack of libraries and development tools (like 
performance analysis tools). Combining MPI and OpenCL, we will present a scalable 
distributed implementation. Performance will be shown on different platforms (NVidia K20, Intel Xeon Phi) but 
also on heterogeneous platform mixing CPUs, and different kind of computing accelerator
 cards. The presented work is a canvas to integrate various multidimensional 
functions for different analysis processes. It is planned to integrate and exploit this 
implementation in the future CMS analysis processes.
        Speaker: Mr Gilles GRASSEAU (Laboratoire Leprince-Ringuet (IN2P3/CNRS))
        Slides
    • 1:05 PM 2:00 PM
      Lunch 55m
    • 2:00 PM 4:30 PM
      GPU in Lattice QCD
      Convener: Massimo D'Elia (PI)
      • 2:00 PM
        Designing and Optimizing LQCD code using OpenACC 30m
        An increasing number of massively-parallel machines are based on heterogeneous node architectures combining traditional powerful multicore CPUs with energy-efficient accelerators. Programming heterogeneous systems can be cumbersome and designing efficient codes can result a hard task. The lack of standard programming frameworks for accelerator based machines makes it more complex; in fact in most of the cases best efficiency can only be achieved rewriting the code, usually written in C or C++, using proprietary programming languages such as CUDA. OpenACC offers a different approach based on directives. Porting applications to run on hybrid architectures "only" requires to annotate existing codes with specific "pragma" instructions. These identify functions to be executed on accelerators, and instruct the compiler on how to generate and structure code for specific target device. In this talk we present our experience in designin and optimizing a LQCD code targeted for multi-GPU cluster machines, giving details about the implementation and presenting preliminary results.
        Speaker: Dr Enrico Calore (FE)
        Slides
      • 2:30 PM
        Conjugate Gradient solvers on Intel Xeon Phi and NVIDIA GPUs 30m
        The runtime of a Lattice QCD simulation is dominated by a small kernel, which calculates the product of a vector by a sparse matrix known as the “Dslash” operator. Therefore, this kernel is frequently optimized for various HPC architectures. In this contribution we want evaluate the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator by inverting multiple vectors at the same time we obtain a performance >250 GFLOPs/s on both architectures. This more than doubles the performance of the naive separate inversion. A detailed comparison of the performance of the accelerators for different scenarios will be presented in the talk. We also discuss some details of the implementation and the effort required to obtain the achieved performance.
        Speaker: Mr Patrick Steinbrecher (Fakultät für Physik, Universität Bielefeld)
        Slides
      • 3:00 PM
        QCDGPU: an open-source OpenCL tool for Monte Carlo lattice simulations on heterogeneous GPU cluster 30m
        A new open-source tool QCDGPU for Monte Carlo lattice simulations of the SU(N) gluodynamics and O(N) models is developed. In particular, the package allows to study vacuum dynamics in external chromomagnetic fields, spontaneous vacuum magnetization at high temperature in the SU(N) gluodynamics and other new phenomena. The QCDGPU code is implemented in the OpenCL environment and tested on different OpenCL-compatible devices. It supports single- and multi-GPU modes as well as MPI-ready GPU clusters. Built-in microbenchmarks provide adaptive performance autotuning and effective task scheduling among computing devices in very heterogeneous clusters. Also, the QCDGPU has a client-server part for distributed simulations over VPN. The core of Monte Carlo procedure is based on the PRNGCL library, which contains implementations of the most popular pseudo-random number generators. The package supports single-, mixed- and full double-precision including pseudo-random number generation. The current version of the QCDGPU is available at https://github.com/vadimdi/QCDGPU.
        Speakers: Natalia Kolomeyets (Dnipropetrovsk National University), Dr Vadim Demchik (Dnipropetrovsk National University)
        Slides
      • 3:30 PM
        CL2QCD - Lattice QCD based on OpenCL 30m
        Lattice QCD (LQCD) can benefit greatly from Graphic Processing Units (GPUs), which are well suited for memory bandwidth limited applications. Accordingly, their usage in LQCD simulations is still expanding, mainly relying on CUDA, applicable to NVIDIA hardware only. A hardware vendor independent approach is given by the Open Computing Language (OpenCL). We present CL2QCD, a LQCD software based on OpenCL, which has been successfully used for non-zero temperature studies on AMD based clusters. While all mathematical operations are performed in OpenCL, the program logic and the hardware management is carried out in C++. This allows for a clear separation of concerns and, in particular, for a clear distinction of high and low level functionality. Several physical applications have been developed, in this contribution we will focus on the HMC implementation for Wilson and twisted mass Wilson fermions as well as the RHMC for staggered fermions and their performance. In addition we will comment the concept of unit tests and how it can be applied to LQCD.
        Speaker: Mr Christopher Pinke (Goethe University Frankfurt)
        Slides
      • 4:00 PM
        cuLGT: Lattice gauge fixing on GPUs 30m
        We adopt CUDA-capable Graphic Processing Units (GPUs) for Landau, Coulomb and maximally Abelian gauge fixing in 3+1 dimensional SU(3) and SU(2) lattice gauge field theories. A combination of simulated annealing and overrelaxation is used to aim for the global maximum of the gauge functional. We use a fine grained degree of parallelism to achieve the maximum performance: instead of the common 1 thread per site strategy we use 4 or 8 threads per lattice site. Here, we report on an improved version of our publicly available code (www.culgt.com) which again increases performance and is much easier to to include in existing code. On the GTX580 we achieve up to 450 GFlops (utilizing 80% of the theoretical peak bandwidth) for the Landau overrelaxation code.
        Speaker: Mr Hannes Vogt (Universitaet Tuebingen)
        Slides
    • 4:30 PM 5:00 PM
      Coffee Break 30m
    • 5:00 PM 6:00 PM
      GPU in High Level Trigger (2/3)
      Convener: Mr Daniel Hugo Campora Perez (CERN)
      • 5:00 PM
        FLES: First Level Event Selection package for the CBM experiment 30m
        The CBM (Compressed Baryonic Matter) experiment is an experiment being prepared to operate at the future Facility for Anti-Proton and Ion Research (FAIR, Darmstadt, Germany). Its main focus is the measurement of very rare probes, which requires interaction rates of up to 10 MHz. Together with the high multiplicity of charged tracks produced in heavy-ion collisions, this leads to huge data rates of up to 1 TB/s. Most trigger signatures are complex (short-lived particles, e.g. open charm decays) and require information from several detector sub-systems. First Level Event Selection (FLES) in the CBM experiment will be performed on-line on a dedicated processor farm. This requires the development of fast and precise reconstruction algorithms suitable for on-line data processing. The algorithms have to be intrinsically local and parallel and thus require a fundamental redesign of traditional approaches to event data processing in order to use the full potential of modern many-core CPU/Phi/GPU architectures. Massive hardware parallelization has to be reflected in mathematical and computational optimization of the algorithms. An overview of the on-line FLES processor farm concept, different levels of parallel data processing in the farm from the supervisor down to the multi-threading and the SIMD vectorization, implementation of the algorithms in single precision, memory optimization, scalability with respect to number of cores, efficiency, precision and speed of the FLES algorithms are presented and discussed.
        Speaker: Prof. Ivan Kisel (Goethe University Frankfurt am Main, FIAS Frankfurt Institute for Advanced Studies)
        Slides
      • 5:30 PM
        Tree contraction, connected components,minimum spanning trees: a GPU path to vertex fitting 30m
        We consider standard parallel computing operations in the context of algorithms for solving 3D graph problems, which have applications in vertex finding in HEP. Exploiting GPU acceleration for tree accumulation and graph algorithms poses a challenge: GPUs offer extreme computational power and high memory access bandwidth, combined with a fine-grained parallelism that may not fit the irregular distribution of the linked representation of graph data structures. Assuming n vertices, the computation of minimum spanning trees for 2D graphs has efficient O(n log n) solutions through Delaunay triangulations; these, however, are not applicable for three or more dimensions. General minimum spanning tree computations are limited by lower bounds (Setti and Ramanchadran) demanding work linear in the number of edges, quadratic in the number of vertices for dense graphs. Practical efficient implementations for either the Setti and Ramachadran or the Chazelle algorithms have been elusive, and without parallel architecture formulations. We consider first the efficiency of tree accumulation algorithms by Reif and Vishkin. We use tree accumulations as a tool in 3D graph connected components evaluation by combining them with the classical Shiloach-Vishkin algorithm and a randomized tree contraction phase. We also use tree accumulation in an approximation algorithm for minimum spanning trees, comparing it with parallel variations of the Boruvka minimum spanning forest calculation. Minimum spanning trees are at the core of the ZVMST vertex finding algorithm. We discuss implementations for graph connected components and spanning trees calculation on GPU and multi-core architectures.
        Speakers: Dr Ivan Reid (Brunel University), Prof. Peter Hobson (Brunel University), Dr Raul Lopes (Brunel University)
        Slides
    • 6:30 PM 8:00 PM
      Piazza dei Miracoli - guided visit 1h 30m
    • 9:00 AM 10:30 AM
      GPU in Low Level Trigger (1/2)
      Convener: Massimiliano Fiorini (FE)
      • 9:00 AM
        GPU-based Online Tracking for the PANDA experiment 30m
        The PANDA experiment (antiProton ANnihilation at DArmstadt) is a new hadron physics experiment currently being built at FAIR, Darmstadt (Germany). PANDA will study fixed target collisions of phase space-cooled antiprotons of 1.5 to 15 GeV/c momentum with protons and nuclei at a rate of 20 million events per second. To distinguish between background and signal events, PANDA will utilize a novel data acquisition mechanism. Instead of relying on fast hardware-level triggers initiating data recording, PANDA uses a sophisticated software-based event filtering scheme involving the reconstruction of the whole incoming data stream in realtime. A massive amount of computing power is needed in order to sufficiently reduce the incoming data rate of 200 GB/s to 3 PB/year for permanent storage and further offline analysis. An important part of the experiment's online event filter is online tracking, giving the base for higher-level discrimination algorithms. To cope with PANDA's high data rate, we explore the feasibility of using GPUs for online tracking. This talk presents the status of the three algorithms currently investigated for PANDA's GPU-based online tracking; a Hough transform, a track finder based on Riemann paraboloids, and a novel algorithm called the Triplet Finder. Their performances and different optimizations are shown. Currently having a processing time of 20 µs per event, the Triplet Finder in particular is a promising algorithm making online tracking on GPUs feasible for PANDA.
        Speaker: Mr Andreas Herten (Forschungszentrum Jülich)
        Slides
      • 9:30 AM
        Parallel Neutrino Triggers using GPUs for an underwater telescope. 30m
        Graphics Processing Units are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Because of their performance, researchers have extended their use beyond the computer graphics scope. We have investigate the possibility of implementing and speeding up neutrino online trigger algorithms in the KM3 experiment using CPU-GPU system. The results of a neutrino trigger simulation on a KM3 14 plane Tower are reported.
        Speaker: Dr Bachir Bouhadef (PI)
        Paper
        Slides
      • 10:00 AM
        Track and Vertex Reconstruction on GPUs for the Mu3e Experiment 30m
        The Mu3e experiment searches for the lepton flavour violating decay mu->eee, aiming at a branching ratio sensitivity better than 10^(-16). To reach this sensitivity, muon rates above 10^9 mu/s are required, which are delivered by the Paul Scherrer Institute in Switzerland. A high precision tracking detector composed of ~300 million pixels combined with excellent timing resolution from scintillating fibers and tiles will measure the momenta, vertices and timing of the decay products of muons stopped in the target. The trigger-less readout system will deliver about one Tbit/s of zero-suppressed data. A network of optical links and switching FPGAs sends the complete detector data for a time slice to one node of the filter farm. An FPGA transfers the event data to the GPU via PCIe direct memory access. The GPU finds and fits tracks using a 3D tracking algorithm for multiple scattering dominated resolution. In a second step, a three track vertex fit is performed, allowing for a reduction of the output data rate to below 100 MB/s. The talk discusses the implementation of the fits on the GPU, which already runs at more than 10^9 track fits/s.
        Speaker: Ms Dorothea vom Bruch (Physikalisches Institut, Universitaet Heidelberg)
        Slides
    • 10:30 AM 11:00 AM
      Coffee Break 30m
    • 11:00 AM 12:15 PM
      GPU in Other Applications (1/2)
      Convener: Mr Felice Pantaleo (CERN)
      • 11:00 AM
        Novel GPU features: Performance and Productivity 45m
        The huge amount of computing power needed for signal processing and off-line simulation makes High-Energy Physics an ideal target for GPUs. Since the first versions of CUDA, considerable progress has been made in demonstrating the benefit of GPUs for these processing pipelines and GPUs are now being deployed for production systems. However, early experiments also showed some of the challenges encountered in HEP specific tasks, including memory footprint, complex control flow, phases of limited concurrency and portability. Many of these concerns have been addressed with recent changes to the GPU hardware and software infrastructure: Unified memory, dynamic parallelism, and priority streams are just some of the features at the developer’s disposal to fully take advantage of the available hardware. In addition, recently added boards like the TK1 processor for embedded high performance, low power applications enable now CUDA accelerated applications from the sensor to the offline simulations. In this talk I will present some of the latest additions to the GPU hardware, provide an overview of the recent changes to the software infrastructure and will walk through features added in the latest CUDA version.
        Speaker: Dr Peter Messmer (NVIDIA)
        Slides
      • 11:45 AM
        Commodity embedded technology and mobile GPUs for future HPC systems 30m
        Around 2005-2008, (mostly) economic reasons led to the adoption of commodity GPU in high-performance computing. This transformation has been so effective that in 2013 the TOP500 list of supercomputers is still dominated by heterogeneous architectures based on CPU+coprocessor. In 2013, the largest commodity market in computing is not the one of PCs or GPUs, but mobile computing, comprising smartphones and tablets, most of which are built with ARM-based System On Chips (SoCs). This leads to the suggestion that, once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. Moreover mobile SoCs embed GPUs that are in many cases OpenCL/CUDA capable, therefore most of the computing experience gained over these years can be highly beneficial. Since the end of 2011 the Mont-Blanc project is tackling the challenges related to the use of mobile SoCs in an HPC environment, developing a prototype, evaluating heterogeneous CPU+GPU architectures and porting libraries and scientific applications to ARM based architectures. In view of the experiences within the Mont-Blanc project at the Barcelona Supercomputing Center, this contribution will show preliminary results about performance evaluation of heterogeneous CPU+GPU computation on mobile SoCs and will describe possibilities and challenges involved in developing high-performance computing platforms from low cost and energy efficient mobile processors and commodity components.
        Speaker: Dr Filippo Mantovani (Barcelona Supercomputing Center)
        Slides
    • 12:15 PM 1:00 PM
      Poster Session
      • 12:15 PM
        CUDA implementation of CG inverters for the Faddeev-Popov matrix 45m
        The study of Green's functions in Yang-Mills theory in minimal Landau gauge (MLG) may offer crucial insights for the understanding of quark confinement in quantum chromodynamics. In MLG, the functional integral over gauge-field configurations is restricted to the set of transverse configurations for which the so-called Faddeev-Popov (FP) matrix is positive definite. Thus, this matrix should encode all the relevant (non-perturbative) aspects of the theory, related to the color-confinement mechanism. In particular, the inverse of the FP matrix enters into the evaluation of several fundamental Green's functions of the theory, such as the ghost propagator, the ghost-gluon vertex, the Bose-ghost propagator, etc. These Green's functions can be computed through Monte Carlo simulations using the lattice formulation of gauge theories. However, the numerical inversion of the FP matrix is rather time consuming, since it is a huge (sparse) matrix with an extremely small eigenvalue, thus requiring the use of a parallel preconditioned conjugate-gradient (CG) algorithm. Moreover, for each lattice configuration, this inversion has to be done for hundreds of different kinematic combinations. In fact, this matrix inversion is the performance bottleneck for these numerical studies. In this poster we present several preconditioned CG algorithms and their implementation (through CUDA) in double and mixed precisions using multiple GPUs. In particular, we report on the performance of the code for Tesla and Kepler GPUs, as well as on its weak and strong scaling for up to 32 GPUs interconnected by InfiniBand.
        Speaker: Attilio Cucchieri (University of São Paulo)
        Slides
      • 12:15 PM
        GPU for realitime processing in HEP trigger systems 45m
        Speakers: Andrea Messina (ROMA1), Gianluca Lamanna (LNF), Massimiliano Fiorini (FE)
        Poster
      • 12:15 PM
        Implementation of a Data Transfer Method for GPU-based Online Tracking for the PANDA Experiment 45m
        PANDA (AntiProton Annihilation at Darmstadt) is a new hadron physics experiment currently under construction at FAIR, Darmstadt. PANDA will analyze reactions of antiprotons at 1.5 to 15 GeV/c momentum with protons and other heavier nuclei. In this energy region, signal and background events have similar signatures, rendering a conventional hardware-level trigger unfeasible. PANDA will instead employ a software-based data acquisition paradigm, reconstructing the whole event stream in realtime to perform background rejection. Our research focuses on the use of Graphics Processing Units (GPUs) for online tracking, an essential part of PANDA's online event reconstruction and filtering process. At an average collision rate of 20 million events per second, PANDA will require a massive amount of computational power to reduce the incoming raw data rate from 200 GB/s to the 3 PB/year of storage available for offline analysis. In order to reach this goal, it is vital to ensure optimal performance of the whole data manipulation chain, including data transfer to and from the GPU devices. This poster outlines PANDA's progress in GPU-based online tracking, and introduces our work on GPU data transfer with FairMQ, a flexible abstraction layer using message queues to communicate with different consumers.
        Speaker: Mr Ludovico Bianchi (Forschungszentrum Jülich)
        Poster
      • 12:15 PM
        Photon Propagation with GPUs in IceCube 45m
        Describing propagation of a large number of photons in a transparent medium is a computational problem of a highly parallel nature. All of the simulated photons go through the same stages: they are emitted, they may scatter a few times, and they get absorbed. These steps, when performed in parallel on a large number of photons, can be done very efficiently on a GPU. The IceCube collaboration uses parallelized code that runs on both GPUs and CPUs to simulate photon propagation in a variety of settings, with significant gains in precision and, in many cases, speed of the simulation compared to the table lookup-based code. Same code is also used for the detector medium calibration and as a part of an event reconstruction tool. I will describe the code and discuss some if its applications within our collaboration.
        Speaker: Dmitry Chirkin (UW-Madison, U.S.A.)
        Slides
      • 12:15 PM
        Studying of SU(N) LGT in external chromomagnetic field with QCDGPU 45m
        The vacuum structure of lattice SU(N) gluodynamics in presence of external chromomagnetic field is studied with open source package QCDGPU. The package is adapted to investigate vacuum thermodynamics at non-zero chromomagnetic field both at zero and finite temperatures. In particular, the QCDGPU package allows to explore such an important problem as spontaneous chromomagnetic field generation in the high temperature phase. The package provides measurement of standard lattice quantities as well as some non-standard quantities like spatial distribution of the Polyakov loop, the components of the field tensor and so on. Especially, they may be used to study the coexistence of chromoelectric and chromomagnetic fields, to investigate the structure of $A_0$ condensate, etc. The direct field strength measurement provides an alternative way to investigate the spontaneous vacuum magnetization at high temperature. Examples of QCDGPU usage in exploring some properties of SU(2) and SU(3) gauge theories are presented.
        Speaker: Ms Natalia Kolomoyets (Dnepropetrovsk National University)
        Poster
    • 1:00 PM 2:00 PM
      Lunch 1h
    • 2:00 PM 4:00 PM
      GPU in Offline, Montecarlo and Analysis (2/3)
      Convener: Piero Vicini (ROMA1)
      • 2:00 PM
        GPUs for Higgs boson data analysis at the LHC using the Matrix Element Method 30m
        The matrix element method utilizes ab initio calculations of probability densities as powerful discriminants to extract small signals from large backgrounds at hadron collider experiments. The computational complexity of this method for final states with many particles and degrees of freedom sets it at a disadvantage compared to supervised classification methods such as decision trees, k nearest-neighbour, or neural networks. We will present a concrete implementation of the matrix element technique in the context of Higgs boson analysis at the LHC employing graphics processing units. Due to the intrinsic parallelizability of multidimensional phase space integration, dramatic speedups can be readily achieved, which makes the matrix element technique viable for general usage at collider experiments.
        Speaker: Dr Bernd Stelzer (Simon Fraser University)
        Slides
      • 2:30 PM
        Fast 3D track reconstruction for antiproton annihilation analysis using GPUs 30m
        Fast 4pi solid angle particle track recognition has been a challenge in particle physics for a long time, especially in using nuclear emulsion detectors. In particular, the data rate from emulsion detectors, i.e. from a scanning microscope, is about 10-100 TB/day. Real-time 3D volume processing and track reconstruction of such a quantity of data without limiting the angular acceptance need a large amount of computation, in which the GPU technology plays an essential role. In order to reconstruct annihilations of antiprotons, a fast 4pi solid angle particle track reconstruction based on GPU technology combined with a multithread programming has been developed. By employing 3 state-of-the-art GPUs with a multithread programming, a 60 times faster processing, at least, of 3D emulsion detector data has been achieved with an excellent tracking performance in comparison with a single-thread CPU processing. This tracking framework will be used in a wide range of applications like analyses of antiproton annihilations and neutron dosimetry.
        Speaker: Dr Akitaka Ariga (University of Bern)
        Slides
      • 3:00 PM
        GPUs in gravitational wave data analysis 30m
        Gravitational wave physics is in the doorstep of a new, very exciting era. When the advanced Virgo and advanced LIGO detectors will start their operation, there will be considerable probability of performing the first direct detection of gravitational waves predicted almost 100 years ago by Einsten's theory of General Relativity. However the extraction of the faint signal from the noisy measurement data is a challanging task - and due to the high arithmetic density of the algorithms - requires special methods and their efficient, sophisticated implementation on high-end many-core architectures such as GPUs, APUs, MIC and FPGAs.. The operation-level paralellizability of the algorithms executed so far on single CPU core results - has already resulted - in nearly 2 order of magnitude speedup of the analysis and can directly be trasnlated to detector sensitivity! As such, the developed and applied computational algorithms can be regarded as part of the instrument, thus giving thorough meaning to the notion of "e-detectors". In this talk we will shortly present and discuss the many-core GPU algorithms used in gravitational wave data analysis for extracting the continous waves emitted by isolated, spinning neutron stars and the chirp-like signals of binary NS-NS or NS-BH systems with an outlook for future possibilities.
        Speaker: Dr Gergely Debreczeni (Wigner Research Centre for Physics of the Hungarian Academy of Sciences)
        Slides
      • 3:30 PM
        Accelerated neutrino oscillation probability calculations and reweighting on GPUs 30m
        Neutrino oscillation experiments are reaching high levels of precision in measurements, which are critical for the search for CP violation in the neutrino sector. Inclusion of matter effects increases the computational burden of oscillation probability calculations. The independency of reweighting individual events in a Monte Carlo sample lends itself to parallel implementation on a Graphics Processing Unit. The library Prob3++ was ported to the GPU using the CUDA C API, allowing for large scale parallelized calculations of neutrino oscillation probabilities through matter of constant density, decreasing the execution time by 2 orders of magnitude, when compared to performance on a single CPU. Additionally, benefit can be realized by porting some systematic uncertainty calculations to GPU, especially non-linear uncertainties evaluated on splines. The implementation of a fast, parallel spline evaluation on a GPU is discussed.
        Speaker: Mr Richard Calland (University of liverpool)
        Slides
    • 4:00 PM 4:30 PM
      Coffee Break 30m
    • 4:30 PM 7:00 PM
      GPU in High Level Trigger (3/3)
      Convener: Andrea Messina (ROMA1)
      • 4:30 PM
        Manycore feasability studies at the LHCb trigger 30m
        The LHCb trigger is a real time system with high computation requirements, where incoming data from the LHCb detector is analyzed and selected by applying a chain of algorithms. The infrastructure that sustains the current trigger consists of Intel Xeon based servers, and is designed for sequential execution. We have extended the current software infrastructure to include support for offloaded execution on manycore platforms like graphics cards or the Intel Xeon/Phi. In this paper, we present the latest developments of our offloading mechanism, and we also show feasability studies over subdetector specific problems which may benefit from a manycore approach.
        Speaker: Mr Daniel Hugo Campora Perez (CERN)
        Slides
      • 5:00 PM
        Track pattern-recognition on GPGPUs in the LHCb experiment 30m
        The LHCb experiment is entering in its upgrading phase, with its detector and read-out system re-designed to cope with the increased LHC energy after the long shutdown of 2018. In this upgrade, a trigger-less data acquisition is being developed to read-out the full detector at the bunch-crossing rate of 40 MHz. In particular, the High Level Trigger (HLT) system, where the bulk of the trigger decision is implemented via software on a CPU farm, has to be heavily revised. Since the small LHCb event size (about 100 kB), many-core architectures such as General Purpose Graphics Processing Units (GPGPUs) and multi-core CPUs can be used to process many events in parallel for real-time selection, and may offer a solution for reducing the cost of the HLT farm. Track reconstruction and vertex finding are the more time-consuming applications running in HLT and therefore are the first to be ported on many-core. In this talk we present our implementation of the existing tracking algorithms on GPGPU, discussing in detail the case of the VErtex LOcator detector (VELO), and we show the achieved performances. We discuss also other tracking algorithms that can be used in view of the LHCb upgrade.
        Speaker: Stefano Gallorini (PD)
        Slides
      • 5:30 PM
        A GPU-based track reconstruction in the core of high pT jets in CMS 30m
        The Large Hadron Collider is presently undergoing work to increase the centre-of-mass energy to 13 TeV and to reach much higher beam luminosity. It is scheduled to return to operation in early 2015. With the increasing amount of data delivered by the LHC, the experiments are facing enormous challenges to adapt their computing resources, also in terms of CPU usage. This trend will continue with the planned future upgrade to the High-Luminosity LHC. Of particular interest is the full reconstruction of the decay products of 3rd generation-quarks in high pT jets that have a crucial role in searches of new physics at the energy frontier. At high pT, tracks from B-decays become more collimated, hence reducing the track-finding efficiency of generic tracking algorithms in the core of the jet. The problem of reconstructing high pT tracks in the core of the jet, once a narrow eta-phi region around the jet is defined, was found to be especially beneficial for the application of GPU programming techniques due to the combinatorial complexity of the algorithm. Our approach to the problem will be described, and particular focus will be given to the partitioning of the problem to map the GPU architecture and improve load balancing. To conclude, measurements are described, which show the execution speedups achieved via multi-threaded and CUDA code in the context of the object-oriented C++ software framework (CMSSW) used to process data acquired by the CMS detector at the LHC.
        Speaker: Mr Felice Pantaleo (CERN)
        Slides
      • 6:00 PM
        An evaluation of the potential of GPUs to accelerate tracking algorithms for the ATLAS trigger 30m
        The potential of GPUs has been evaluated as a possible way to accelerate trigger algorithms for the ATLAS experiment located at the Large Hadron Collider (LHC). During LHC Run-1 ATLAS employed a three-level trigger system to progressively reduce the LHC collision rate of 20 MHz to a storage rate of about 600 Hz for offline processing. Reconstruction of charged particles trajectories through the Inner Detector (ID) was performed at the second (L2) and third (EF) trigger levels. The ID contains pixel, silicon strip (SCT) and straw-tube technologies. Prior to tracking, data-preparation algorithms processed the ID raw data producing measurements of the track position at each detector layer. The data-preparation and tracking consumed almost three-quarters of the total L2 CPU resources during 2012 data-taking. Detailed performance studies of a CUDATM implementation of the L2 pixel and SCT data-preparation and tracking algorithms running on a Nvidia® Tesla C2050 GPU have shown a speed-up by a factor of 12 for the tracking code and by up to a factor of 26 for the data preparation code compared to the equivalent C++ code running on a CPU. A client-server technology has been used to interface the CUDATM code to the CPU-based software, allowing a sharing of the GPU resource between several CPU tasks. A re-implementation of the pixel data-preparation code in openCL has also been performed, offering the advantage of portability between various GPU and multi-core CPU architectures.
        Speakers: Dr Denis Oliveira Damazio (Brookhaven National Laboratory), Mr Jacob Howard (University of Oxford)
        Slides
      • 6:30 PM
        Use of hardware accelerators for ATLAS computing 30m
        Modern HEP experiments produce tremendous amounts of data. These data are processed by in-house built software frameworks which have lifetimes longer than the detector itself. Such frameworks were traditionally based on serial code and relied on advances in CPU technologies, mainly clock frequency, to cope with increasing data volumes. With the advent of many-core architectures and GPGPUs this paradigm has to shift to parallel processing and has to include the use of co-processors. However, since the design of most existing frameworks is based on the assumption of frequency scaling and predate co-processors, parallelisation and integration of co-processors are not an easy task. The ATLAS experiment is an example of such a big experiment with a big software framework called Athena. In this talk we will present the studies on parallelisation and co-processor (GPGPU) use in data preparation and tracking for trigger and offline reconstruction as well as their integration into a multiple process based Athena framework.
        Speakers: Mr Maik Dankel (CERN), Dr Sami Kama (Southern Methodist University Dallas/US)
        Slides
    • 8:00 PM 9:00 PM
      Dinner 1h
    • 9:00 AM 11:00 AM
      GPU in Offline, Montecarlo and Analysis (3/3)
      Convener: Marco Sozzi (PI)
      • 9:00 AM
        Sampling secondary particles in high energy physics simulation on the GPU 30m
        We present a massively parallel application for sampling secondary particles in high energy physics (HEP) simulation on a Graphics Processing Unit (GPU). HEP experiments primarily uses the Geant4 toolkit (Geant4) to simulate the passage of particles through a general-purpose detector, which requires intensive computing resources due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The combined composition and rejection methods to sample secondary particles often used in Geant4 may not be suitable for optimal performance of HEP events simulation using recent hardware architectures of accelerated or many-core processors owing to the stochastic nature of the Monte Carlo technique. An alternative approach based on a discrete inverse cumulative probability distribution is explored to minimize the divergence in thread level parallelism as well as to vectorize physics processes for spatial locality and instruction throughput. The inverse cumulative distribution of the differential cross section associated with each electromagnetic physics process is tabulated based on algorithms excerpted from Geant4 and a simple random sampling technique with a linear interpolation is implemented for GPU. Validation and performance evaluation with the alternative technique compared to the conventional composition and rejection method both on GPU and CPU are presented.
        Speaker: Soon Yung Jun (Fermilab)
        Slides
      • 9:30 AM
        Implementation of a Thread-Parallel, GPU-Friendly Function Evaluation Library 30m
        GooFit is a thread-parallel, GPU-friendly function evaluation library, nominally designed for use with the maximum likelihood fitting program MINUIT. In this use case, it provides highly parallel calculations of normalization integrals and log (likelihood) sums. A key feature of the design is its use of the Thrust library to manage all parallel kernel launches. This allows GooFit to execute on any architecture for which Thrust has a backend, currently, including CUDA for nVidia GPUs and OpenMP for single- and multi-core CPUs. Running on an nVidia C2050, GooFit executes as much as 300 times more quickly for a complex high energy physics problem than does the prior (algorithmically equivalent) code running on a single CPU core. This talk will focus on design and implementation issues, in addition to performance.
        Speaker: Michael Sokoloff (University of Cincinnati)
        Slides
      • 10:00 AM
        Discovering matter-antimatter asymmetries with GPUs 30m
        The search for matter-antimatter asymmetries requires highest precision analyses and thus very large datasets and intensive computing. This contribution discusses two complementary approaches where GPU systems have been successfully exploited in this area. Both approaches make use of the CUDA Thrust library. The first approach is a generic search for local asymmetries in phase-space distributions of matter and antimatter particle decays. This powerful analysis method has never been used to date due to its high demand in CPU time. The implementation details on a GPU system, which allowed this method to be used for the first time, as well as its performance including on GPUs on the grid will be discussed in detail. The second approach uses the GooFit framework, which is a generic fitting framework that exploits massive parallelisation on GPUs. Its performance for the use case of a many-parameter fit to a large dataset is discussed as well as its interface from a user point-of-view.
        Speaker: Ms Stefanie Reichert (The University of Manchester)
        Slides
      • 10:30 AM
        Prospects of GPGPU in the Offline Software Framework 30m
        The Pierre Auger Observatory is the currently largest experiment dedicated to unveil the nature and origin of the highest energetic cosmic rays. The software framework 'Offline' has been developed by the Pierre Auger Collaboration for joint analysis of data from different independent detector systems used in one observatory. While reconstruction modules are specific to the Pierre Auger Observatory components of the Offline framework are also used by other experiments. The software framework has recently been extended to incorporate data from the Auger Engineering Radio Array (AERA), the radio extension of the Pierre Auger Observatory. The reconstruction of the data of such radio detectors requires the repeated evaluation of complex antenna gain patterns which significantly increases the required computing resources in the joint analysis. In this contribution we explore the usability of massive parallelization of parts of the Offline code on the GPU. We present the result of a systematic profiling of the joint analysis of the Offline software framework aiming for the identification of code areas suitable for parallelization on GPUs. Possible strategies and obstacles for the usage of GPGPU in an existing experiment framework are discussed.
        Speaker: Dr Tobias Winchen (University of Wupeprtal)
        Slides
    • 11:00 AM 11:30 AM
      Coffee Break 30m
    • 11:30 AM 1:15 PM
      GPU in Low Level Trigger (2/2)
      Convener: Silvia Arezzini (PI)
      • 11:30 AM
        INTEL HPC portfolio 45m
        The use of heterogeneous architectures in HPC at the large scale has become increasingly common over the past few years. One new technology for HPC is the Intel Xeon Phi co-processor that is x86 based, hosts its own Linux OS, and is capable of running most codes with little porting effort. However, the Xeon Phi architecture has significant features that are different from that of Xeon CPUs. Attaining optimal performance requires an understanding of possible execution models and the architecture. This talk highlights various options in Intel HPC portfolio.
        Speaker: Emiliano Politano
        Slides
      • 12:15 PM
        The GAP project: GPU for online processing in low-level trigger 30m
        We describe a pilot project for the use of GPUs (Graphics processing units) in online triggering applications for high energy physics experiments. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss in details the use of online parallel computing on GPU for synchronous low level trigger. We will show the results of two solution to reduce the data transmission latency: the first based on fast capture special driver and the second based on direct GPU communication with the NANET board. We will present preliminary results on a first field test in the CERN NA62 experiment. This study is done in the framework of GAP (GPU application project), a wider project intended to study the use of GPUs in real-time applications.
        Speaker: Massimiliano Fiorini (FE)
        Slides
      • 12:45 PM
        A FPGA-based Network Interface Card with GPUDirect enabling real-time GPU computing in HEP experiments. 30m
        While GPGPU paradigm is widely recognized as an effective approach to high performance computing, its usage in low-latency, real-time systems is still in early stages in HEP experiments. GPUs typically show deterministic behaviour in terms of processing latency once data are available in their internal memories, but assessment of real-time features of a standard GPGPU system takes a careful characterization of all subsystems along data stream path. The networking subsystem results the most critical one in terms of latency fluctuations. Our envisioned solution to this issue is NaNet, an FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable set of network channels with direct access to NVIDIA Fermi/Kepler GPU memories (GPUDirect). NaNet design currently supports both standard - 1GbE (1000Base-T) and 10GbE (10Base-R) - and custom - 34Gbps APElink and 2.5Gbps deterministic latency KM3link - channels, but its modularity allows for a straightforward inclusion of other link technologies. To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a transport layer offload module with cycle-accurate upper-bound latency, supporting UDP, KM3link Time Domain Multiplexing and APElink protocols. After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in NA62 experiment and the on-/off-shore data link for KM3 underwater neutrino telescope. NaNet performances in both experiments will be presented and discussed.
        Speaker: Alessandro Lonardo (ROMA1)
        Slides
    • 1:15 PM 2:15 PM
      Lunch 1h
    • 2:15 PM 3:45 PM
      GPU in Other Applications (2/2)
      Convener: Claudio Bonati (PI)
      • 2:15 PM
        Using GPUs to Solve the Classical N-Body Problem in Physics and Astrophysics 30m
        Computational physics has experienced a fast growth in the last few years, also thanks to the advent of new technologies, such as that of Graphics Processing Units (GPUs). GPUs are currently used for a plethora of scientific applications. In particular, in computational astrophysics, GPUs can speed up the solution to many problems like data processing and the study dynamical evolution of stellar systems. The gain in terms of performance can be by more than a factor 100 with respect to the use of Central Processing Units (CPUs) alone. In this talk I will show some techniques and strategies adopted to speed up the classical newtonian N-body problem using GPUs and I will present HiGPUs, a fully parallel, direct N-body code developed at the Dep. of Physics, Sapienza, Univ. of Roma. I will also discuss several promising applications of GPUs in astrophysics concerning high energy phenomena like the mutual interaction of black holes and the dynamical evolution of dense stellar environments around supermassive black holes. Although the main applications of this code are in Astrophysics, some of the techniques discussed in this talk are of general validity and can be efficiently applied to other branches of physics like, for example, electrodynamics and QCD. For this reason, my talk is a fruitful link among themes discussed in this Pisa Meeting and the one (Perspectives of GPU computing in Physics and Astrophysics) which is held in Rome, September 15-17 2014.
        Speaker: Dr Mario Spera (INAF - Astronomical Observatory of Padova)
        Slides
      • 2:45 PM
        Fast Cone-beam CT reconstruction using GPU 30m
        The Common Unified Device Architecture (CUDA) is a NVIDIA software development platform that allows us to implement general-purpose applications on graphics processing unit (GPU). There are a lot of application areas that benefit of the GPU computing. Fast 3D cone-beam reconstruction is required by many application fields like medical CT imaging or industrial non-destructive testing. For that reason, researchers work on hardware optimized 3D reconstruction algorithms to reduce the reconstruction time. We have used GPU hardware and CUDA platform to speed-up the Feldkamp Davis Kress (FDK) algorithm, which permits the reconstruction of cone-beam CT data. In this work, we present our implementation of the most time-consuming step of FDK algorithm: filtering and back-projection. We also show the required steps needed to be done for parallelization of the algorithm on the CUDA architecture. Our FDK algorithm implementation in addition allows to do a rapid reconstruction, which means that the reconstructed data is ready just after the end of data acquisition.
        Speaker: Giovanni Di Domenico (FE)
        Slides
      • 3:15 PM
        GPU-parallelized Levenberg-Marquardt model fitting towards real-time automated parametric diffusion NMR imaging 30m
        In this contribution we report one of the main goals of the GAP project, which aims to investigate the deployment of Graphic Processing Units (GPU) in different context of realtime scientific applications. In particular we focused on the application of GPUs in reconstruction of diffusion weighted nuclear magnetic resonance (DW-NMR) images by using non-Gaussian diffusion models. This application can benefit from the implementation on the massively parallel architecture of GPUs, optimizing different aspects and enabling online imaging. In this work the stretched exponential model [1] was used to be fitted to DW-NMR biomedical images, obtained from an excised (in vitro) and a healthy (in vivo) mouse brain at 7.0T, in order to extract quantitative non-Gaussian diffusion parametric maps. A pixel-wise approach [2] by using a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer [3], was implemented on a Nvidia Quadro K2000 GPU. A dramatic speed-up (~250x) in massive model fitting analysis was obtained with respect to a multi-core Intel Xeon E5430 processor @2.66GHz. This results suggest that real-time automated pixel-wise parametric DW-NMR imaging is a promising application of GPUs. [1] Palombo, M., et al. JCP2011, 135(3),034504. [2] Capuani, et al.MRI 2013, 31(3),359-365. [3] Zhu, X., Zhang, D. PloSone 2013, 8(10),e76665.
        Speaker: Dr Marco Palombo (MIRcen, I2BM, DSV, CEA, Fontenay-aux-Roses, France; CNR-IPCF UOS ROMA, Physics Department, Sapienza University of Rome, Rome, Italy)
        Slides
    • 3:45 PM 4:00 PM
      Conclusions
      • 3:45 PM
        Summary & Conclusions 15m
        Speaker: Gianluca Lamanna (LNF)
        Slides
    • 4:00 PM 4:20 PM
      Coffee Break 20m