3–7 Jun 2019
Hotel Hermitage - Isola d'Elba
Europe/Rome timezone

Web-based interactive data analysis for HEP with Spark and ROOT DataFrame

6 Jun 2019, 09:00
30m
Sala Maria Luisa (Hotel Hermitage - Isola d'Elba)

Sala Maria Luisa

Hotel Hermitage - Isola d'Elba

La Biodola 57037 Portoferraio (Li) Tel. +39.0565 9740 http://www.hotelhermitage.it/

Speakers

Enric Tejedor Saavedra (CERN) Javier Cervantes Villanueva (CERN) Piotr Mrowczynski (CERN) Prasanth Kothuri (CERN) Vincenzo Eduardo Padulano

Description

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics (HEP), building applications on top of it and exploring its usage in some use case scenarios.
Apache Spark is an analytics framework especially aimed at managing big data, with a strong traction and the support of a large user base. At CERN, the issue of distributing large scale computations has been tackled deploying both Hadoop service(s) to on-premise clusters with Spark running on YARN and on OpenStack Cloud infrastructure(s) with Spark running on Kubernetes.
Meanwhile, CERN provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of (jupyter) notebooks, seamlessly accessing the data and software they need without having them
on their machine.
The first part of the presentation talks about the integrations between Spark, Hadoop with YARN and OpenStack with Kubernetes that together brought a truly Unified Analytics Platform, enabling scaled, distributed HEP data processing. Furthermore, we will discuss how SWAN has become the interface of such Analytics Platform, providing submission and monitoring capabilities for Spark computations.
The second part will focus on evolutions in exploiting analytics infrastructure, namely new developments in ROOT analytics framework - Distributed RDataFrame and PyRDF - which through SWAN allow interactive, parallel and distributed analysis on large physics datasets stored on EOS that can be easily monitored and shared with others.

Primary author

Vincenzo Eduardo Padulano

Presentation materials