Speakers
Description
This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics (HEP), building applications on top of it and exploring its usage in some use case scenarios.
Apache Spark is an analytics framework especially aimed at managing big data, with a strong traction and the support of a large user base. At CERN, the issue of distributing large scale computations has been tackled deploying both Hadoop service(s) to on-premise clusters with Spark running on YARN and on OpenStack Cloud infrastructure(s) with Spark running on Kubernetes.
Meanwhile, CERN provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of (jupyter) notebooks, seamlessly accessing the data and software they need without having them
on their machine.
The first part of the presentation talks about the integrations between Spark, Hadoop with YARN and OpenStack with Kubernetes that together brought a truly Unified Analytics Platform, enabling scaled, distributed HEP data processing. Furthermore, we will discuss how SWAN has become the interface of such Analytics Platform, providing submission and monitoring capabilities for Spark computations.
The second part will focus on evolutions in exploiting analytics infrastructure, namely new developments in ROOT analytics framework - Distributed RDataFrame and PyRDF - which through SWAN allow interactive, parallel and distributed analysis on large physics datasets stored on EOS that can be easily monitored and shared with others.