Data generation rates at CERN are growing very fast for database workloads going into LHC run 2 and beyond. In particular, this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. In order to cope with this problem, CERN has adopted modern Big Data solutions based on Apache Hadoop and its ecosystem. Notably, technologies like Apache Spark, Impala, Parquet are offering a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. Additionally, they enable new, flexible interfaces for data processing including machine learning.
This presentation will also describe the infrastructure that currently is deployed at CERN and the most interesting projects that are running on top of it.