Speaker
Description
In the book the “Fourth Paradigm” it is described how scientific computing and data-driven research have become the third and fourth pillars of science following observation and theoretical modeling and it is debated whether AI constitutes a separate fifth pillar. In case, computing infrastructures play a critical role in building upon these pillars either in running bottom-up simulations or processing large amounts of data from experimental sources such as it happens in high energy physics or biology or using machine learning techniques whose training process is based on these data. On one hand this evolution has meant for HPC centers providing computing power to adapt to different necessities at any scale and this demand for increased flexibility has caused increased complexity with more complicated software stacks and ways to access a computing facility; on the other hand, external constraints such as the pressure from the industrial sector (with the push to architectures specialized for inference and low precision integer arithmetic), the right demand for a controlled carbon footprint and finally the existence of heterogeneous platforms demands for accurate management of HPC workloads and infrastructures in order to adapt to different demands and still provide optimal performance, reliability and power use. In E4 we have labeled this as “reconfigurable HPC” and we are developing a complete hardware portfolio and software stack that blends the flexibility of procedures inherited from the cloud and the level of control and optimization required by on premise HPC cluster. In the presentation some case studies will be presented, starting from the evaluation of different architecture s for different workloads to examples of how this end-to-end building of HPC uses different solutions for lifecycle management, installation of software, orchestration and monitoring, in accordance with user’s necessities. A perspective will be given to a direction in which also E4 wants to proceed of a data driven HPC in which monitoring and analysis play a central role in the design and operation of an infrastructure.