AI_INFN Technical Meeting

Europe/Rome
Description

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2024-03-11

An update on the migration is discussed.
In summary,

  • we are respecting the plan. The new platform (https://hub.ai.cloud.infn.it) is online and several users have started migrating to it;
  • interestingly there are still users using the 212 deployment, we have contacted them to understand if there is a reason preventing migration but got no reply.
  • we introduced a mechanism to reserve Kubernetes node to JupyterHub-defined user groups through node taints
  • we trasnferred data from the clusters used for testing before the migration, users and groups that had data on multiple clusters will find them in folders named from.96.90, from.97.147… and so on.
  • The 212 deployment will be deleted tomorrow.

Tracked developments:

:fast_forward: Tests on deployments with RKE2 (L. Anderlini, R. Petrini, G. Misurelli, M. Corvo)

  • NTR

:arrow_forward: Port monitoring infrastructure to Helm chart (R. Petrini)

  • The bug on the NFS exporter reported last week has been fixed. Problem was the pile-up of multiple scans due to the fact that the time required for a scan of the filesystem exceeded the interval between scans.
  • Still no reply from DataCloud PMB on the possility of using the centralized grafana instance
  • Problems with the monitoring of an RTX node: our investigation brought to the conclusion that the silicon chip of RTX5000 devices does not include some profiling-related feature that makes the nVidia exporter to crash with default configuration. Digging in the exporter docs, we managed to disable those metrics (currently they are disabled for all nodes since we did not find a way to disable it only on the nodes with an RTX5000 installed.)
  • We created a PostgreSQL instance to take snapshots from prometheus as a first attempt for an accounting and billing mechanism. We managed to connect it to grafana, but currently in only includes random metrics. Need more effort on encryption (TLS) and least-priviledge policy (we gave grafana admin access which is unacceptable if real data are in the DB).
  • Help from S. Dal Pra will be needed to secure the installation and the data once it will be clearer what we want from that DB.
  • The development of a user’s personal dashboard has started. Last week, the focus has been on the authentication of a dash application, which now works (though it is a bit rough, for example there is no logout and one needs to manually clear the cookies).

:arrow_forward: Define a list of libraries for QC simulations in Cloud (S. Giagu, S. Bordoni)

  • Matteo B. reports attempts to have the QC conda environment developed by Giagu and Bordoni accessing the GPUs, there are inconsistency in the versions and he did not manage to find a recipe to have PyTorch, Jax, TensorFlow and PennyLane to share the same Cuda version. The details are in the below mail (in Italian),
Oggi pomeriggio ho provato a mettere mano all’environment di cui avevamo discusso prima della migrazione dell’hub di AI-INFN.

Come primo step ho provato a prepararlo a mano, selezionando uno ad uno i pacchetti sotto elencati e privilegiando l’istallazione con pip (che è quella che dà tipicamente accesso alle ultime versioni disponibili). Mentre sono riuscito senza troppi problemi a creare un environment in cui coesistono TensorFlow e PyTorch con entrambi accesso alla GPU, aggiungere Jax ha rivelato i primi intoppi. In particolare, Jax è disponibile per CUDA 11.8 o 12.3, mentre le ultime versioni di TensorFlow (2.15 o 2.16) hanno CUDA > 11 e PyTorch 2 disponibile per CUDA 11.8 o 12.1. L’unica soluzione che pensavo percorribile era passare a TensorFlow 2.14 (con CUDA 11.8) e combinarlo a PyTorch 2.1 per poter finalmente creare un trio funzionante su GPU insieme a Jax. Come sperato sono riuscito nell’impresa, ma una nuova sventura si nascondeva dietro l’angolo…

Dalla documentazione di PennyLane (https://pypi.org/project/PennyLane-Lightning-GPU):

> Lightning-GPU can be installed using pip:
> 	pip install pennylane-lightning[gpu]
> Lightning-GPU requires CUDA 12 […]

Scacco matto!

Ho quindi provato a seguire passo passo le istruzioni sotto riportate, ma sono arrivato a un environment dove nessuno dei tre pacchetti sopra citati riesce a "vedere correttamente" la GPU.
Cosa sto sbagliando? Voi eravate riusciti ad avere un environment funzionante?

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • Il primo helm chart funzionante c’è. Abbiamo un jupyter hub toy per tirare su il setup. Il notebook parte correttamente col la GPU. L’immagine apre un ssh per parlare con il JupyterHub server per aggirare i problemi di network.

:arrow_forward: Acquisto FPGA

  • NTR. Sentito Cesini, l’ordine deve partire, è tutto definito.

:arrow_forward: User’s forum

  • The new selected dates are: Tuesday June 11th to Thursday June 12th, 2024, or 12-13 giugno.
  • An indico event (empty for the moment) has been created: https://agenda.infn.it/event/40489/

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

There are minutes attached to this event. Show them.
    • 1
      News and setup
      Speaker: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 2
      Discussion on tasks and priorities
      Speaker: All
    • 3
      Any other business