AI_INFN Technical Meeting

Europe/Rome
Descrizione

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2024-04-08

News

  • Submission of abstract for the CCR meeting in Palau is open and closes today.
  • Next week we will have to cancel the AI_INFN meeting.
  • Tomorrow the platform will be unreacheable for a migration of the hardware assets from one site to another.
    We are making our best to restart all the services on Wednesday.

Operations

  • 17 users have accessed the platform during the last month
  • We keep receiving complaints on minio backbone. We have started working on a in-house solution beyond NFS.
  • Users reported problems of crashes of the environment when trying to use the GPU. The problem seems related to wrong propagation of some environment variables from the conda environment to the ipython kernel. Matteo is following up.
  • We started tests of the batch system with GPUs (Ale Bombini), we are debugging and fixing things.

Tracked developments:

:white_check_mark: Tests on deployments with RKE2 (L. Anderlini, R. Petrini, G. Misurelli, M. Corvo)

  • The activity of test on RKE2 is complete and continues as “Automation of RKE2 deployments in INFN Cloud”.

:arrow_forward: Automation of RKE2 deployments in INFN Cloud (R. Petrini, G. Misurelli, M. Corvo)

  • Formalizziamo un’attività per creare delle pipeline. fissiamo una riunione la settimana del 21.
  • L’obiettivo finale è passare ad un approccio CD.

:arrow_forward: Develop monitoring infrastructure (R. Petrini)

  • Still no reply from DataCloud PMB on the possility of using the centralized grafana instance. We have saturated the number of administrators (users) available in the free tier of Grafana Cloud (3 admins).
  • We are now using a monitoring metric for GPU inferring the fraction of usage from the allocated FrameBuffer memory. More checks needed to assess its reliability.
  • Accounting: Rosa and Stefano are focussing on the TLS certification of the PostgreSQL used for the accounting. Nadir (DataCloud WP1) is providing support.
  • Work on the dashboard
    • NTR

:arrow_forward: Environment setup (M. Barbetti, S. Giagu, S. Bordoni)

  • L’environment di QC è pronto. I pacchetti sono tutti installati e parlano tutti con GPU, tranne PennyLane. Qiskit sembra che funzioni. C’è uno script di test disponibile su PennyLane. La versione di CUDA è la 11.8.
  • JAX, PyTorch e Tensorflow funzionano tutti con GPU.
  • PennyLane funziona però con GPU.
  • Problemi con la versione 2.14, dovevano installare tutte le versioni con i driver di CUDA. Installando tensorflow, installando cuda-toolkit indipendentemente.
  • Problemi anche la versione 2.16, non è chiaro qual è il problema. Funziona se lanciato direttamente da script.
  • Activity focused on following up with user’s difficulties in setting up the environment.
  • Created a “keras3” environment made available by default to anyone
  • Problems with PennyLane, NTR.
  • A set of tests is prepared to test the environment.

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

:arrow_forward: Acquisto FPGA

  • NTR.

:arrow_forward: User’s forum

  • The indico page of the User’s forum (June 11-12) is accessible here: https://agenda.infn.it/event/40489/
  • Please, get in touch with Elisabetta and Matteo if you wish to present something.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

Ci sono verbali allegati a questo evento. Mostrali.
    • 16:00 16:15
      News and setup 15m
      Relatore: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 16:15 16:50
      Discussion on tasks and priorities 35m
      Relatore: All
    • 16:50 17:00
      Any other business 10m