AI_INFN Technical Meeting

Europe/Rome
Description

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2025-02-10

Operations

  • The meetings for the organization of the INFN Computing Workshop (La Biodola) have started. AI will be more important than last year in the program. Please start planning abstract submission.
  • CSN5 started a review of the age of the participants to the experiments.
    • FTE of people with an age above 70 years should not be included in the experiment plan;
    • Responsibilities should be assigned to people with an age below 65 years.
  • The school of Alghero (June 8-14/06) featuring a basic introduction fo machine learning (and replacing our hackathon as such) will open registrations soon.

Tracked developments:

:arrow_forward: Automation of RKE2 deployments in INFN Cloud

  • Gioacchino tagged the image with a new version schema. Snakemake is now available.
  • The new-named image will be the default one in the platform soon.

:arrow_forward: Develop monitoring and accounting infrastructure (R. Petrini)

  • Practitioning on the new storage cluster for practicing with monitoring, something went wrong with the setup and we have just re-created it.
  • Created a new dashboard for InterLink.

:arrow_forward: Environment setup (S. Giagu, S. Bordoni, L. Cappelli)

  • Mail from Luca Clissa on GPU profiling:

In breve, con tf non credo di capire bene come funziona, o per lo meno quello che mi dice non mi torna con il risultato di nvidia-smi. Con torch mi sembra più controllato.

Maggiori dettagli

Ho cercato di fare un test semplice caricando su gpu un modello con stesso numero di parametri, sia con torch che tf. Poi sono andato a vedere memoria totale, usata e disponibile con torch per confrontarla con memoria “current” (che dovrebbe essere quella usata) di tensorflow [1]. Risultato:
• torch mi dà più dettagli e mi sembra che torni con l’output di nvidia-smi (circa 2.7GB usati su circa 9.5 GB totali)
• tensorflow in generale traccia meno informazioni (per lo meno non sono riuscito a trovare funzioni che mi dessero maggiori dettagli di ‘tf.config.experimental.get_memory_info(‘GPU:0’)’, che restituisce solo “current” e “peak”)
• i numeri di tensorflow mi sembrano controintuitivi. “current” dovrebbe essere memoria occupata quando lancio il comando, mentre “peak” il massimo raggiunto [1]. Nel mio caso verrebbero rispettivamente 0.6 GB e 7.7 GB. Tuttavia se guardo l’output di nvidia-smi sembra che venga usata praticamente tutta la memoria (8 GB), quindi più simile al numero che trovo in peak (non in current) e totalmente diverso da quello che usa pytorch

Se per caso vuoi darci un’occhiata ho copiato due notebook (1 per libreria) in una cartella ‘shared/public/test_gpu_usage’. Dovresti riuscire a vederla, il codice è molto semplice però nel caso qualcosa non sia chiaro fammi sapere.

Se più avanti riesco a fare altri check ti faccio sapere.

Grazie,
Luca
[1] https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_info

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • JuiceFS on Redis fail to scale beyond 100 concurrent jobs. This can still be optimized, but we won’t be able to improve by orders of magnitude. We hope in a factor 2.

:arrow_forward: Acquisto FPGA

  • Diego collected information from Enrico and others and tried to contact an AMD engineer. The main problem is RHEL8.3 is needed to run the board, but OpenStack requires RHEL9.
  • Call with Mario Ruiz. He was surprised we had that V70. It seems a recent revision with a platform that is not acknowleded by the Xilinx driver. It is a more recent object that it was not supposed to be bought. Then he wrote us that the platform is not supported by VitisAI: we should get in touch with the distributor to ask for the software platform to be replaced (possibly replacing the board itself).
  • Given the official version of AMD is that this board can only be supported with VitisAI and it is not usable with VitisAI… how can it be used?
  • Diego will get in touch with E4 to understand how to move forward.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

There are minutes attached to this event. Show them.
    • 16:00 16:15
      News and setup 15m
      Speaker: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 16:15 16:50
      Updates on development activities 35m
    • 16:50 17:00
      Any other business 10m