AI_INFN Technical Meeting

Europe/Rome
Descrizione

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

AI_INFN Technical meeting – Minutes and actions

Date: 2024-04-22

News

  • AI_INFN abstracts submitted to CCR were both accepted as talks
  • Next week we will have to cancel the AI_INFN meeting for the overlap with EuCAIFCon.
  • During the weekend (most probably on Friday 26) we will restart the VMs of the Kubernetes cluster. User sessions will be interrupted.
  • Giuseppe called a meeting on docker image pipelines as a starting point towards automation of the platform deployment, on May 2nd, at 3.30 CEST. Please join.
  • Please register to the AI_INFN user’s workshop in Bologna, 11-12 June.

Operations

  • Downtime on April 9 caused by the migration of the hardware assets to technopole.
  • During the migration, the calico network interface broke and we did not manage to restart it. We had to reinitialize the cluster.
  • 18 users logged in during the last month.
  • Experimental support to increase the resource allocation for specific user groups/projects.
  • 732 GB in 2.3M files on NFS. Thanks to users who keep their environments clean.
  • We are experimenting with a Minio-on-nVME. We allocated 1 TB, we are using 125 GB for testing. If interested in beta testing, let us know.
  • We continue the tests of the batch system with GPUs (Ale Bombini).

Tracked developments:

:arrow_forward: Automation of RKE2 deployments in INFN Cloud (R. Petrini, G. Misurelli, M. Corvo)

:arrow_forward: Develop monitoring infrastructure (R. Petrini)

  • Work with Nadir is ongoing to set up the AI_INFN multi-site database. We are fighting with Ansible.
  •  

:arrow_forward: Environment setup (M. Barbetti, S. Giagu, S. Bordoni)

Matteo B. prepared an environment with the following libraries:

  • TensorFlow 2.14.0
  • PyTorch 2.1.0
  • Jax 0.4.25
  • Pennylane 0.35.1
  • Qiskit 1.0.2
  • CuPy 13.0.0

All of them (with the exception of Pennylane) access properly the GPU.

Identified a script that verifies if Pennylane finds the GPU, and documented the operations to clone the environment to extend it further.

Stefano G. and Simone B. are testing it.

Preliminary attemps for using Apptainer instead of conda, documented in the slides attached.

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • The target is a single VM with an nVidia T4 GPU, provided by AI_INFN.
  • The deployment for the upcoming tests is: https://jhub.131.154.98.92.myip.cloud.infn.it
  • Cloud Veneto is setting up a second target for testing
  • We have created a tiny rke2 cluster to test the submission of jobs, and the target is being tweaked to support job submission.
    • The submission is not reproducible. Sometimes the jobs is accepted and runs, sometimes it goes immediately in “Completed” status. GB is following up.
    • It was agreed with DataCloud WP6 to collect in a shared document the tests we plan to perform.

:arrow_forward: Acquisto FPGA

  • NTR.

:arrow_forward: User’s forum

  • The indico page of the User’s forum (June 11-12) is accessible here: https://agenda.infn.it/event/40489/
  • Please, get in touch with Elisabetta and Matteo if you wish to present something.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

Ci sono verbali allegati a questo evento. Mostrali.