AI_INFN Technical Meeting

Europe/Rome
Description

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2025-04-07

News

  • Web page: https://ai.cloud.infn.it
  • Consuntivi: tomorrow meeting with referees. We will discuss a document prepared with the WP leaders on the achievements of 2024. Lucio will circulate slides.
  • Database consuntivi: deadline 9/4.
  • “XXII Seminar on Software for Nuclear, Subnuclear and Applied Physics, Alghero”, registration deadline May 10th, 2025. [Link]
  • “Introductory Course to VHDL and HLS FPGA Programming, Milano”, by ICSC-Spoke 2 - WP4 agenda, registration deadline April 30th.

Operations

  • Few temporary connection problems on Thursday due to work requiring updates of the hub container (FPGA support, see later). No ticket open, no requests for support.

Tracked developments:

:arrow_forward: Automation of RKE2 deployments in INFN Cloud

  • March 3
    • Gioacchino tagged the image with a new version schema. Snakemake is now available.
    • Gioacchino is working on Jupyter for INFN Cloud, trying to remain aligned with AI_INFN.
    • Plan migration to Jupyter 5 together with Gioacchino.
    • The new-named image will be the default one in the platform soon.
  • March 10
    • Update stopped due to the authentication is being forced every minute; to be understood;
      the development activity is stopped for a course onging on the same infrastructure.
      The course should be concluded by tomorrow.
  • March 31
    • The interface between snakemake and WebDAV required a patch. Followed up in a GitHub Issue. A new image release will be needed.
  • April 7
    • We have added to the JupyterLab image the Jupyter Remote Desktop plugin which is important especially for FPGA programming.
    • Lucio will send an updated summary (patched snakemake-s3, patched snakemake-webdav, Jupyter RDS) of the modifications to the image to Gioacchino this week.

:arrow_forward: Develop monitoring and accounting infrastructure (R. Petrini)

  • March 31.
    • Rosa added the network metrics to the file system monitoring

:arrow_forward: Environment setup (S. Giagu, S. Bordoni, L. Cappelli)

  • March 31st
    • An image with Jupyter Remote Desktop and VIVADO would be needed for testing and developing the workflow.
      Giulio has an image with everything preinstalled in his hub.
  • April 7
    • We are trying to replicate the setup used for Python, for the FPGA stack: minimal dependencies in the Docker Image, as much as possible in NFS.
    • It seems we managed to have a working version for CLI, though some piece of firmware used for tests seems missing
    • We did not manage to have X11 applications running within Apptainer, though. Further investigation is needed.

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • March 3
    • 3 buckets made available; not yet tested;
    • Work on GPU continues.
    • Offloading towards FPGA. SSH tunnel is not working any longer and it is not clear why.
  • March 10
    • Offloading verso GPU: aggiornato il plugin NATS per utilizzare “SlurmFlavor” per supportare l’utilizzo di GPU. Il Flavor viene selezionato scorrendo sui flavor disponibili dal più economico al più costoso.
    • Offloading verso FPGA. Il VK di interlink supporta il provisioning di FPGA con il plugin docker. Si può schedulare un pod che richiede FPGA così come si richiede la GPU. Nel Jupyter notebook si possono già usare tutti i tool della Xilinx. Prove fatte con una U55c a Perugia e le prime prove “semplici” sembrano tutte funzionanti in modo corretto. Il sistema potrebbe essere fatto funzionare anche con V70.
    • Stefano Dal Pra organizza una call per organizzare i test.
  • March 24
    • Tested nodes by Stefano Stalio, problems with boto3 client, we will consider adding a WebDAV layer in front of S3 to use different client
    • Also presigned URLs do not work for PUTs (Access Denied). The application relying on presigned URLs would need a complete rewrite of the authorization pattern to avoid using them.
  • March 31st
    • Tested multi-GPU training in Cloud Veneto from the AI INFN Platform. Multi node not tested, yet. Lucio wonders if it is even possible.
    • Marco suggests a single Pod could be executed on multiple nodes, using the network of the cluster as loop back.
    • Enrico shares info on MPI+SLURM.
  • April 7
    • We managed to submit jobs transparenty to the HPC bubble in Cloud Veneto or to CINECA Leonardo. Next step is tring to submit a job running half on Leonardo and half in the bubble.

:arrow_forward: Acquisto FPGA

  • March 10
    • Stefano G. is sending the U55c FPGA to CNAF;
    • Lucio asks to remove one the V70 to send it to Ferrara to continue the test with two different hypervisors and collect additional information.
  • March 24
    • We are starting to acquire information on the bureaucracy to face to obtain a refund for the V70.
    • (Diego M., offline) E4 is available t
  • March 31
    • Jupyter Hub with the U55C
  • April 7
    • Enrico replied to E4 asking to trade 2x V70 with 1x V80

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

There are minutes attached to this event. Show them.
    • 16:00 16:15
      News and setup 15m
      Speaker: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 16:15 16:50
      Updates on development activities 35m
    • 16:50 17:00
      Any other business 10m