AI_INFN Technical Meeting

Europe/Rome
Descrizione

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2024-06-10

News

  • The call for the national Ph.D program on Artificial Intelligence is open: https://dottorato.unipi.it/index.php/it/concorsi-d-ammissione-a-a-2024-2025.html.
    Deadline: 2024-06-20.
  • The Sixth International School on Open Science Cloud (SOSC24) this year will be held in Bologna, from 2024-12-02 to 2024-12-06. Applications are open.
  • INFN Cloud updated the S3 cloud storage phasing out MinioGW and moving to RadosGW. To access your data in cloud storage through the AI_INFN Platform, you will need to Stop your server and restart. You should find your data mounted in /home/rgw (rather than /home/minio, where it was before).
  • We are testing a software solution to backup user’s data based on BorgBackup. Currently, only NFS data is backupped. A 4 TB volume on Cloud@CNAF Ceph is used to store an encrpyted, compressed and deduplicated copy of user’s data. Pruning is configured to keep: 12 hourly, 7 daily, 4 weekly and 6 monthly backups. The uncompressed NFS filesystem is 761 GB, the compressed and deduplicated backup slots require 406 GB. Minimal monitoring has been added to Grafana (https://hub.ai.cloud.infn.it/grafana).

Tracked developments:

:arrow_forward: Automation of RKE2 deployments in INFN Cloud

  • INFN Cloud is moving forward to support our request of adding functionalities to the dashboard to resize the nodes adding and removing nodes.
  • There is a call for developing a Kubernetes AutoScaler interfaced with the INFN Cloud dashboard to automate this process based on the pressure on the cluster. Please volunteer!

:arrow_forward: Develop monitoring and accounting infrastructure (R. Petrini)

  • [Open issue since last week] There is a problem with the migration to the database developed by Nadir: when writing “TYPES” the SQL console hangs forever. We rolled back to the previous database, and we will give it another try as soon as possible.

:arrow_forward: Environment setup (M. Barbetti, S. Giagu, S. Bordoni, L. Cappelli)

  • The documentation has been moved to https://ai-infn.baltig-pages.infn.it/wp-1/docs/ and is linked in the JupyterHub splash page.
  • Progress in writing the documentation, but help is very much appreciated.
  • We agree that we should move forward with two different environments, one with Torch and Quantum and one with TensorFlow and quantum.
  •  

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • We have integrated CloudVeneto among the sites that can run CPU jobs through offloading. Some problems with the provisioning of cvmfs, not specific to CloudVeneto but more evidente there, that we are following up.
  • The error rate of jobs submitted through interlink is higher. In most cases it seems connected to filesystem issues. We are working to automatically relaunching failed jobs, increasing the success rate by tuning the configuration of both the job and the remote sites. Example of un-attended activity during the weekend (Lamarr validation).

  • More work needed to improve the reliability of the infrastructure.

:arrow_forward: Acquisto FPGA

  • NTR

:arrow_forward: User’s forum

  • It’s tomorrw! See you soon.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

Ci sono verbali allegati a questo evento. Mostrali.