AI_INFN Technical Meeting

Europe/Rome
Description

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

AI_INFN Technical meeting – Minutes and actions

Date: 2024-03-25

News

  • Submission of abstract for the CCR meeting in Palau is open. If you plan to submit activities related to AI_INFN, please get in touch to coordinate the effort.

Operations

  • 16 users have accessed the platform since its deployment on March 8th
  • Reported problems at boundary between IAM and Minio, not directly related to the platform.
  • A user failed to log in possibly due to some mismatch between the IAM token recorded in his instance and the new
    token issued by the IAM service. The problem was solved by restarting the server as administrators.
  • Culling policy is extremely relaxed, some users don’t even realize they have resources allocated. Currently this is not a problem as resources are more than sufficient to cope with the demand and we will simply manually clean up stalling users each Monday.
  • The fact that people do not trust minio, puts stress over NFS which should be used for software rather than data. The situation is still manageable.
  • On Thusday, we switched off M2. This may have caused problems with a dev kubernetes cluster and some project-managed VMs. The platform is untouched.

Tracked developments:

:white_check_mark: Tests on deployments with RKE2 (L. Anderlini, R. Petrini, G. Misurelli, M. Corvo)

  • The activity of test on RKE2 is complete and continues as “Automation of RKE2 deployments in INFN Cloud”.

:arrow_forward: Tests on deployments with RKE2 (R. Petrini, G. Misurelli, M. Corvo)

  • See kick-off slides.

:arrow_forward: Port monitoring infrastructure to Helm chart (R. Petrini)

  • Still no reply from DataCloud PMB on the possility of using the centralized grafana instance. We have saturated the number of administrators (users) available in the free tier of Grafana Cloud (3 admins).
  • No news on metrics to monitor GPU real usage.
    • As a reminder, RTX5000 silicon does not support the prometheus exporter on the percentage of usage of the GPU, resulting in crashes of the prometeus stack itself. We removed metrics unavailable on RTX5000, but this make us blind. We moved to frame buffer memory usage which is more opaque as a metric, but might be sufficient Further testing is needed.
  • Accounting: Rosa and Stefano are focussing on the TLS certification of the PostgreSQL used for the accounting.
    • There is an inconsistency in the solution we have followed. We ask for the certificates to Sectigo.
  • Work on the dashboard: we are studying Dash.

:arrow_forward: Define a list of libraries for QC simulations in Cloud (S. Giagu, S. Bordoni)

  • Matteo is looking at the QC environment.
  • Increasing the number of users, also the number of user requests is needed.
  • Problems with PennyLane, it is not clear if it uses PennyLane in particular.
  • A set of tests is prepared to test the environment.

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • The simplified helm chart developed by Giulio has been deployed on INFN Cloud resources.
  • The target is a single VM with an nVidia T4 GPU, provided by AI_INFN.
  • The deployment for the upcoming tests is: https://jhub.131.154.98.92.myip.cloud.infn.it

:arrow_forward: Acquisto FPGA

  • NTR.

:arrow_forward: User’s forum

  • The indico page of the User’s forum (June 11-12) is accessible here: https://agenda.infn.it/event/40489/
  • Please, get in touch with Elisabetta and Matteo if you wish to present something.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

There are minutes attached to this event. Show them.
    • 16:00 16:15
      News and setup 15m
      Speaker: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 16:15 16:35
      Setting up Day-2 Operations 20m
      Speaker: Giuseppe Misurelli (Istituto Nazionale di Fisica Nucleare)
    • 16:35 16:50
      Discussion on tasks and priorities 15m
      Speaker: All
    • 16:50 17:00
      Any other business 10m