AI_INFN Technical meeting – Minutes and actions
Date: 2024-12-09
Operations
- On November 14, we filled the NFS disks of the platform. This caused a DiskPressure condition on the storage node that Evicted some pods. Consequences:
- A short downtime (no access to the file systems)
- Lost of the JuiceFS metadata running on Redis. It is still not 100% clear why, but the backup of the Redis database was updated to April 26th (basically useless). 835 GB of data in JuiceFS has been lost.
- We have created a new storage cluster with 6 TB on nVME on 3 instances. It runs:
- MinIO operator for data (with 1-server failure tolerance)
- PostgreSQL for metadata
- On November 27th, “Macchinone 1 (T4 and RTX GPUs)” had a kernel panic due to an “Uncorrectable ECC (@DIMME1(CPU2))”. The maintenance on this machine has expired. Diego Michelotto will monitor the machine more closely in the coming days. Consequences:
- The RTX nodes on the platform were unreacheable, but the GPUs were not removed from the availability display in the spawning page. We had one ticket open for failures at spwaning time.
- In case of increased frequency in kernel panic errors we will need to consider decommissioning this machine.
- On Friday December 7th, the Platform has been used for a demo on the offloading at the SOSC2024.
- Hackathon. At the end of this week, we should free the resources at CNAF and clean the access through IAM-Demo
Tracked developments:
Automation of RKE2 deployments in INFN Cloud
- The button in the INFN Cloud dashboard redirecting to the AI_INFN platform has been added.
For the moment it is only visible in the development-version of the PaaS and for users in the admin/ai-infn group.
- Gioacchino pushed the docker images of the JupyterLab and of the JupyterHub in the repository of DataCloud with automated build.
Develop monitoring and accounting infrastructure (R. Petrini)
- The grafana dashboard has been extended to:
- monitor the hackathon platform
- monitor the Postgres used for the new JuiceFS
- monitor the MinIO instance used by the new JuiceFS (this does not work)
Environment setup (M. Barbetti, S. Giagu, S. Bordoni, L. Cappelli)
Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)
- After running a first demonstrator of workflows combining Tier-1 and Cineca Leonardo resources,
we have started a redesign of the Condor plugin to introduce a further abstraction layer with NATS.
The new layer enables pulling jobs through a websocket and submitting them through condor.
It also decouple the translation to singularity from the submission process, so the same job script can be submitted to condor or to other backend by modifying only the submission and retrival logics instead of re-designing the translation from Kubernetes to docker runtime.
The new setup has been tested for a live demo at the end of the SOSC2024, but (while preparing for the live) we observed scalability issues, that will be investigated further this week.
- This approach will be also used to submit jobs to CloudVeneto with an interlink-to-interlink plugin.
Acquisto FPGA
Status legend
Active
Priority
Problems
Postponed or Blocked by others
Completed
There are minutes attached to this event.
Show them.