AI_INFN Technical Meeting

Europe/Rome
Description

Virtual meeting room (zoom): https://l.infn.it/ai-infn-meeting

Date: 2024-07-01

  • New managed environment added to the platform dashboard
  • Reported several timeout errors when trying to spawn notebooks. They seem genuine timeouts. We will increase the time before timeout.

Tracked developments:

:arrow_forward: Automation of RKE2 deployments in INFN Cloud

  • Giuseppe confirms the activity on the deployment of the service for SBOM files of the cluster images is ongoing; he is developing a puppet module for the setup.
  • Lucio asks if he can share the recipes for multi-master configuration and rke2 cluster upgrade. It is agreed to make the upgrade in joint session, possibly not this week.

:arrow_forward: Develop monitoring and accounting infrastructure (R. Petrini)

  • [Open since last week] We need an automated backup of the postgres database used for the accounting

:arrow_forward: Environment setup (M. Barbetti, S. Giagu, S. Bordoni, L. Cappelli)

Matteo reports on the five new environments now available through the dashboard

  • TF 2.14
  • TF 2.16 + jax
  • PyTorch + jax
  • TF 2.16 + jax + QC (an open issue on Pennylane, not preventing its usage, but an ugly workaround is needed)
  • PyTorch + QC + JAX

Stefano Giagu has prepared a notebook with tests

Laura and Matteo will play with the notebooks during this week

An action is pending on Matteo for documenting the contents of the environment in the RTD. At the moment, the environment are documented in a self-container README.md, but this is not standard and users might not look for that file. That’s perfect as a temporary solution.

:arrow_forward: Offloading tests with virtual kubelets (G. Bianchini, D. Ciangottini)

  • A new version of the VK and of InterLink to increase resilence against unavailability of the remote backend. Thoroughly tested.
  • The virtual node in the cluster must be updated.
  • The docker plugin in the T4 is updated.
  • A lot of work is ongoing to provide telemetry capabilities to the interLink application. It relies on jaeger and grafana, both running on the master cluster.

:arrow_forward: Acquisto FPGA

  • NTR

:arrow_forward: Advanced Hackathon

  • We are waiting for confirmation of the availability of the rooms in Padua for the weeks of 18/11 and 25/11.

Status legend

:arrow_forward: Active
:fast_forward: Priority
:bangbang: Problems
:parking: Postponed or Blocked by others
:white_check_mark: Completed

There are minutes attached to this event. Show them.
    • 16:00 16:15
      News and setup 15m
      Speaker: Lucio Anderlini (Istituto Nazionale di Fisica Nucleare)
    • 16:15 16:50
      Discussion on tasks and priorities 35m
      Speaker: All
    • 16:50 17:00
      Any other business 10m