Slurm hardware migration 2023-09-04

From ScientificComputing
Jump to: navigation, search

On the morning of Monday, 4. September 2023, the Slurm services will be migrated to newer hardware.

During this time we will stop the queues (the partitions will be in the down state).

The Slurm commands such as squeue will also be unavailable for some shorter time during this migration.


2023-09-04 09:10
The migration is complete.
2023-09-04 10:20
Any job notification emails sent between around 08:00 and 10:15 are lost and will not be delivered.
2023-09-04 12:20
The partitions are down, we are investigating and will bring them back as soon as possible.
2023-09-04 13:20
The partitions have been re-opened.
2023-09-04 15:40
Many jobs that started before the migration and finished after it are stuck in a Completing state. These jobs generally completed successfully. We are looking at solving this gracefully.
2023-09-04 17:40
Jobs that started before the migration and finished before 17:00 may have the status of Cancelled, even the calculations completed successfully. Jobs started before the migration but completed after 17:00 are expected to complete normally.
2023-09-05 09:00
Interactive srun issue: any srun job from the login nodes never launches, yet the srun command hangs.
2023-09-05 15:30
Interactive srun issue: this issue has been resolved. srun jobs can be run normally again.