Slurm broken (04.03.2024)
From ScientificComputing
The Slurm batch system experienced an error when updating its configuration.
As a result, most batch jobs running at around 16:50 on 4 March 2024 were killed by Slurm with an exit code of "NODE_FAILURE".
We will start to re-open the partitions (queues) later this evening after ensuring the health of the batch system and compute nodes.
We apologize for the inconvenience and disruption this has caused.
Updates
- 2024-03-04 21:10
- The Slurm partitions (queues) have been re-opened.