Slurm broken (04.03.2024)

From ScientificComputing
Jump to: navigation, search

The Slurm batch system experienced an error when updating its configuration.

As a result, most batch jobs running at around 16:50 on 4 March 2024 were killed by Slurm with an exit code of "NODE_FAILURE".

We will start to re-open the partitions (queues) later this evening after ensuring the health of the batch system and compute nodes.

We apologize for the inconvenience and disruption this has caused.

Updates

2024-03-04 21:10
The Slurm partitions (queues) have been re-opened.