Slurm broken (04.03.2024)

From ScientificComputing
Revision as of 21:15, 4 March 2024 by Urbanb (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Slurm batch system experienced an error when updating its configuration.

As a result, most batch jobs running at around 16:50 on 4 March 2024 were killed by Slurm with an exit code of "NODE_FAILURE".

We will start to re-open the partitions (queues) later this evening after ensuring the health of the batch system and compute nodes.

We apologize for the inconvenience and disruption this has caused.

Updates

2024-03-04 21:10
The Slurm partitions (queues) have been re-opened.