Power outage 2023-08-29

From ScientificComputing
Revision as of 10:16, 30 August 2023 by Sfux (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Due to a short power outage in the CSCS datacenter, hundreds of compute nodes came down around 11:15 today. All jobs running on these compute nodes were lost.

Many of these nodes rebooted and came back up when the power was restored, but some were left in a bad state. We are currently investigating this issue with CSCS.

Updates

2023-08-29 16:40
As we investigate a network issue, we are keeping all Euler VII, which represents almost ⅔ of all CPU nodes, closed for the time being.
2023-08-30 11:15
We could resolve the network issue and the cluster is again fully operational