Infrastructure incident at CSCS 2023-10-07

From ScientificComputing
Revision as of 07:29, 16 October 2023 by Sfux (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Saturday 7 October 2023 around 19:00, a number of compute nodes went down due to some infrastructure incident at CSCS. The cause of this incident is currently being investigated. The affected compute nodes will be brought back into operation as soon as possible.

This infrastructure incident affected only a limited number of compute nodes and had no impact on the operation of Euler as a whole. In particular, all the cluster's file servers and the vast majority of its compute nodes remained up and running throughout this incident.

Unfortunately all the jobs that were running on the affected nodes have been lost.

We are sorry for the inconvenience.


Please check this wiki page for further updates.

Updates

2022-10-16 08:30
Most of the compute nodes that went down were brough back on Monday/Tuesday. Tuesday/Wednesday, some problems with the mouting of storage systems on Euler VIII were resolved. Since Wednesday, the cluster is back to normal operation.