Euler power outage (10 Aug 2018)

From ScientificComputing
Revision as of 07:16, 13 August 2018 by Byrdeo (talk | contribs)

Jump to: navigation, search

Due to a fire in the electrical substation providing power to CSCS, most of the compute nodes went down around 15:10 today. Furthermore, we were asked by CSCS to perform an emergency shutdown of Euler to save UPS power for critical systems. All jobs running on Euler have been lost.

We will update this page as the situation evolves.

Updates

10 July, 15:45 — All compute nodes are down, storage system are being gradually shut down.

10 July, 16:15 — According to CSCS, it will be several hours before the electricity provider can give us more information about the situation. Euler will therefore remain down for the whole weekend.

13 July, 08:00 — The cluster team worked throughout the weekend to restore network connectivity and bring up the cluster's administration servers and storage systems (NetApp, Panasas, Lustre). So far everything works fine, all file systems (apps, home, project, scratch, work) are healthy with no data loss. Today we will gradually power up and test all compute nodes, as well as the InfiniBand network between them.