Euler power outage (10 Aug 2018)

From ScientificComputing
Jump to: navigation, search

Due to a fire in the electrical substation providing power to CSCS, most of the compute nodes went down around 15:10 today. Furthermore, we were asked by CSCS to perform an emergency shutdown of Euler to save UPS power for critical systems. All jobs running on Euler have been lost.

We will update this page as the situation evolves.

Updates

2018-08-10 15:45
All compute nodes are down, storage system are being gradually shut down.
2018-08-10 16:15
According to CSCS, it will be several hours before the electricity provider can give us more information about the situation. Euler will therefore remain down for the whole weekend.
2018-08-13 08:00
The cluster team worked throughout the weekend to restore network connectivity and bring up the cluster's administration servers and storage systems (NetApp, Panasas, Lustre). So far everything works fine, all file systems (apps, home, project, scratch, work) are healthy with no data loss. Today we will gradually power up and test all compute nodes, as well as the InfiniBand network between them.
2018-08-13 10:15
We are experiencing problems with some cluster management tools. Consequently Euler will probably remain off-line today.
2018-08-13 17:00
Most of the compute nodes have been powered up and have passed health and performance tests. However, we are still working on a storage issue, so Euler will remain off-line until tomorrow. (We will try to schedule a few big jobs manually tonight already.)
2018-08-14 09:30
Batch queues for 4h and 24h jobs are active again. Since we are still working on /cluster/scratch, this file system is accessible read-only at this time. Jobs that require it will be held in the queue until this file system is fully operational.
2018-08-14 09:40
We plan to reopen Euler at 10:00 today.
2018-08-14 10:00
Euler is open again. Please keep in mind that /cluster/scratch is accessible read-only until further notice.
2018-08-15 11:00
We are making good progress with /cluster/scratch. If all goes well it should be fully functional (read+write) later today.
2018-08-15 17:15
The scratch file system has been migrated to a new storage system and is fully functional again. Jobs that depend on it will be progressively released from the queue and will start as soon as the requested resources are available.