Difference between revisions of "Euler power outage (10 Aug 2018)"

From ScientificComputing
Jump to: navigation, search
Line 5: Line 5:
 
==Updates==
 
==Updates==
  
10 July, 15:45 — All compute nodes are down, storage system are being gradually shut down.
+
10 Aug, 15:45 — All compute nodes are down, storage system are being gradually shut down.
  
10 July, 16:15 — According to CSCS, it will be several hours before the electricity provider can give us more information about the situation. '''Euler will therefore remain down for the whole weekend'''.
+
10 Aug, 16:15 — According to CSCS, it will be several hours before the electricity provider can give us more information about the situation. '''Euler will therefore remain down for the whole weekend'''.
  
13 July, 08:00 — The cluster team worked throughout the weekend to restore '''network''' connectivity and bring up the cluster's '''administration servers''' and '''storage systems''' (NetApp, Panasas, Lustre). So far everything works fine, '''all file systems''' (apps, home, project, scratch, work) '''are healthy with no data loss'''. Today we will gradually power up and test all compute nodes, as well as the InfiniBand network between them.
+
13 Aug, 08:00 — The cluster team worked throughout the weekend to restore '''network''' connectivity and bring up the cluster's '''administration servers''' and '''storage systems''' (NetApp, Panasas, Lustre). So far everything works fine, '''all file systems''' (apps, home, project, scratch, work) '''are healthy with no data loss'''. Today we will gradually power up and test all compute nodes, as well as the InfiniBand network between them.
  
13 July, 10:15 — We are experiencing problems with some cluster management tools. Consequently Euler will probably remain off-line today.
+
13 Aug, 10:15 — We are experiencing problems with some cluster management tools. Consequently Euler will probably remain off-line today.
  
13 July, 17:00 — Most of the compute nodes have been powered up and have passed health and performance tests. However, we are still working on a storage issue, so Euler will remain off-line until tomorrow. (We will try to schedule a few big jobs manually tonight already.)
+
13 Aug, 17:00 — Most of the compute nodes have been powered up and have passed health and performance tests. However, we are still working on a storage issue, so Euler will remain off-line until tomorrow. (We will try to schedule a few big jobs manually tonight already.)
 +
 
 +
14 Aug, 09:30 — Batch queues for 4h and 24h jobs are active again. Since we are still working on '''/cluster/scratch''', this file system is accessible '''read-only''' at this time. Jobs that require it will be held in the queue until this file system is fully operational.
 +
 
 +
14 Aug, 09:40 — We plan to reopen Euler at '''10:00 today.'''

Revision as of 07:43, 14 August 2018

Due to a fire in the electrical substation providing power to CSCS, most of the compute nodes went down around 15:10 today. Furthermore, we were asked by CSCS to perform an emergency shutdown of Euler to save UPS power for critical systems. All jobs running on Euler have been lost.

We will update this page as the situation evolves.

Updates

10 Aug, 15:45 — All compute nodes are down, storage system are being gradually shut down.

10 Aug, 16:15 — According to CSCS, it will be several hours before the electricity provider can give us more information about the situation. Euler will therefore remain down for the whole weekend.

13 Aug, 08:00 — The cluster team worked throughout the weekend to restore network connectivity and bring up the cluster's administration servers and storage systems (NetApp, Panasas, Lustre). So far everything works fine, all file systems (apps, home, project, scratch, work) are healthy with no data loss. Today we will gradually power up and test all compute nodes, as well as the InfiniBand network between them.

13 Aug, 10:15 — We are experiencing problems with some cluster management tools. Consequently Euler will probably remain off-line today.

13 Aug, 17:00 — Most of the compute nodes have been powered up and have passed health and performance tests. However, we are still working on a storage issue, so Euler will remain off-line until tomorrow. (We will try to schedule a few big jobs manually tonight already.)

14 Aug, 09:30 — Batch queues for 4h and 24h jobs are active again. Since we are still working on /cluster/scratch, this file system is accessible read-only at this time. Jobs that require it will be held in the queue until this file system is fully operational.

14 Aug, 09:40 — We plan to reopen Euler at 10:00 today.