Euler power outage (5 July 2018)
Last night shortly after 1 AM a thunderstorm in Lugano caused a partial power outage in the CSCS data centre. Most of the compute nodes of Euler went down. All jobs that were running on these nodes have crashed. (LSF will report their status as "UNKNOWN" until the nodes are rebooted.)
The cluster’s storage systems, which are connected to uninterruptible power supply (UPS) survived the outage, apparently without data loss.
The cluster team is busy bringing the cluster back on-line and testing all its components. The login nodes are up and accessible normally. Batch queues will remain inactive until we are sure that the cluster is healthy.
Sorry for the inconvenience.
- 2018-07-05 13:00
- We start to progressively open the batch queues starting with the shortest queues.
- 2018-07-05 17:00
- All batch queues are open and active; Euler is fully operational. Hopefully there won't be thunderstorms again tonight :-/