Outage 2023-07-03

From ScientificComputing
Jump to: navigation, search

Current Status

A power interruption in the CSCS datacenter where Euler is located has caused a loss of networking and the reboot of many compute nodes at around 09:45 today.

Jobs running on any rebooted compute nodes were lost.

Most compute nodes are online and all queues are open.

Due to a bug in the nVidia CUDA driver, a series of GPU jobs failed with a state of NODE_FAIL.

Updates

2023-07-03 10:00
An incident in the CSCS datacenter where Euler is located has caused the majority of compute nodes to become unavailable at around 09:45 on 3 July 2023.
2023-07-03 10:05
A network problem is the cause a symptom of the outage.
2023-07-03 11:00
A brief power outage in the CSCS datacenter in Lugano caused an interruption in the network as well as the reboot of many compute nodes. Jobs running on any rebooted compute nodes are lost.
2023-07-03 11:15
We are working on bringing up the rebooted compute nodes. We will also progressively re-open the queues, starting with the shortest ones.
2023-07-04 10:00
some jobs on GPU nodes are failing with the error NODE_FAIL. We are investigating.
2023-07-05 09:00
Jobs started recently should not have the NODE_FAIL error anymore. The nodes having this issue have been taken out of slurm.
2023-07-05 15:00
The cause of jobs failing with NODE_FAIL has been found, caused by a bug in the newest Nvidia CUDA driver.