Euler maintenance (October 2022)

From ScientificComputing
Jump to: navigation, search

CSCS informed us that the data center will be undergoing a major power and cooling maintenance on Wednesday 19 October 2022, which will require a complete shutdown of Euler.

Tentative schedule (subject to change):

Tue 18 Oct, early morning Start of power-down procedure, Euler off-line
Wed 19 Oct, whole day CSCS maintenance, Euler completely down
Thu 20 Oct, whole day Power-on of networks, storage system, admin nodes
Fri 21 Oct, noon Power-on of login nodes, access to storage possible; compute nodes still down
Sat 22 Oct – Sun 23 Oct Power-on and testing of all compute nodes
Mon 24 Oct, afternoon Gradual reactivation of batch queues: first 4h, then 24h, and finally 120h

As usual, batch queues will be progressively inactivated in the days and hours prior to the maintenance, to ensure that no jobs get killed when the cluster is shut down. Short jobs can still run until the the cluster is taken off-line. You will not be able to access your data during the first phase of the maintenance window (Tuesday morning to Friday noon). If all goes well, Euler will start running jobs again in the afternoon of Monday, 24 October and will be fully operational in the evening.

This is for your information only. No action is required on your part.

We are sorry for the inconvenience.

We will update this page in the days before and during the maintenance. Please come back here to get the latest information.

Status updates

2022-10-19, 22:00: Power in the data center was restored around 15:30 this afternoon so we could start the power-on procedure of Euler a bit earlier than planned.

2022-10-21, 11:45: All storage systems are operational, login nodes are open, most compute nodes are up. Performance and stability tests are in progress, batch queues will be progressively reactivate this afternoon.

2022-10-21, 16:00: The 4h queues in LSF and Slurm are active. If no problem appears, 24h queues will be activated in 1-2 hours.

2022-10-21, 19:00: All queues in LSF and Slurm are active, the cluster is operating normally. A number of nodes are not operational yet: some cannot boot, some are broken, and some have other problems. We'll take care of them next week.