Issues with /cluster/work file system (31.07.2024)

From ScientificComputing
Revision as of 07:02, 6 August 2024 by Sfux (talk | contribs) (Updates)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

We have been experiencing a series of (unrelated) issues which affected

/cluster/work

in the last few days. Due to this, accessing /cluster/work was slow and the file system was sometimes unresponsive. We are investigating the problem and work on resolving it as soon as possible.

We are sorry for the inconvenience

Updates

2024-08-05 16:50
The file system check finished much faster than expected. Nodes are back online and the queues again process jobs. System status is set back to green.
2024-08-05 14:00
Due to instabilites of /cluster/work we have to start the file system check now (14:00). The check will run for about 8 hours.
2024-08-05 10:10
Tonight at 11:00 pm, we will start to run a low-level file system check on /cluster/work because of the ongoing issues in the past days. The check is estimated to finish tomorrow around 8:00 AM. We have set the system status of Euler to orange as the queues will be closed during that time.
  • The WORK file system will be completely OFFLINE during this operation
  • The batch system will not start new jobs today if their expected run-time overlaps with the maintenance
  • In particular, 24h and 120h jobs will be held in the queue until tomorrow - Jobs in the 4h queues will be started until 19:00 tonight
  • These restrictions apply to ALL batch jobs, including those submitted via JupyterHub, as well as those that do NOT use /cluster/work
  • Jobs and Jupyter sessions that have already started and will still be running after 23:00 will HANG when they try to access /cluster/work; jobs that do not use that file system will not be affected
  • Accessing of /cluster/work on the login nodes will hang during the check.
2024-07-31 18:00
These issues have now been solved. /cluster/work is accessible but running in a degraded mode, which may affect its performance over the coming days. We will continue working behind the scenes to make sure the systems are running stable in the future.