Work storage /cluster/work/ partially available (10 May 2022)

From ScientificComputing
Revision as of 18:58, 11 May 2022 by Morenod (talk | contribs) (Updates)

Jump to: navigation, search

This morning a storage controller crashed which affects the /cluster/work storage. Parts of the /cluster/work/ are temporarily unavailable. Our storage specialists are in close contact with the vendor and work on bringing back the storage system as fast as possible. Please note that only some users are affected by this incident, not all.

If you have any command in your .bashrc or .bash_profile that accesses a storage volume that is temporarily unavailable, then your login might get stuck. If you encounter this problem, then please write to cluster-support@id.ethz.ch and we can comment out those commands from your .bashrc and/or .bash_profile such that you can again login to Euler.

Affected volumes:

  • /cluster/work/anasta
  • /cluster/work/beltrao
  • /cluster/work/bewi
  • /cluster/work/biol
  • /cluster/work/bmlbb
  • /cluster/work/borgw
  • /cluster/work/bsse_sdsc
  • /cluster/work/cemk
  • /cluster/work/chenp
  • /cluster/work/cobi
  • /cluster/work/compmech
  • /cluster/work/coss
  • /cluster/work/cotterell
  • /cluster/work/cpesm
  • /cluster/work/demello
  • /cluster/work/drzrh
  • /cluster/work/faist
  • /cluster/work/fcoletti
  • /cluster/work/flatt
  • /cluster/work/gdc
  • /cluster/work/gess
  • /cluster/work/gfb
  • /cluster/work/grewe
  • /cluster/work/hahnl
  • /cluster/work/harra
  • /cluster/work/hilliges
  • /cluster/work/hora
  • /cluster/work/ibk_chatzi
  • /cluster/work/ifd
  • /cluster/work/igp_psr
  • /cluster/work/igt_tunnel
  • /cluster/work/infk_mtc
  • /cluster/work/itphys
  • /cluster/work/ivt_vpl
  • /cluster/work/jesch
  • /cluster/work/karlen
  • /cluster/work/kovalenko
  • /cluster/work/krek
  • /cluster/work/kurtcuoglu
  • /cluster/work/lav
  • /cluster/work/lke
  • /cluster/work/lpc
  • /cluster/work/mandm
  • /cluster/work/mansuy
  • /cluster/work/math
  • /cluster/work/moor
  • /cluster/work/nenad
  • /cluster/work/nme
  • /cluster/work/pacbio
  • /cluster/work/pausch
  • /cluster/work/petro
  • /cluster/work/pueschel
  • /cluster/work/puzrin
  • /cluster/work/qchem
  • /cluster/work/reddy
  • /cluster/work/refcosmo
  • /cluster/work/reiher
  • /cluster/work/riner
  • /cluster/work/rjeremy
  • /cluster/work/rre
  • /cluster/work/rsl
  • /cluster/work/sachan
  • /cluster/work/sorkine
  • /cluster/work/sornette
  • /cluster/work/stocke
  • /cluster/work/swissloop
  • /cluster/work/woern
  • /cluster/work/yang

Volumes not affected:

  • /cluster/work/climate
  • /cluster/work/cmbm
  • /cluster/work/cvl
  • /cluster/work/gfd
  • /cluster/work/igc
  • /cluster/work/magna
  • /cluster/work/noiray
  • /cluster/work/refregier
  • /cluster/work/tnu
  • /cluster/work/wenderoth
  • /cluster/work/zhang

We will update this news item whenever there is some new information.

We are sorry for the inconvenience.

Updates

2022-05-10 13:20
The problem with the storage controller could not be fixed. It needs to be replaced. We don't know yet, how long the it will take until /cluster/work is back to normal operation (our current guess is 24 to 96 hours). After the replacement we will also run some integrity checks on the data. We will publish another update later this afternoon.
2022-05-10 16:30
The vendor is sending a new controller that is already on the way to the data center. We will the publish another update tomorrow morning.
2022-05-11 11:55
We are still working on fixing the problem with the affected /cluster/work volumes. We now prevent jobs from users which own affected volumes from starting. This will ensure that the jobs don't get stuck in D-state when trying to access an affected volume.
2022-05-11 15:50
The controller is now replaced and we started to run the file system and data integrity checks. We will publish another update on Thursday afternoon.
2022-05-11 21:00
Filesystem online again from login nodes. Some of the filesystem checks finished successfully and the integrity checks have been so far successful. The filesystem is however running in degraded performance mode until the integrity checks finish and new storage hardware is installed. This means the jobs blocked will continue blocked until further notice. This situation might last until Friday morning when we expect the system to be back on nominal status and all jobs released. We continue working with our HPC storage vendor for a definitive resolution of the problem.