Work storage /cluster/work/ partially available (10 May 2022)
This morning a storage controller crashed which affects the /cluster/work storage. Parts of the /cluster/work/ are temporarily unavailable. Our storage specialists are in close contact with the vendor and work on bringing back the storage system as fast as possible. Please note that only some users are affected by this incident, not all.
If you have any command in your .bashrc or .bash_profile that accesses a storage volume that is temporarily unavailable, then your login might get stuck. If you encounter this problem, then please write to cluster-support@id.ethz.ch and we can comment out those commands from your .bashrc and/or .bash_profile such that you can again login to Euler.
Affected volumes:
- /cluster/work/anasta
- /cluster/work/beltrao
- /cluster/work/bewi
- /cluster/work/biol
- /cluster/work/bmlbb
- /cluster/work/borgw
- /cluster/work/bsse_sdsc
- /cluster/work/cemk
- /cluster/work/chenp
- /cluster/work/cobi
- /cluster/work/compmech
- /cluster/work/coss
- /cluster/work/cotterell
- /cluster/work/cpesm
- /cluster/work/demello
- /cluster/work/drzrh
- /cluster/work/faist
- /cluster/work/fcoletti
- /cluster/work/flatt
- /cluster/work/gdc
- /cluster/work/gess
- /cluster/work/gfb
- /cluster/work/grewe
- /cluster/work/hahnl
- /cluster/work/harra
- /cluster/work/hilliges
- /cluster/work/hora
- /cluster/work/ibk_chatzi
- /cluster/work/ifd
- /cluster/work/igp_psr
- /cluster/work/igt_tunnel
- /cluster/work/infk_mtc
- /cluster/work/itphys
- /cluster/work/ivt_vpl
- /cluster/work/jesch
- /cluster/work/karlen
- /cluster/work/kovalenko
- /cluster/work/krek
- /cluster/work/kurtcuoglu
- /cluster/work/lav
- /cluster/work/lke
- /cluster/work/lpc
- /cluster/work/mandm
- /cluster/work/mansuy
- /cluster/work/math
- /cluster/work/moor
- /cluster/work/nenad
- /cluster/work/nme
- /cluster/work/pacbio
- /cluster/work/pausch
- /cluster/work/petro
- /cluster/work/pueschel
- /cluster/work/puzrin
- /cluster/work/qchem
- /cluster/work/reddy
- /cluster/work/refcosmo
- /cluster/work/reiher
- /cluster/work/riner
- /cluster/work/rjeremy
- /cluster/work/rre
- /cluster/work/rsl
- /cluster/work/sachan
- /cluster/work/schneider
- /cluster/work/sis
- /cluster/work/sunagawa
- /cluster/work/sorkine
- /cluster/work/sornette
- /cluster/work/stocke
- /cluster/work/swissloop
- /cluster/work/tnubank
- /cluster/work/treutlein
- /cluster/work/woern
- /cluster/work/yang
Volumes not affected:
- /cluster/work/climate
- /cluster/work/cmbm
- /cluster/work/cvl
- /cluster/work/gfd
- /cluster/work/igc
- /cluster/work/magna
- /cluster/work/noiray
- /cluster/work/refregier
- /cluster/work/tnu
- /cluster/work/wenderoth
- /cluster/work/zhang
We will update this news item whenever there is some new information.
We are sorry for the inconvenience.
Updates
- 2022-05-10 13:20
- The problem with the storage controller could not be fixed. It needs to be replaced. We don't know yet, how long the it will take until /cluster/work is back to normal operation (our current guess is 24 to 96 hours). After the replacement we will also run some integrity checks on the data. We will publish another update later this afternoon.
- 2022-05-10 16:30
- The vendor is sending a new controller that is already on the way to the data center. We will the publish another update tomorrow morning.
- 2022-05-11 11:55
- We are still working on fixing the problem with the affected /cluster/work volumes. We now prevent jobs from users which own affected volumes from starting. This will ensure that the jobs don't get stuck in D-state when trying to access an affected volume.
- 2022-05-11 15:50
- The controller is now replaced and we started to run the file system and data integrity checks. We will publish another update on Thursday afternoon.
- 2022-05-11 21:00
- Filesystem online again from login nodes. Some of the filesystem checks finished successfully and the integrity checks have been so far successful. The filesystem is however running in degraded performance mode until the integrity checks finish and new storage hardware is installed. This means the jobs will remain blocked until further notice. This situation might last until Friday morning when we expect the system to be back on nominal status and all jobs released. We continue working with our HPC storage vendor for a definitive resolution of the problem.
- 2022-05-12 18:00
- The integrity checks are continuing and should finish around midnight. So far no major issues have been found, so we are progressively allowing more jobs to run. We expect that operation will be back to normal Friday morning.
- 2022-05-13 08:25
- The integrity checks have finished and we no longer prevent jobs from starting. There is still one final change required that will cause a short interruption.
- 2022-05-13 10:20
- All the operations on the storage system are finished. The system is back to nomina status. Further operations are not required for now.