Difference between revisions of "Work storage /cluster/work/ partially available (10 May 2022)"

From ScientificComputing
Jump to: navigation, search
(Updates)
(Updates)
Line 107: Line 107:
  
 
: We are still working on fixing the problem with the affected /cluster/work volumes. We now prevent jobs from users which own affected volumes from starting. This will ensure that the jobs don't get stuck in D-state when trying to access an affected volume.
 
: We are still working on fixing the problem with the affected /cluster/work volumes. We now prevent jobs from users which own affected volumes from starting. This will ensure that the jobs don't get stuck in D-state when trying to access an affected volume.
 +
 +
;'''2022-05-11 15:50'''
 +
 +
: The controller is now replaced and we started to run the file system and data integrity checks. We will publish another update on Thursday afternoon.

Revision as of 13:49, 11 May 2022

This morning a storage controller crashed which affects the /cluster/work storage. Parts of the /cluster/work/ are temporarily unavailable. Our storage specialists are in close contact with the vendor and work on bringing back the storage system as fast as possible. Please note that only some users are affected by this incident, not all.

If you have any command in your .bashrc or .bash_profile that accesses a storage volume that is temporarily unavailable, then your login might get stuck. If you encounter this problem, then please write to cluster-support@id.ethz.ch and we can comment out those commands from your .bashrc and/or .bash_profile such that you can again login to Euler.

Affected volumes:

  • /cluster/work/anasta
  • /cluster/work/beltrao
  • /cluster/work/bewi
  • /cluster/work/biol
  • /cluster/work/bmlbb
  • /cluster/work/borgw
  • /cluster/work/bsse_sdsc
  • /cluster/work/cemk
  • /cluster/work/chenp
  • /cluster/work/cobi
  • /cluster/work/compmech
  • /cluster/work/coss
  • /cluster/work/cotterell
  • /cluster/work/cpesm
  • /cluster/work/demello
  • /cluster/work/drzrh
  • /cluster/work/faist
  • /cluster/work/fcoletti
  • /cluster/work/flatt
  • /cluster/work/gdc
  • /cluster/work/gess
  • /cluster/work/gfb
  • /cluster/work/grewe
  • /cluster/work/hahnl
  • /cluster/work/harra
  • /cluster/work/hilliges
  • /cluster/work/hora
  • /cluster/work/ibk_chatzi
  • /cluster/work/ifd
  • /cluster/work/igp_psr
  • /cluster/work/igt_tunnel
  • /cluster/work/infk_mtc
  • /cluster/work/itphys
  • /cluster/work/ivt_vpl
  • /cluster/work/jesch
  • /cluster/work/karlen
  • /cluster/work/kovalenko
  • /cluster/work/krek
  • /cluster/work/kurtcuoglu
  • /cluster/work/lav
  • /cluster/work/lke
  • /cluster/work/lpc
  • /cluster/work/mandm
  • /cluster/work/mansuy
  • /cluster/work/math
  • /cluster/work/moor
  • /cluster/work/nenad
  • /cluster/work/nme
  • /cluster/work/pacbio
  • /cluster/work/pausch
  • /cluster/work/petro
  • /cluster/work/pueschel
  • /cluster/work/puzrin
  • /cluster/work/qchem
  • /cluster/work/reddy
  • /cluster/work/refcosmo
  • /cluster/work/reiher
  • /cluster/work/riner
  • /cluster/work/rjeremy
  • /cluster/work/rre
  • /cluster/work/rsl
  • /cluster/work/sachan
  • /cluster/work/sorkine
  • /cluster/work/sornette
  • /cluster/work/stocke
  • /cluster/work/swissloop
  • /cluster/work/woern
  • /cluster/work/yang

Volumes not affected:

  • /cluster/work/climate
  • /cluster/work/cmbm
  • /cluster/work/cvl
  • /cluster/work/gfd
  • /cluster/work/igc
  • /cluster/work/magna
  • /cluster/work/noiray
  • /cluster/work/refregier
  • /cluster/work/tnu
  • /cluster/work/wenderoth
  • /cluster/work/zhang

We will update this news item whenever there is some new information.

We are sorry for the inconvenience.

Updates

2022-05-10 13:20
The problem with the storage controller could not be fixed. It needs to be replaced. We don't know yet, how long the it will take until /cluster/work is back to normal operation (our current guess is 24 to 96 hours). After the replacement we will also run some integrity checks on the data. We will publish another update later this afternoon.
2022-05-10 16:30
The vendor is sending a new controller that is already on the way to the data center. We will the publish another update tomorrow morning.
2022-05-11 11:55
We are still working on fixing the problem with the affected /cluster/work volumes. We now prevent jobs from users which own affected volumes from starting. This will ensure that the jobs don't get stuck in D-state when trying to access an affected volume.
2022-05-11 15:50
The controller is now replaced and we started to run the file system and data integrity checks. We will publish another update on Thursday afternoon.