Difference between revisions of "Leonhard Open maintenance (December 2018)"

From ScientificComputing
Jump to: navigation, search
(Updates)
 
(One intermediate revision by the same user not shown)
Line 16: Line 16:
 
;'''2018-12-10 16:30'''
 
;'''2018-12-10 16:30'''
 
:Testing the Leonhard Open cluster after the maintenance has revealed that the openmpi MPI module does not work as expected. Jobs that have been identified as MPI jobs have been suspended, though we encourage you to kill them and resubmit them once we solve the MPI issues. In the meantime, we suggest you do not submit new MPI jobs and do not use the openmpi modules.
 
:Testing the Leonhard Open cluster after the maintenance has revealed that the openmpi MPI module does not work as expected. Jobs that have been identified as MPI jobs have been suspended, though we encourage you to kill them and resubmit them once we solve the MPI issues. In the meantime, we suggest you do not submit new MPI jobs and do not use the openmpi modules.
 +
 +
;'''2018-12-11 16:40'''
 +
:We are still working on the OpenMPI issue. OpenMPI jobs will not fail, but they will have lots of warnings. This is a known problem, therefore please do not report these warnings to cluster support. Since many jobs are not using OpenMPI, we decided to open the 4h and 24h queues.
 +
 +
;'''2018-12-12 14:50'''
 +
:We have opened the 120h queue and set the system status of Leonhard Open back to '''fully operational'''. Users that run OpenMPI jobs, will still get the following warnings:
 +
 +
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
 +
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
 +
 +
We cannot suppress them, but they can safely be ignored and will not have any influence on your jobs.

Latest revision as of 13:50, 12 December 2018

We would like to inform you about an upcoming maintenance of the Leonhard Open cluster.

The Leonhard Open cluster will be offline from 15:00 on Friday, 7 December 2018 to migrate data to a new storage system. We expect to bring the cluster online in the afternoon of Monday, 10 December 2018.

No action needs to be taken from your side. As usual, jobs that can not start before the downtime will be held in the queues until the end of the maintenance, after which they will start normally.

We are sorry for any inconvenience this may cause.

We will update this page before and during the maintenance.

Updates

2018-12-10 10:20
Our storage experts have successfully migrated the data to the new storage system and finished the integrity checks. Currently we are doing tests and will provide further updates on the maintenance in the afternoon.
2018-12-10 16:30
Testing the Leonhard Open cluster after the maintenance has revealed that the openmpi MPI module does not work as expected. Jobs that have been identified as MPI jobs have been suspended, though we encourage you to kill them and resubmit them once we solve the MPI issues. In the meantime, we suggest you do not submit new MPI jobs and do not use the openmpi modules.
2018-12-11 16:40
We are still working on the OpenMPI issue. OpenMPI jobs will not fail, but they will have lots of warnings. This is a known problem, therefore please do not report these warnings to cluster support. Since many jobs are not using OpenMPI, we decided to open the 4h and 24h queues.
2018-12-12 14:50
We have opened the 120h queue and set the system status of Leonhard Open back to fully operational. Users that run OpenMPI jobs, will still get the following warnings:
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0

We cannot suppress them, but they can safely be ignored and will not have any influence on your jobs.