Infiniband problems on Euler VII nodes (November 2021)

From ScientificComputing
Revision as of 09:44, 22 November 2021 by Sfux (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

We are currently experiencing a problem with the Infiniband network on Euler VII nodes. This can affect MPI jobs that use infiniband as well as jobs that access to /cluster/scratch and /cluster/work from an Euler VII node. We are in close contact with the hardware vendors and are investigating the problem. We will make some changes in the scheduling of jobs to avoid that multi-node MPI jobs are starting on Euler VII nodes.

If you encounter problems with stuck jobs on nodes whose hostname does not start with eu-a2p, then please report those cases to cluster support

Updates

2021-11-22 10:30
We found the root cause of the problem with the Infiniband network on Euler VII nodes and could fix it by updating the configuration of the nodes. The Euler VII nodes have now run stable for 5 days.