Infiniband problems on Euler VII nodes (November 2021)
From ScientificComputing
We are currently experiencing a problem with the Infiniband network on Euler VII nodes. This can affect MPI jobs that use infiniband as well as jobs that access to /cluster/scratch and /cluster/work from an Euler VII node. We are in close contact with the hardware vendors and are investigating the problem. We will make some changes in the scheduling of jobs to avoid that multi-node MPI jobs are starting on Euler VII nodes.
If you encounter problems with stuck jobs on nodes whose hostname does not start with eu-a2p, then please report those cases to cluster support
Updates
- 2021-11-22 10:30
- We found the root cause of the problem with the Infiniband network on Euler VII nodes and could fix it by updating the configuration of the nodes. The Euler VII nodes have now run stable for 5 days.