Euler VI Testing

From ScientificComputing
Revision as of 15:00, 31 January 2020 by Urbanb (talk | contribs) (Troubleshooting)

Jump to: navigation, search

The new Euler VI nodes are available for beta testing. They have 128 cores, 512 GB of memory and are connected in a 200 Gbps EDR Infiniband fabric.

During the beta testing phase these nodes should not be used for production runs and the stability of the system is not guaranteed. You can expect problems and you should be willing to report them to us and work on resolving them.

Roadmap

Euler VI will be put into regular production through the following phases. The current phase is highlighted in bold.

  • Closed beta testing: the new nodes are tested by the HPC group and interested users who contact us and agree to be beta-testers.
  • Open beta testing: the new nodes can be tested by anyone who is interested.
  • Gradual easement: a portion of all jobs may be able to run on the system, starting with a minimal set and gradually increasing.
  • Production: the new nodes are treated as any other node in the cluster


Select or avoid Euler VI nodes

During the testing and transition period you can force your job to use or avoid these nodes.

To force your job to run on these nodes, request the “-R beta” or “-R "select[model==EPYC_7742]"” bsub option:

bsub -R beta [other bsub options] ./my_command
bsub -R "select[model==EPYC_7742]" [other bsub options] ./my_command

To prevent your job from running on these nodes, request the “-R stable” bsub option:

bsub -R stable [other bsub options] ./my_command

If you encounter any problem with running your jobs on the new Euler VI nodes, then please report it to cluster support.

After the nodes are put into production, any jobs submitted with the “stable” option may run on the new nodes, too. Some time after than, jobs submitted with the “beta” option will no longer be able to run.

Changes in behavior

If you request Euler VI nodes, then the batch system will run jobs requesting up to 128 cores on a single node.

Threaded jobs

Non-threaded (multi-node MPI) jobs

You should use the “-R "span[ptile=128]"” (or other appropriate value instead of 128) if you intend to run multi-node jobs.

Known issues

Multi-node MPI jobs on Euler VI require OpenMPI 4.0.x

The new Mellanox infiniband cards in the Euler VI nodes require OpenMPI to be built with support for Unified Communication X (UCX) which is an optimized communication layer for Message Passing (MPI), PGAS/OpenSHMEM libraries. The existing OpenMPI installations were not built with support for UCX.

If you use OpenMPI < 4.0.x, then only single-node MPI jobs (up to 128 cores) will work on the new Euler VI nodes. We are working on providing a new OpenMPI 4.0.2 installation that also supports running multi-node MPI jobs on the new Euler VI nodes. Once the new OpenMPI installation is available we will also provide the most common libraries (HDF5, NetCDF, Boost, etc.) compiled with the new OpenMPI. This is work in progress.

Troubleshooting

Please contact us in case of problems.