Euler VI Testing

From ScientificComputing
Revision as of 12:46, 23 October 2020 by Sfux (talk | contribs) (Software known to fail on Euler VI nodes)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The new Euler VI nodes are available for beta testing. They have 128 cores, 512 GB of memory and are connected in a 200 Gbps EDR Infiniband fabric.

During the beta testing phase these nodes should not be used for production runs and the stability of the system is not guaranteed. You can expect problems and you should be willing to report them to us and work on resolving them.

Roadmap

Euler VI will be put into regular production through the following phases. The current phase is highlighted in bold.

  • Closed beta testing: the new nodes are tested by the HPC group and interested users who contact us and agree to be beta-testers.
  • Open beta testing: the new nodes can be tested by anyone who is interested.
  • Gradual introduction: a portion of all jobs may be able to run on the system, starting with a minimal set and gradually increasing.
  • Production: the new nodes are treated as any other node in the cluster.


Select or avoid Euler VI nodes

During the testing and transition period you can force your job to use or avoid these nodes.

To force your job to run on these nodes, request the -R beta or -R "select[model==EPYC_7742]" bsub option:

bsub -R beta [other bsub options] ./my_command
bsub -R "select[model==EPYC_7742]" [other bsub options] ./my_command

To prevent your job from running on these nodes, request the -R stable bsub option:

bsub -R stable [other bsub options] ./my_command

If you encounter any problem with running your jobs on the new Euler VI nodes, then please report it to cluster support.

After the nodes are put into production, any jobs submitted with the “stable” option may run on the new nodes, too. Some time after than, jobs submitted with the “beta” option will no longer be able to run.

Changes in behavior

If you request Euler VI nodes, then the batch system will run jobs requesting up to 128 cores on a single node (unless otherwise specified).

The scheduler will abort job submission if it detects that an old MPI version will be used on these nodes.

Threaded jobs

Non-threaded (multi-node MPI) jobs

You should use the -R "span[ptile=128]" (or other appropriate value instead of 128) if you intend to run multi-node jobs.

Known issues

Multi-node MPI jobs on Euler VI require OpenMPI 4.0.x

The new Mellanox infiniband cards in the Euler VI nodes require OpenMPI to be built with support for Unified Communication X (UCX) which is an optimized communication layer for Message Passing (MPI), PGAS/OpenSHMEM libraries. The existing OpenMPI installations were not built with support for UCX.

If you use OpenMPI < 4.0.x, then only single-node MPI jobs (up to 128 cores) will work on the new Euler VI nodes.OpenMPI 4.0.2 that also supports running multi-node MPI jobs on the new Euler VI nodes is now available in the new software stack on Euler.

https://scicomp.ethz.ch/wiki/New_SPACK_software_stack_on_Euler
https://scicomp.ethz.ch/wiki/Euler_applications_and_libraries

We provide the most common libraries (HDF5, NetCDF, Boost, etc.) compiled with the new OpenMPI 4.0.2 in the new software stack.

Software known to fail on Euler VI nodes

  • abaqus/6.14-1 (As a workaround, you can use a newer version, e.g., abaqus/2018 has been tested on Euler VI and works fine, or use the -R stable option)
  • Programs compiled with the Intel compiler, for cases where Intel CPU specific compiler optimization flags were used (as a workaround the software needs to be recompiled without Intel CPU specific compiler optimization flags)
  • The intel/18.0.1 toolchain is not working on Euler VI, because packages were compiled with Intel CPU specific optimization flags. Please use the new intel/19.1.0 toolchain if you would like to run codes compiled with intel on the Euler VI nodes

Troubleshooting

Please contact us in case of problems.