In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.
Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.
Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that
- the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
- you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
Good combinations on Euler:
- 2 MPI ranks per node, 12 threads per MPI rank (M=N/12, T=12 and S=1 where S is the number of ranks per socket),
- 4 MPI ranks per node, 6 threads per MPI rank (M=N/6, T=6 and S=2), or even
- 12 MPI ranks per node, 2 threads per MPI rank (M=N/2, T=2 and S=6).
Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.
Open MPI >= 4.1.4 and Intel 2022
The general way to run such a job is to submit the following script with "sbatch FILENAME" where the file contains:
#!/bin/bash #SBATCH --ntasks=M #SBATCH --cpus-per-task=T #SBATCH --ntasks-per-socket=S #SBATCH --nodes=M/(2S) export OMP_NUM_THREADS=T srun --cpu-bind=cores my_hybrid_program
Here we supposed that we have 2 sockets per node in order to set the number of nodes. If you do not specify the number of tasks per socket, you do not need to specify the number of nodes. We also request CPU binding to the cores which often improve the performances thanks to a faster access to caches. If you are interested into more details (e.g. GPU binding), you can take a look at the documentation in NERSC
For example, if you wish to run your software on 48 cores (N=48) with 8 MPI ranks (M=8) and 6 threads per rank (T=6) and using only 2 MPI ranks per socket (S=2):
#!/bin/bash #SBATCH --ntasks=8 #SBATCH --cpus-per-task=6 #SBATCH --ntasks-per-socket=2 #SBATCH --nodes=2 export OMP_NUM_THREADS=6 srun --cpu-bind=cores my_hybrid_program
These examples assume that full nodes are used.
Older MPI libraries
Older MPI libraries should be used in the same way except that srun is not aware of the MPI library, so mpirun should be used. We strongly recommend using a more recent version of MPI that is supported by srun.
#!/bin/bash #SBATCH --ntasks=M #SBATCH --cpus-per-task=T #SBATCH --ntasks-per-socket=S #SBATCH --nodes=M/(2S) export OMP_NUM_THREADS=T mpirun -n M --map-by node:PE=T my_hybrid_program
#!/bin/bash #SBATCH --ntasks=M #SBATCH --cpus-per-task=T #SBATCH --ntasks-per-socket=S #SBATCH --nodes=M/(2S) export OMP_NUM_THREADS=T mpirun -n M -ppn ranks_per_node my_hybrid_program