Difference between revisions of "Hybrid jobs"

From ScientificComputing
Jump to: navigation, search
(Note about checking the speedup.)
(Copy from the temporary page)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 48 MPI ranks on 48 cores you run a program with 2 MPI ranks with 12 threads each on those 48 cores.
+
==Introduction==
 +
In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.
  
 
Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.
 
Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.
Line 7: Line 8:
 
* you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
 
* you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
 
Good combinations on Euler:
 
Good combinations on Euler:
* 2 MPI ranks per node, 12 threads per MPI rank (M=N/12 and T=12),
+
* 2 MPI ranks per node, 12 threads per MPI rank (M=N/12, T=12 and S=1 where S is the number of ranks per socket),
* 4 MPI ranks per node, 6 threads per MPI rank (M=N/6 and T=6), or even
+
* 4 MPI ranks per node, 6 threads per MPI rank (M=N/6, T=6 and S=2), or even
* 12 MPI ranks per node, 2 threads per MPI rank (M=N/2 and T=2).
+
* 12 MPI ranks per node, 2 threads per MPI rank (M=N/2, T=2 and S=6).
 
Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.
 
Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.
  
== Open MPI 1.6 ==
+
<noinclude>==Open&nbsp;MPI >= 4.1.4 and Intel 2022 ==</noinclude><includeonly>===Open&nbsp;MPI >= 4.1.4 and Intel 2022 ===</includeonly>
  
The general way to run such a job is
+
The general way to run such a job is to submit the following script with "sbatch FILENAME" where the file contains:
 +
#!/bin/bash
 +
#SBATCH --ntasks=M
 +
#SBATCH --cpus-per-task=T
 +
#SBATCH --ntasks-per-socket=S
 +
#SBATCH --nodes=M/(2S)
 
  export OMP_NUM_THREADS=T
 
  export OMP_NUM_THREADS=T
  bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program
+
  srun --cpu-bind=cores my_hybrid_program
for example, for N=48, M=16, and T=6:
+
 
 +
Here we supposed that we have 2 sockets per node in order to set the number of nodes. If you do not specify the number of tasks per socket, you do not need to specify the number of nodes.
 +
We also request CPU binding to the cores which often improve the performances thanks to a faster access to caches.
 +
If you are interested into more details (e.g. GPU binding), you can take a look at the documentation in [https://docs.nersc.gov/jobs/affinity/ NERSC]
 +
 
 +
For example, if you wish to run your software on 48 cores (N=48) with 8 MPI ranks (M=8) and 6 threads per rank (T=6) and using only 2 MPI ranks per socket (S=2):
 +
#!/bin/bash
 +
#SBATCH --ntasks=8
 +
#SBATCH --cpus-per-task=6
 +
#SBATCH --ntasks-per-socket=2
 +
#SBATCH --nodes=2
 
  export OMP_NUM_THREADS=6
 
  export OMP_NUM_THREADS=6
  bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program
+
  srun --cpu-bind=cores my_hybrid_program
 
These examples assume that full nodes are used.
 
These examples assume that full nodes are used.
  
== Open&nbsp;MPI ≥1.10 ==
+
<noinclude>==Older MPI libraries==</noinclude><includeonly>===Older MPI libraries===</includeonly>
 +
 
 +
Older MPI libraries should be used in the same way except that srun is not aware of the MPI library, so mpirun should be used.
 +
We strongly recommend using a more recent version of MPI that is supported by srun.
  
The general way to run such a job is
+
OpenMPI:
 +
#!/bin/bash
 +
#SBATCH --ntasks=M
 +
#SBATCH --cpus-per-task=T
 +
#SBATCH --ntasks-per-socket=S
 +
#SBATCH --nodes=M/(2S)
 
  export OMP_NUM_THREADS=T
 
  export OMP_NUM_THREADS=T
  bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"
+
  mpirun -n M --map-by node:PE=T my_hybrid_program
For example,
 
export OMP_NUM_THREADS=6
 
bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"
 
These examples assume that full nodes are used. The LSB_AFFINITY_HOSTFILE environment variable must be unset otherwise the mapping directives will be ignored.
 
  
== MVAPICH2 ==
+
MVAPICH2:
 
+
#!/bin/bash
The general way to run such a job is
+
#SBATCH --ntasks=M
 +
#SBATCH --cpus-per-task=T
 +
#SBATCH --ntasks-per-socket=S
 +
#SBATCH --nodes=M/(2S)
 
  export OMP_NUM_THREADS=T
 
  export OMP_NUM_THREADS=T
  bsub -n N "export MV2_ENABLE_AFFINITY=0 ; mpirun -n M -ppn ranks_per_node ./my_mpi_program"
+
  mpirun -n M -ppn ranks_per_node my_hybrid_program
where ranks_per_node is generally 24/''T'' on Euler. For example,
 
export OMP_NUM_THREADS=6
 
bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"
 
These examples assume that full nodes are used.
 

Latest revision as of 08:21, 16 November 2022

Introduction

In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.

Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.

Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that

  • the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
  • you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).

Good combinations on Euler:

  • 2 MPI ranks per node, 12 threads per MPI rank (M=N/12, T=12 and S=1 where S is the number of ranks per socket),
  • 4 MPI ranks per node, 6 threads per MPI rank (M=N/6, T=6 and S=2), or even
  • 12 MPI ranks per node, 2 threads per MPI rank (M=N/2, T=2 and S=6).

Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.

Open MPI >= 4.1.4 and Intel 2022

The general way to run such a job is to submit the following script with "sbatch FILENAME" where the file contains:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
srun --cpu-bind=cores my_hybrid_program

Here we supposed that we have 2 sockets per node in order to set the number of nodes. If you do not specify the number of tasks per socket, you do not need to specify the number of nodes. We also request CPU binding to the cores which often improve the performances thanks to a faster access to caches. If you are interested into more details (e.g. GPU binding), you can take a look at the documentation in NERSC

For example, if you wish to run your software on 48 cores (N=48) with 8 MPI ranks (M=8) and 6 threads per rank (T=6) and using only 2 MPI ranks per socket (S=2):

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH --ntasks-per-socket=2
#SBATCH --nodes=2
export OMP_NUM_THREADS=6
srun --cpu-bind=cores my_hybrid_program

These examples assume that full nodes are used.

Older MPI libraries

Older MPI libraries should be used in the same way except that srun is not aware of the MPI library, so mpirun should be used. We strongly recommend using a more recent version of MPI that is supported by srun.

OpenMPI:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
mpirun -n M --map-by node:PE=T my_hybrid_program

MVAPICH2:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
mpirun -n M -ppn ranks_per_node my_hybrid_program