Difference between revisions of "Hybrid jobs"
(→Open MPI 1.6: The cpus-per-proc options ensures the binding still works correctly.) |
(Fixes.) |
||
Line 2: | Line 2: | ||
Let's say you want to run a program on ''N'' cores with ''M'' MPI ranks and ''T'' OpenMP threads per MPI rank where ''N=M×T''. It is strongly advisable that | Let's say you want to run a program on ''N'' cores with ''M'' MPI ranks and ''T'' OpenMP threads per MPI rank where ''N=M×T''. It is strongly advisable that | ||
− | * the number of cores on the node (24 in Euler) is divisible by your | + | * the number of cores on the node (24 in Euler) is divisible by your chosen ''T'', the number of threads per MPI rank, and |
* you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler). | * you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler). | ||
Good combinations on Euler: | Good combinations on Euler: | ||
Line 14: | Line 14: | ||
The general way to run such a job is | The general way to run such a job is | ||
export OMP_NUM_THREADS=T | export OMP_NUM_THREADS=T | ||
− | bsub -n N | + | bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program |
− | for example, for N= | + | for example, for N=48, M=16, and T=6: |
export OMP_NUM_THREADS=6 | export OMP_NUM_THREADS=6 | ||
− | bsub -n | + | bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program |
+ | These examples assume that full nodes are used. | ||
== Open MPI ≥1.10 == | == Open MPI ≥1.10 == | ||
Line 23: | Line 24: | ||
The general way to run such a job is | The general way to run such a job is | ||
export OMP_NUM_THREADS=T | export OMP_NUM_THREADS=T | ||
− | |||
bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program" | bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program" | ||
− | + | For example, | |
export OMP_NUM_THREADS=6 | export OMP_NUM_THREADS=6 | ||
− | |||
bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program" | bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program" | ||
+ | These examples assume that full nodes are used. | ||
== MVAPICH2 == | == MVAPICH2 == | ||
Line 38: | Line 38: | ||
export OMP_NUM_THREADS=6 | export OMP_NUM_THREADS=6 | ||
bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program" | bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program" | ||
+ | These examples assume that full nodes are used. |
Revision as of 12:25, 7 April 2017
In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 48 MPI ranks on 48 cores you run a program with 2 MPI ranks with 12 threads each on those 48 cores.
Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that
- the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
- you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
Good combinations on Euler:
- 2 MPI ranks per node, 12 threads per MPI rank (M=N/12 and T=12),
- 4 MPI ranks per node, 6 threads per MPI rank (M=N/6 and T=6), or even
- 12 MPI ranks per node, 2 threads per MPI rank (M=N/2 and T=2).
Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.
Open MPI 1.6
The general way to run such a job is
export OMP_NUM_THREADS=T bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program
for example, for N=48, M=16, and T=6:
export OMP_NUM_THREADS=6 bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program
These examples assume that full nodes are used.
Open MPI ≥1.10
The general way to run such a job is
export OMP_NUM_THREADS=T bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"
For example,
export OMP_NUM_THREADS=6 bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"
These examples assume that full nodes are used.
MVAPICH2
The general way to run such a job is
export OMP_NUM_THREADS=T bsub -n N "export MV2_ENABLE_AFFINITY=0 ; mpirun -n M -ppn ranks_per_node ./my_mpi_program"
where ranks_per_node is generally 24/T on Euler. For example,
export OMP_NUM_THREADS=6 bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"
These examples assume that full nodes are used.