Difference between revisions of "Hybrid jobs"

From ScientificComputing
Jump to: navigation, search
(Open MPI 1.6: The cpus-per-proc options ensures the binding still works correctly.)
(Fixes.)
Line 2: Line 2:
  
 
Let's say you want to run a program on ''N'' cores with ''M'' MPI ranks and ''T'' OpenMP threads per MPI rank where ''N=M×T''. It is strongly advisable that
 
Let's say you want to run a program on ''N'' cores with ''M'' MPI ranks and ''T'' OpenMP threads per MPI rank where ''N=M×T''. It is strongly advisable that
* the number of cores on the node (24 in Euler) is divisible by your chose ''T'', the number of threads per MPI rank, and
+
* the number of cores on the node (24 in Euler) is divisible by your chosen ''T'', the number of threads per MPI rank, and
 
* you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
 
* you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).
 
Good combinations on Euler:
 
Good combinations on Euler:
Line 14: Line 14:
 
The general way to run such a job is
 
The general way to run such a job is
 
  export OMP_NUM_THREADS=T
 
  export OMP_NUM_THREADS=T
  bsub -n N -R "span[ptile=XX]" mpirun --loadbalance --cpus-per-proc T my_hybrid_program
+
  bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program
for example, for N=96, M=16, and T=6:
+
for example, for N=48, M=16, and T=6:
 
  export OMP_NUM_THREADS=6
 
  export OMP_NUM_THREADS=6
  bsub -n 96 -R "span[ptile=24]" mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program
+
  bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program
 +
These examples assume that full nodes are used.
  
 
== Open MPI ≥1.10 ==
 
== Open MPI ≥1.10 ==
Line 23: Line 24:
 
The general way to run such a job is
 
The general way to run such a job is
 
  export OMP_NUM_THREADS=T
 
  export OMP_NUM_THREADS=T
OMPI_MCA_orte_process_binding=none
 
 
  bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"
 
  bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"
for example,
+
For example,
 
  export OMP_NUM_THREADS=6
 
  export OMP_NUM_THREADS=6
OMPI_MCA_orte_process_binding=none
 
 
  bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"
 
  bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"
 +
These examples assume that full nodes are used.
  
 
== MVAPICH2 ==
 
== MVAPICH2 ==
Line 38: Line 38:
 
  export OMP_NUM_THREADS=6
 
  export OMP_NUM_THREADS=6
 
  bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"
 
  bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"
 +
These examples assume that full nodes are used.

Revision as of 12:25, 7 April 2017

In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 48 MPI ranks on 48 cores you run a program with 2 MPI ranks with 12 threads each on those 48 cores.

Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that

  • the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
  • you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).

Good combinations on Euler:

  • 2 MPI ranks per node, 12 threads per MPI rank (M=N/12 and T=12),
  • 4 MPI ranks per node, 6 threads per MPI rank (M=N/6 and T=6), or even
  • 12 MPI ranks per node, 2 threads per MPI rank (M=N/2 and T=2).

Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.

Open MPI 1.6

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program

for example, for N=48, M=16, and T=6:

export OMP_NUM_THREADS=6
bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program

These examples assume that full nodes are used.

Open MPI ≥1.10

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"

For example,

export OMP_NUM_THREADS=6
bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"

These examples assume that full nodes are used.

MVAPICH2

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N "export MV2_ENABLE_AFFINITY=0 ; mpirun -n M -ppn ranks_per_node ./my_mpi_program"

where ranks_per_node is generally 24/T on Euler. For example,

export OMP_NUM_THREADS=6
bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"

These examples assume that full nodes are used.