Hybrid jobs

From ScientificComputing
Jump to: navigation, search

In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.

Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.

Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that

  • the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
  • you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).

Good combinations on Euler:

  • 2 MPI ranks per node, 12 threads per MPI rank (M=N/12 and T=12),
  • 4 MPI ranks per node, 6 threads per MPI rank (M=N/6 and T=6), or even
  • 12 MPI ranks per node, 2 threads per MPI rank (M=N/2 and T=2).

Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.

Open MPI 1.6

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N mpirun --loadbalance --cpus-per-proc T my_hybrid_program

for example, for N=48, M=16, and T=6:

export OMP_NUM_THREADS=6
bsub -n 48 mpirun --loadbalance --cpus-per-proc 6 my_hybrid_program

These examples assume that full nodes are used.

Open MPI ≥1.10

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program"

For example,

export OMP_NUM_THREADS=6
bsub -n 48 "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 8 --map-by node:PE=6 ./my_hybrid_program"

These examples assume that full nodes are used. The LSB_AFFINITY_HOSTFILE environment variable must be unset otherwise the mapping directives will be ignored.

MVAPICH2

The general way to run such a job is

export OMP_NUM_THREADS=T
bsub -n N "export MV2_ENABLE_AFFINITY=0 ; mpirun -n M -ppn ranks_per_node ./my_mpi_program"

where ranks_per_node is generally 24/T on Euler. For example,

export OMP_NUM_THREADS=6
bsub -n 48 "MV2_ENABLE_AFFINITY=0 mpirun -n 8 -ppn 4 ./my_mpi_program"

These examples assume that full nodes are used.