Parallel computing

From ScientificComputing
Revision as of 10:15, 30 November 2017 by Sfux (talk | contribs) (Created page with "==Introduction== ==Shared memory parallelization (OpenMP)== Shared memory parallelization<br /> {{:Shared_memory_parallelization}} ==Hyper...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Shared memory parallelization (OpenMP)

Shared memory parallelization
OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer.

If your application is parallelized using OpenMP or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of cores (or threads) that it can use is controlled by the environment variable OMP_NUM_THREADS. This variable must be set before you submit your job:

export OMP_NUM_THREADS=number_of_cores
sbatch --ntasks=1 --cpus-per-task=number_of_cores ...

Please note that for OpenMP, you request --ntasks=1 and then request the number of cores through the sbatch option --cpus-per-task.

NOTE: if OMP_NUM_THREADS is not set, your application will either use one core only, or will attempt to use all cores that it can find, stealing' them from other jobs if needed. In other words, your job will either use too few or too many cores.

Pthreads and other threaded applications

Their behavior is similar to OpenMP applications. It is important to limit the number of threads that the application spawns. There is no standard way to do this, so be sure to check the application's documentation on how to do this. Usually a program supports at least one of four ways to limit itself to N threads:

  • it understands the OMP_NUM_THREADS=N environment variable,
  • it has its own environment variable, such as GMX_NUM_THREADS=N for Gromacs,
  • it has a command-line option, such as -nt N (for Gromacs), or
  • it has an input-file option, such as num_threads N.

If you are unsure about the program's behavior, please contact us and we will analyze it.

Tips and Tricks

GNU libgomp

Virtually all OpenMP programs compiling using the GNU compilers use the GNU libgomp library. As such, some environment variables may make your program run faster or they may make it run much worse. It is safer not to use them than to use them without testing whether they help or not.

Binds threads to cores. This option is relatively safe to use. Some programs may run much slower with this option. To use, set the OMP_PROC_BIND environment variable to true before submitting the job (or in a job script):
export OMP_PROC_BIND=true
bsub -n 4 my_program
Binds specific threads to specific cores. Some programs may run much slower with this option. This is strongly discouraged: in most cases the OMP_PROC_BIND option is sufficient. To use it anyway, set the GOMP_CPU_AFFINITY environment variable in a job script (e.g., according to the cores assigned by LSF:
and submit the script to LSF:
bsub -n 4 <
Intel OpenMP library

Use the KMP_AFFINITY environment variable to control affinity. For example,


Refer to the Intel compiler documentation for details and other options.


Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.

For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline; it takes advantage of superscalar architecture, in which multiple instructions operate on separate data in parallel. With HTT, one physical core appears as two processors to the operating system, allowing concurrent scheduling of two processes per core.

Hyperthreading on Euler

The operating system will now see 48 logical cores on a 24-core node. The batch system will also see these logical cores, but will continue to use physical cores when allocating resources to batch jobs. A job requesting 1 (physical) core will thus get two logical cores and will be able to execute two threads simultaneously — if the application supports it.

Relation to Slurm job slots

Slurm is aware of hyperthreading so there is no change to how jobs are assigned to physical cores. This means there continue to be 24 job slots on the 24 cores of an Euler I or II node. The slots, however, are assigned to both virtual cores of a physical core.

All of the supported MPI libraries we provide are also aware of hyperthreading and continue to schedule only one rank (MPI processes) to an individual physical core.

Using hyperthreading

In those cases where you are running a loosely-coupled parallel program, you can make use of hyperthreading to let twice as many processes run as you have requested cores. While each individual processes will run slower, the time-to-solution will probably be faster than if they sequentially one after the other.

To ensure your job will run only on nodes with hyperthreading enabled, use the -R "select[nthreads==2]" bsub option:

sbatch --ntasks-per-core=2 ...     # request nodes where HT is enabled (2 threads per core)

Distributed memory parallelization (MPI)

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures. MPI is a communication protocol for programming parallel computers. Both point-to-point and collective communication are supported. "MPI's goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today. MPI is not sanctioned by any major standards body; nevertheless, it has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run such programs. Most MPI implementations consist of a specific set of routines directly callable from C, C++, Fortran (i.e., an API) and any language able to interface with such libraries, including C#, Java or Python.

MPI on our clusters

Two kinds of MPI libraries are available on our cluster: Open MPI (recommended) and MVAPICH2. Before you can submit and execute an MPI job, you must load the corresponding modules (compiler + MPI, in that order):

module load compiler
module load mpi_library

The command used to launch an MPI application is mpirun.

Let's assume for example that hello_world was compiled with GCC 4.8.2 and linked with Open MPI 1.6.5. The command to execute this job on 4 processors is:

module load gcc/8.2.0
module load openmpi/4.1.4
sbatch --ntasks=4 --wrap="mpirun ./hello_world"

Note that mpirun automatically uses all processors allocated to the job by LSF.

sbatch --ntasks=4 --wrap="mpirun -np 4 ./hello_world"      ←  "-np 4" not needed!

It is therefore not necessary to indicate this number again to the mpirun command itself:

Hybrid jobs



In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.

Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.

Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that

  • the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
  • you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).

Good combinations on Euler:

  • 2 MPI ranks per node, 12 threads per MPI rank (M=N/12, T=12 and S=1 where S is the number of ranks per socket),
  • 4 MPI ranks per node, 6 threads per MPI rank (M=N/6, T=6 and S=2), or even
  • 12 MPI ranks per node, 2 threads per MPI rank (M=N/2, T=2 and S=6).

Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.

Open MPI >= 4.1.4 and Intel 2022

The general way to run such a job is to submit the following script with "sbatch FILENAME" where the file contains:

#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
srun --cpu-bind=cores my_hybrid_program

Here we supposed that we have 2 sockets per node in order to set the number of nodes. If you do not specify the number of tasks per socket, you do not need to specify the number of nodes. We also request CPU binding to the cores which often improve the performances thanks to a faster access to caches. If you are interested into more details (e.g. GPU binding), you can take a look at the documentation in NERSC

For example, if you wish to run your software on 48 cores (N=48) with 8 MPI ranks (M=8) and 6 threads per rank (T=6) and using only 2 MPI ranks per socket (S=2):

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH --ntasks-per-socket=2
#SBATCH --nodes=2
srun --cpu-bind=cores my_hybrid_program

These examples assume that full nodes are used.

Older MPI libraries

Older MPI libraries should be used in the same way except that srun is not aware of the MPI library, so mpirun should be used. We strongly recommend using a more recent version of MPI that is supported by srun.


#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
mpirun -n M --map-by node:PE=T my_hybrid_program


#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
mpirun -n M -ppn ranks_per_node my_hybrid_program