Parallel computing

From ScientificComputing
Jump to: navigation, search

Introduction

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. Parallelism has been employed for many years in high-performance computing.

In parallel computing, a computational task is typically broken down in several, often many, very similar subtasks that can be processed independently and whose results are combined afterwards, upon completion. In contrast, in concurrent computing, the various processes often do not address related tasks; when they do, as is typical in distributed computing, the separate tasks may have a varied nature and often require some inter-process communication during execution.

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task.

A theoretical upper bound on the speed-up of a single program as a result of parallelization is given by Amdahl's law.

Shared memory parallelization (OpenMP)

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer.

If your application is parallelized using OpenMP or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of cores (or threads) that it can use is controlled by the environment variable OMP_NUM_THREADS. This variable must be set before you submit your job:

export OMP_NUM_THREADS=number_of_cores
sbatch --ntasks=1 --cpus-per-task=number_of_cores ...

Please note that for OpenMP, you request --ntasks=1 and then request the number of cores through the sbatch option --cpus-per-task.

NOTE: if OMP_NUM_THREADS is not set, your application will either use one core only, or will attempt to use all cores that it can find, stealing' them from other jobs if needed. In other words, your job will either use too few or too many cores.

Pthreads and other threaded applications

Their behavior is similar to OpenMP applications. It is important to limit the number of threads that the application spawns. There is no standard way to do this, so be sure to check the application's documentation on how to do this. Usually a program supports at least one of four ways to limit itself to N threads:

  • it understands the OMP_NUM_THREADS=N environment variable,
  • it has its own environment variable, such as GMX_NUM_THREADS=N for Gromacs,
  • it has a command-line option, such as -nt N (for Gromacs), or
  • it has an input-file option, such as num_threads N.

If you are unsure about the program's behavior, please contact us and we will analyze it.

Tips and Tricks

GNU libgomp

Virtually all OpenMP programs compiling using the GNU compilers use the GNU libgomp library. As such, some environment variables may make your program run faster or they may make it run much worse. It is safer not to use them than to use them without testing whether they help or not.

OMP_PROC_BIND
Binds threads to cores. This option is relatively safe to use. Some programs may run much slower with this option. To use, set the OMP_PROC_BIND environment variable to true before submitting the job (or in a job script):
export OMP_PROC_BIND=true
bsub -n 4 my_program
GOMP_CPU_AFFINITY
Binds specific threads to specific cores. Some programs may run much slower with this option. This is strongly discouraged: in most cases the OMP_PROC_BIND option is sufficient. To use it anyway, set the GOMP_CPU_AFFINITY environment variable in a job script (e.g., my_script.sh) according to the cores assigned by LSF:
#!/bin/bash
export GOMP_CPU_AFFINITY="${LSB_BIND_CPU_LIST//,/ }"
my_program
and submit the script to LSF:
bsub -n 4 < my_script.sh
Intel OpenMP library

Use the KMP_AFFINITY environment variable to control affinity. For example,

KMP_AFFINITY=compact

Refer to the Intel compiler documentation for details and other options.

Hyperthreading

Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.

For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline; it takes advantage of superscalar architecture, in which multiple instructions operate on separate data in parallel. With HTT, one physical core appears as two processors to the operating system, allowing concurrent scheduling of two processes per core.

Hyperthreading on Euler

The operating system will now see 48 logical cores on a 24-core node. The batch system will also see these logical cores, but will continue to use physical cores when allocating resources to batch jobs. A job requesting 1 (physical) core will thus get two logical cores and will be able to execute two threads simultaneously — if the application supports it.

Relation to Slurm job slots

Slurm is aware of hyperthreading so there is no change to how jobs are assigned to physical cores. This means there continue to be 24 job slots on the 24 cores of an Euler I or II node. The slots, however, are assigned to both virtual cores of a physical core.

All of the supported MPI libraries we provide are also aware of hyperthreading and continue to schedule only one rank (MPI processes) to an individual physical core.

Using hyperthreading

In those cases where you are running a loosely-coupled parallel program, you can make use of hyperthreading to let twice as many processes run as you have requested cores. While each individual processes will run slower, the time-to-solution will probably be faster than if they sequentially one after the other.

To ensure your job will run only on nodes with hyperthreading enabled, use the -R "select[nthreads==2]" bsub option:

sbatch --ntasks-per-core=2 ...     # request nodes where HT is enabled (2 threads per core)


Distributed memory parallelization (MPI)

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures. MPI is a communication protocol for programming parallel computers. Both point-to-point and collective communication are supported. "MPI's goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today. MPI is not sanctioned by any major standards body; nevertheless, it has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run such programs. Most MPI implementations consist of a specific set of routines directly callable from C, C++, Fortran (i.e., an API) and any language able to interface with such libraries, including C#, Java or Python.

MPI on our clusters

Two kinds of MPI libraries are available on our cluster: Open MPI (recommended) and MVAPICH2. Before you can submit and execute an MPI job, you must load the corresponding modules (compiler + MPI, in that order):

module load compiler
module load mpi_library

The command used to launch an MPI application is mpirun.

Let's assume for example that hello_world was compiled with GCC 4.8.2 and linked with Open MPI 1.6.5. The command to execute this job on 4 processors is:

module load gcc/8.2.0
module load openmpi/4.1.4
sbatch --ntasks=4 --wrap="mpirun ./hello_world"

Note that mpirun automatically uses all processors allocated to the job by LSF.

sbatch --ntasks=4 --wrap="mpirun -np 4 ./hello_world"      ←  "-np 4" not needed!

It is therefore not necessary to indicate this number again to the mpirun command itself:

Hybrid jobs

Introduction

In certain cases it is advantageous to run hybrid jobs such as a program that mixes both MPI and OpenMP. For example, instead of running a program with 24 MPI ranks on 24 cores you run a program with 2 MPI ranks with 12 threads each on those 24 cores.

Before running such jobs for production calculations, please ensure that you get an appropriate speedup in comparison to a pure MPI or OpenMPI job.

Let's say you want to run a program on N cores with M MPI ranks and T OpenMP threads per MPI rank where N=M×T. It is strongly advisable that

  • the number of cores on the node (24 in Euler) is divisible by your chosen T, the number of threads per MPI rank, and
  • you match threads and MPI ranks to the sockets of the node (there are two sockets per node in Euler).

Good combinations on Euler:

  • 2 MPI ranks per node, 12 threads per MPI rank (M=N/12, T=12 and S=1 where S is the number of ranks per socket),
  • 4 MPI ranks per node, 6 threads per MPI rank (M=N/6, T=6 and S=2), or even
  • 12 MPI ranks per node, 2 threads per MPI rank (M=N/2, T=2 and S=6).

Of course this needs to be balanced by the performance behavior of your thread program, which you should test before relying on such jobs for production.

Open MPI >= 4.1.4 and Intel 2022

The general way to run such a job is to submit the following script with "sbatch FILENAME" where the file contains:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --cores-per-socket=T*S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
export OMP_PLACES=cores
srun --cpus-per-task=T my_hybrid_program

Here we supposed that we have 2 sockets per node in order to set the number of nodes. If you do not specify the number of tasks per socket, you do not need to specify the number of nodes. We need to specify --cpus-per-task twice due to a bug in slurm. We also request CPU binding to the cores which often improve the performances thanks to a faster access to caches. If you are interested into more details (e.g. GPU binding), you can take a look at the documentation in NERSC

For example, if you wish to run your software on 48 cores (N=48) with 8 MPI ranks (M=8) and 6 threads per rank (T=6) and using only 2 MPI ranks per socket (S=2):

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH --ntasks-per-socket=2
#SBATCH --cores-per-socket=12
#SBATCH --nodes=2
export OMP_NUM_THREADS=6
export OMP_PLACES=cores
srun --cpu-per-task=6 my_hybrid_program

These examples assume that full nodes are used.

Older MPI libraries

Older MPI libraries should be used in the same way except that srun is not aware of the MPI library, so mpirun should be used. We strongly recommend using a more recent version of MPI that is supported by srun.

OpenMPI:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
mpirun -n M --map-by node:PE=T my_hybrid_program

MVAPICH2:

#!/bin/bash
#SBATCH --ntasks=M
#SBATCH --cpus-per-task=T
#SBATCH --ntasks-per-socket=S
#SBATCH --nodes=M/(2S)
export OMP_NUM_THREADS=T
mpirun -n M -ppn ranks_per_node my_hybrid_program

Parallel efficiency

The goal of parallel computing is to reduce the time-to-solution of a problem by running it on multiple cores.

Speedup and efficiency

For a parallel job, we can calculate the speedup S and the efficiency E by comparing the run-time on one core T_1 and on n cores T_n.

S=\tfrac{T_1}{T_n}
E=\tfrac{T_1}{n\cdot T_n}

Optimally, the speedup from parallelization would be linear — doubling the number of processing elements should halve the run-time, and doubling it a second time should again halve the run-time. However, very few parallel algorithms achieve optimal speedup. If you plan to run a larger parallel job, then please do a scaling study first, where you run a medium-size example on 1,2,4,8,12 and 24 cores and then calculate the speedup and the efficiency of the runs to determine if your code even scales up to 24 cores, or if the sweet spot corresponds to a lower number of cores. This will help you to get a higher throughput in terms of numbers of jobs, if you can only use a limited number of cores.

Please see below an example for a scaling study done on Euler for a code that is memory-bound.

Cores runtime [s] speedup efficiency
1 26779 - -
2 15237 1.76 88%
4 6743 3.97 99%
8 13435 1.99 25%
12 19369 1.38 12%
24 18761 1.43 6%

The scaling study indicates, that using 4 cores is the sweet spot in terms of parallel efficiency and adding more cores even makes the job slower.

Amdahl's law

Often a fraction of the code can not be parallelized and will run serially.

Amdahl's Law describes the maximal speedup S_{max} of a computation with a parallel fraction p of the code and a speedup S of the parallel part of the code.

S_{max}=\tfrac{1}{(1-p)+\tfrac{p}{s}}

If we for instance assume a code that contains a serial fraction of 20% (p=1-0.2=0.8) and that has a linear speedup in the parallel part. If the job is executed on a 24 core compute node in Euler, then the maximal speedup is

S_{max}=\tfrac{1}{(1-0.8)+\tfrac{0.8}{24}} = 4.29

This job would therefore on average only use 4.29 cores out of the 24 that are allocated for the job, which would be quite a waste of resources. You can get a much higher throughput by running six 4-core jobs on 24 cores instead of using all 24 cores for a single job.