Parallel efficiency

From ScientificComputing
Jump to: navigation, search

The goal of parallel computing is to reduce the time-to-solution of a problem by running it on multiple cores.

Speedup and efficiency

For a parallel job, we can calculate the speedup S and the efficiency E by comparing the run-time on one core T_1 and on n cores T_n.

E=\tfrac{T_1}{n\cdot T_n}

Optimally, the speedup from parallelization would be linear — doubling the number of processing elements should halve the run-time, and doubling it a second time should again halve the run-time. However, very few parallel algorithms achieve optimal speedup. If you plan to run a larger parallel job, then please do a scaling study first, where you run a medium-size example on 1,2,4,8,12 and 24 cores and then calculate the speedup and the efficiency of the runs to determine if your code even scales up to 24 cores, or if the sweet spot corresponds to a lower number of cores. This will help you to get a higher throughput in terms of numbers of jobs, if you can only use a limited number of cores.

Please see below an example for a scaling study done on Euler for a code that is memory-bound.

Cores runtime [s] speedup efficiency
1 26779 - -
2 15237 1.76 88%
4 6743 3.97 99%
8 13435 1.99 25%
12 19369 1.38 12%
24 18761 1.43 6%

The scaling study indicates, that using 4 cores is the sweet spot in terms of parallel efficiency and adding more cores even makes the job slower.

Amdahl's law

Often a fraction of the code can not be parallelized and will run serially.

Amdahl's Law describes the maximal speedup S_{max} of a computation with a parallel fraction p of the code and a speedup S of the parallel part of the code.


If we for instance assume a code that contains a serial fraction of 20% (p=1-0.2=0.8) and that has a linear speedup in the parallel part. If the job is executed on a 24 core compute node in Euler, then the maximal speedup is

S_{max}=\tfrac{1}{(1-0.8)+\tfrac{0.8}{24}} = 4.29

This job would therefore on average only use 4.29 cores out of the 24 that are allocated for the job, which would be quite a waste of resources. You can get a much higher throughput by running six 4-core jobs on 24 cores instead of using all 24 cores for a single job.