Parallel efficiency
The goal of parallel computing is to reduce the time-to-solution of a problem by running it on multiple cores.
Speedup and efficiency
For a parallel job, we can calculate the speedup and the efficiency by comparing the run-time on one core and on cores .
Optimally, the speedup from parallelization would be linear — doubling the number of processing elements should halve the run-time, and doubling it a second time should again halve the run-time. However, very few parallel algorithms achieve optimal speedup. If you plan to run a larger parallel job, then please do a scaling study first, where you run a medium-size example on 1,2,4,8,12 and 24 cores and then calculate the speedup and the efficiency of the runs to determine if your code even scales up to 24 cores, or if the sweet spot corresponds to a lower number of cores. This will help you to get a higher throughput in terms of numbers of jobs, if you can only use a limited number of cores.
Please see below an example for a scaling study done on Euler for a code that is memory-bound.
Cores | runtime [s] | speedup | efficiency |
---|---|---|---|
1 | 26779 | - | - |
2 | 15237 | 1.76 | 88% |
4 | 6743 | 3.97 | 99% |
8 | 13435 | 1.99 | 25% |
12 | 19369 | 1.38 | 12% |
24 | 18761 | 1.43 | 6% |
The scaling study indicates, that using 4 cores is the sweet spot in terms of parallel efficiency and adding more cores even makes the job slower.
Amdahl's law
Often a fraction of the code can not be parallelized and will run serially.
Amdahl's Law describes the maximal speedup of a computation with a parallel fraction of the code and a speedup of the parallel part of the code.
If we for instance assume a code that contains a serial fraction of 20% () and that has a linear speedup in the parallel part. If the job is executed on a 24 core compute node in Euler, then the maximal speedup is
This job would therefore on average only use 4.29 cores out of the 24 that are allocated for the job, which would be quite a waste of resources. You can get a much higher throughput by running six 4-core jobs on 24 cores instead of using all 24 cores for a single job.