- 1 Introduction
- 2 Storage
- 3 Slurm
- 4 Own Code
- 5 Running a benchmark on Euler
- 6 Languages
On this page, we list some of our recommendations on how to use the cluster most efficiently, which will reduce the stress on the system and speedup your throughput of jobs by improving your workflow.
Euler contains multiple storage systems, which are optimized for different purposes (high amount of I/O operations, large bandwidth, etc.). By choosing the best suited storage system for your workflow, you can improve time to solution for your jobs.
- This comparison of storage systems provides a comparison between the different storage.
- Quotas are based on the total size of the files AND on the number of files, so please try to keep the number of files low (e.g. avoid conda).
- Due to the small latency is almost always the best option to use the local scratch of the compute nodes in terms of performance
- NetApp file systems (/cluster/home, /cluster/project) are optimized for a high number of I/O operations. These storage systems are well suited for handling a large number of small files
- Lustre file systems (/cluster/scratch, /cluster/work) are optimized for large bandwidth. Lustre is a high-performance parallel file system with read/write speeds up to 38 GB/s. It is optimized for large files (>= 4 MB) and cannot well deal with a large number of small files
For more details see Job Management with Slurm.
- Check if your resource usage is efficient. We provide the command get_inefficient_jobs to check it and regularly use it to ensure a good usage of the cluster.
- Request only the resources you need and not what you think you need (see this section).
- Avoid requesting too much resources on a GPU node as multiple GPUs are available on a single node and it can prevent other users to use the GPUs (see this section).
- Slurm is not done to handle a lot of tiny jobs. Please group them together to prevent spamming slurm.
- Do not run commands in a loop without waiting between steps. Sending too many requests will spam Slurm and make it unresponsive to the other users.
- If you need to send many jobs, think about using job arrays
- Do not use the wrap parameter with sbatch (e.g. sbatch --wrap="echo test") but submit a script setting up the environment and containing your commands. It is more easy for someone else to use your work.
How to Improve Your Resource Usage
The easiest approach is to run your script once and then check the resource usage with myjobs -j JOB_ID:
XXX@eu-login-26:~$ myjobs -j 24283000 Job information Job ID : 24283000 Status : COMPLETED Running on node : eu-a2p-372 User : XXX Shareholder group : XXX Slurm partition (queue) : normal.120h Command : sbatch submit.sh Working directory : /cluster/scratch/XXX Requested resources Requested runtime : 2-00:00:00 Requested cores (total) : 8 Requested nodes : 1 Requested memory (total) : 40960 MiB Job history Submitted at : 2023-08-11T13:20:00 Started at : 2023-08-11T13:25:00 Queue waiting time : 5 m Resource usage Wall-clock : 00:00:51 Total CPU time : 00:34.007 CPU utilization : 12.2% Total resident memory : 3960.6 MiB Resident memory utilization : 9.67%
The interesting parts are CPU utilization and Resident memory utilization. From the percentage, you can see that this job is requesting too many cores and too much system RAM.
On the CPU side, we can see that the utilization is less than 100% / 8 cores requested which is a strong indication that the code does not run on multiple cores. Requesting a single core would be sufficient to increase the CPU utilization to almost 100%.
If the percentage was higher, you can either decrease the amount of cores or improve the parallel efficiency of your code.
For the system RAM, we are in a similar situation. Requesting less RAM (e.g. 4000MiB) would be sufficient and allow for a small buffers for uncertainties.
How Much Resources Can I Request on a GPU Node
In theory, each job on a GPU node should request at most a fraction of the resources corresponding to the number of GPU requested divided by the total number of GPUs on a node. Of course people will request different amount of resources so smaller job could possibly run next to large ones, so we have no issue if you use a bit more resources than this theoretical limit.
The first step to find out what is this limit is to find on which node you are running. It is provided by squeue for running jobs and you can check it with myjobs -j JOB_ID for past jobs. Once you have the node (let's say eu-g3-001), you can check its information with:
XXX@eu-login-08:~$ scontrol show node=eu-g3-001 | grep -e CfgTRES CfgTRES=cpu=128,mem=513500M,billing=2177240,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8,gres/gpumem=11811160064
You can see that this node has 128 Cores (cpu=128), 513.5G of system RAM (mem=513500M) and 8 GPUs (gres/gpu=8). So if you request 1 GPU, the limit for this node is 16 cores (128 / 8) and 64G of RAM (513 / 8).
While we do our best to provide a HPC infrastructure, the most crucial part of HPC is the software used for the computations. Spending a bit of time optimizing and / or parallelizing a code can greatly decrease the time required to get a solution while also reducing your impact on the climate change as it avoids using unnecessary resources. In this section, we give hints on how to take care of your code and to optimize and parallelize it.
If you need some help to improve your code, you can ask the cluster support or, for DBIOL, DBSSE and DGESS, contact the code clinic where you can have a far higher level of support (other departments will need to take a subscription for this service).
Formatting / Checking your Code
Some tools exist in every languages that format your code according to the standard and check it for best practices. Using them help to share your code (e.g. with the cluster support when requesting help) and can even prevent mistakes.
Most of the git website (e.g. gitlab.com or github.com) provides continuous integration which is a fancy term to say some scripts that runs every time a change is pushed to the repository. Some good practices are to setup a pipeline that run some code formatting tools and some automatic tests (e.g. unit tests) which guarantees that your code works as expected.
When talking about optimization, it is important to keep in mind that optimization does not only concern the time to solution but also the memory / storage optimization. While this section talks about speed optimization, memory / storage optimization can also help a lot but are far more specific to each code.
The main steps to optimize a code are:
- Finding the bottleneck with a profiler
- Finding why it is a bottleneck
- Re-implementing the bottleneck
- Restarting the steps by checking if your bottleneck is still a bottleneck and moving to a new one if it is not the case
When optimizing your code, the most important element to think about (from most important to less important) are:
- Algorithm / Data structure (e.g. transforming a code scaling in N^3 to scaling in N)
- Improvement to the implementation (e.g. parallelization, pre-computation, use caches, avoid extra copy, avoid type conversion, call compiled code for script languages [such as python], ...)
- Vectorization / Low level optimization (advanced users)
Using specialized hardware (e.g. GPUs) is not necessary the way to go in order to get better performances. If your computations cannot be done in parallel by a large number of threads doing the same operation at the same time (e.g. no if or switch), then GPUs will surely not provide interesting performances. Optimizing a code requires far less effort than migrating it to GPUs and, depending on your work, can be a lot more performant.
We provide a list of tools specific to each programming language in their corresponding section below.
The tools for parallelization depend on the programming language, so they will be mentioned below. Different parallelization models exist depending on your aim (from simpler to harder):
- Shared memory model: This is by far the simpler model where multiple threads on the same machine will make the computation. All threads share the same memory and can read / write at the same time (need to be careful with race conditions).
- Distributed memory model: This model allows to go to multiple nodes (and also works with multiple threads on a single machine) but requires to explicitly transfer memory between threads as they can't fetch the memory of the others.
- Hybrid memory model: It consists simply in a mix between the two previous models. It relies on shared memory model between threads within a node and distributed memory model between nodes.
The most common libraries used for shared memory model are OpenMP and pthreads. For distributed memory, it is MPI (e.g. OpenMPI, Mpich and IntelMPI).
Once a code is parallelized, it is important to check if the implementation is working correctly by doing a scaling plot. Before going into the details, it is important to mention the two types of scaling:
- Weak scaling: A perfectly parallelized code should take the same time to run N elements over T threads than N * X elements over T * X threads.
- Strong scaling: A perfectly parallelized code should take a time T to reach the solution on a single thread and T / M to reach it with M threads.
It means that, if we have an infinite amount of resources, the weak scaling ensure that you can solve a problem of any size and while the strong scaling ensure that you can solve it in any required amount of time.
The way to create a scaling plot is to simply run the computation with different number of threads and compute the amount of time required to do it. In the case of the weak scaling, the parameters need to be adapted in order to scale linearly with the number of threads. You can then plot (the time to solution for one thread divided by the time to solution) vs the number of threads (strong scaling vs weak scaling). If you wish more information, you can look at the following wiki. Something to keep in mind when analyzing the scaling of a code is the Amdahl's law which states that your maximum speedup is defined by the fraction of your code that is not parallel.
Running a benchmark on Euler
Running your benchmarks on Euler requires to be attentive to a few key points:
- We have many different types of node, so ensure that you always compare runs on the same hardware (CPU, GPU, Networking). You can pick given hardware with the Slurm --constraint option to select a given CPU model and node generation (i.e., using the ibfabric7 or fabric8 feature) for CPU nodes. For GPU nodes, you can select the GPU model using --gpus:MODEL:X together with a CPU model (i.e., --constraint EPYC_7742 --gpus=:2080:1). We do not recommend selecting specific nodes by name (the -w option).
- Ensure that your performances are not impacted by other users. You can request an exclusive node with --exclusive. Please use this only for running a benchmark and not as a normal workflow (the CPU time of all the node will be deduced from your account even if you request a single core).
- If you use some storage / networking, you might be impacted by other users as those resources are shared. Running the benchmarks multiple times might help to get better results.
Here you can find some recommendations specific to some given programming languages.
Python does not really compile your code, it means that when the code is executed, the python binary will read your file and send the instruction to the CPU without any deep optimization based on the surrounding code. In order to prevent this behavior, the code can be compiled thanks to the C API and provide speedup that can easily reach 100x. Some libraries already provide this and you can also do it yourself. See below for more details.
- Python loops are slow and should be used only in the highest level loops (each instruction within a python loop should take at least a few seconds to make them worthwhile).
- Try using C libraries as much as possible as they provide C performances (e.g. numpy [more performant than pandas], numba, pytorch, tensorflow)
- You can write part of your code in C (directly or indirectly) in order to speed it up. From simpler to more complex: numba, Cython, C++ boost library, C/C++ API
- Shared memory: Unfortunately the infamous GIL (resource locking in python) prevent from using a true shared memory model. One solution is to release the GIL in C and use C parallelization. With numba, you don't have to deal with it.
- Distributed memory: MPI4Py provides an implementation of MPI for python.
It is worth mentioning the system package multiprocessing which allows to share data between processes.
- CProfile: here you can find an example
- Flake8: Ensure that your code follows the standard recommendations
- Black and Isort: Format your code according to a standard style
- Pytest: Allow the creation of tests for your code (ensure that the code behaves as expected). It can be used with Coverage to see where the code has been tested.
C / C++
- Use some optimization parameters for the compiler. The most important one is O3.
- Inline code
- Try to make the memory access as predictable as possible in order to allow the compiler to fetch the memory as soon as possible (avoid too many levels of pointers, setup the memory in a way you can loop on it without any jumps).
- Avoid implicit type conversion (e.g. multiplying a float with a double)
- Avoid using double if not needed (many scientific simulations don't require double)
- Avoid having global variables
- Avoid division when possible (use multiplication on inverse and use multiple times the inverse)
- In C++ avoid using polymorphism
- In C++ use constant references and not copies
- In C++ avoid using exceptions as standard path of your code
- Shared memory: OpenMP
- Distributed memory: MPI