GPU job submission with SLURM

From ScientificComputing
Jump to: navigation, search

< Submit a parallel job

Home

Monitor a job >


ⓘ Note

You can only use GPUs if you are a member of a shareholder group that invested into GPU nodes


Cpu gpu system arch.png

Figure: Here is an example of CPU & GPU system architecture. There are several system architectures on the cluster.

To use the GPUs for a job node you need to request the -G or --gpus resource.

For example, to run a serial job with one GPU :

$ sbatch -G 1 --wrap="./my_cuda_program"

How to select GPU memory

If you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the --gres=gpumem: option to do this. For example, if you need 10 GB (=10240  MB) per GPU:

 $ sbatch -G 1 --gres=gpumem:10g --wrap="./my_cuda_program"

This ensures your job will not run on GPUs with less than 10 GB of GPU memory. The default unit for the gpumem option is bytes. You are therefore advised to specify units, for example 20g or 11000m.

How to select a GPU model

In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.

To select a certain GPU model, add the --gpus=slurm_specifierl:N resource requirement to sbatch, where the SLURM specifier for the GPU models is detailed in the table below, and N is the number of requested GPUs.

$ sbatch -G 1 --gpus=gtx_1080:1 --wrap="./my_cuda_program"

programs.


Available GPU node types

GPU Model Slurm specifier GPU per node GPU memory per GPU CPU cores per node System memory per node CPU cores per GPU System memory per GPU Compute capability Minimal CUDA version required
NVIDIA GeForce GTX 1080 Ti gtx_1080_ti 8 11 GiB 20 256 GiB 2.5 32 GiB 6.1 8.0
NVIDIA GeForce RTX 2080 Ti rtx_2080_ti 8 11 GiB 36 384 GiB 4.5 48 GiB 7.5 10.0
NVIDIA GeForce RTX 2080 Ti rtx_2080_ti 8 11 GiB 128 512 GiB 16 64 GiB 7.5 10.0
NVIDIA GeForce RTX 3090 rtx_3090 8 24 GiB 128 512 GiB 16 64 GiB 8.6 11.0
NVIDIA GeForce RTX 4090 rtx_4090 8 24 GiB 128 512 GiB 16 64 GiB 8.9 11.8
NVIDIA TITAN RTX titan_rtx 8 24 GiB 128 512 GiB 16 64 GiB 7.5 10.0
NVIDIA Quadro RTX 6000 quadro_rtx_6000 8 24 GiB 128 512 GiB 8 64 GiB 7.5 10.0
NVIDIA Tesla V100-SXM2 32 GiB v100 8 32 GiB 48 768 GiB 6 96 GiB 7.0 9.0
NVIDIA Tesla V100-SXM2 32 GB v100 8 32 GiB 40 512 GiB 5 64 GiB 7.0 9.0
Nvidia Tesla A100 (40 GiB) a100-pcie-40gb 8 40 GiB 48 768 GiB 6 96 GiB 8.0 11.0
Nvidia Tesla A100 (80 GiB) a100_80gb 10 80 GiB 48 1024 GiB 4.8 96 GiB 8.0 11.0

Example

Further reading


< Submit a parallel job

Home

Monitor a job >