Change of GPU specifiers in the batch system
Contents
Introduction
The batch system allows to target particular GPU model types by using a specifier that is provided by the Nvidia GPU driver. Up to now (driver version 450.80.02) those specifiers were consistent and did not change, but with the new GPU driver (495.29.05) those specifiers change, which can cause some issues when submitting GPU jobs. The new driver version is required to fully support CUDA versions newer than 11.1.x.
Available GPU types and corresponding specifiers for the old and the new driver version
GPU Model | Slurm specifier | GPU per node | GPU memory per GPU | CPU cores per node | System memory per node | CPU cores per GPU | System memory per GPU | Compute capability | Minimal CUDA version required |
---|---|---|---|---|---|---|---|---|---|
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 36 | 384 GiB | 4.5 | 48 GiB | 7.5 | 10.0 |
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 128 | 512 GiB | 16 | 64 GiB | 7.5 | 10.0 |
NVIDIA GeForce RTX 3090 | rtx_3090 | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 8.6 | 11.0 |
NVIDIA GeForce RTX 4090 | rtx_4090 | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 8.9 | 11.8 |
NVIDIA TITAN RTX | titan_rtx | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 7.5 | 10.0 |
NVIDIA Quadro RTX 6000 | quadro_rtx_6000 | 8 | 24 GiB | 128 | 512 GiB | 8 | 64 GiB | 7.5 | 10.0 |
NVIDIA Tesla V100-SXM2 32 GiB | v100 | 8 | 32 GiB | 48 | 768 GiB | 6 | 96 GiB | 7.0 | 9.0 |
NVIDIA Tesla V100-SXM2 32 GB | v100 | 8 | 32 GiB | 40 | 512 GiB | 5 | 64 GiB | 7.0 | 9.0 |
Nvidia Tesla A100 (40 GiB) | a100-pcie-40gb | 8 | 40 GiB | 48 | 768 GiB | 6 | 96 GiB | 8.0 | 11.0 |
Nvidia Tesla A100 (80 GiB) | a100_80gb | 10 | 80 GiB | 48 | 1024 GiB | 4.8 | 96 GiB | 8.0 | 11.0 |
Targeting GPU nodes with the new driver version
For the time being, we will have a mix of nodes with the old (450.80.02) and the new (495.29.05) version of the GPU driver, until all GPU nodes are updated (rebooted). If you don't target any particular driver version, then your job can run on any kind of GPU node. If you would like to target GPU nodes with the new driver version, then you can use the bsub option
-R 'select[gpu_driver>460]'
Potential issues when submitting jobs
If you target nodes with the new GPU driver and request a particular GPU model, then please make sure to use the correct specifier for the GPU model. When using the old specifier and targeting a node with the new GPU driver, then your job will be pending forever as those two requirements are mutually exclusive.
If you don't explicitly depend on the new GPU driver, then you can use the logical or operator (||) to request a GPU type using both specifiers (old/new), for instance
-R "select[(gpu_model0==GeForceRTX2080Ti || gpu_model0==NVIDIAGeForceRTX2080Ti)]"
Please note that this only works if the two specifiers are for the same GPU type.