Change of GPU specifiers in the batch system

From ScientificComputing
Jump to: navigation, search

Introduction

The batch system allows to target particular GPU model types by using a specifier that is provided by the Nvidia GPU driver. Up to now (driver version 450.80.02) those specifiers were consistent and did not change, but with the new GPU driver (495.29.05) those specifiers change, which can cause some issues when submitting GPU jobs. The new driver version is required to fully support CUDA versions newer than 11.1.x.

Available GPU types and corresponding specifiers for the old and the new driver version

GPU Model Slurm specifier GPU per node GPU memory per GPU CPU cores per node System memory per node CPU cores per GPU System memory per GPU Compute capability Minimal CUDA version required
NVIDIA GeForce RTX 2080 Ti rtx_2080_ti 8 11 GiB 36 384 GiB 4.5 48 GiB 7.5 10.0
NVIDIA GeForce RTX 2080 Ti rtx_2080_ti 8 11 GiB 128 512 GiB 16 64 GiB 7.5 10.0
NVIDIA GeForce RTX 3090 rtx_3090 8 24 GiB 128 512 GiB 16 64 GiB 8.6 11.0
NVIDIA GeForce RTX 4090 rtx_4090 8 24 GiB 128 512 GiB 16 64 GiB 8.9 11.8
NVIDIA TITAN RTX titan_rtx 8 24 GiB 128 512 GiB 16 64 GiB 7.5 10.0
NVIDIA Quadro RTX 6000 quadro_rtx_6000 8 24 GiB 128 512 GiB 8 64 GiB 7.5 10.0
NVIDIA Tesla V100-SXM2 32 GiB v100 8 32 GiB 48 768 GiB 6 96 GiB 7.0 9.0
NVIDIA Tesla V100-SXM2 32 GB v100 8 32 GiB 40 512 GiB 5 64 GiB 7.0 9.0
Nvidia Tesla A100 (40 GiB) a100-pcie-40gb 8 40 GiB 48 768 GiB 6 96 GiB 8.0 11.0
Nvidia Tesla A100 (80 GiB) a100_80gb 10 80 GiB 48 1024 GiB 4.8 96 GiB 8.0 11.0

Targeting GPU nodes with the new driver version

For the time being, we will have a mix of nodes with the old (450.80.02) and the new (495.29.05) version of the GPU driver, until all GPU nodes are updated (rebooted). If you don't target any particular driver version, then your job can run on any kind of GPU node. If you would like to target GPU nodes with the new driver version, then you can use the bsub option

-R 'select[gpu_driver>460]'

Potential issues when submitting jobs

If you target nodes with the new GPU driver and request a particular GPU model, then please make sure to use the correct specifier for the GPU model. When using the old specifier and targeting a node with the new GPU driver, then your job will be pending forever as those two requirements are mutually exclusive.

If you don't explicitly depend on the new GPU driver, then you can use the logical or operator (||) to request a GPU type using both specifiers (old/new), for instance

-R "select[(gpu_model0==GeForceRTX2080Ti || gpu_model0==NVIDIAGeForceRTX2080Ti)]"

Please note that this only works if the two specifiers are for the same GPU type.