Change of GPU specifiers in the batch system

From ScientificComputing
Jump to: navigation, search

Introduction

The batch system allows to target particular GPU model types by using a specifier that is provided by the Nvidia GPU driver. Up to now (driver version 450.80.02) those specifiers were consistent and did not change, but with the new GPU driver (495.29.05) those specifiers change, which can cause some issues when submitting GPU jobs. The new driver version is required to fully support CUDA versions newer than 11.1.x.

Available GPU types and corresponding specifiers for the old and the new driver version

GPU Model LSF Specifier (GPU driver > 450.80.02) Slurm specifier GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 NVIDIAGeForceGTX1080 unavailable 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti NVIDIAGeForceGTX1080Ti gtx_1080_ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti unavailable 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti unavailable 11 GiB 128 512 GiB
NVIDIA GeForce RTX 3090 NVIDIAGeForceRTX3090 rtx_3090 24 GiB 128 512 GiB
NVIDIA TITAN RTX NVIDIATITANRTX unavailable 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 unavailable 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GiB TeslaV100_SXM2_32GB unavailable 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB unavailable 32 GiB 40 512 GiB
Nvidia Tesla A100 (40 GiB) NVIDIAA100_PCIE_40GB unavailable 40 GiB 48 768 GiB
Nvidia Tesla A100 (80 GiB) unavailable nvidia_a100_80gb_pcie 80 GiB 48 1024 GiB

Targeting GPU nodes with the new driver version

For the time being, we will have a mix of nodes with the old (450.80.02) and the new (495.29.05) version of the GPU driver, until all GPU nodes are updated (rebooted). If you don't target any particular driver version, then your job can run on any kind of GPU node. If you would like to target GPU nodes with the new driver version, then you can use the bsub option

-R 'select[gpu_driver>460]'

Potential issues when submitting jobs

If you target nodes with the new GPU driver and request a particular GPU model, then please make sure to use the correct specifier for the GPU model. When using the old specifier and targeting a node with the new GPU driver, then your job will be pending forever as those two requirements are mutually exclusive.

If you don't explicitly depend on the new GPU driver, then you can use the logical or operator (||) to request a GPU type using both specifiers (old/new), for instance

-R "select[(gpu_model0==GeForceRTX2080Ti || gpu_model0==NVIDIAGeForceRTX2080Ti)]"

Please note that this only works if the two specifiers are for the same GPU type.