Change of GPU specifiers in the batch system

From ScientificComputing
Revision as of 07:18, 16 December 2021 by Sfux (talk | contribs)

Jump to: navigation, search

Introduction

The batch system allows to target particular GPU model types by using a specifier that is provided by the Nvidia GPU driver. Up to now (driver version 450.80.02) those specifiers were consistent and did not change, but with the new GPU driver (495.29.05) those specifiers change, which can cause some issues when submitting GPU jobs. The new driver version is required to fully support CUDA versions newer than 11.1.x.

Available GPU types and corresponding specifiers for the old and the new driver version

GPU Model Specifier (GPU driver <= 450.80.02) Specifier (GPU driver > 450.80.02) GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 GeForceGTX1080 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti GeForceGTX1080Ti NVIDIAGeForceRTX1080Ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 128 512 GiB
NVIDIA TITAN RTX TITANRTX NVIDIATITANRTX 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 QuadroRTX6000 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB TeslaV100_SXM2_32GB 32 GiB 40 512 GiB
Nvidia Tesla A100 A100_PCIE_40GB 40 GiB 48 768 GiB

Please note that the update of the GPU driver is a rolling update. For GPU node types where all nodes have already the updated driver version, the old identifier is crossed-out in the table above. Don't use crossed-out identifiers, as your job will be pending forever as LSF cannot find nodes with GPUs that have those identifiers.

Targeting GPU nodes with the new driver version

For the time being, we will have a mix of nodes with the old (450.80.02) and the new (495.29.05) version of the GPU driver, until all GPU nodes are updated (rebooted). If you don't target any particular driver version, then your job can run on any kind of GPU node. If you would like to target GPU nodes with the new driver version, then you can use the bsub option

-R 'select[gpu_driver>460]'

Potential issues when submitting jobs

If you target nodes with the new GPU driver and request a particular GPU model, then please make sure to use the correct specifier for the GPU model. When using the old specifier and targeting a node with the new GPU driver, then your job will be pending forever as those two requirements are mutually exclusive.