Change of GPU specifiers in the batch system
The batch system allows to target particular GPU model types by using a specifier that is provided by the Nvidia GPU driver. Up to now (driver version 450.80.02) those specifiers were consistent and did not change, but with the new GPU driver (495.29.05) those specifiers change, which can cause some issues when submitting GPU jobs. The new driver version is required to fully support CUDA versions newer than 11.1.x.
Available GPU types and corresponding specifiers for the old and the new driver version
|GPU Model||LSF Specifier (GPU driver > 450.80.02)||Slurm specifier||GPU memory per GPU||CPU cores per node||CPU memory per node|
|NVIDIA GeForce GTX 1080||NVIDIAGeForceGTX1080||unavailable||8 GiB||20||256 GiB|
|NVIDIA GeForce GTX 1080 Ti||NVIDIAGeForceGTX1080Ti||gtx_1080_ti||11 GiB||20||256 GiB|
|NVIDIA GeForce RTX 2080 Ti||NVIDIAGeForceRTX2080Ti||unavailable||11 GiB||36||384 GiB|
|NVIDIA GeForce RTX 2080 Ti||NVIDIAGeForceRTX2080Ti||unavailable||11 GiB||128||512 GiB|
|NVIDIA GeForce RTX 3090||NVIDIAGeForceRTX3090||rtx_3090||24 GiB||128||512 GiB|
|NVIDIA TITAN RTX||NVIDIATITANRTX||unavailable||24 GiB||128||512 GiB|
|NVIDIA Quadro RTX 6000||QuadroRTX6000||unavailable||24 GiB||128||512 GiB|
|NVIDIA Tesla V100-SXM2 32 GiB||TeslaV100_SXM2_32GB||unavailable||32 GiB||48||768 GiB|
|NVIDIA Tesla V100-SXM2 32 GB||TeslaV100_SXM2_32GB||unavailable||32 GiB||40||512 GiB|
|Nvidia Tesla A100 (40 GiB)||NVIDIAA100_PCIE_40GB||unavailable||40 GiB||48||768 GiB|
|Nvidia Tesla A100 (80 GiB)||unavailable||nvidia_a100_80gb_pcie||80 GiB||48||1024 GiB|
Please note that the update of the GPU driver is a rolling update. For GPU node types where all nodes have already the updated driver version, the old identifier is crossed-out in the table above. Don't use crossed-out identifiers, as your job will be pending forever as LSF cannot find nodes with GPUs that have those identifiers.
Targeting GPU nodes with the new driver version
For the time being, we will have a mix of nodes with the old (450.80.02) and the new (495.29.05) version of the GPU driver, until all GPU nodes are updated (rebooted). If you don't target any particular driver version, then your job can run on any kind of GPU node. If you would like to target GPU nodes with the new driver version, then you can use the bsub option
Potential issues when submitting jobs
If you target nodes with the new GPU driver and request a particular GPU model, then please make sure to use the correct specifier for the GPU model. When using the old specifier and targeting a node with the new GPU driver, then your job will be pending forever as those two requirements are mutually exclusive.
If you don't explicitly depend on the new GPU driver, then you can use the logical or operator (||) to request a GPU type using both specifiers (old/new), for instance
-R "select[(gpu_model0==GeForceRTX2080Ti || gpu_model0==NVIDIAGeForceRTX2080Ti)]"
Please note that this only works if the two specifiers are for the same GPU type.