Getting started with GPUs

From ScientificComputing
Jump to: navigation, search


There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.

CUDA and cuDNN

cuDNN versions provided are compiled for a particular CUDA version. We will soon add here a table with the compatible versions

How to submit a GPU job

All GPUs are configured in Exclusive Process mode. To run multi-node job, you will need to request span[ptile=XX] with XX being the number of CPU cores per GPU node, which is depending on the node type (the node types are listed in the table below).

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all 8 GeForce GTX 1080 Ti GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" -R "span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

Sofware with GPU support

On Euler, packages with GPU support are only available in the new software stack. None of the packages in the old software stack on Euler has support for GPUs.

Available GPU node types


GPU Model Specifier (GPU driver <= 450.80.02) Specifier (GPU driver > 450.80.02) GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 GeForceGTX1080 NVIDIAGeForceGTX1080 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti GeForceGTX1080Ti NVIDIAGeForceGTX1080Ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 128 512 GiB
NVIDIA GeForce RTX 3090 NVIDIAGeForceRTX3090 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 QuadroRTX6000 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB TeslaV100_SXM2_32GB 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB TeslaV100_SXM2_32GB 32 GiB 40 512 GiB
Nvidia Tesla A100 A100_PCIE_40GB NVIDIAA100_PCIE_40GB 40 GiB 48 768 GiB

Please note that the update of the GPU driver is a rolling update. For GPU node types where all nodes have already the updated driver version, the old identifier is crossed-out in the table above. Don't use crossed-out identifiers, as your job will be pending forever as LSF cannot find nodes with GPUs that have those identifiers.

How to select GPU memory

If you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the gpu_mtotal0 host selection to do this. For example, if you need 10 GB (=10240  MB) per GPU:

 [sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program

This ensures your job will not run on GPUs with less than 10 GB of GPU memory.

How to select a GPU model

In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.

To select a certain GPU model, add the -R "select[gpu_model1==GPU_MODEL]" resource requirement to bsub,

[sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program