Difference between revisions of "Getting started with GPUs"
(→Python and GPUs) |
(→Python and GPUs) |
||
Line 41: | Line 41: | ||
[sfux@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program''' | [sfux@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program''' | ||
+ | <!-- | ||
==Python and GPUs== | ==Python and GPUs== | ||
Because some Python packages need different installations for their CPU and GPU versions, we decided to have separate Python modules (python/XXX and python_gpu/XXX) with regards to using CPUs and GPUs. The python_gpu modules will in addition automatically load a CUDA and a CUDNN module. When running the GPU version of TensorFlow (<2.0.0) or PyTorch on a CPU node will immediately crash, because those packages are checking on start up if the compute node has a GPU driver installed. From TensorFlow 2.0.0 on, google merged the CPU and the GPU version of TensorFlow into a single package, but for PyTorch there are still two installations (CPU/GPU) required. | Because some Python packages need different installations for their CPU and GPU versions, we decided to have separate Python modules (python/XXX and python_gpu/XXX) with regards to using CPUs and GPUs. The python_gpu modules will in addition automatically load a CUDA and a CUDNN module. When running the GPU version of TensorFlow (<2.0.0) or PyTorch on a CPU node will immediately crash, because those packages are checking on start up if the compute node has a GPU driver installed. From TensorFlow 2.0.0 on, google merged the CPU and the GPU version of TensorFlow into a single package, but for PyTorch there are still two installations (CPU/GPU) required. | ||
Line 142: | Line 143: | ||
[sfux@lo-login-01 test2]$ | [sfux@lo-login-01 test2]$ | ||
− | With TensorFlow 2.0 it is possible to build a single Python package that supports CPU and GPU. If TensorFlow 2.0 is imported on a pure CPU compute node, it will no longer fail due to checking the GPU driver as it will fall back to the CPU version in this case. | + | With TensorFlow 2.0 it is possible to build a single Python package that supports CPU and GPU. If TensorFlow 2.0 is imported on a pure CPU compute node, it will no longer fail due to checking the GPU driver as it will fall back to the CPU version in this case.--> |
Revision as of 08:22, 18 August 2022
Contents
Introduction
There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.
CUDA and cuDNN
cuDNN versions provided are compiled for a particular CUDA version. We will soon add here a table with the compatible versions
How to submit a GPU job
All GPUs are configured in Exclusive Process mode. To run multi-node job, you will need to request span[ptile=XX] with XX being the number of CPU cores per GPU node, which is depending on the node type (the node types are listed in the table below).
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.
For example, to run a serial job with one GPU,
bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program
or on a full node with all 8 GeForce GTX 1080 Ti GPUs and up to 90 GB of RAM,
bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" ./my_cuda_program
or on two full nodes:
bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" -R "span[ptile=20]" ./my_cuda_program
While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.
Sofware with GPU support
On Euler, packages with GPU support are only available in the new software stack. None of the packages in the old software stack on Euler has support for GPUs.
Available GPU node types
Euler
GPU Model | Slurm specifier | GPU per node | GPU memory per GPU | CPU cores per node | System memory per node | Recommended max CPU cores per GPU | Recommended max system memory per GPU |
---|---|---|---|---|---|---|---|
NVIDIA GeForce GTX 1080 | gtx_1080 | 8 | 8 GiB | 20 | 256 GiB | 2 | 32 GiB |
NVIDIA GeForce GTX 1080 Ti | gtx_1080_ti | 8 | 11 GiB | 20 | 256 GiB | 2 | 32 GiB |
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 36 | 384 GiB | 4 | 48 GiB |
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 128 | 512 GiB | 4 | 48 GiB |
NVIDIA GeForce RTX 3090 | rtx_3090 | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB |
NVIDIA TITAN RTX | titan_rtx | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB |
NVIDIA Quadro RTX 6000 | quadro_rtx_6000 | 8 | 24 GiB | 128 | 512 GiB | 8 | 64 GiB |
NVIDIA Tesla V100-SXM2 32 GiB | v100 | 8 | 32 GiB | 48 | 768 GiB | 6 | 96 GiB |
NVIDIA Tesla V100-SXM2 32 GB | v100 | 8 | 32 GiB | 40 | 512 GiB | 5 | 64 GiB |
Nvidia Tesla A100 (40 GiB) | a100-pcie-40gb | 8 | 40 GiB | 48 | 768 GiB | 6 | 96 GiB |
Nvidia Tesla A100 (80 GiB) | a100_80gb | 10 | 80 GiB | 48 | 1024 GiB | 4 | 96 GiB |
How to select GPU memory
If you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the gpu_mtotal0 host selection to do this. For example, if you need 10 GB (=10240 MB) per GPU:
[sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program
This ensures your job will not run on GPUs with less than 10 GB of GPU memory.
How to select a GPU model
In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.
To select a certain GPU model, add the -R "select[gpu_model1==GPU_MODEL]" resource requirement to bsub,
[sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program