Difference between revisions of "GPU job submission"

From ScientificComputing
Jump to: navigation, search
(Available GPU node types)
(Available GPU node types)
Line 61: Line 61:
  
 
==Available GPU node types==
 
==Available GPU node types==
{| class="wikitable"
+
{{GPUTable}}
|-
 
! GPU Model !! Specifier (GPU driver <= 450.80.02) !! Specifier (GPU driver > 450.80.02) !! GPU memory per GPU !! CPU cores per node !! CPU memory per node
 
|-
 
| NVIDIA GeForce GTX 1080 || <tt>GeForceGTX1080</tt> || || 8&nbsp;GiB || 20 || 256&nbsp;GiB
 
|-
 
| NVIDIA GeForce GTX 1080 Ti || <tt>GeForceGTX1080Ti</tt> || || 11&nbsp;GiB || 20 || 256&nbsp;GiB
 
|-
 
| NVIDIA GeForce RTX 2080 Ti || <tt>GeForceRTX2080Ti</tt> || <tt>NVIDIAGeForceRTX2080Ti</tt> || 11&nbsp;GiB || 36 || 384&nbsp;GiB
 
|-
 
| NVIDIA GeForce RTX 2080 Ti || <tt>GeForceRTX2080Ti</tt> || <tt>NVIDIAGeForceRTX2080Ti</tt> || 11&nbsp;GiB || 128 || 512&nbsp;GiB
 
|-
 
| NVIDIA TITAN RTX || <tt>TITANRTX</tt> || <tt>NVIDIATITANRTX</tt> || 24&nbsp; GiB  || 128 || 512&nbsp;GiB
 
|-
 
| [[Nvidia_DGX-1_with_Tensor_Cores| NVIDIA Tesla V100-SXM2 32 GB]] || <tt>TeslaV100_SXM2_32GB</tt> || || 32&nbsp;GiB || 48 || 768&nbsp;GiB
 
|-
 
| NVIDIA Tesla A100 || <tt> A100_PCIE_40GB </tt> || || 40&nbsp;GiB || 48 || 768&nbsp;GiB
 
|}
 
  
 
== Example ==
 
== Example ==

Revision as of 07:42, 9 December 2021

< Submit a parallel job

Home

Monitor a job >


ⓘ Note

You can only use GPUs if you are a member of a shareholder group that invested into GPU nodes


Cpu gpu system arch.png

Figure: Here is an example of CPU & GPU system architecture. There are several system architectures on the cluster.

To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

$ bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

How to select GPU memory

If you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the gpu_mtotal0 host selection to do this. For example, if you need 10 GB (=10240  MB) per GPU:

 $ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program

This ensures your job will not run on GPUs with less than 10 GB of GPU memory.

How to select a GPU model

In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.

To select a certain GPU model, add the -R "select[gpu_model1==GPU_MODEL]" resource requirement to bsub,

$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.


Available GPU node types

GPU Model LSF Specifier (GPU driver > 450.80.02) Slurm specifier GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 NVIDIAGeForceGTX1080 gtx_1080 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti NVIDIAGeForceGTX1080Ti gtx_1080_ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti rtx_2080_ti 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti rtx_2080_ti 11 GiB 128 512 GiB
NVIDIA GeForce RTX 3090 NVIDIAGeForceRTX3090 rtx_3090 24 GiB 128 512 GiB
NVIDIA TITAN RTX NVIDIATITANRTX titan_rtx 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 quadro_rtx_6000 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GiB TeslaV100_SXM2_32GB v100 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB v100 32 GiB 40 512 GiB
Nvidia Tesla A100 (40 GiB) NVIDIAA100_PCIE_40GB a100-pcie-40gb 40 GiB 48 768 GiB
Nvidia Tesla A100 (80 GiB) unavailable a100_80gb 80 GiB 48 1024 GiB

Example

Further reading


< Submit a parallel job

Home

Monitor a job >