Difference between revisions of "GPU job submission"
(→Available GPU node types) |
|||
(16 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
</td> | </td> | ||
<td style="width: 35%; text-align:center"> | <td style="width: 35%; text-align:center"> | ||
− | [[ | + | [[Main Page | Home]] |
</td> | </td> | ||
<td style="width: 35%; text-align:right"> | <td style="width: 35%; text-align:right"> | ||
Line 14: | Line 14: | ||
</table> | </table> | ||
− | <font color=" | + | |
+ | <div class="button" style=" width: 95%; text-align:left; background: #FFFFCE; border-radius: 5px; padding-left: 15px;"> | ||
+ | <p style="font-weight: bold; size:5">ⓘ Note</p> | ||
+ | |||
+ | <font color="#3A3B3C" size="4"> You can only use GPUs if you are a member of a shareholder group that invested into GPU nodes</font> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <table style="width: 100%;"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 23%; text-align:left"> | ||
+ | [[File:Cpu_gpu_system_arch.png|320px]] | ||
+ | |||
+ | <font color="#3A3B3C" size="1"> Figure: Here is an example of CPU & GPU system architecture. There are several system architectures on the cluster. </font> | ||
+ | </td> | ||
+ | <td style="width: 2%;"> | ||
+ | </td> | ||
+ | <td style="width: 75%;"> | ||
To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node'''. This is unlike other resources, which are requested '''per core'''. | To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node'''. This is unlike other resources, which are requested '''per core'''. | ||
For example, to run a serial job with one GPU, | For example, to run a serial job with one GPU, | ||
− | $ bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program | + | $ bsub '''-R "rusage[ngpus_excl_p=1]"''' ./my_cuda_program |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== How to select GPU memory == | == How to select GPU memory == | ||
Line 51: | Line 42: | ||
If you know that you will need more memory on a GPU than some models provide, <em>i.e.,</em> more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the <tt>gpu_mtotal0</tt> host selection to do this. For example, if you need 10 GB (=10240 MB) per GPU: | If you know that you will need more memory on a GPU than some models provide, <em>i.e.,</em> more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the <tt>gpu_mtotal0</tt> host selection to do this. For example, if you need 10 GB (=10240 MB) per GPU: | ||
− | $ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program | + | $ bsub -R "rusage[ngpus_excl_p=1]" '''-R "select[gpu_mtotal0>=10240]"''' ./my_cuda_program |
This ensures your job will not run on GPUs with less than 10 GB of GPU memory. | This ensures your job will not run on GPUs with less than 10 GB of GPU memory. | ||
Line 60: | Line 51: | ||
To select a certain GPU model, add the <tt>-R "select[gpu_model1==GPU_MODEL]"</tt> resource requirement to bsub, | To select a certain GPU model, add the <tt>-R "select[gpu_model1==GPU_MODEL]"</tt> resource requirement to bsub, | ||
− | $ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program | + | $ bsub -R "rusage[ngpus_excl_p=1]" '''-R "select[gpu_model0==GeForceGTX1080]"''' ./my_cuda_program |
+ | |||
+ | While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs. | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | |||
+ | |||
+ | ==Available GPU node types== | ||
+ | {{GPUTable}} | ||
== Example == | == Example == | ||
− | * [[Neural network training with TensorFlow on GPU]] | + | * [[Neural network training with TensorFlow on GPU | Deep learning with TensorFlow on GPU]] |
+ | |||
== Further reading == | == Further reading == | ||
* [[Getting started with GPUs]] | * [[Getting started with GPUs]] | ||
Line 75: | Line 77: | ||
</td> | </td> | ||
<td style="width: 35%; text-align:center"> | <td style="width: 35%; text-align:center"> | ||
− | [[ | + | [[Main Page | Home]] |
</td> | </td> | ||
<td style="width: 35%; text-align:right"> | <td style="width: 35%; text-align:right"> |
Latest revision as of 07:42, 9 December 2021
Figure: Here is an example of CPU & GPU system architecture. There are several system architectures on the cluster. |
To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core. For example, to run a serial job with one GPU, $ bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program How to select GPU memoryIf you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the gpu_mtotal0 host selection to do this. For example, if you need 10 GB (=10240 MB) per GPU: $ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program This ensures your job will not run on GPUs with less than 10 GB of GPU memory. How to select a GPU modelIn some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time. To select a certain GPU model, add the -R "select[gpu_model1==GPU_MODEL]" resource requirement to bsub, $ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs. |
Available GPU node types
GPU Model | Specifier (GPU driver <= 450.80.02) | Specifier (GPU driver > 450.80.02) | GPU memory per GPU | CPU cores per node | CPU memory per node |
---|---|---|---|---|---|
NVIDIA GeForce GTX 1080 | |
NVIDIAGeForceGTX1080 | 8 GiB | 20 | 256 GiB |
NVIDIA GeForce GTX 1080 Ti | |
NVIDIAGeForceGTX1080Ti | 11 GiB | 20 | 256 GiB |
NVIDIA GeForce RTX 2080 Ti | |
NVIDIAGeForceRTX2080Ti | 11 GiB | 36 | 384 GiB |
NVIDIA GeForce RTX 2080 Ti | |
NVIDIAGeForceRTX2080Ti | 11 GiB | 128 | 512 GiB |
NVIDIA TITAN RTX | |
NVIDIATITANRTX | 24 GiB | 128 | 512 GiB |
NVIDIA Quadro RTX 6000 | |
QuadroRTX6000 | 24 GiB | 128 | 512 GiB |
NVIDIA Tesla V100-SXM2 32 GB | |
TeslaV100_SXM2_32GB | 32 GiB | 48 | 768 GiB |
NVIDIA Tesla V100-SXM2 32 GB | |
TeslaV100_SXM2_32GB | 32 GiB | 40 | 512 GiB |
Nvidia Tesla A100 | |
NVIDIAA100_PCIE_40GB | 40 GiB | 48 | 768 GiB |
Please note that the update of the GPU driver is a rolling update. For GPU node types where all nodes have already the updated driver version, the old identifier is crossed-out in the table above. Don't use crossed-out identifiers, as your job will be pending forever as LSF cannot find nodes with GPUs that have those identifiers.
Example
Further reading