Difference between revisions of "Getting started with GPUs"

From ScientificComputing
Jump to: navigation, search
(How to submit a GPU job)
(Python and GPUs)
 
(28 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders. Therefore the instructions on this wiki page are only referring to the Leonhard cluster.
+
There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.
 +
 
 +
==CUDA and cuDNN==
 +
cuDNN versions provided are compiled for a particular CUDA version. We will soon add here a table with the compatible versions
  
 
==How to submit a GPU job==
 
==How to submit a GPU job==
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=XX]</tt> with <tt>XX</tt> being the number of CPU cores per GPU node, which is depending on the node type (the node types are listed in the table below).
+
All GPUs are configured in Exclusive Process mode. To run multi-node job, you will need to request <tt>span[ptile=XX]</tt> with <tt>XX</tt> being the number of CPU cores per GPU node, which is depending on the node type (the node types are listed in the table below).
  
 
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node'''. This is unlike other resources, which are requested '''per core'''.
 
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node'''. This is unlike other resources, which are requested '''per core'''.
Line 12: Line 15:
 
  bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" ./my_cuda_program
 
  bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" ./my_cuda_program
 
or on two full nodes:
 
or on two full nodes:
  bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] -R "select[gpu_model0==GeForceGTX1080Ti]" span[ptile=20]" ./my_cuda_program
+
  bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" -R "span[ptile=20]" ./my_cuda_program
  
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
 +
 +
==Sofware with GPU support==
 +
On Euler, packages with GPU support are only available in the [[Euler_applications_and_libraries|new software stack]]. None of the packages in the old software stack on Euler has support for GPUs.
 +
 +
==Available GPU node types==
 +
===Euler===
 +
{{GPUTable}}
 +
 +
== How to select GPU memory ==
 +
 +
If you know that you will need more memory on a GPU than some models provide, <em>i.e.,</em> more than 8&nbsp;GB, then you can request that your job will run only on GPUs that have enough memory. Use the <tt>gpu_mtotal0</tt> host selection to do this. For example, if you need 10&nbsp;GB (=10240&nbsp; MB) per&nbsp;GPU:
 +
 +
  [sfux@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program'''
 +
 +
This ensures your job will not run on GPUs with less than 10&nbsp;GB of GPU memory.
 +
 +
== How to select a GPU model ==
 +
In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.
 +
 +
To select a certain GPU model, add the <tt>-R "select[gpu_model1==GPU_MODEL]"</tt> resource requirement to bsub,
 +
 +
[sfux@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program'''
  
 
==Python and GPUs==
 
==Python and GPUs==
Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs. For instance running the GPU version of TensorFlow on a CPU node will immediately crash, because TensorFlow is checking on start up if the compute node has a GPU driver.
+
Because some Python packages need different installations for their CPU and GPU versions, we decided to have separate Python modules (python/XXX and python_gpu/XXX) with regards to using CPUs and GPUs. The python_gpu modules will in addition automatically load a CUDA and a CUDNN module. When running the GPU version of TensorFlow (<2.0.0) or PyTorch on a CPU node will immediately crash, because those packages are checking on start up if the compute node has a GPU driver installed. From TensorFlow 2.0.0 on, google merged the CPU and the GPU version of TensorFlow into a single package, but for PyTorch there are still two installations (CPU/GPU) required.
 
 
{|class="wikitable" border=1 style="width: 65%;"
 
! CPU version !! GPU version
 
|-
 
|module load python_cpu/3.6.1 || module load python_gpu/3.6.1
 
|}
 
  
===Tensorflow example===
+
===Tensorflow 1.x example===
 
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string '''Hello TensorFlow!''' and the result of a simple matrix multiplication:
 
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string '''Hello TensorFlow!''' and the result of a simple matrix multiplication:
 
    
 
    
  [leonhard@lo-login-01 ~]$ '''cd testrun/python'''
+
  [sfux@lo-login-01 ~]$ '''cd testrun/python'''
  [leonhard@lo-login-01 python]$ '''module load python_gpu/2.7.13'''
+
  [sfux@lo-login-01 python]$ '''module load python_gpu/2.7.13'''
  [leonhard@lo-login-01 python]$ '''cat tftest1.py'''
+
  [sfux@lo-login-01 python]$ '''cat tftest1.py'''
 
  #/usr/bin/env python
 
  #/usr/bin/env python
 
  from __future__ import print_function
 
  from __future__ import print_function
Line 46: Line 65:
 
  print(sess.run(product))
 
  print(sess.run(product))
 
  sess.close()
 
  sess.close()
  [leonhard@lo-login-01 python]$ '''bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py'''
+
  [sfux@lo-login-01 python]$ '''bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py'''
 
  Generic job.
 
  Generic job.
 
  Job <10620> is submitted to queue <gpu.4h>.
 
  Job <10620> is submitted to queue <gpu.4h>.
  [leonhard@lo-login-01 python]$ '''bjobs'''
+
  [sfux@lo-login-01 python]$ '''bjobs'''
 
  JOBID      USER      STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 
  JOBID      USER      STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
  10620      leonhard  PEND  gpu.4h    lo-login-01            *tftest.py Sep 28 08:02
+
  10620      sfux      PEND  gpu.4h    lo-login-01            *tftest.py Sep 28 08:02
  [leonhard@lo-login-01 python]$ '''bjobs'''
+
  [sfux@lo-login-01 python]$ '''bjobs'''
 
  JOBID      USER      STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 
  JOBID      USER      STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
  10620      leonhard  RUN  gpu.4h    lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
+
  10620      sfux      RUN  gpu.4h    lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
  [leonhard@lo-login-01 python]$ '''bjobs'''
+
  [sfux@lo-login-01 python]$ '''bjobs'''
 
  No unfinished job found
 
  No unfinished job found
  [leonhard@lo-login-01 python]$ '''grep -A3 "Creating TensorFlow device" lsf.o10620'''
+
  [sfux@lo-login-01 python]$ '''grep -A3 "Creating TensorFlow device" lsf.o10620'''
 
  2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
 
  2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
 
  1.3.0
 
  1.3.0
 
  Hello, TensorFlow!
 
  Hello, TensorFlow!
  [[ 12.]]
+
  [[12.]]
  [leonhard@lo-login-01 python]$
+
  [sufx@lo-login-01 python]$
  
 
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.
 
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.
  
== How to select GPU memory ==
+
===Tensorflow 2.x example===
  
If you know that you will need more memory on a GPU than some models provide, <em>i.e.,</em> more than 8&nbsp;GB, then you can request that your job will run only on GPUs that have enough memory. Use the <tt>gpu_mtotal0</tt> host selection to do this. For example, if you need 10&nbsp;GB (=10240&nbsp; MB) per&nbsp;GPU:
+
Tensorflow 2.x does no longer use sessions. Please find below an updated example job for tensorflow 2.0, where we create two 2000x2000 Matrices with random numbers and then carry out a matrix multiplication once on the CPU and once on the GPU and then compare the run times.
  
  [leonhard@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program'''
+
[sfux@lo-login-01 ~]$ '''cd testrun/tf/test2'''
 
+
[sfux@lo-login-01 test2]$ '''module load gcc/6.3.0 python_gpu/3.7.4'''
This ensures your job will not run on GPUs with less than 10&nbsp;GB of GPU memory. The most memory capacities of different GPUs are
+
{| class="wikitable"
+
The following have been reloaded with a version change:
|-
+
  1) gcc/4.8.5 => gcc/6.3.0
! GPU Model !! GPU Memory
+
|-
+
[sfux@lo-login-01 test2]$ '''cat tf2test.py'''
| NVIDIA GeForce GTX 1080 || 8&nbsp;GiB
+
#/usr/bin/env python
|-
+
| NVIDIA GeForce GTX 1080 Ti || 11&nbsp;GiB
+
import time
|-
+
import tensorflow as tf
| NVIDIA GeForce RTX 2080 Ti || 11&nbsp;GiB
+
|-
+
k = 2000
| [[Nvidia_DGX-1_with_Tensor_Cores|NVIDIA Tesla V100-SXM2 32 GB]] || 32&nbsp;GiB
+
a = tf.random.uniform(shape=[k,k], minval=0, maxval=20,dtype=tf.float16)
|}
+
b = tf.random.uniform(shape=[k,k], minval=0, maxval=20,dtype=tf.float16)
 
+
 
+
cpu_slot = 0
== How to select a GPU model ==
+
gpu_slot = 0
 
+
In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.
+
# Using CPU at slot 0
 
+
with tf.device('/CPU:' + str(cpu_slot)):
To select a certain GPU model, add the <tt>-R "select[gpu_model1==GPU_MODEL]"</tt> resource requirement to bsub,
+
    start = time.time()
 
+
    c1 = tf.matmul(a,b)
  [leonhard@lo-login-01 ~]$ '''bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program'''
+
    print("Time on CPU:")
 +
    end = time.time() - start
 +
    print(end)
 +
 +
# Using the GPU at slot 0
 +
with tf.device('/GPU:' + str(gpu_slot)):
 +
    start = time.time()
 +
    c2 = tf.matmul(a,b)
 +
    print("Time on GPU:")
 +
    end = time.time() - start
 +
    print(end)
 +
 +
[sfux@lo-login-01 test2]$ '''bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tf2test.py'''
 +
Generic job.
 +
Job <5074756> is submitted to queue <gpu.4h>.
 +
[sfux@lo-login-01 test2]$ '''bjobs'''
 +
JOBID      USER    STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 +
5074756    sfux    PEND  gpu.4h    lo-login-01            *f2test.py Mar  5 12:28
 +
  [sfux@lo-login-01 test2]$ '''bjobs'''
 +
JOBID      USER    STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 +
5074756    sfux    RUN  gpu.4h    lo-login-01 lo-s4-082  *f2test.py Mar  5 12:28
 +
[sfux@lo-login-01 test2]$ '''bjobs'''
 +
No unfinished job found
 +
[sfux@lo-login-01 test2]$ '''grep -A1 "Time on" lsf.o5074756'''
 +
Time on CPU:
 +
63.97628474235535
 +
Time on GPU:
 +
0.4504997730255127
 +
[sfux@lo-login-01 test2]$
  
The list of possible GPU models you can specify are
+
With TensorFlow 2.0 it is possible to build a single Python package that supports CPU and GPU. If TensorFlow 2.0 is imported on a pure CPU compute node, it will no longer fail due to checking the GPU driver as it will fall back to the CPU version in this case.
{| class="wikitable"
 
|-
 
! GPU Model !! Specifier !! CPU cores per node
 
|-
 
| NVIDIA GeForce GTX 1080 || <tt>GeForceGTX1080</tt> || 20
 
|-
 
| NVIDIA GeForce GTX 1080 Ti || <tt>GeForceGTX1080Ti</tt> || 20
 
|-
 
| NVIDIA GeForce RTX 2080 Ti || <tt>GeForceRTX2080Ti</tt> || 36
 
|-
 
| [[Nvidia_DGX-1_with_Tensor_Cores|NVIDIA Tesla V100-SXM2 32 GB]] || <tt>TeslaV100_SXM2_32GB</tt> || 40
 
|}
 

Latest revision as of 14:14, 31 May 2022

Introduction

There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.

CUDA and cuDNN

cuDNN versions provided are compiled for a particular CUDA version. We will soon add here a table with the compatible versions

How to submit a GPU job

All GPUs are configured in Exclusive Process mode. To run multi-node job, you will need to request span[ptile=XX] with XX being the number of CPU cores per GPU node, which is depending on the node type (the node types are listed in the table below).

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all 8 GeForce GTX 1080 Ti GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8]" -R "select[gpu_model0==GeForceGTX1080Ti]" -R "span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

Sofware with GPU support

On Euler, packages with GPU support are only available in the new software stack. None of the packages in the old software stack on Euler has support for GPUs.

Available GPU node types

Euler

GPU Model Specifier (GPU driver <= 450.80.02) Specifier (GPU driver > 450.80.02) GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 GeForceGTX1080 NVIDIAGeForceGTX1080 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti GeForceGTX1080Ti NVIDIAGeForceGTX1080Ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti GeForceRTX2080Ti NVIDIAGeForceRTX2080Ti 11 GiB 128 512 GiB
NVIDIA GeForce RTX 3090 NVIDIAGeForceRTX3090 24 GiB 128 512 GiB
NVIDIA TITAN RTX TITANRTX NVIDIATITANRTX 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 QuadroRTX6000 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB TeslaV100_SXM2_32GB 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB TeslaV100_SXM2_32GB 32 GiB 40 512 GiB
Nvidia Tesla A100 A100_PCIE_40GB NVIDIAA100_PCIE_40GB 40 GiB 48 768 GiB

Please note that the update of the GPU driver is a rolling update. For GPU node types where all nodes have already the updated driver version, the old identifier is crossed-out in the table above. Don't use crossed-out identifiers, as your job will be pending forever as LSF cannot find nodes with GPUs that have those identifiers.

How to select GPU memory

If you know that you will need more memory on a GPU than some models provide, i.e., more than 8 GB, then you can request that your job will run only on GPUs that have enough memory. Use the gpu_mtotal0 host selection to do this. For example, if you need 10 GB (=10240  MB) per GPU:

 [sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_mtotal0>=10240]" ./my_cuda_program

This ensures your job will not run on GPUs with less than 10 GB of GPU memory.

How to select a GPU model

In some cases it is desirable or necessary to select the GPU model on which your job runs, for example if you know you code runs much faster on a newer model. However, you should consider that by narrowing down the list of allowable GPUs, your job may need to wait for a longer time.

To select a certain GPU model, add the -R "select[gpu_model1==GPU_MODEL]" resource requirement to bsub,

[sfux@lo-login-01 ~]$ bsub -R "rusage[ngpus_excl_p=1]" -R "select[gpu_model0==GeForceGTX1080]" ./my_cuda_program

Python and GPUs

Because some Python packages need different installations for their CPU and GPU versions, we decided to have separate Python modules (python/XXX and python_gpu/XXX) with regards to using CPUs and GPUs. The python_gpu modules will in addition automatically load a CUDA and a CUDNN module. When running the GPU version of TensorFlow (<2.0.0) or PyTorch on a CPU node will immediately crash, because those packages are checking on start up if the compute node has a GPU driver installed. From TensorFlow 2.0.0 on, google merged the CPU and the GPU version of TensorFlow into a single package, but for PyTorch there are still two installations (CPU/GPU) required.

Tensorflow 1.x example

As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:

[sfux@lo-login-01 ~]$ cd testrun/python
[sfux@lo-login-01 python]$ module load python_gpu/2.7.13
[sfux@lo-login-01 python]$ cat tftest1.py
#/usr/bin/env python
from __future__ import print_function
import tensorflow as tf

vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()
[sfux@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py
Generic job.
Job <10620> is submitted to queue <gpu.4h>.
[sfux@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      sfux      PEND  gpu.4h     lo-login-01             *tftest.py Sep 28 08:02
[sfux@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      sfux      RUN   gpu.4h     lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
[sfux@lo-login-01 python]$ bjobs
No unfinished job found
[sfux@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620
2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
1.3.0
Hello, TensorFlow!
12.
[sufx@lo-login-01 python]$

Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.

Tensorflow 2.x example

Tensorflow 2.x does no longer use sessions. Please find below an updated example job for tensorflow 2.0, where we create two 2000x2000 Matrices with random numbers and then carry out a matrix multiplication once on the CPU and once on the GPU and then compare the run times.

[sfux@lo-login-01 ~]$ cd testrun/tf/test2
[sfux@lo-login-01 test2]$ module load gcc/6.3.0 python_gpu/3.7.4

The following have been reloaded with a version change:
  1) gcc/4.8.5 => gcc/6.3.0

[sfux@lo-login-01 test2]$ cat tf2test.py 
#/usr/bin/env python

import time
import tensorflow as tf

k = 2000
a = tf.random.uniform(shape=[k,k], minval=0, maxval=20,dtype=tf.float16)
b = tf.random.uniform(shape=[k,k], minval=0, maxval=20,dtype=tf.float16)

cpu_slot = 0
gpu_slot = 0

# Using CPU at slot 0
with tf.device('/CPU:' + str(cpu_slot)):
    start = time.time()
    c1 = tf.matmul(a,b)
    print("Time on CPU:")
    end = time.time() - start
    print(end)

# Using the GPU at slot 0
with tf.device('/GPU:' + str(gpu_slot)):
    start = time.time()
    c2 = tf.matmul(a,b)
    print("Time on GPU:")
    end = time.time() - start
    print(end)

[sfux@lo-login-01 test2]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tf2test.py 
Generic job.
Job <5074756> is submitted to queue <gpu.4h>.
[sfux@lo-login-01 test2]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5074756    sfux    PEND  gpu.4h     lo-login-01             *f2test.py Mar  5 12:28
[sfux@lo-login-01 test2]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5074756    sfux    RUN   gpu.4h     lo-login-01 lo-s4-082   *f2test.py Mar  5 12:28
[sfux@lo-login-01 test2]$ bjobs
No unfinished job found
[sfux@lo-login-01 test2]$ grep -A1 "Time on" lsf.o5074756
Time on CPU:
63.97628474235535
Time on GPU:
0.4504997730255127
[sfux@lo-login-01 test2]$

With TensorFlow 2.0 it is possible to build a single Python package that supports CPU and GPU. If TensorFlow 2.0 is imported on a pure CPU compute node, it will no longer fail due to checking the GPU driver as it will fall back to the CPU version in this case.