Getting started with GPUs
Introduction
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.
Using GPUs on Leonhard
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.
For example, to run a serial job with one GPU,
bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program
or on a full node with all eight GPUs and up to 90 GB of RAM,
bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program
or on two full nodes:
bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program
While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.
How to submit a GPU job=
Tensorflow example
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:
[leonhard@lo-login-01 ~]$ cd testrun/python [leonhard@lo-login-01 python]$ module load python_gpu/2.7.13 [leonhard@lo-login-01 python]$ cat tftest1.py #/usr/bin/env python from __future__ import print_function import tensorflow as tf vers = tf.__version__ print(vers) hello = tf.constant('Hello, TensorFlow!') matrix1 = tf.constant(3., 3.) matrix2 = tf.constant([[2.],[2.]]) product = tf.matmul(matrix1, matrix2) sess = tf.Session() print(sess.run(hello)) print(sess.run(product)) sess.close() [leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py Generic job. Job <10620> is submitted to queue <gpu.4h>. [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard PEND gpu.4h lo-login-01 *tftest.py Sep 28 08:02 [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard RUN gpu.4h lo-login-01 lo-gtx-001 *ftest1.py Sep 28 08:03 [leonhard@lo-login-01 python]$ bjobs No unfinished job found [leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620 2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0) 1.3.0 Hello, TensorFlow! 12. [leonhard@lo-login-01 python]$
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.