Difference between revisions of "Getting started with GPUs"
(Created page with "==Introduction== Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders. ==Using GPUs on Leonhard== All GPUs in Leonhard are confi...") |
|||
Line 2: | Line 2: | ||
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders. | Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders. | ||
− | == | + | ==How to submit a GPU job== |
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>. | All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>. | ||
Line 15: | Line 15: | ||
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs. | While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs. | ||
− | |||
− | |||
==Tensorflow example== | ==Tensorflow example== |
Revision as of 14:36, 30 November 2017
Introduction
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.
How to submit a GPU job
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.
For example, to run a serial job with one GPU,
bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program
or on a full node with all eight GPUs and up to 90 GB of RAM,
bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program
or on two full nodes:
bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program
While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.
Tensorflow example
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:
[leonhard@lo-login-01 ~]$ cd testrun/python [leonhard@lo-login-01 python]$ module load python_gpu/2.7.13 [leonhard@lo-login-01 python]$ cat tftest1.py #/usr/bin/env python from __future__ import print_function import tensorflow as tf vers = tf.__version__ print(vers) hello = tf.constant('Hello, TensorFlow!') matrix1 = tf.constant(3., 3.) matrix2 = tf.constant([[2.],[2.]]) product = tf.matmul(matrix1, matrix2) sess = tf.Session() print(sess.run(hello)) print(sess.run(product)) sess.close() [leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py Generic job. Job <10620> is submitted to queue <gpu.4h>. [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard PEND gpu.4h lo-login-01 *tftest.py Sep 28 08:02 [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard RUN gpu.4h lo-login-01 lo-gtx-001 *ftest1.py Sep 28 08:03 [leonhard@lo-login-01 python]$ bjobs No unfinished job found [leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620 2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0) 1.3.0 Hello, TensorFlow! 12. [leonhard@lo-login-01 python]$
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.