Difference between revisions of "Getting started with GPUs"

From ScientificComputing
Jump to: navigation, search
(Created page with "==Introduction== Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders. ==Using GPUs on Leonhard== All GPUs in Leonhard are confi...")
 
Line 2: Line 2:
 
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.
 
Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.
  
==Using GPUs on Leonhard==
+
==How to submit a GPU job==
 
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>.
 
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>.
  
Line 15: Line 15:
  
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
 
==How to submit a GPU job===
 
  
 
==Tensorflow example==
 
==Tensorflow example==

Revision as of 14:36, 30 November 2017

Introduction

Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.

How to submit a GPU job

All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

Tensorflow example

As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:

[leonhard@lo-login-01 ~]$ cd testrun/python
[leonhard@lo-login-01 python]$ module load python_gpu/2.7.13
[leonhard@lo-login-01 python]$ cat tftest1.py
#/usr/bin/env python
from __future__ import print_function
import tensorflow as tf

vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant(3., 3.)
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()
[leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py
Generic job.
Job <10620> is submitted to queue <gpu.4h>.
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  PEND  gpu.4h     lo-login-01             *tftest.py Sep 28 08:02
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  RUN   gpu.4h     lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
[leonhard@lo-login-01 python]$ bjobs
No unfinished job found
[leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620
2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
1.3.0
Hello, TensorFlow!
 12.
[leonhard@lo-login-01 python]$

Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.