Difference between revisions of "Getting started with GPUs"

From ScientificComputing
Jump to: navigation, search
Line 16: Line 16:
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
 
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
  
==Tensorflow example==
+
==Python and GPUs==
 +
Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs. For instance running the GPU version of TensorFlow on a CPU node will immediately crash, because TensorFlow is checking on start up if the compute node has a GPU driver.
 +
 
 +
{|class="wikitable" border=1 style="width: 65%;"
 +
! CPU version !! GPU version
 +
|-
 +
|module load python_cpu/3.6.1 || module load python_gpu/3.6.1
 +
|}
 +
 
 +
===Tensorflow example===
 
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string '''Hello TensorFlow!''' and the result of a simple matrix multiplication:
 
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string '''Hello TensorFlow!''' and the result of a simple matrix multiplication:
 
    
 
    

Revision as of 12:41, 30 November 2017

Introduction

Currently we only provide GPUs in the Leonhard Cluster, where access is restricted to Shareholders.

How to submit a GPU job

All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

Python and GPUs

Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs. For instance running the GPU version of TensorFlow on a CPU node will immediately crash, because TensorFlow is checking on start up if the compute node has a GPU driver.

CPU version GPU version
module load python_cpu/3.6.1 module load python_gpu/3.6.1

Tensorflow example

As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:

[leonhard@lo-login-01 ~]$ cd testrun/python
[leonhard@lo-login-01 python]$ module load python_gpu/2.7.13
[leonhard@lo-login-01 python]$ cat tftest1.py
#/usr/bin/env python
from __future__ import print_function
import tensorflow as tf

vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant(3., 3.)
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()
[leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py
Generic job.
Job <10620> is submitted to queue <gpu.4h>.
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  PEND  gpu.4h     lo-login-01             *tftest.py Sep 28 08:02
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  RUN   gpu.4h     lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
[leonhard@lo-login-01 python]$ bjobs
No unfinished job found
[leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620
2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
1.3.0
Hello, TensorFlow!
 12.
[leonhard@lo-login-01 python]$

Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.