Difference between revisions of "Leonhard beta testing"
(→TensorFlow example) |
(→TensorFlow example) |
||
Line 205: | Line 205: | ||
[[ 12.]] | [[ 12.]] | ||
[leonhard@lo-login-01 python]$ | [leonhard@lo-login-01 python]$ | ||
+ | |||
+ | Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver. |
Revision as of 06:55, 28 September 2017
The Leonhard cluster is available for early-access beta testing.
Please read through the following to get started.
Contents
Accessing the cluster
Who can access the cluster
Access is restricted to Leonhard shareholders and groups that want to test it before investing. Guest users cannot access the Leonhard cluster.
SSH
Users can access the Leonhard cluster via SSH:
ssh username@login.leonhard.ethz.ch
where username corresponds to your NETHZ username.
Note: the load balancer is still "work in progress"; if it does not work, please try to access one of the login nodes directly:
ssh username@lo-login-01.login.leonhard.ethz.ch
Storage
Like on the Euler cluster, every user also has a home directory and a personal scratch directory:
/cluster/home/username /cluster/scratch/username
Applications
For the Leonhard cluster, we decided to switch from the environment modules that are used on the Euler cluster to Lmod modules, which provide some nice features that are not available for environment modules. You should barely notice the transition from environment modules to Lmod modules as the commands are mostly the same:
[leonhard@lo-login-02 ~]$ module avail ------------------------------------------------- /cluster/spack/lmodules ------------------------------------------------- apr-util/1.5.4 gettext/0.19.8.1 libgpg-error/1.21 ncurses/6.0 apr/1.5.2 gflags/2.1.2 libgpuarray/0.6.2_py2 nettle/3.2 arpack/96 ghostscript-fonts/8.11 libgpuarray/0.6.2_py3 (D) openblas/0.2.19 atk/2.20.0 ghostscript/9.21 libice/1.0.9 openssl/1.0.1e atlas/3.11.34 git/2.12.1 libiconv/1.15 pango/1.40.3 atop/2.2-3 glib/2.49.7 libmng/2.0.2 patch/2.7.5 autoconf/2.69 glog/0.3.4 libogg/1.3.2 patchelf/0.9 automake/1.15 glpk/4.61 libpciaccess/0.13.4 pcre/8.40 bash/4.4 glproto/1.4.17 libpng/1.6.27 perl/5.24.1 bdw-gc/7.4.4 gmake/4.0 libpthread-stubs/0.3 pixman/0.34.0 binutils/2.28 gmp/6.1.2 libsigsegv/2.11 pkg-config/0.29.2 bison/3.0.4 gnat/2016 libsm/1.2.2 presentproto/1.0 bitmap/1.0.8 gnuplot/5.0.5 libtiff/4.0.6 py-mako/1.0.4 blaze/3.1 gnutls/3.5.10 libtool/2.4.6 python/2.7.13 boost/1.63.0 go-bootstrap/1.4-bootstrap-20161024 libunistring/0.9.7 python/3.6.0 (D) bzip2/1.0.6 go/1.8.1 libunwind/1.1 python_gpu/2.7.12 cairo/1.14.8 gobject-introspection/1.49.2 libx11/1.6.3 python_gpu/3.6.0 (D) cmake/2.8.10.2 gperf/3.0.4 libxau/1.0.8 qhull/2015.2 cmake/3.4.3 gperftools/2.4 libxaw/1.0.13 r/3.3.3 cmake/3.8.0 (D) gtkplus/2.24.31 libxcb/1.12 readline/7.0 coreutils/8.26 guile/2.0.11 libxdamage/1.1.4 renderproto/0.11.1 cscope/15.8b harfbuzz/1.4.6 libxdmcp/1.1.2 ruby/2.2.0 cuda/8.0.61 help2man/1.47.4 libxext/1.3.3 scotch/6.0.4 cudnn/6.0 hwloc/1.11.6 libxfixes/5.0.2 sqlite/3.18.0 curl/7.53.1 icu4c/58.2 libxft/2.3.2 suite-sparse/4.5.5 damageproto/1.2.1 image-magick/7.0.2-7 libxml2/2.9.4 swig/3.0.12 dbus/1.11.2 inputproto/2.3.2 libxmu/1.1.2 tar/1.29 dos2unix/7.3.4 isl/0.18 libxpm/3.5.10 tbb/2017.5 dri2proto/2.8 jdk/8u92 libxrender/0.9.10 tcl/8.6.6 dri3proto/1.0 jpeg/9b libxshmfence/1.2 tk/8.6.6 eigen/3.3.3 jsoncpp/1.7.3 libxslt/1.1.29 unzip/6.0 expat/2.2.0 kbproto/1.0.7 libxt/1.1.5 util-linux/2.29.1 exuberant-ctags/5.8 lcms/2.8 llvm/3.8.1 util-macros/1.19.1 fftw/3.3.5 libarchive/3.2.1 lmod/7.4.11 (D) vim/8.0.0503 fixesproto/5.0 libatomic-ops/7.4.4 lua-luafilesystem/1_6_3 wget/1.17 flex/2.6.1 libcerf/1.3 lua-luaposix/33.4.0 wx/3.1.0 flex/2.6.3 (D) libctl/3.2.2 lua/5.3.2 xbitmaps/1.1.1 font-util/1.3.1 libdrm/2.4.70 lz4/1.7.5 xcb-proto/1.12 fontcacheproto/0.1.3 libdwarf/20160507 lzma/4.32.7 xextproto/7.3.0 fontconfig/2.11.1 libedit/3.1-20170329 lzo/2.09 xproto/7.0.29 fontsproto/2.1.3 libelf/0.8.13 m4/1.4.18 xtrans/1.3.5 fonttosfnt/1.0.4 libffi/3.2.1 mawk/1.3.4 xz/5.2.3 freetype/2.7 libfontenc/1.1.3 metis/5.1.0 yasm/1.3.0 gawk/4.1.4 libfs/1.0.7 mpc/1.0.3 zlib/1.2.11 gdbm/1.13 libgcrypt/1.6.2 mpfr/3.1.5 gdk-pixbuf/2.31.2 libgd/2.2.4 nasm/2.11.06 ------------------------------------------------ /cluster/apps/lmodules/Core ------------------------------------------------- StdEnv (L) eth_proxy gcc/4.8.5 (L) lmod/7.4.11 settarg/7.4.11
[leonhard@lo-login-02 ~]$ module avail boost ----------------------------------------- /cluster/apps/lmodules/Compiler/gcc/4.8.5 ------------------------------------------ boost/1.63.0 Use "module spider" to find all possible modules. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys". [leonhard@lo-login-02 ~]$ module load boost/1.63.0 [leonhard@lo-login-02 ~]$ module list Currently Loaded Modules: 1) gcc/4.8.5 2) StdEnv 3) boost/1.63.0 [leonhard@lo-login-02 ~]$
Please note that this is work in progress and the module names might change. Currently, the number of software packages provided on Leonhard is not comparable to the software we provide on the Euler cluster, but it will grow over time.
TensorFlow
On Leonhard, we provide several versions of TensorFlow (for different Python versions, for CPU's, for GPU's etc.). The following combinations are available:
CPU | |
---|---|
Module command | TensorFlow version |
module load python_cpu/2.7.12 | Python 2.7.12, TensorFlow 1.2.1 |
module load python_cpu/2.7.13 | Python 2.7.13, TensorFlow 1.3 |
module load python_cpu/3.6.0 | Python 3.6.0, TensorFlow 1.2.1 |
module load python_cpu/3.6.1 | Python 3.6.1, TensorFlow 1.3 |
GPU | |
Module command | TensorFlow version |
module load python_gpu/2.7.12 | Python 2.7.12, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1 |
module load python_gpu/2.7.13 | Python 2.7.13, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0 |
module load python_gpu/3.6.0 | Python 3.6.0, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1 |
module load python_gpu/3.6.1 | Python 3.6.1, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0 |
If you would like to run a TensorFlow job on a CPU node, then you would need to load a CPU version of TensorFlow, whereas you would need to load a GPU version of TensorFlow for running a TensorFlow job on a GPU node.
Submitting jobs
Leonhard uses the same LSF batch system as the Euler cluster.
Use the “bsub” command to submit a job and specify resources needed to run your job. By default, a job will get 1 core and 1024 MB of RAM for 4 hours. Unless otherwise specified, jobs requesting more than 36 cores will run on a single node. Regular nodes have 36 cores and 128 or 512 GB of RAM (of which about 90 and 460 GB, respectively, are usable).
Unlike Euler, requested memory is strictly enforced as a memory limit. For example, if you do not specifically state a memory requirement, your program can not use more than 1 GB of RAM per core. What counts is is actually used memory, including page cache for your job. All processes from the same job on a node share the same pool. For example, with a job submitted as
bsub -n 16 -R "rusage[mem=1024] span[ptile=8]" mpirun ./my_job
all of the 8 MPI ranks on a single node can use up to 8 GB.
Submitting GPU jobs
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.
For example, to run a serial job with one GPU,
bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program
or on a full node with all eight GPUs and up to 90 GB of RAM,
bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program
or on two full nodes:
bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program
While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.
TensorFlow example
As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:
[leonhard@lo-login-01 ~]$ cd testrun/python [leonhard@lo-login-01 python]$ module load python_gpu/2.7.13 [leonhard@lo-login-01 python]$ cat tftest1.py #/usr/bin/env python from __future__ import print_function import tensorflow as tf vers = tf.__version__ print(vers) hello = tf.constant('Hello, TensorFlow!') matrix1 = tf.constant(3., 3.) matrix2 = tf.constant([[2.],[2.]]) product = tf.matmul(matrix1, matrix2) sess = tf.Session() print(sess.run(hello)) print(sess.run(product)) sess.close() [leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py Generic job. Job <10620> is submitted to queue <gpu.4h>. [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard PEND gpu.4h lo-login-01 *tftest.py Sep 28 08:02 [leonhard@lo-login-01 python]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 10620 leonhard RUN gpu.4h lo-login-01 lo-gtx-001 *ftest1.py Sep 28 08:03 [leonhard@lo-login-01 python]$ bjobs No unfinished job found [leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620 2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0) 1.3.0 Hello, TensorFlow! 12. [leonhard@lo-login-01 python]$
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.