Difference between revisions of "Leonhard beta testing"

From ScientificComputing
Jump to: navigation, search
(SSH)
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The [[Leonhard|Leonhard cluster]] is available for early-access beta testing.
+
{{LeoOpen_obsolete}}
  
Please read through the following to get started.
+
Please read through the following to get started on the [[Leonhard|Leonhard cluster]].
 +
 
 +
==Current status==
 +
The Leonhard Open cluster is still in the beta testing phase.
 +
 
 +
The '''lo-a2-*''', the '''lo-gtx-*''' and the '''lo-s4-*''' compute nodes are connected to the '''InfiniBand''' high-performance interconnect and can access the '''GPFS''' storage system (<tt>/cluster/scratch</tt>,<tt>/cluster/work</tt>) and the '''NetApp''' storage system (<tt>/cluster/project</tt>,<tt>/cluster/home</tt>).
  
 
==Accessing the cluster==
 
==Accessing the cluster==
Line 9: Line 14:
  
 
===SSH===
 
===SSH===
General information about how to access our clusters via SSH can be found in the [[Accessing_the_clusters#SSH|Accessing the clusters]] tutorial.
+
You can find general information about how to access our clusters via SSH in the [[Accessing_the_clusters#SSH|Accessing the clusters]] tutorial.
  
 
The command to access the Leonhard cluster via SSH is:
 
The command to access the Leonhard cluster via SSH is:
Line 15: Line 20:
 
  ssh ''username''@login.leonhard.ethz.ch
 
  ssh ''username''@login.leonhard.ethz.ch
  
where ''username'' corresponds to your NETHZ username.
+
where ''username'' corresponds to your ETH username.
  
 
==Storage==
 
==Storage==
 +
You can find general information about the storage systems in the [[Storage_systems|Storage systems]] tutorial.
 +
 
Like on the Euler cluster, every user also has a home directory and a personal scratch directory:
 
Like on the Euler cluster, every user also has a home directory and a personal scratch directory:
  
Line 23: Line 30:
 
  /cluster/scratch/''username''
 
  /cluster/scratch/''username''
  
==Applications and modules==
+
==Modules for preparing your environment==
  
 
===LMOD===
 
===LMOD===
Line 48: Line 55:
  
 
===Hierarchical modules===
 
===Hierarchical modules===
LMOD allows to define a hierarchy of modules containing 3 layers (Core, Compiler, MPI). The ''core'' layer contains all module files which are not depending on any compiler/MPI. The ''compiler'' layer contains all modules which are depending on a particular compilers, but not on any MPI library. The MPI layer contains modules that are depending on a particular compiler/MPI combination.
+
LMOD allows to define a hierarchy of modules containing 3 layers (Core, Compiler, MPI). The '''core''' layer contains all module files which are not depending on any compiler/MPI. The '''compiler''' layer contains all modules which are depending on a particular compilers, but not on any MPI library. The '''MPI''' layer contains modules that are depending on a particular compiler/MPI combination.
  
When a user logs in to the Leonhard cluster, the standard compiler gcc/4.8.5 is automatically loaded. Running the ''module avail'' command results in displaying all modules that are available for gcc/4.8.5. If you would like to see the modules available for a different compiler, for instance gcc/6.3.0, then you would need to load the compiler module and run ''module avail' again. For checking out the available modules for gcc/4.8.5 openmpi/2.1.0, you would load the corresponding compiler and MPI module and run again ''module avail'.
+
When you login to the Leonhard cluster, the standard compiler gcc/4.8.5 is automatically loaded. Running the '''module avail''' command displays all modules that are available for gcc/4.8.5. If you would like to see the modules available for a different compiler, for instance gcc/6.3.0, then you would need to load the compiler module and run '''module avail''' again. For checking out the available modules for gcc/4.8.5 openmpi/2.1.0, you would load the corresponding compiler and MPI module and run again ''module avail'.
  
 
As a consequence of the module hierachy, you can never have two different versions of the same module loaded at the same time. This helps to avoid problems arising due to misconfiguration of the environment.
 
As a consequence of the module hierachy, you can never have two different versions of the same module loaded at the same time. This helps to avoid problems arising due to misconfiguration of the environment.
  
===Python on Leonhard===
 
Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs.
 
 
{|class="wikitable" border=1 style="width: 65%;"
 
! CPU version !! GPU version
 
|-
 
|module load python_cpu/3.6.1 || module load python_gpu/3.6.1
 
|}
 
 
===TensorFlow===
 
On Leonhard, we provide several versions of TensorFlow. The following combinations are available:
 
 
{|class="wikitable" border=1 style="width: 65%;"
 
|-
 
! colspan=2 | CPU
 
|-
 
! Module command !! TensorFlow version
 
|-
 
|module load python_cpu/2.7.12
 
|Python 2.7.12, TensorFlow 1.2.1
 
|-
 
|module load python_cpu/2.7.13
 
|Python 2.7.13, TensorFlow 1.3
 
|-
 
|module load python_cpu/3.6.0
 
|Python 3.6.0, TensorFlow 1.2.1
 
|-
 
|module load python_cpu/3.6.1
 
|Python 3.6.1, TensorFlow 1.3
 
|-
 
! colspan=2 | GPU
 
|-
 
! Module command !! TensorFlow version
 
|-
 
|module load python_gpu/2.7.12
 
|Python 2.7.12, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
 
|-
 
|module load python_gpu/2.7.13
 
|Python 2.7.13, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
 
|-
 
|module load python_gpu/3.6.0
 
|Python 3.6.0, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
 
|-
 
|module load python_gpu/3.6.1
 
|Python 3.6.1, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
 
|}
 
 
If you would like to run a TensorFlow job on a CPU node, then you would need to load a CPU version of TensorFlow, whereas you would need to load a GPU version of TensorFlow for running a TensorFlow job on a GPU node.
 
  
==Submitting jobs==
 
  
Leonhard uses the same LSF batch system as the [[Euler|Euler cluster]].
+
==Batch system==
 +
Leonhard uses the same LSF batch system as the [[Euler|Euler cluster]]. You can find some general information about the batch system in the [[Using_the_batch_system|Using the batch system]] tutorial.
  
Use the “bsub” command to submit a job and specify resources needed to run your job. By default, a job will get 1&nbsp;core and 1024&nbsp;MB of RAM for 4&nbsp;hours. Unless otherwise specified, jobs requesting more than 36&nbsp;cores will run on a single node. Regular nodes have 36&nbsp;cores and 128 or 512&nbsp;GB of RAM (of which about 90 and 460&nbsp;GB, respectively, are usable).
+
Unless otherwise specified, jobs requesting more than 36&nbsp;cores will run on a single node. Regular nodes have 36&nbsp;cores and 128 or 512&nbsp;GB of RAM (of which about 90 and 460&nbsp;GB, respectively, are usable).
  
Unlike Euler, '''requested memory is strictly enforced as a memory limit.''' For example, if you do not specifically state a memory requirement, your program can not use more than 1&nbsp;GB of RAM per core. What counts is is actually used memory, including page cache for your job. All processes from the same job on a node share the same pool. For example, with a job submitted as
+
Unlike Euler, '''requested memory is strictly enforced as a memory limit.'''  
bsub -n 16 -R "rusage[mem=1024] span[ptile=8]" mpirun ./my_job
 
all of the 8&nbsp;MPI ranks on a single node can use up to 8&nbsp;GB.
 
  
 
===Submitting GPU jobs===
 
===Submitting GPU jobs===
 
'''Until further notice''' most GPU nodes '''can not access the GPFS filesystem'''. This includes /cluster/scratch and /cluster/work. You are advised to stage files into your $HOME directory.
 
  
 
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>.
 
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>.
Line 170: Line 125:
  
 
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.
 
Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.
 +
 +
==Third-party applications==
 +
 +
===Python on Leonhard===
 +
Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs.
 +
 +
{|class="wikitable" border=1 style="width: 65%;"
 +
! CPU version !! GPU version
 +
|-
 +
|module load python_cpu/3.6.1 || module load python_gpu/3.6.1
 +
|}
 +
 +
===TensorFlow===
 +
On Leonhard, we provide several versions of TensorFlow. The following combinations are available:
 +
 +
{|class="wikitable" border=1 style="width: 65%;"
 +
|-
 +
! colspan=2 | CPU
 +
|-
 +
! Module command !! TensorFlow version
 +
|-
 +
|module load python_cpu/2.7.12
 +
|Python 2.7.12, TensorFlow 1.2.1
 +
|-
 +
|module load python_cpu/2.7.13
 +
|Python 2.7.13, TensorFlow 1.3
 +
|-
 +
|module load python_cpu/2.7.14
 +
|Python 2.7.14, TensorFlow 1.7
 +
|-
 +
|module load python_cpu/3.6.0
 +
|Python 3.6.0, TensorFlow 1.2.1
 +
|-
 +
|module load python_cpu/3.6.1
 +
|Python 3.6.1, TensorFlow 1.3
 +
|-
 +
|module load python_cpu/3.6.4
 +
|Python 3.6.4, TensorFlow 1.7
 +
|-
 +
! colspan=2 | GPU
 +
|-
 +
! Module command !! TensorFlow version
 +
|-
 +
|module load python_gpu/2.7.12
 +
|Python 2.7.12, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
 +
|-
 +
|module load python_gpu/2.7.13
 +
|Python 2.7.13, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
 +
|-
 +
|module load python_gpu/2.7.14
 +
|Python 2.7.14, TensorFlow 1.7, CUDA 9.0.176, cuDNN 7.0
 +
|-
 +
|module load python_gpu/3.6.0
 +
|Python 3.6.0, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
 +
|-
 +
|module load python_gpu/3.6.1
 +
|Python 3.6.1, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
 +
|-
 +
|module load python_gpu/3.6.4
 +
|Python 3.6.4, TensorFlow 1.7, CUDA 9.0.176, cuDNN 7.0
 +
|}
 +
 +
If you would like to run a TensorFlow job on a CPU node, then you would need to load a CPU version of TensorFlow, whereas you would need to load a GPU version of TensorFlow for running a TensorFlow job on a GPU node.
 +
 +
==Troubleshooting==
 +
 +
===I can no longer access the Leonhard Open cluster===
 +
Some users only got temporary access to the Leonhard Open cluster for a course or a project, which is limited in time. The HPC group does not manage membership of the shareholder groups on Leonhard. The shareholder groups are defined through a custom ETH group. If you could access Leonhard Open for some time and can no longer access it, then you were most likely removed from the custom ETH group, which was used to define the share. If this is the case, then please contact the local IT support group (ISG) of your department for further information, as they are managing the custom ETH groups.
 +
 +
===I get an error message, when running my software that uses a GPU===
 +
If you are getting error messages, about not finding a cuda library, like for instance:
 +
 +
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
 +
 +
or
 +
 +
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
 +
 +
Then you are most likely running the software on a login node or a compute node without GPU's.
 +
 +
If you would like to run a software that requires access to the GPU driver, then you need to submit it as a batch job and [[Leonhard_beta_testing#Submitting_GPU_jobs | request a GPU]] from the batch system.
 +
 +
===Cluster is missing the h5py python package===
 +
The h5py Python package is linked against the HDF5 library, therefore you need to also load the HDF5 module, such that h5py can located the HDF5 libraries.
 +
 +
[leonhard@lo-s4-019 ~]$ '''module load python_gpu/3.6.1 hdf5/1.10.1'''
 +
[leonhard@lo-s4-019 ~]$ '''python'''
 +
Python 3.6.1 (default, Sep 27 2017, 13:27:13)
 +
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
 +
Type "help", "copyright", "credits" or "license" for more information.
 +
>>> '''import h5py'''
 +
>>> '''h5py.__version__'''
 +
'2.7.1'
 +
>>>

Latest revision as of 10:42, 16 September 2021

This page contains information about the Leonhard Open cluster, which is now obsolete as the cluster has been integrated into the Euler cluster on 14/15 September 2021

Please read through the following to get started on the Leonhard cluster.

Current status

The Leonhard Open cluster is still in the beta testing phase.

The lo-a2-*, the lo-gtx-* and the lo-s4-* compute nodes are connected to the InfiniBand high-performance interconnect and can access the GPFS storage system (/cluster/scratch,/cluster/work) and the NetApp storage system (/cluster/project,/cluster/home).

Accessing the cluster

Who can access the cluster

Access is restricted to Leonhard shareholders and groups that want to test it before investing. Guest users cannot access the Leonhard cluster.

SSH

You can find general information about how to access our clusters via SSH in the Accessing the clusters tutorial.

The command to access the Leonhard cluster via SSH is:

ssh username@login.leonhard.ethz.ch

where username corresponds to your ETH username.

Storage

You can find general information about the storage systems in the Storage systems tutorial.

Like on the Euler cluster, every user also has a home directory and a personal scratch directory:

/cluster/home/username
/cluster/scratch/username

Modules for preparing your environment

LMOD

For the Leonhard cluster, we decided to switch from the environment modules that are used on the Euler cluster to Lmod modules, which provide some additional features. You should barely notice the transition from environment modules to Lmod modules as the commands are mostly the same. Therefore please refer to the Setting up your environment tutorial for a general documentation about the module commands.

[leonhard@lo-login-02 ~]$ module avail boost

----------------------------------------- /cluster/apps/lmodules/Compiler/gcc/4.8.5 ------------------------------------------
   boost/1.63.0

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".


[leonhard@lo-login-02 ~]$ module load boost/1.63.0
[leonhard@lo-login-02 ~]$ module list

Currently Loaded Modules:
  1) gcc/4.8.5   2) StdEnv   3) boost/1.63.0

[leonhard@lo-login-02 ~]$ 

Please note that this is work in progress and the module names might change. Currently, the number of software packages provided on Leonhard is not comparable to the software we provide on the Euler cluster, but it will grow over time.

Hierarchical modules

LMOD allows to define a hierarchy of modules containing 3 layers (Core, Compiler, MPI). The core layer contains all module files which are not depending on any compiler/MPI. The compiler layer contains all modules which are depending on a particular compilers, but not on any MPI library. The MPI layer contains modules that are depending on a particular compiler/MPI combination.

When you login to the Leonhard cluster, the standard compiler gcc/4.8.5 is automatically loaded. Running the module avail command displays all modules that are available for gcc/4.8.5. If you would like to see the modules available for a different compiler, for instance gcc/6.3.0, then you would need to load the compiler module and run module avail again. For checking out the available modules for gcc/4.8.5 openmpi/2.1.0, you would load the corresponding compiler and MPI module and run again module avail'.

As a consequence of the module hierachy, you can never have two different versions of the same module loaded at the same time. This helps to avoid problems arising due to misconfiguration of the environment.


Batch system

Leonhard uses the same LSF batch system as the Euler cluster. You can find some general information about the batch system in the Using the batch system tutorial.

Unless otherwise specified, jobs requesting more than 36 cores will run on a single node. Regular nodes have 36 cores and 128 or 512 GB of RAM (of which about 90 and 460 GB, respectively, are usable).

Unlike Euler, requested memory is strictly enforced as a memory limit.

Submitting GPU jobs

All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

TensorFlow example

As an example for running a TensorFlow job on a GPU node, we are printing out the TensorFlow version, the string Hello TensorFlow! and the result of a simple matrix multiplication:

[leonhard@lo-login-01 ~]$ cd testrun/python
[leonhard@lo-login-01 python]$ module load python_gpu/2.7.13
[leonhard@lo-login-01 python]$ cat tftest1.py
#/usr/bin/env python
from __future__ import print_function
import tensorflow as tf

vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()
[leonhard@lo-login-01 python]$ bsub -n 1 -W 4:00 -R "rusage[mem=2048, ngpus_excl_p=1]" python tftest1.py
Generic job.
Job <10620> is submitted to queue <gpu.4h>.
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  PEND  gpu.4h     lo-login-01             *tftest.py Sep 28 08:02
[leonhard@lo-login-01 python]$ bjobs
JOBID      USER      STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
10620      leonhard  RUN   gpu.4h     lo-login-01 lo-gtx-001  *ftest1.py Sep 28 08:03
[leonhard@lo-login-01 python]$ bjobs
No unfinished job found
[leonhard@lo-login-01 python]$ grep -A3 "Creating TensorFlow device" lsf.o10620
2017-09-28 08:08:43.235886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
1.3.0
Hello, TensorFlow!
 12.
[leonhard@lo-login-01 python]$

Please note, that your job will crash if you are running the GPU version of TensorFlow on a CPU node, because TensorFlow is checking on start up if the compute node has a GPU driver.

Third-party applications

Python on Leonhard

Because certain Python packages need different installations for their CPU and GPU versions, we decided to have separate Python installations with regards to using CPUs and GPUs.

CPU version GPU version
module load python_cpu/3.6.1 module load python_gpu/3.6.1

TensorFlow

On Leonhard, we provide several versions of TensorFlow. The following combinations are available:

CPU
Module command TensorFlow version
module load python_cpu/2.7.12 Python 2.7.12, TensorFlow 1.2.1
module load python_cpu/2.7.13 Python 2.7.13, TensorFlow 1.3
module load python_cpu/2.7.14 Python 2.7.14, TensorFlow 1.7
module load python_cpu/3.6.0 Python 3.6.0, TensorFlow 1.2.1
module load python_cpu/3.6.1 Python 3.6.1, TensorFlow 1.3
module load python_cpu/3.6.4 Python 3.6.4, TensorFlow 1.7
GPU
Module command TensorFlow version
module load python_gpu/2.7.12 Python 2.7.12, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
module load python_gpu/2.7.13 Python 2.7.13, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
module load python_gpu/2.7.14 Python 2.7.14, TensorFlow 1.7, CUDA 9.0.176, cuDNN 7.0
module load python_gpu/3.6.0 Python 3.6.0, TensorFlow 1.2.1, CUDA 8.0.61, cuDNN 5.1
module load python_gpu/3.6.1 Python 3.6.1, TensorFlow 1.3, CUDA 8.0.61, cuDNN 6.0
module load python_gpu/3.6.4 Python 3.6.4, TensorFlow 1.7, CUDA 9.0.176, cuDNN 7.0

If you would like to run a TensorFlow job on a CPU node, then you would need to load a CPU version of TensorFlow, whereas you would need to load a GPU version of TensorFlow for running a TensorFlow job on a GPU node.

Troubleshooting

I can no longer access the Leonhard Open cluster

Some users only got temporary access to the Leonhard Open cluster for a course or a project, which is limited in time. The HPC group does not manage membership of the shareholder groups on Leonhard. The shareholder groups are defined through a custom ETH group. If you could access Leonhard Open for some time and can no longer access it, then you were most likely removed from the custom ETH group, which was used to define the share. If this is the case, then please contact the local IT support group (ISG) of your department for further information, as they are managing the custom ETH groups.

I get an error message, when running my software that uses a GPU

If you are getting error messages, about not finding a cuda library, like for instance:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

or

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

Then you are most likely running the software on a login node or a compute node without GPU's.

If you would like to run a software that requires access to the GPU driver, then you need to submit it as a batch job and request a GPU from the batch system.

Cluster is missing the h5py python package

The h5py Python package is linked against the HDF5 library, therefore you need to also load the HDF5 module, such that h5py can located the HDF5 libraries.

[leonhard@lo-s4-019 ~]$ module load python_gpu/3.6.1 hdf5/1.10.1
[leonhard@lo-s4-019 ~]$ python
Python 3.6.1 (default, Sep 27 2017, 13:27:13)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h5py
>>> h5py.__version__
'2.7.1'
>>>