Difference between revisions of "Leonhard beta testing"

From ScientificComputing
Jump to: navigation, search
(Accessing the cluster)
Line 6: Line 6:
  
 
===Who can access the cluster===
 
===Who can access the cluster===
Access to the Leonhard cluster is restricted to shareholders and prospective shareholder that would like to test the system. Guest users cannot access the Leonhard cluster.
+
Access to the Leonhard cluster is restricted to Leonhard shareholders and prospective Leonhard shareholders. Guest users cannot access the Leonhard cluster.
  
 
===SSH===
 
===SSH===

Revision as of 12:12, 17 May 2017

The Leonhard cluster is available for early-access beta testing.

Please read through the following to get started.

Accessing the cluster

Who can access the cluster

Access to the Leonhard cluster is restricted to Leonhard shareholders and prospective Leonhard shareholders. Guest users cannot access the Leonhard cluster.

SSH

Users can access the Leonhard cluster via SSH.

ssh USERNAME@login.leonhard.ethz.ch

where USERNAME needs to be replaced with your NETHZ username.

Lmod modules

For the Leonhard cluster, we decided to switch from the environment modules that are used on the Euler cluster to Lmod modules, which provide some nice features that are not available for environment modules. Users should barely notice the transition from environment modules to Lmod modules as Lmod modules provide mostly the same commands as environment modules:

[leonhard@lo-login-02 ~]$ module list

Currently Loaded Modules:
  1) StdEnv

 

[leonhard@lo-login-02 ~]$ module avail openblas

 ------------------------------- /cluster/spack/lmodules -------------------------------
   gcc/4.8.5/openblas/0.2.19

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of
the "keys".
 

[leonhard@lo-login-02 ~]$ module load gcc/4.8.5/openblas/0.2.19
[leonhard@lo-login-02 ~]$ module list

Currently Loaded Modules:
  1) StdEnv   2) gcc/4.8.5/openblas/0.2.19

 

[leonhard@lo-login-02 ~]$

Please note that this is work in progress and the module names might change. We are also planning to introduce a so-called module hierarchy, where users first load a compiler module and then the module avail command only shows modules that have been compiled with this particular compilers. In most cases, the hierarchy has 3 layers that involve a compiler, an MPI version (for serial applications, the MPI category will be serial) and the application itself:

COMPILER / MPI / APPLICATION

Available software

In addition to switching from environment modules to Lmod modules, we are also setting up a new software stack based on the package manager SPACK that is developed at the Lawrence Livermore National Laboratory (LLNL). Currently, the number of software packages provided on Leonhard is not comparable to the software we provide on the Euler cluster, but it will grow over time.

Storage

Like on the Euler cluster, every user also has a home directory and a personal scratch directory on Leonhard open.

Submitting jobs

Leonhard uses the same LSF batch system as the Euler cluster.

Use the “bsub” command to submit a job and specify resources needed to run your job. By default, a job will get 1 core and 1024 MB of RAM for 4 hours. Unless otherwise specified, jobs requesting more than 36 cores will run on a single node. Regular nodes have 36 cores and 128 or 512 GB of RAM (of which about 90 and 460 GB, respectively, are usable).

Unlike Euler, requested memory is strictly enforced as a memory limit. For example, if you do not specifically state a memory requirement, your program can not use more than 1 GB of RAM per core. What counts is is actually used memory, including page cache for your job. All processes from the same job on a node share the same pool. For example, with a job submitted as

bsub -n 16 -R "rusage[mem=1024] span[ptile=8]" mpirun ./my_job

all of the 8 MPI ranks on a single node can use up to 8 GB.

Submitting GPU jobs

All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node, which is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.