Leonhard beta testing

From ScientificComputing
Revision as of 15:12, 16 May 2017 by Urbanb (talk | contribs) (LSF is broken.)

Jump to: navigation, search

The Leonhard cluster is available for early-access beta testing.

Please read through the following to get started.

Submitting jobs

Leonhard uses the same LSF batch system as the Euler cluster.

Use the “bsub” command to submit a job and specify resources needed to run your job. By default, a job will get 1 core and 1024 MB of RAM for 4 hours.

Unlike Euler, requested memory is strictly enforced as a memory limit. For example, if you do not specifically state a memory requirement, your program can not use more than 1 GB of RAM per core. What counts is is actually used memory, including page cache for your job. All processes from the same job on a node share the same pool. For example, with a job submitted as

bsub -n 16 -R "rusage[mem=1024] span[ptile=8]" mpirun ./my_job

all of the 8 MPI ranks on a single node can use up to 8 GB.

Requesting GPUs

All GPUs in Leonhard are configured in Exclusive Process mode.

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node, which is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1]" ./my_cuda_program

or on two full nodes:

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.