Difference between revisions of "Leonhard beta testing"

From ScientificComputing
Jump to: navigation, search
(Initial getting started page with job submission instructions focusing on GPUs.)
(LSF is broken.)
Line 17: Line 17:
 
All GPUs in Leonhard are configured in Exclusive Process mode.
 
All GPUs in Leonhard are configured in Exclusive Process mode.
  
The LSF batch system has integrated support for GPUs. To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node''', which is unlike other resources, which are requested '''per core'''.
+
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node''', which is unlike other resources, which are requested '''per core'''.
  
 
For example, to run a serial job with one GPU,
 
For example, to run a serial job with one GPU,
Line 26: Line 26:
 
  bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1] span[ptile=20]" ./my_cuda_program
 
  bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1] span[ptile=20]" ./my_cuda_program
  
Your job will only see the GPUs allocated to it by the batch system. They will always be seen as GPU devices numbered from zero, even if it is not the “first” GPU in a system.
+
While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.
 
 
'''Known problem:''' We have seen instances of LSF not assigning GPUs to jobs that have requested them. In such a case your job will start but will see no GPUs. We are investigating this issue but do have not fix yet. Please report these issues to us if you encounter them.
 

Revision as of 13:12, 16 May 2017

The Leonhard cluster is available for early-access beta testing.

Please read through the following to get started.

Submitting jobs

Leonhard uses the same LSF batch system as the Euler cluster.

Use the “bsub” command to submit a job and specify resources needed to run your job. By default, a job will get 1 core and 1024 MB of RAM for 4 hours.

Unlike Euler, requested memory is strictly enforced as a memory limit. For example, if you do not specifically state a memory requirement, your program can not use more than 1 GB of RAM per core. What counts is is actually used memory, including page cache for your job. All processes from the same job on a node share the same pool. For example, with a job submitted as

bsub -n 16 -R "rusage[mem=1024] span[ptile=8]" mpirun ./my_job

all of the 8 MPI ranks on a single node can use up to 8 GB.

Requesting GPUs

All GPUs in Leonhard are configured in Exclusive Process mode.

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node, which is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1]" ./my_cuda_program

or on two full nodes:

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=1] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.