Using the batch system

From ScientificComputing
Jump to: navigation, search

Introduction

On our HPC cluster, we use the IBM LSF (Load Sharing Facility) batch system. A basic knowledge of LSF is required if you would like to work on the HPC clusters. The present article will show you how to use LSF to execute simple batch jobs and give you an overview of some advanced features that can dramatically increase your productivity on a cluster.

Using a batch system has numerous advantages:

  • single system image — all computing resources in the cluster can be accessed from a single point
  • load balancing — the workload is automatically distributed across all available processors
  • exclusive use — many computations can be executed at the same time without affecting each other
  • prioritization — computing resources can be dedicated to specific applications or people
  • fair share — a fair allocation of those resources among all users is guaranteed

In fact, our HPC clusters contains so many processors (30,000) and are used by so many people (more than 2,000) that it would be impossible to use it efficiently without a batch system.

All computations on our HPC cluster must be submitted to the batch system. Please do not run any job interactively on the login nodes, except for testing or debugging purposes.

Basic job submission

Simple commands and programs

Submitting a job to the batch system is as easy as:

bsub command [arguments]
bsub /path/to/program [arguments] 

Examples:

[leonhard@euler03 ~]$ bsub gzip big_file.dat
Generic job.
Job <8146539> is submitted to queue <normal.4h>.
[leonhard@euler03 ~]$ bsub ./hello_world
Generic job.
Job <8146540> is submitted to queue <normal.4h>.

Two or more commands can be combined together by enclosing them in quotes:

bsub "command1; command2"

Example:

[leonhard@euler03 ~]$ bsub "configure; make; make install"
Generic job.
Job <8146541> is submitted to queue <normal.4h>.

Quotes are also necessary if you want to use I/O redirection (">", "<"), pipes ("|") or conditional operators ("&&", "||"):

bsub "command < data.in > data.out"
bsub "command1 | command2"

Examples:

[leonhard@euler03 ~]$ bsub "tr ',' '\n' < comma_separated_list > linebreak_separated_list"
Generic job.
Job <8146542> is submitted to queue <normal.4h>.
[leonhard@euler03 ~]$ bsub "cat unsorted_list_with_redundant_entries | sort | uniq > sorted_list"
Generic job.
Job <8146543> is submitted to queue <normal.4h>.

Shell scripts

More complex commands may be placed in a shell script, which should then be submitted like this:

bsub < script

Example:

[leonhard@euler03 ~]$ bsub < hello.sh
Generic job.
Job <8146544> is submitted to queue <normal.4h>.

In principle, it is also possible to submit a script as if it were a program:

bsub /path/to/scriptBAD IDEA!

however this syntax is strongly discouraged on our clusters because it does not allow the batch system to "see" what your script is doing, which may lead to errors in the submission and/or execution of your job.

Output file

By default your job's output (or standard output, to be precise) is written into a file named lsf.oJobID in the directory where you executed bsub, where JobID is the number assigned to your job by LSF. You can select a different output file using the option:

bsub -o output_file command [argument]

The option -o output_file tells LSF to append your job's output to output_file. If you want to overwrite this file, use:

bsub -oo output_file ...

Note that this option, like all bsub options, must be placed before the command that you want to execute in your job. A common mistake is to place bsub options in the wrong place, like.

bsub command -o output_fileWRONG!

Batch interactive job

If you just want to run a quick test, you can submit it as a batch interactive job. In this case the job's output is not written into a file, but directly to your terminal, as if it were executed interactively:

bsub -I command [arguments]

Example:

[leonhard@euler03 ~]$ bsub -I "env | sort"
Generic job.
Job <8146545> is submitted to queue <normal.4h>.
<<Waiting for dispatch ...>

Resource requirements

By default, a batch job can use only one processor for up to 4 hours. (The job is killed when it reaches its run-time limit.) If your job needs more resources — time, processors, memory or scratch space —, you must request them when you submit it.

Wall-clock time

The time limits on our clusters are always based on wall-clock (or elapsed) time. You can specify the amount of time needed by your job using the option:

bsub -W minutes ...                  example:  bsub -W 90 ...
bsub -W HH:MM ...                    example:  bsub -W 1:30 ...

Examples:

[leonhard@euler03 ~]$ bsub -W 20 ./Riemann_zeta -arg 26
Generic job.
Job <8146546> is submitted to queue <normal.4h>.
[leonhard@euler03 ~]$ bsub -W 20:00 ./solve_Koenigsberg_bridge_problem
Generic job.
Job <8146547> is submitted to queue <normal.24h>.

Since our clusters contains processors with different speeds two similar jobs will not necessarily take the same time to complete. It is therefore safer to request more time than strictly necessary... but not too much, for shorter jobs have generally a higher priority than longer ones.

The maximum run-time for jobs that can run on most compute nodes in the cluster is 240 hours. We remain the right to stop jobs with a run time of more than 5 days in case of an emergency maintenance.

Number of processor cores

If your job requires multiple processors (or threads), you must request them using the option:

bsub -n number_of_procs ...

Note that merely requesting multiple processors does not mean that your application will use them.

Memory

By default the batch system allocates 1024 MB (1 GB) of memory per processor core. A single-core job will thus get 1 GB of memory; a 4-core job will get 4 GB; and a 16-core job, 16 GB. If your computation requires more memory, you must request it when you submit your job:

bsub -R "rusage[mem=XXX]" ...

Example:

[leonhard@euler03 ~]$ bsub -R "rusage[mem=2048]" ./evaluate_gamma -precision 10e-30
Generic job.
Job <8146548> is submitted to queue <normal.4h>.

where XXX is the amount of memory needed by your job, in MB per processor.

Scratch space

LSF automatically creates a local scratch directory when your job starts and deletes it when the job ends. This directory has a unique name, which is passed to your job via the variable $TMPDIR.

Unlike memory, the batch system does not reserve any disk space for this scratch directory by default. If your job is expected to write large amounts of temporary data (say, more than 250 MB) into $TMPDIR — or anywhere in the local /scratch file system — you must request enough scratch space when you submit it:

bsub -R "rusage[scratch=YYY]" ...

Example:

[leonhard@euler03 ~]$ bsub -R "rusage[scratch=5000]" ./generating_Euler_numbers -num 5000000
Generic job.
Job <8146548> is submitted to queue <normal.4h>.

where YYY is the amount of scratch space needed by your job, in MB per processor.

Note that /tmp is reserved for the operating system. Do not write temporary data there! You should either use the directory created by LSF ($TMPDIR) or create your own temporary directory in the local /scratch file system; in the latter case, do not forget to delete this directory at the end of your job.

Multiple requirements

It is possible to combine memory and scratch requirements:

bsub -R "rusage[mem=XXX]" -R "rusage[scratch=YYY]" ...

is equivalent to:

bsub -R "rusage[mem=XXX,scratch=YYY]" ...

LSF submission line advisor

For users that are not yet very experienced with using a batch system, we provide a small helper tool, which simplifies to setup the command for requesting resources from the batch system in order to submit a job.

https://scicomp.ethz.ch/lsf_submission_line_advisor

GPU

Please note that currently only the Leonhard cluster contains GPUs, but the Euler cluster does not. Unlike Euler, which is open to all members of ETH without restriction, Leonhard is reserved exclusively to the groups who have invested in it (the so-called shareholders). Therefore the following information is only relevant for Leonhard shareholders.

All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20 cores, 8 GPUs, and 256 GB of RAM (of which only about 210 GB is usable). To run multi-node job, you will need to request span[ptile=20].

The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the ngpus_excl_p resource. It refers to the number of GPUs per node. This is unlike other resources, which are requested per core.

For example, to run a serial job with one GPU,

bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program

or on a full node with all eight GPUs and up to 90 GB of RAM,

bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program

or on two full nodes:

bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program

While your jobs will see all GPUs, LSF will set the CUDA_VISIBLE_DEVICES environment variable, which is honored by CUDA programs.

For advanced settings, please have a look at our getting started with GPUs page.

Parallel job submission

Before submitting parallel jobs, please make sure that your application can run in parallel at all in order to not waste resources by requesting multiple cores for a serial application. Further more, please do a short scaling analysis to see how well your code scales in parallel before requesting dozens or hundreds of cores.


OpenMP

If your application is parallelized using OpenMP or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of processors (or threads) that it can use is controlled by the environment variable OMP_NUM_THREADS. This variable must be set before you submit your job:

export OMP_NUM_THREADS=number_of_processors
bsub -R "span[ptile=number_of_processors]" -n number_of_processors ...

NOTE: if OMP_NUM_THREADS is not set, your application will either use one processor only, or will attempt to use all processors that it can find, stealing them from other jobs if needed. In other words, your job will either use too few or too many processors.

MPI

Three kinds of MPI libraries are available on our cluster: Open MPI (recommended), MVAPICH2 and Intel MPI. Before you can submit and execute an MPI job, you must load the corresponding modules (compiler + MPI, in that order):

module load compiler
module load mpi_library

The command used to launch an MPI application is mpirun.

Let's assume for example that hello_world was compiled with PGI 15.1 and linked with Open MPI 1.6.5. The command to execute this job on 4 processors is:

module load pgi/15.1
module load open_mpi/1.6.5
bsub -n 4 mpirun ./hello_world

Note that mpirun automatically uses all processors allocated to the job by LSF. It is therefore not necessary to indicate this number again to the mpirun command itself:

bsub -n 4 mpirun -np 4 ./hello_world      ←  "-np 4" not needed!

Euler III nodes are targeted to serial and shared-memory parallel jobs, but multi-node parallel jobs are still accepted.

You need to tell the system that Infiniband is not available,

module load interconnect/ethernet

before loading the MPI module. Then you need to request at most four cores per node:

bsub -R "span[ptile=4] select[maxslots==4]" [other bsub options] ./my_command
Open MPI
Open MPI 1.6.5 has been tested to work with acceptable performance.
Open MPI 2.0.2 has been tested to work
MVAPICH2
MVAPICH2 2.1 works but preliminary results show low scalability. You need to load the interconnect/ethernet module.
Intel MPI
Intel MPI 5.1.3 has been tested.

Pthreads and other threaded applications

Their behavior is similar to OpenMP applications. It is important to limit the number of threads that the application spawns. There is no standard way to do this, so be sure to check the application's documentation on how to do this. Usually a program supports at least one of four ways to limit itself to N threads:

  • it understands the OMP_NUM_THREADS=N environment variable,
  • it has its own environment variable, such as GMX_NUM_THREADS=N for Gromacs,
  • it has a command-line option, such as -nt N (for Gromacs), or
  • it has an input-file option, such as num_threads N.

If you are unsure about the program's behavior, please contact us and we will analyze it.

Hybrid jobs

It is possible to run hybrid jobs that mix MPI and OpenMP on our HPC clusters, but we strongly recommend to not submit these kind of jobs.

Full-node jobs

(Only on Euler) If you need to run a job that will use a full node, then request the fullnode resource and request a number of cores that is a multiple of 24 or 36 cores:

bsub -n 48 -R fullnode ./my_job

Such a job will only run on two 24-core nodes.

Job monitoring

Please find below a table with commands for job monitoring and job control

Command Description
busers user limits, number of pending and running jobs
bqueues queues status (open/closed; active/inactive)
bjobs more or less detailed information about pending, running and recently finished jobs
bbjobs better bjobs (bjobs with human readable output)
bhist information about jobs that finished in the last hours/days
bpeek display the standard output of a given job
lsf_load show the CPU load of all nodes used by a job
bjob_connect login to a node where one of your jobs is running
bkill kill a job

For an overview on the most common options for the LSF commands, please have a look at the LSF mini reference.

bjobs

The bjobs command allows you to get information about pending, running and shortly finished jobs.

bbjobs

The command bbjobs can be used to see the resource request and usage (cpu, memory, swap, etc.) of any specific job.

bbjobs [-u username -r -a -s -d -p -f -l -P] JOBID
Option Description
(no option) List your jobs — information, requested resources and usage.
-u username user username.
-r Show only running jobs.
-a Show all jobs.
-s Show only suspended jobs.
-d Show only jobs that ended recently (done).
-p Show only pending jobs.
-f Show job cpu affinity, which cores it is running.
-l Show job information in log format.

Example of output for bbjobs:

[leonhard@euler08 ~]$ bbjobs 31989961
Job information
 Job ID                          : 31989961
 Status                          : RUNNING
 Running on node                 : e1268 
 User                            : leonhard
 Queue                           : normal.4h
 Command                         : compute_pq.py
 Working directory               : $HOME/testruns
Requested resources
 Requested cores                 : 1
 Requested memory                : 1024 MB per core
 Requested scratch               : not specified
 Dependency                      : -
Job history
 Submitted at                    : 08:45 2016-11-15
 Started at                      : 08:48 2016-11-15
 Queue wait time                 : 140 sec
Resource usage
 Updated at                      : 08:48 2016-11-15
 Wall-clock                      : 34 sec
 Tasks                           : 4
 Total CPU time                  : 5 sec
 CPU utilization                 : 80.0 %
 Sys/Kernel time                 : 0.0 %
 Total resident memory           : 2 MB
 Resident memory utilization     : 0.2 %

bjob_connect

Sometimes it is necessary to monitor the job on the node(s) where it is running. On Euler, compute nodes can not be accessed directly via ssh. To access a node where a job is running the tool bjob_connect should be used.

bjob_connect JOBID [SSH OPTIONS]

The tool will connect directly to the node where the job is running. In the case of multi-node runs, a list of nodes will be printed and one should be chosen to be accessed.

Connections to nodes created via bjob_connect must be ended explicitly (exit from terminal) by the user when done with job monitoring.

Troubleshooting

Bsub rejects my job

If the error message is not self-explanatory, then please report it to the cluster support.

My job is stick in the queue since XXX hours/days

Please try to find out, why the job is pending. You can do this with the following command.

bjobs -p

Individual host-based reasons means that the resources requested by your jobs are not available at this time. Some resources may never become available ( e.g. mem=10000000). Some resource requirements may be mutually exclusive.

My job was sent to the purgatory queue

The purgatory queue is designed to catch jobs that were not submitted properly, either due to a user error or a bug in the batch system. Please always report this type of problem to the cluster support.