Job arrays

From ScientificComputing
Revision as of 08:26, 21 September 2017 by Urbanb (talk | contribs) (Some clarifications.)

Jump to: navigation, search

Introduction

Many cluster users are running embarrassingly parallel simulations consisting of hundreds or thousands of similar calculations, each one executing the same program but with slightly different — or random in the case of Monte-Carlo simulation — parameters. The usual approach is to submit each one as an independent job. This works fine, although keeping track of all these jobs is not easy, and can get quite complicated if these jobs must be executed in a coordinated fashion (e.g. master/slave). It would be much simpler if one could submit all these jobs at once, and manage them as a single entity. The good news is that it is indeed possible using a so-called job array. Jobs in an array have a common name and job-ID, plus a specific job-index ($LSB_JOBINDEX) corresponding to their position in the array. The name is mandatory, as it is used to define the range of the job array.

Submitting a job array

Let's take for example a simulation consisting of 4 independent calculations. Normally, one would submit them as 4 individual jobs:

bsub -J "calc 1" ./program [arguments]
bsub -J "calc 2" ./program [arguments]
bsub -J "calc 3" ./program [arguments]
bsub -J "calc 4" ./program [arguments]

or

for ((n=1;n<=4;n++)); do
    bsub -J "calc $n" ./program [arguments]
done

Using a job array, however, one can submit these calculations all at once:

bsub -J "calc[1-4]" ./program [arguments]

The option -J "calc[1-4]" defines both the name of the array, calc, but also the number of its elements, 4.

[leonhard@euler05 ~]$ bsub -J "calc[1-4]" echo "Hello, I am an independent job"
Job array.
Job <33383740> is submitted to queue <normal.4h>.
[leonhard@euler05 ~]$ bjobs
JOBID      USER        STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
33383740   leonhard    PEND  normal.4h  euler05                 calc[1]    Dec  2 08:50
33383740   leonhard    PEND  normal.4h  euler05                 calc[2]    Dec  2 08:50
33383740   leonhard    PEND  normal.4h  euler05                 calc[3]    Dec  2 08:50
33383740   leonhard    PEND  normal.4h  euler05                 calc[4]    Dec  2 08:50

A job array creates only a single LSF logfile, which contains the stdout of all array elements. The job name, in this case calc, can also be used as a dependency condition. For example, to run a job to analyze only after allcalc calculations are done, submit it like

bsub -w "numdone(calc,*)" -J analyze "./program [arguments]"

when appropriate, the re-runnable option, -r, can be specified when submitting the job array. If any of the calculations fail due to a system failure, they will be automatically re-run. If any of the calculations fails on its own (segmentation fault, out of time, ...), it will not be re-run and the exit status of the job array will be considered to be unsuccessfull.

Simulation parameters

Since all jobs in an array execute the same program (or script), you need to define specific parameters for each calculation. You can do this using different mechanisms:

  • create a different input file for each job
  • pass the job index as argument to the program
  • use environment variables set by LSF

Input and output files

One can use the special string %I in the job's input file name as a placeholder for the job's index in the array. For example:

bsub -i param.%I
bsub -i calc%I.in

The same mechanism also applies to the output file:

bsub -o result.%I
bsub -o calc%I.out

If the name of the input and/or output file does not contain %I, all jobs use the same input and/or output file.

The main drawback of this mechanism is that all jobs' input files must be created in advance.

Program arguments

A common case is to pass the parameter value (the array index $LSB_JOBINDEX) as a command-line argument. Here is an example for a MATLAB function with the parameter as its sole argument:

bsub -J "hello[1-4]" matlab -nodisplay -singleCompThread -r "my_function(\$LSB_JOBINDEX)"

Here it is important that the $ sign in front of LSB_JOBINDEX is masked with a backslash, as the variable needs to be evaluated at runtime. This example would be equivalent to submitting 4 jobs in a row:

bsub -J "hello[1]" matlab -nodisplay -singleCompThread -r "my_function(1)"
bsub -J "hello[2]" matlab -nodisplay -singleCompThread -r "my_function(2)"
bsub -J "hello[3]" matlab -nodisplay -singleCompThread -r "my_function(3)"
bsub -J "hello[4]" matlab -nodisplay -singleCompThread -r "my_function(4)"

Environment variables

Each job will execute the same script or command, but can have its own input and output file (-i "in.%I" -o "out.%I") where %I corresponds to the job index in the array. The variables $LSB_JOBINDEX and $LSB_JOBINDEX_END can be used inside the script to find out what is the current job's index and the number of jobs in the array, for example:

  bsub -J "hello[1-10]" "echo Hello, I am job \$LSB_JOBINDEX of \$LSB_JOBINDEX_END"
  bsub -w hello "echo Everybody is here"

to be continued…

Rerunning failed jobs

One of the advantages to job arrays is that it is easy to rerun just the failed jobs in a job array. Simply run

brequeue -e JOBID

and LSF will resubmit just the failed jobs from the job array. This assumes that a successful jobs exits with exit code 0 while a failed job exits with a non-zero exit code.

You can combine this with a dependency condition on the entire job array:

bsub -J "calc[1-4]" [other bsub options] ./program [arguments]
# Note the JOBID
bsub -w "numended(JOBID,*)" -J "brequeue -e JOBID"

Any failed jobs will be requeued once all of the jobs have had a chance to run. The same JOBID will be reused when runnig the failed jobs a second time. This processes can be repeated if necessary.

An alternative is to create a dependency condition to rerun failed jobs in a new array job:

bsub -J "calc[1-4]" [other bsub options] ./program [arguments]
bsub -J "calc_two[1-4]" -w "exit(calc[*])" [other bsub options] ./program [arguments]

See also