Difference between revisions of "Job arrays"
(→Group calculations into fewer jobs) |
(→Program arguments) |
||
Line 81: | Line 81: | ||
bsub -J "'''hello[1-4]'''" matlab -nodisplay -singleCompThread -r "my_function('''\$LSB_JOBINDEX''')" | bsub -J "'''hello[1-4]'''" matlab -nodisplay -singleCompThread -r "my_function('''\$LSB_JOBINDEX''')" | ||
− | + | It is important that the <tt>$</tt> sign in front of <tt>LSB_JOBINDEX</tt> is masked with a backslash <tt>\$</tt>, as the variable needs to be evaluated at runtime. This example would be equivalent to submitting 4 jobs in a row: | |
bsub -J "hello['''1''']" matlab -nodisplay -singleCompThread -r "my_function('''1''')" | bsub -J "hello['''1''']" matlab -nodisplay -singleCompThread -r "my_function('''1''')" | ||
Line 87: | Line 87: | ||
bsub -J "hello['''3''']" matlab -nodisplay -singleCompThread -r "my_function('''3''')" | bsub -J "hello['''3''']" matlab -nodisplay -singleCompThread -r "my_function('''3''')" | ||
bsub -J "hello['''4''']" matlab -nodisplay -singleCompThread -r "my_function('''4''')" | bsub -J "hello['''4''']" matlab -nodisplay -singleCompThread -r "my_function('''4''')" | ||
+ | |||
+ | You can specify the range for the job array by using the format | ||
+ | |||
+ | start-end:step | ||
+ | |||
+ | For example | ||
+ | |||
+ | bsub -J "testjob[10-20:2]" echo "\$LSB_JOBINDEX" | ||
+ | |||
+ | would create a job array with 6 elements that would be equivalent to submitting the following six commands: | ||
+ | |||
+ | bsub -J "testjob[10]" echo "10" | ||
+ | bsub -J "testjob[12]" echo "12" | ||
+ | bsub -J "testjob[14]" echo "14" | ||
+ | bsub -J "testjob[16]" echo "16" | ||
+ | bsub -J "testjob[18]" echo "18" | ||
+ | bsub -J "testjob[20]" echo "20" | ||
===Environment variables=== | ===Environment variables=== |
Revision as of 10:16, 12 March 2019
Contents
Introduction
Many cluster users are running embarrassingly parallel simulations consisting of hundreds or thousands of similar calculations, each one executing the same program but with slightly different — or random in the case of Monte-Carlo simulation — parameters. The usual approach is to submit each one as an independent job. This works fine, although keeping track of all these jobs is not easy, and can get quite complicated if these jobs must be executed in a coordinated fashion (e.g. master/slave). It would be much simpler if one could submit all these jobs at once, and manage them as a single entity. The good news is that it is indeed possible using a so-called job array. Jobs in an array have a common name and job-ID, plus a specific job-index ($LSB_JOBINDEX) corresponding to their position in the array. The name is mandatory, as it is used to define the range of the job array.
Submitting a job array
Let's take for example a simulation consisting of 4 independent calculations. Normally, one would submit them as 4 individual jobs:
bsub -J "calc 1" ./program [arguments] bsub -J "calc 2" ./program [arguments] bsub -J "calc 3" ./program [arguments] bsub -J "calc 4" ./program [arguments]
or
for ((n=1;n<=4;n++)); do bsub -J "calc $n" ./program [arguments] done
Using a job array, however, one can submit these calculations all at once:
bsub -J "calc[1-4]" ./program [arguments]
The option -J "calc[1-4]" defines both the name of the array, calc, but also the number of its elements, 4.
[leonhard@euler05 ~]$ bsub -J "calc[1-4]" echo "Hello, I am an independent job" Job array. Job <33383740> is submitted to queue <normal.4h>. [leonhard@euler05 ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 33383740 leonhard PEND normal.4h euler05 calc[1] Dec 2 08:50 33383740 leonhard PEND normal.4h euler05 calc[2] Dec 2 08:50 33383740 leonhard PEND normal.4h euler05 calc[3] Dec 2 08:50 33383740 leonhard PEND normal.4h euler05 calc[4] Dec 2 08:50
A job array creates only a single LSF logfile, which contains the stdout of all array elements. The job name, in this case calc, can also be used as a dependency condition. For example, to run a job to analyze only after all 4 calc calculations are done, submit it like
bsub -w "numdone(33383740,*)" -J analyze "./program [arguments]"
Please note that for dependency conditions regarding job arrays, you need to specify the jobid (jobnames do not work). You can extract the job name after submitting a job by parsing its stdout. For example:
bsub test | awk '/is submitted/{print substr($2, 2, length($2)-2);}'
or
jobid=$(bsub test | awk '/is submitted/{print substr($2, 2, length($2)-2);}') if [ -n "$jobid" ]; then bsub -w "numdone($jobid,*)" ./followup_job fi
When appropriate, the re-runnable option, -r, can be specified when submitting the job array. If any of the calculations fail due to a system failure, they will be automatically re-run. If any of the calculations fails on its own (segmentation fault, out of time, ...), it will not be re-run and the exit status of the job array will be considered to be unsuccessfull.
Limiting the number of jobs that are allowed to run at the same time
A job array allows a large number of jobs to be submitted with one command, potentially flooding a system, and job slot limits provide a way to limit the impact a job array may have on a system. You can set this limit by adding %job_slot_limit after specifying the range of the array
bsub -J "calc[1-10000]%10" echo "Hello, I am an independent job"
In this example the array contains 10000 elements and maximally 10 jobs are allowed to run at the same time.
Simulation parameters
Since all jobs in an array execute the same program (or script), you need to define specific parameters for each calculation. You can do this using different mechanisms:
- create a different input file for each job
- pass the job index as argument to the program
- use environment variables set by LSF
Input and output files
One can use the special string %I in the job's input file name as a placeholder for the job's index in the array. For example:
bsub -i param.%I bsub -i calc%I.in
The same mechanism also applies to the output file:
bsub -o result.%I bsub -o calc%I.out
If the name of the input and/or output file does not contain %I, all jobs use the same input and/or output file.
The main drawback of this mechanism is that all jobs' input files must be created in advance.
Program arguments
A common case is to pass the parameter value (the array index $LSB_JOBINDEX) as a command-line argument. Here is an example for a MATLAB function with the parameter as its sole argument:
bsub -J "hello[1-4]" matlab -nodisplay -singleCompThread -r "my_function(\$LSB_JOBINDEX)"
It is important that the $ sign in front of LSB_JOBINDEX is masked with a backslash \$, as the variable needs to be evaluated at runtime. This example would be equivalent to submitting 4 jobs in a row:
bsub -J "hello[1]" matlab -nodisplay -singleCompThread -r "my_function(1)" bsub -J "hello[2]" matlab -nodisplay -singleCompThread -r "my_function(2)" bsub -J "hello[3]" matlab -nodisplay -singleCompThread -r "my_function(3)" bsub -J "hello[4]" matlab -nodisplay -singleCompThread -r "my_function(4)"
You can specify the range for the job array by using the format
start-end:step
For example
bsub -J "testjob[10-20:2]" echo "\$LSB_JOBINDEX"
would create a job array with 6 elements that would be equivalent to submitting the following six commands:
bsub -J "testjob[10]" echo "10" bsub -J "testjob[12]" echo "12" bsub -J "testjob[14]" echo "14" bsub -J "testjob[16]" echo "16" bsub -J "testjob[18]" echo "18" bsub -J "testjob[20]" echo "20"
Environment variables
Each job will execute the same script or command, but can have its own input and output file (-i "in.%I" -o "out.%I") where %I corresponds to the job index in the array. The variables $LSB_JOBINDEX and $LSB_JOBINDEX_END can be used inside the script to find out what is the current job's index and the number of jobs in the array, for example:
bsub -J "hello[1-10]" "echo Hello, I am job \$LSB_JOBINDEX of \$LSB_JOBINDEX_END" bsub -w hello "echo Everybody is here"
to be continued…
Group calculations into fewer jobs
Often the jobs within a job array are too short (anything below a few minutes) because every job in the array runs just one short calculation.
You can increase the throughput of your entire job array be grouping several calculations into a fewer number of jobs instead of running a single calculation per job. You should target each job to run for at least about half an hour and 5 minutes at the very least.
In the previous example, we showed how to run four matlab function calls (matlab -nodisplay -singleCompThread -r "my_function(\$LSB_JOBINDEX)") as a job array with four jobs. Now let us convert this to a job array with two jobs, each of which runs two of the function calls. In the first step we will put the matlab call into a script, run_my_function.sh:
#!/bin/bash matlab -nodisplay -singleCompThread -r "my_function($LSB_JOBINDEX)"
which can be submitted by redirecting it to the bsub command:
bsub -J "hello[1-4]" < run_my_function.sh
So far nothing has changed except for how the the command is passed to bsub. Note that there is no backslash before $LSB_JOBINDEX in the script. In the second step, change the run_my_function.sh script to run two matlab function calls by writing a for loop. Define the STEP variable to be the number of calculations to run in each loop. In our case this is 2:
#!/bin/bash STEP=2 for ((i=1;i<=$STEP;i++)); do MY_JOBINDEX=$((($LSB_JOBINDEX-1)*$STEP + $i)) matlab -nodisplay -singleCompThread -r "my_function($MY_JOBINDEX)" done
Note that we now pass MY_JOBINDEX instead of LSB_JOBINDEX to the my_function call so that each calculations gets its unique index. Submit this script but tell LSF to run just two jobs in the job array (4 calculations/(2 calculations/job) = 2 jobs):
bsub -J "hello[1-2]" < run_my_function.sh
If the number of calculations to run is not divisible by the number of calculations per job (let's say we want to run 3 calculations per job), then expand the script to be as follows:
#!/bin/bash STEP=3 MAXINDEX=4 for ((i=1;i<=$STEP;i++)); do MY_JOBINDEX=$((($LSB_JOBINDEX-1)*$STEP + $i)) if [ $MY_JOBINDEX -gt $MAXINDEX ]; then break fi matlab -nodisplay -singleCompThread -r "my_function($MY_JOBINDEX)" done
Submit this script and set the ending value to ceiling(MAXINDEX/STEP)=ceiling(4/3)=2,
bsub -J "hello[1-2]" < run_my_function.sh
Monitoring job arrays
You can monitor a job array with the bjobs command:
bjobs -J array_name # all jobs in an array bjobs -J jobID # all jobs in an array bjobs -J array_name[index] # specific job in an array bjobs -J jobID[index] # specific job in an array
Rerunning failed jobs
One of the advantages to job arrays is that it is easy to rerun just the failed jobs in a job array. Simply run
brequeue -e JOBID
and LSF will resubmit just the failed jobs from the job array. This assumes that a successful jobs exits with exit code 0 while a failed job exits with a non-zero exit code.
You can combine this with a dependency condition on the entire job array:
bsub -J "calc[1-4]" [other bsub options] ./program [arguments] # Note the JOBID bsub -w "numended(JOBID,*)" -J "brequeue -e JOBID"
Any failed jobs will be requeued once all of the jobs have had a chance to run. The same JOBID will be reused when runnig the failed jobs a second time. This processes can be repeated if necessary.
An alternative is to create a dependency condition to rerun failed jobs in a new array job:
bsub -J "calc[1-4]" [other bsub options] ./program [arguments] bsub -J "calc_two[1-4]" -w "exit(calc[*])" [other bsub options] ./program [arguments]