Job chaining

From ScientificComputing
Jump to: navigation, search

Introduction

There are several use cases, which can require you to set up a job chain. Either you would like to split up a very long calculation into a series of jobs that fit within the allow run-time limits on the cluster, or you might have a workflow that is having jobs with are not independent from each other. For instance if you would like to run 2 jobs and the second one requires some information, which is generated in the first one, then you would like to make sure that job 2 does not start before job 1 has finished.

Obvious (and not recommended) solution

A simple way to chain two job is to add a bsub < job2 command at the end of job1. This solution is not recommended because it is error-prone and may lead to infinite loops.

One typical problem is that job2 inherits the environment variables of job1, which may conflict with its own variables or those set by the batch system. This type of error may go undetected for a few jobs, for example if each job appends some directories to $PATH, and then cause weird problems when this variable hits the maximum length allowed by the shell.

Using dependency conditions

A more robust solution is to use dependency conditions, e.g. job2 should start only when job1 is done, job3 after job2, etc. This is done using bsub -w (wait):

bsub -J job1 command1
bsub -J job2 -w "done(job1)" command2
bsub -J job3 -w "done(job2)" command3

All jobs in a series may be submitted at once. Each job must be given a name (option -J) that will be used to define the dependency condition of the subsequent job.

The condition "done(job1)" is true only if job1 completed successfully. If job1 crashed or was killed by LSF when it reached its run-time limit, the dependency condition becomes "invalid or never satisfied" and job2 will not be executed, ever. (Invalid jobs stay in the queue until they are deleted, which is done periodically.)

Use the condition "ended(job1)" if job2 ought to be executed no matter what happened to job1.

If job1, job2, job3 are merely iterations of the same program, it may be more convenient to use a single name for all jobs, for example "job_chain"; in that case the dependency is based on the order in which the jobs were submitted:

bsub -J job_chain command
bsub -J job_chain -w "done(job_chain)" command
bsub -J job_chain -w "done(job_chain)" command

In the example above, command is generally a shell script that will retrieve data from the previous job, check if there was any error, prepare the input for the current job and execute it. If the script detects an error, it should kill itself and all subsequent jobs in the chain using the command:

bkill -J job_chain 0

The special job ID "0" (zero) means all jobs submitted under the specified name, i.e. the whole chain.

Fixing chains after crashed jobs

Let's say you have submitted the above three jobs (with JOBIDs 1001 1002 1003) and the first one (1001) crashes. The second job (1002) will then wait forever because its dependency condition is not satisfied. There are two ways to solve this problem, depending on your application:

  • ignore the failed job and continue with the waiting ones or
  • resubmit the failed job and have the waiting one continue after the new job is done.

Use the bmod command to remove the dependency condition of the waiting job:

bmod -wn jobid

or bmod -wn 1002 for the example.

To resubmit the job, submit it with no dependency condition

bsub -J job_chain command

(let's say this job gets jobid 2001) and then modify the dependency condition of the second job:

bmod -w "done(jobid_of_new_job)" jobid_of_waiting_job

or bmod -w "done(2013)" 1002

See also

Dependency conditions in administering Platform LSF