Job chaining
Introduction
There are several use cases, which can require you to set up a job chain. Either you would like to split up a very long calculation into a series of jobs that fit within the allow run-time limits on the cluster, or you might have a workflow that is having jobs with are not independent from each other. For instance if you would like to run 2 jobs and the second one requires some information, which is generated in the first one, then you would like to make sure that job 2 does not start before job 1 has finished.
Obvious (and not recommended) solution
A simple way to chain two job is to add a sbatch job2.sh command at the end of the jobscript for job1.sh. This solution is not recommended because it is error-prone and may lead to infinite loops.
One typical problem is that job2 inherits the environment variables of job1, which may conflict with its own variables or those set by the batch system. This type of error may go undetected for a few jobs, for example if each job appends some directories to $PATH, and then cause weird problems when this variable hits the maximum length allowed by the shell.
Using dependency conditions
A more robust solution is to use dependency conditions, e.g. job2 should start only when job1 is done, job3 after job2, etc. This is done using bsub -w (wait):
myjobid=$(sbatch --parsable -J job1 --wrap="command1") sbatch -J job2 -d afterany:$myjobid --wrap="command2"
All jobs in a series may be submitted at once. The jobid of the first job needs to be stored in a variable that is then used to define the job dependency of the second job.
There are different possible conditions:
- after:job_id[[+time][:jobid[+time]...]]
- After the specified jobs start or are cancelled and 'time' in minutes from job start or cancellation happens, this job can begin execution. If no 'time' is given then there is no delay after start or cancellation.
- afterany:job_id[:jobid...]
- This job can begin execution after the specified jobs have terminated. This is the default dependency type.
- afterburstbuffer:job_id[:jobid...]
- This job can begin execution after the specified jobs have terminated and any associated burst buffer stage out operations have completed.
- aftercorr:job_id[:jobid...]
- A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).
- afternotok:job_id[:jobid...]
- This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc). This job must be submitted while the specified job is still active or within MinJobAge seconds after the specified job has ended.
- afterok:job_id[:jobid...]
- This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero). This job must be submitted while the specified job is still active or within MinJobAge seconds after the specified job has ended.