Job monitoring
From ScientificComputing
The most frequent job monitoring operations are
- Check the job status with bjobs or bbjobs
- Check the job screen output with bpeek
- Kill a job with bkill
bjobs
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan PEND normal.4h eu-login-43 *cho hello Jan 22 06:01
When the job is running on a compute node, it has the RUNNING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options | Description |
---|---|
(no option) | list all your jobs in all queues |
-p | list only pending(waiting) jobs and indicate why they are pending |
-r | list only running jobs |
-d | list only done job (finished within the last hour) |
-l | display status in long format |
-w | display status in wide format |
-o "format" | use custom output format (see LSF documentation for details) |
-J jobname | show only job(s) called jobname |
-q queue | show only jobs in a specific queue |
job-ID(s) | list of job-IDs (this must be the last option) |
bbjobs
bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.
PENDING status$ bbjobs Job information Job ID : 161182479 Status : PENDING User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Queue wait time : 18 sec |
RUNNING status$ bbjobs Job information Job ID : 161182479 Status : RUNNING Running on node : eu-ms-025-27 User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Started at : 06:03 2021-01-22 Queue wait time : 20 sec Resource usage Updated at : 06:04 2021-01-22 Wall-clock : 4 sec Tasks : 4 Total CPU time : 0 sec CPU utilization : 0.0 % Sys/Kernel time : 0.0 % Total resident Memory : 2 MB Resident memory utilization : 0.2 % |
bpeek
Use bpeek to display the standard output of a given job
$ bpeek jobID
To display the updated information as the standard output grows
$ bpeek -f jobID
bkill
Use bkill to terminate a submitted job
$ bkill 161182774 Job <161182774> is being terminated
bkill options | Description |
---|---|
job-ID | kill job-ID |
0 | kill all jobs (yours only) |
-J jobname | kill most recent job called jobname |
-J jobname 0 | kill all jobs called jobname |
-q queue | kill most recent job in queue |
-q queue 0 | kill all jobs in queue |
Job control commands
Job control commands | Description |
---|---|
busers | user limits, number of pending and running jobs |
bqueues | queues status (open/closed; active/inactive) |
bjobs | more or less detailed information about pending and running jobs, and recently finished jobs |
bbjobs | better bjobs |
bhist | info about jobs finished in the last hours/days |
bpeek | display the standard output of a given job |
lsf_load | show the CPU load of all nodes used by a job |
bjob_connect | login to a node where your job is running |
bkill | kill a job |
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
Further reading