Difference between revisions of "Job monitoring"
From ScientificComputing
(Created page with "== Job control commands == {| class="wikitable" |- | busers || user limits, number of pending and running jobs |- | bqueues || queues status (open/closed; active/inactive) |-...") |
|||
(27 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | __NOTOC__ | ||
+ | <table style="width: 100%;"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 30%; text-align:left"> | ||
+ | < [[GPU job submission | Submit a GPU job]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:center"> | ||
+ | [[Main Page | Home]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:right"> | ||
+ | [[Job output]] > | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | |||
+ | |||
+ | The most frequent job monitoring operations are | ||
+ | # Check the job status with [[Job monitoring#bjobs|'''bjobs''']] or [[Job monitoring#bbjobs|'''bbjobs''']] | ||
+ | # Check the job screen output with [[Job monitoring#bpeek|'''bpeek''']] | ||
+ | # Kill a job with [[Job monitoring#bkill|'''bkill''']] | ||
+ | |||
+ | == bjobs == | ||
+ | After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status. | ||
+ | $ bjobs | ||
+ | JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME | ||
+ | 161182423 jarunan PEND normal.4h eu-login-43 *cho hello Jan 22 06:01 | ||
+ | |||
+ | When the job is running on a compute node, it has the RUNNING status. | ||
+ | $ bjobs | ||
+ | JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME | ||
+ | 161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01 | ||
+ | |||
+ | {| class="wikitable" | style="background:white;" | ||
+ | ! bjobs options || Description | ||
+ | |- | ||
+ | | (no option) || list all your jobs in all queues | ||
+ | |- | ||
+ | | -p || list only pending(waiting) jobs and indicate why they are pending | ||
+ | |- | ||
+ | | -r || list only running jobs | ||
+ | |- | ||
+ | | -d || list only done job (finished within the last hour) | ||
+ | |- | ||
+ | | -l || display status in long format | ||
+ | |- | ||
+ | | -w || display status in wide format | ||
+ | |- | ||
+ | | -o "format" || use custom output format (see LSF documentation for details) | ||
+ | |- | ||
+ | | -J jobname || show only job(s) called jobname | ||
+ | |- | ||
+ | | -q queue || show only jobs in a specific queue | ||
+ | |- | ||
+ | | job-ID(s) || list of job-IDs (this must be the last option) | ||
+ | |} | ||
+ | |||
+ | == bbjobs == | ||
+ | bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status. | ||
+ | <table style="width: 100%"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 45%; background: white;"> | ||
+ | ==== PENDING status ==== | ||
+ | $ bbjobs | ||
+ | Job information | ||
+ | Job ID : 161182479 | ||
+ | Status : PENDING | ||
+ | User : jarunanp | ||
+ | Queue : normal.4h | ||
+ | Command : sleep 10; echo hello | ||
+ | Working directory : $HOME/- | ||
+ | Requested resources | ||
+ | Requested cores : 1 | ||
+ | Requested runtime : 4 h 0 min | ||
+ | Requested memory : 1024 MB per core | ||
+ | Requested scratch : not specified | ||
+ | Dependency : - | ||
+ | Job history | ||
+ | Submitted at : 06:03 2021-01-22 | ||
+ | Queue wait time : 18 sec | ||
+ | </td> | ||
+ | <td style="width: 3%; background: white;"> | ||
+ | </td> | ||
+ | <td style="width: 50%; background: white;"> | ||
+ | |||
+ | ==== RUNNING status ==== | ||
+ | $ bbjobs | ||
+ | Job information | ||
+ | Job ID : 161182479 | ||
+ | Status : RUNNING | ||
+ | Running on node : eu-ms-025-27 | ||
+ | User : jarunanp | ||
+ | Queue : normal.4h | ||
+ | Command : sleep 10; echo hello | ||
+ | Working directory : $HOME/- | ||
+ | Requested resources | ||
+ | Requested cores : 1 | ||
+ | Requested runtime : 4 h 0 min | ||
+ | Requested memory : 1024 MB per core | ||
+ | Requested scratch : not specified | ||
+ | Dependency : - | ||
+ | Job history | ||
+ | Submitted at : 06:03 2021-01-22 | ||
+ | Started at : 06:03 2021-01-22 | ||
+ | Queue wait time : 20 sec | ||
+ | Resource usage | ||
+ | Updated at : 06:04 2021-01-22 | ||
+ | Wall-clock : 4 sec | ||
+ | Tasks : 4 | ||
+ | Total CPU time : 0 sec | ||
+ | CPU utilization : 0.0 % | ||
+ | Sys/Kernel time : 0.0 % | ||
+ | Total resident Memory : 2 MB | ||
+ | Resident memory utilization : 0.2 % | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | == bpeek == | ||
+ | Use bpeek to display the standard output of a given job | ||
+ | $ bpeek jobID | ||
+ | |||
+ | To display the updated information as the standard output grows | ||
+ | $ bpeek -f jobID | ||
+ | |||
+ | |||
+ | == bkill == | ||
+ | Use bkill to terminate a submitted job | ||
+ | $ bkill 161182774 | ||
+ | Job <161182774> is being terminated | ||
+ | |||
+ | {| class="wikitable" | style="background:white;" | ||
+ | ! bkill options || Description | ||
+ | |- | ||
+ | | job-ID || kill job-ID | ||
+ | |- | ||
+ | | 0 || kill all jobs (yours only) | ||
+ | |- | ||
+ | | -J jobname || kill most recent job called jobname | ||
+ | |- | ||
+ | | -J jobname 0 || kill all jobs called jobname | ||
+ | |- | ||
+ | | -q queue || kill most recent job in queue | ||
+ | |- | ||
+ | | -q queue 0 || kill all jobs in queue | ||
+ | |} | ||
+ | |||
== Job control commands == | == Job control commands == | ||
− | {| class="wikitable" | + | {| class="wikitable" | style="background:white;" |
+ | ! Job control commands || Description | ||
|- | |- | ||
| busers || user limits, number of pending and running jobs | | busers || user limits, number of pending and running jobs | ||
Line 7: | Line 155: | ||
|- | |- | ||
| bjobs || more or less detailed information about pending and running jobs, and recently finished jobs | | bjobs || more or less detailed information about pending and running jobs, and recently finished jobs | ||
− | |- | + | |- style="color:green" |
| bbjobs || better bjobs | | bbjobs || better bjobs | ||
|- | |- | ||
| bhist || info about jobs finished in the last hours/days | | bhist || info about jobs finished in the last hours/days | ||
|- | |- | ||
− | | bpeek || display the standard output of a given | + | | bpeek || display the standard output of a given job |
− | |- | + | |- style="color:green" |
+ | | lsf_load || show the CPU load of all nodes used by a job | ||
+ | |- style="color:green" | ||
| bjob_connect || login to a node where your job is running | | bjob_connect || login to a node where your job is running | ||
|- | |- | ||
| bkill || kill a job | | bkill || kill a job | ||
|} | |} | ||
+ | |||
+ | Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands. | ||
+ | |||
+ | == Further reading == | ||
+ | * [[Using_the_batch_system#Job_monitoring|User guide: Using the batch system - Job monitoring]] | ||
+ | |||
+ | |||
+ | |||
+ | <table style="width: 100%;"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 30%; text-align:left"> | ||
+ | < [[GPU job submission | Submit a GPU job]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:center"> | ||
+ | [[Main Page| Home]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:right"> | ||
+ | [[Job output]] > | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> |
Latest revision as of 09:26, 1 October 2021
The most frequent job monitoring operations are
- Check the job status with bjobs or bbjobs
- Check the job screen output with bpeek
- Kill a job with bkill
bjobs
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan PEND normal.4h eu-login-43 *cho hello Jan 22 06:01
When the job is running on a compute node, it has the RUNNING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options | Description |
---|---|
(no option) | list all your jobs in all queues |
-p | list only pending(waiting) jobs and indicate why they are pending |
-r | list only running jobs |
-d | list only done job (finished within the last hour) |
-l | display status in long format |
-w | display status in wide format |
-o "format" | use custom output format (see LSF documentation for details) |
-J jobname | show only job(s) called jobname |
-q queue | show only jobs in a specific queue |
job-ID(s) | list of job-IDs (this must be the last option) |
bbjobs
bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.
PENDING status$ bbjobs Job information Job ID : 161182479 Status : PENDING User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Queue wait time : 18 sec |
RUNNING status$ bbjobs Job information Job ID : 161182479 Status : RUNNING Running on node : eu-ms-025-27 User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Started at : 06:03 2021-01-22 Queue wait time : 20 sec Resource usage Updated at : 06:04 2021-01-22 Wall-clock : 4 sec Tasks : 4 Total CPU time : 0 sec CPU utilization : 0.0 % Sys/Kernel time : 0.0 % Total resident Memory : 2 MB Resident memory utilization : 0.2 % |
bpeek
Use bpeek to display the standard output of a given job
$ bpeek jobID
To display the updated information as the standard output grows
$ bpeek -f jobID
bkill
Use bkill to terminate a submitted job
$ bkill 161182774 Job <161182774> is being terminated
bkill options | Description |
---|---|
job-ID | kill job-ID |
0 | kill all jobs (yours only) |
-J jobname | kill most recent job called jobname |
-J jobname 0 | kill all jobs called jobname |
-q queue | kill most recent job in queue |
-q queue 0 | kill all jobs in queue |
Job control commands
Job control commands | Description |
---|---|
busers | user limits, number of pending and running jobs |
bqueues | queues status (open/closed; active/inactive) |
bjobs | more or less detailed information about pending and running jobs, and recently finished jobs |
bbjobs | better bjobs |
bhist | info about jobs finished in the last hours/days |
bpeek | display the standard output of a given job |
lsf_load | show the CPU load of all nodes used by a job |
bjob_connect | login to a node where your job is running |
bkill | kill a job |
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
Further reading