Difference between revisions of "Job monitoring"
From ScientificComputing
(19 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
__NOTOC__ | __NOTOC__ | ||
+ | <table style="width: 100%;"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 30%; text-align:left"> | ||
+ | < [[GPU job submission | Submit a GPU job]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:center"> | ||
+ | [[Main Page | Home]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:right"> | ||
+ | [[Job output]] > | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | |||
+ | |||
+ | The most frequent job monitoring operations are | ||
+ | # Check the job status with [[Job monitoring#bjobs|'''bjobs''']] or [[Job monitoring#bbjobs|'''bbjobs''']] | ||
+ | # Check the job screen output with [[Job monitoring#bpeek|'''bpeek''']] | ||
+ | # Kill a job with [[Job monitoring#bkill|'''bkill''']] | ||
+ | |||
== bjobs == | == bjobs == | ||
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status. | After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status. | ||
Line 11: | Line 32: | ||
161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01 | 161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01 | ||
− | {| class="wikitable" | + | {| class="wikitable" | style="background:white;" |
! bjobs options || Description | ! bjobs options || Description | ||
|- | |- | ||
Line 39: | Line 60: | ||
<table style="width: 100%"> | <table style="width: 100%"> | ||
<tr valign=top> | <tr valign=top> | ||
− | <td style="width: | + | <td style="width: 45%; background: white;"> |
==== PENDING status ==== | ==== PENDING status ==== | ||
$ bbjobs | $ bbjobs | ||
Job information | Job information | ||
− | Job ID | + | Job ID : 161182479 |
− | Status | + | Status : PENDING |
− | User | + | User : jarunanp |
− | Queue | + | Queue : normal.4h |
− | Command | + | Command : sleep 10; echo hello |
− | Working directory | + | Working directory : $HOME/- |
Requested resources | Requested resources | ||
− | Requested cores | + | Requested cores : 1 |
− | Requested runtime | + | Requested runtime : 4 h 0 min |
− | Requested memory | + | Requested memory : 1024 MB per core |
− | Requested scratch | + | Requested scratch : not specified |
− | Dependency | + | Dependency : - |
Job history | Job history | ||
− | Submitted at | + | Submitted at : 06:03 2021-01-22 |
− | Queue wait time | + | Queue wait time : 18 sec |
</td> | </td> | ||
<td style="width: 3%; background: white;"> | <td style="width: 3%; background: white;"> | ||
Line 95: | Line 116: | ||
</tr> | </tr> | ||
</table> | </table> | ||
+ | |||
+ | == bpeek == | ||
+ | Use bpeek to display the standard output of a given job | ||
+ | $ bpeek jobID | ||
+ | |||
+ | To display the updated information as the standard output grows | ||
+ | $ bpeek -f jobID | ||
+ | |||
== bkill == | == bkill == | ||
Line 101: | Line 130: | ||
Job <161182774> is being terminated | Job <161182774> is being terminated | ||
− | {| class="wikitable" | + | {| class="wikitable" | style="background:white;" |
! bkill options || Description | ! bkill options || Description | ||
|- | |- | ||
Line 118: | Line 147: | ||
== Job control commands == | == Job control commands == | ||
− | {| class="wikitable" | + | {| class="wikitable" | style="background:white;" |
! Job control commands || Description | ! Job control commands || Description | ||
|- | |- | ||
Line 126: | Line 155: | ||
|- | |- | ||
| bjobs || more or less detailed information about pending and running jobs, and recently finished jobs | | bjobs || more or less detailed information about pending and running jobs, and recently finished jobs | ||
− | |- | + | |- style="color:green" |
| bbjobs || better bjobs | | bbjobs || better bjobs | ||
|- | |- | ||
| bhist || info about jobs finished in the last hours/days | | bhist || info about jobs finished in the last hours/days | ||
|- | |- | ||
− | | bpeek || display the standard output of a given | + | | bpeek || display the standard output of a given job |
− | |- | + | |- style="color:green" |
+ | | lsf_load || show the CPU load of all nodes used by a job | ||
+ | |- style="color:green" | ||
| bjob_connect || login to a node where your job is running | | bjob_connect || login to a node where your job is running | ||
|- | |- | ||
| bkill || kill a job | | bkill || kill a job | ||
|} | |} | ||
+ | |||
+ | Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands. | ||
+ | |||
+ | == Further reading == | ||
+ | * [[Using_the_batch_system#Job_monitoring|User guide: Using the batch system - Job monitoring]] | ||
+ | |||
+ | |||
+ | |||
+ | <table style="width: 100%;"> | ||
+ | <tr valign=top> | ||
+ | <td style="width: 30%; text-align:left"> | ||
+ | < [[GPU job submission | Submit a GPU job]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:center"> | ||
+ | [[Main Page| Home]] | ||
+ | </td> | ||
+ | <td style="width: 35%; text-align:right"> | ||
+ | [[Job output]] > | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> |
Latest revision as of 09:26, 1 October 2021
The most frequent job monitoring operations are
- Check the job status with bjobs or bbjobs
- Check the job screen output with bpeek
- Kill a job with bkill
bjobs
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan PEND normal.4h eu-login-43 *cho hello Jan 22 06:01
When the job is running on a compute node, it has the RUNNING status.
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 161182423 jarunan RUN normal.4h eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options | Description |
---|---|
(no option) | list all your jobs in all queues |
-p | list only pending(waiting) jobs and indicate why they are pending |
-r | list only running jobs |
-d | list only done job (finished within the last hour) |
-l | display status in long format |
-w | display status in wide format |
-o "format" | use custom output format (see LSF documentation for details) |
-J jobname | show only job(s) called jobname |
-q queue | show only jobs in a specific queue |
job-ID(s) | list of job-IDs (this must be the last option) |
bbjobs
bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.
PENDING status$ bbjobs Job information Job ID : 161182479 Status : PENDING User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Queue wait time : 18 sec |
RUNNING status$ bbjobs Job information Job ID : 161182479 Status : RUNNING Running on node : eu-ms-025-27 User : jarunanp Queue : normal.4h Command : sleep 10; echo hello Working directory : $HOME/- Requested resources Requested cores : 1 Requested runtime : 4 h 0 min Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 06:03 2021-01-22 Started at : 06:03 2021-01-22 Queue wait time : 20 sec Resource usage Updated at : 06:04 2021-01-22 Wall-clock : 4 sec Tasks : 4 Total CPU time : 0 sec CPU utilization : 0.0 % Sys/Kernel time : 0.0 % Total resident Memory : 2 MB Resident memory utilization : 0.2 % |
bpeek
Use bpeek to display the standard output of a given job
$ bpeek jobID
To display the updated information as the standard output grows
$ bpeek -f jobID
bkill
Use bkill to terminate a submitted job
$ bkill 161182774 Job <161182774> is being terminated
bkill options | Description |
---|---|
job-ID | kill job-ID |
0 | kill all jobs (yours only) |
-J jobname | kill most recent job called jobname |
-J jobname 0 | kill all jobs called jobname |
-q queue | kill most recent job in queue |
-q queue 0 | kill all jobs in queue |
Job control commands
Job control commands | Description |
---|---|
busers | user limits, number of pending and running jobs |
bqueues | queues status (open/closed; active/inactive) |
bjobs | more or less detailed information about pending and running jobs, and recently finished jobs |
bbjobs | better bjobs |
bhist | info about jobs finished in the last hours/days |
bpeek | display the standard output of a given job |
lsf_load | show the CPU load of all nodes used by a job |
bjob_connect | login to a node where your job is running |
bkill | kill a job |
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
Further reading