Difference between revisions of "Job monitoring"

From ScientificComputing
Jump to: navigation, search
(Created page with "== Job control commands == {| class="wikitable" |- | busers || user limits, number of pending and running jobs |- | bqueues || queues status (open/closed; active/inactive) |-...")
 
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
__NOTOC__
 +
<table style="width: 100%;">
 +
<tr valign=top>
 +
<td style="width: 30%; text-align:left">
 +
< [[GPU job submission | Submit a GPU job]]
 +
</td>
 +
<td style="width: 35%; text-align:center">
 +
[[Main Page | Home]]
 +
</td>
 +
<td style="width: 35%; text-align:right">
 +
[[Job output]] >
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
 +
The most frequent job monitoring operations are
 +
# Check the job status with [[Job monitoring#bjobs|'''bjobs''']] or [[Job monitoring#bbjobs|'''bbjobs''']]
 +
# Check the job screen output with [[Job monitoring#bpeek|'''bpeek''']]
 +
# Kill a job with [[Job monitoring#bkill|'''bkill''']]
 +
 +
== bjobs ==
 +
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
 +
$ bjobs
 +
JOBID      USER    STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 +
161182423  jarunan PEND  normal.4h  eu-login-43            *cho hello Jan 22 06:01
 +
 +
When the job is running on a compute node, it has the RUNNING status.
 +
$ bjobs
 +
JOBID      USER    STAT  QUEUE      FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
 +
161182423  jarunan RUN  normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
 +
 +
{| class="wikitable" | style="background:white;"
 +
! bjobs options || Description
 +
|-
 +
| (no option) || list all your jobs in all queues
 +
|-
 +
| -p || list only pending(waiting) jobs and indicate why they are pending
 +
|-
 +
| -r || list only running jobs
 +
|-
 +
| -d || list only done job (finished within the last hour)
 +
|-
 +
| -l || display status in long format
 +
|-
 +
| -w || display status in wide format
 +
|-
 +
| -o "format" || use custom output format (see LSF documentation for details)
 +
|-
 +
| -J jobname || show only job(s) called jobname
 +
|-
 +
| -q queue || show only jobs in a specific queue
 +
|-
 +
| job-ID(s) || list of job-IDs (this must be the last option)
 +
|}
 +
 +
== bbjobs ==
 +
bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.
 +
<table style="width: 100%">
 +
<tr valign=top>
 +
<td style="width: 45%; background: white;">
 +
==== PENDING status ====
 +
$ bbjobs
 +
Job information
 +
  Job ID                : 161182479
 +
  Status                : PENDING
 +
  User                  : jarunanp
 +
  Queue                  : normal.4h
 +
  Command                : sleep 10; echo hello
 +
  Working directory      : $HOME/-
 +
Requested resources
 +
  Requested cores        : 1
 +
  Requested runtime      : 4 h 0 min
 +
  Requested memory      : 1024 MB per core
 +
  Requested scratch      : not specified
 +
  Dependency            : -
 +
Job history
 +
  Submitted at          : 06:03 2021-01-22
 +
  Queue wait time        : 18 sec
 +
</td>
 +
<td style="width: 3%; background: white;">
 +
</td>
 +
<td style="width: 50%; background: white;">
 +
 +
==== RUNNING status ====
 +
$ bbjobs
 +
Job information
 +
  Job ID                        : 161182479
 +
  Status                        : RUNNING
 +
  Running on node              : eu-ms-025-27
 +
  User                          : jarunanp
 +
  Queue                        : normal.4h
 +
  Command                      : sleep 10; echo hello
 +
  Working directory            : $HOME/-
 +
Requested resources
 +
  Requested cores              : 1
 +
  Requested runtime            : 4 h 0 min
 +
  Requested memory              : 1024 MB per core
 +
  Requested scratch            : not specified
 +
  Dependency                    : -
 +
Job history
 +
  Submitted at                  : 06:03 2021-01-22
 +
  Started at                    : 06:03 2021-01-22
 +
  Queue wait time              : 20 sec
 +
Resource usage
 +
  Updated at                    : 06:04 2021-01-22
 +
  Wall-clock                    : 4 sec
 +
  Tasks                        : 4
 +
  Total CPU time                : 0 sec
 +
  CPU utilization              : 0.0 %
 +
  Sys/Kernel time              : 0.0 %
 +
  Total resident Memory        : 2 MB
 +
  Resident memory utilization  : 0.2 %
 +
</td>
 +
</tr>
 +
</table>
 +
 +
== bpeek ==
 +
Use bpeek to display the standard output of a given job
 +
$ bpeek jobID
 +
 +
To display the updated information as the standard output grows
 +
$ bpeek -f jobID
 +
 +
 +
== bkill ==
 +
Use bkill to terminate a submitted job
 +
$ bkill 161182774
 +
Job <161182774> is being terminated
 +
 +
{| class="wikitable" | style="background:white;"
 +
! bkill options || Description
 +
|-
 +
| job-ID || kill job-ID
 +
|-
 +
| 0 || kill all jobs (yours only)
 +
|-
 +
| -J jobname || kill most recent job called jobname
 +
|-
 +
| -J jobname 0 || kill all jobs called jobname
 +
|-
 +
| -q queue || kill most recent job in queue
 +
|-
 +
| -q queue 0 || kill all jobs in queue
 +
|}
 +
 
== Job control commands ==
 
== Job control commands ==
{| class="wikitable"
+
{| class="wikitable" | style="background:white;"
 +
! Job control commands || Description
 
|-
 
|-
 
| busers || user limits, number of pending and running jobs
 
| busers || user limits, number of pending and running jobs
Line 7: Line 155:
 
|-
 
|-
 
| bjobs || more or less detailed information about pending and running jobs, and recently finished jobs
 
| bjobs || more or less detailed information about pending and running jobs, and recently finished jobs
|-
+
|- style="color:green"
 
| bbjobs || better bjobs
 
| bbjobs || better bjobs
 
|-  
 
|-  
 
| bhist || info about jobs finished in the last hours/days
 
| bhist || info about jobs finished in the last hours/days
 
|-  
 
|-  
| bpeek || display the standard output of a given joblsf_loadshow the CPU load of all nodes used by a job
+
| bpeek || display the standard output of a given job
|-  
+
|- style="color:green"
 +
| lsf_load || show the CPU load of all nodes used by a job
 +
|- style="color:green"
 
| bjob_connect || login to a node where your job is running
 
| bjob_connect || login to a node where your job is running
 
|-
 
|-
 
| bkill || kill a job
 
| bkill || kill a job
 
|}
 
|}
 +
 +
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
 +
 +
== Further reading ==
 +
* [[Using_the_batch_system#Job_monitoring|User guide: Using the batch system - Job monitoring]]
 +
 +
 +
 +
<table style="width: 100%;">
 +
<tr valign=top>
 +
<td style="width: 30%; text-align:left">
 +
< [[GPU job submission | Submit a GPU job]]
 +
</td>
 +
<td style="width: 35%; text-align:center">
 +
[[Main Page| Home]]
 +
</td>
 +
<td style="width: 35%; text-align:right">
 +
[[Job output]] >
 +
</td>
 +
</tr>
 +
</table>

Latest revision as of 09:26, 1 October 2021

< Submit a GPU job

Home

Job output >


The most frequent job monitoring operations are

  1. Check the job status with bjobs or bbjobs
  2. Check the job screen output with bpeek
  3. Kill a job with bkill

bjobs

After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan PEND  normal.4h  eu-login-43             *cho hello Jan 22 06:01

When the job is running on a compute node, it has the RUNNING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan RUN   normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options Description
(no option) list all your jobs in all queues
-p list only pending(waiting) jobs and indicate why they are pending
-r list only running jobs
-d list only done job (finished within the last hour)
-l display status in long format
-w display status in wide format
-o "format" use custom output format (see LSF documentation for details)
-J jobname show only job(s) called jobname
-q queue show only jobs in a specific queue
job-ID(s) list of job-IDs (this must be the last option)

bbjobs

bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.

PENDING status

$ bbjobs
Job information
  Job ID                 : 161182479
  Status                 : PENDING
  User                   : jarunanp
  Queue                  : normal.4h
  Command                : sleep 10; echo hello
  Working directory      : $HOME/-
Requested resources
  Requested cores        : 1
  Requested runtime      : 4 h 0 min
  Requested memory       : 1024 MB per core
  Requested scratch      : not specified
  Dependency             : -
Job history
  Submitted at           : 06:03 2021-01-22
  Queue wait time        : 18 sec

RUNNING status

$ bbjobs
Job information
  Job ID                        : 161182479
  Status                        : RUNNING
  Running on node               : eu-ms-025-27 
  User                          : jarunanp
  Queue                         : normal.4h
  Command                       : sleep 10; echo hello
  Working directory             : $HOME/-
Requested resources
  Requested cores               : 1
  Requested runtime             : 4 h 0 min
  Requested memory              : 1024 MB per core
  Requested scratch             : not specified
  Dependency                    : -
Job history
  Submitted at                  : 06:03 2021-01-22
  Started at                    : 06:03 2021-01-22
  Queue wait time               : 20 sec
Resource usage
  Updated at                    : 06:04 2021-01-22
  Wall-clock                    : 4 sec
  Tasks                         : 4
  Total CPU time                : 0 sec
  CPU utilization               : 0.0 %
  Sys/Kernel time               : 0.0 %
  Total resident Memory         : 2 MB
  Resident memory utilization   : 0.2 % 

bpeek

Use bpeek to display the standard output of a given job

$ bpeek jobID

To display the updated information as the standard output grows

$ bpeek -f jobID


bkill

Use bkill to terminate a submitted job

$ bkill 161182774
Job <161182774> is being terminated
bkill options Description
job-ID kill job-ID
0 kill all jobs (yours only)
-J jobname kill most recent job called jobname
-J jobname 0 kill all jobs called jobname
-q queue kill most recent job in queue
-q queue 0 kill all jobs in queue

Job control commands

Job control commands Description
busers user limits, number of pending and running jobs
bqueues queues status (open/closed; active/inactive)
bjobs more or less detailed information about pending and running jobs, and recently finished jobs
bbjobs better bjobs
bhist info about jobs finished in the last hours/days
bpeek display the standard output of a given job
lsf_load show the CPU load of all nodes used by a job
bjob_connect login to a node where your job is running
bkill kill a job

Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.

Further reading


< Submit a GPU job

Home

Job output >