Difference between revisions of "Job monitoring"

From ScientificComputing
Jump to: navigation, search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
__NOTOC__
 
__NOTOC__
 +
<table style="width: 100%;">
 +
<tr valign=top>
 +
<td style="width: 30%; text-align:left">
 +
< [[GPU job submission | Submit a GPU job]]
 +
</td>
 +
<td style="width: 35%; text-align:center">
 +
[[Main Page | Home]]
 +
</td>
 +
<td style="width: 35%; text-align:right">
 +
[[Job output]] >
 +
</td>
 +
</tr>
 +
</table>
 +
 +
 +
 +
The most frequent job monitoring operations are
 +
# Check the job status with [[Job monitoring#bjobs|'''bjobs''']] or [[Job monitoring#bbjobs|'''bbjobs''']]
 +
# Check the job screen output with [[Job monitoring#bpeek|'''bpeek''']]
 +
# Kill a job with [[Job monitoring#bkill|'''bkill''']]
 +
 
== bjobs ==
 
== bjobs ==
 
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
 
After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.
Line 11: Line 32:
 
  161182423  jarunan RUN  normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
 
  161182423  jarunan RUN  normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
  
{| class="wikitable"
+
{| class="wikitable" | style="background:white;"
 
! bjobs options || Description
 
! bjobs options || Description
 
|-
 
|-
Line 109: Line 130:
 
  Job <161182774> is being terminated
 
  Job <161182774> is being terminated
  
{| class="wikitable"
+
{| class="wikitable" | style="background:white;"
 
! bkill options || Description
 
! bkill options || Description
 
|-
 
|-
Line 126: Line 147:
  
 
== Job control commands ==
 
== Job control commands ==
{| class="wikitable"
+
{| class="wikitable" | style="background:white;"
 
! Job control commands || Description
 
! Job control commands || Description
 
|-
 
|-
Line 149: Line 170:
  
 
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
 
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
 +
 +
== Further reading ==
 +
* [[Using_the_batch_system#Job_monitoring|User guide: Using the batch system - Job monitoring]]
 +
 +
 +
 +
<table style="width: 100%;">
 +
<tr valign=top>
 +
<td style="width: 30%; text-align:left">
 +
< [[GPU job submission | Submit a GPU job]]
 +
</td>
 +
<td style="width: 35%; text-align:center">
 +
[[Main Page| Home]]
 +
</td>
 +
<td style="width: 35%; text-align:right">
 +
[[Job output]] >
 +
</td>
 +
</tr>
 +
</table>

Latest revision as of 09:26, 1 October 2021

< Submit a GPU job

Home

Job output >


The most frequent job monitoring operations are

  1. Check the job status with bjobs or bbjobs
  2. Check the job screen output with bpeek
  3. Kill a job with bkill

bjobs

After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan PEND  normal.4h  eu-login-43             *cho hello Jan 22 06:01

When the job is running on a compute node, it has the RUNNING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan RUN   normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options Description
(no option) list all your jobs in all queues
-p list only pending(waiting) jobs and indicate why they are pending
-r list only running jobs
-d list only done job (finished within the last hour)
-l display status in long format
-w display status in wide format
-o "format" use custom output format (see LSF documentation for details)
-J jobname show only job(s) called jobname
-q queue show only jobs in a specific queue
job-ID(s) list of job-IDs (this must be the last option)

bbjobs

bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.

PENDING status

$ bbjobs
Job information
  Job ID                 : 161182479
  Status                 : PENDING
  User                   : jarunanp
  Queue                  : normal.4h
  Command                : sleep 10; echo hello
  Working directory      : $HOME/-
Requested resources
  Requested cores        : 1
  Requested runtime      : 4 h 0 min
  Requested memory       : 1024 MB per core
  Requested scratch      : not specified
  Dependency             : -
Job history
  Submitted at           : 06:03 2021-01-22
  Queue wait time        : 18 sec

RUNNING status

$ bbjobs
Job information
  Job ID                        : 161182479
  Status                        : RUNNING
  Running on node               : eu-ms-025-27 
  User                          : jarunanp
  Queue                         : normal.4h
  Command                       : sleep 10; echo hello
  Working directory             : $HOME/-
Requested resources
  Requested cores               : 1
  Requested runtime             : 4 h 0 min
  Requested memory              : 1024 MB per core
  Requested scratch             : not specified
  Dependency                    : -
Job history
  Submitted at                  : 06:03 2021-01-22
  Started at                    : 06:03 2021-01-22
  Queue wait time               : 20 sec
Resource usage
  Updated at                    : 06:04 2021-01-22
  Wall-clock                    : 4 sec
  Tasks                         : 4
  Total CPU time                : 0 sec
  CPU utilization               : 0.0 %
  Sys/Kernel time               : 0.0 %
  Total resident Memory         : 2 MB
  Resident memory utilization   : 0.2 % 

bpeek

Use bpeek to display the standard output of a given job

$ bpeek jobID

To display the updated information as the standard output grows

$ bpeek -f jobID


bkill

Use bkill to terminate a submitted job

$ bkill 161182774
Job <161182774> is being terminated
bkill options Description
job-ID kill job-ID
0 kill all jobs (yours only)
-J jobname kill most recent job called jobname
-J jobname 0 kill all jobs called jobname
-q queue kill most recent job in queue
-q queue 0 kill all jobs in queue

Job control commands

Job control commands Description
busers user limits, number of pending and running jobs
bqueues queues status (open/closed; active/inactive)
bjobs more or less detailed information about pending and running jobs, and recently finished jobs
bbjobs better bjobs
bhist info about jobs finished in the last hours/days
bpeek display the standard output of a given job
lsf_load show the CPU load of all nodes used by a job
bjob_connect login to a node where your job is running
bkill kill a job

Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.

Further reading


< Submit a GPU job

Home

Job output >