Difference between revisions of "Job monitoring"

From ScientificComputing
Jump to: navigation, search
Line 149: Line 149:
  
 
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
 
Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.
 +
 +
== Further reading ==
 +
[[Using_the_batch_system#Job_monitoring|The complete guide: Job monitoring]]

Revision as of 10:59, 22 January 2021

bjobs

After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan PEND  normal.4h  eu-login-43             *cho hello Jan 22 06:01

When the job is running on a compute node, it has the RUNNING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan RUN   normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options Description
(no option) list all your jobs in all queues
-p list only pending(waiting) jobs and indicate why they are pending
-r list only running jobs
-d list only done job (finished within the last hour)
-l display status in long format
-w display status in wide format
-o "format" use custom output format (see LSF documentation for details)
-J jobname show only job(s) called jobname
-q queue show only jobs in a specific queue
job-ID(s) list of job-IDs (this must be the last option)

bbjobs

bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.

PENDING status

$ bbjobs
Job information
  Job ID                 : 161182479
  Status                 : PENDING
  User                   : jarunanp
  Queue                  : normal.4h
  Command                : sleep 10; echo hello
  Working directory      : $HOME/-
Requested resources
  Requested cores        : 1
  Requested runtime      : 4 h 0 min
  Requested memory       : 1024 MB per core
  Requested scratch      : not specified
  Dependency             : -
Job history
  Submitted at           : 06:03 2021-01-22
  Queue wait time        : 18 sec

RUNNING status

$ bbjobs
Job information
  Job ID                        : 161182479
  Status                        : RUNNING
  Running on node               : eu-ms-025-27 
  User                          : jarunanp
  Queue                         : normal.4h
  Command                       : sleep 10; echo hello
  Working directory             : $HOME/-
Requested resources
  Requested cores               : 1
  Requested runtime             : 4 h 0 min
  Requested memory              : 1024 MB per core
  Requested scratch             : not specified
  Dependency                    : -
Job history
  Submitted at                  : 06:03 2021-01-22
  Started at                    : 06:03 2021-01-22
  Queue wait time               : 20 sec
Resource usage
  Updated at                    : 06:04 2021-01-22
  Wall-clock                    : 4 sec
  Tasks                         : 4
  Total CPU time                : 0 sec
  CPU utilization               : 0.0 %
  Sys/Kernel time               : 0.0 %
  Total resident Memory         : 2 MB
  Resident memory utilization   : 0.2 % 

bpeek

Use bpeek to display the standard output of a given job

$ bpeek jobID

To display the updated information as the standard output grows

$ bpeek -f jobID


bkill

Use bkill to terminate a submitted job

$ bkill 161182774
Job <161182774> is being terminated
bkill options Description
job-ID kill job-ID
0 kill all jobs (yours only)
-J jobname kill most recent job called jobname
-J jobname 0 kill all jobs called jobname
-q queue kill most recent job in queue
-q queue 0 kill all jobs in queue

Job control commands

Job control commands Description
busers user limits, number of pending and running jobs
bqueues queues status (open/closed; active/inactive)
bjobs more or less detailed information about pending and running jobs, and recently finished jobs
bbjobs better bjobs
bhist info about jobs finished in the last hours/days
bpeek display the standard output of a given job
lsf_load show the CPU load of all nodes used by a job
bjob_connect login to a node where your job is running
bkill kill a job

Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.

Further reading

The complete guide: Job monitoring