Job monitoring

From ScientificComputing
Revision as of 09:26, 1 October 2021 by Jarunanp (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

< Submit a GPU job

Home

Job output >


The most frequent job monitoring operations are

  1. Check the job status with bjobs or bbjobs
  2. Check the job screen output with bpeek
  3. Kill a job with bkill

bjobs

After submitting a job, the job will wait in a queue to be run on a compute node and has the PENDING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan PEND  normal.4h  eu-login-43             *cho hello Jan 22 06:01

When the job is running on a compute node, it has the RUNNING status.

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
161182423  jarunan RUN   normal.4h  eu-login-43 eu-ms-005-0 *cho hello Jan 22 06:01
bjobs options Description
(no option) list all your jobs in all queues
-p list only pending(waiting) jobs and indicate why they are pending
-r list only running jobs
-d list only done job (finished within the last hour)
-l display status in long format
-w display status in wide format
-o "format" use custom output format (see LSF documentation for details)
-J jobname show only job(s) called jobname
-q queue show only jobs in a specific queue
job-ID(s) list of job-IDs (this must be the last option)

bbjobs

bbjobs displays more human-friendly information than bjobs. Here are examples in PENDING and RUNNING status.

PENDING status

$ bbjobs
Job information
  Job ID                 : 161182479
  Status                 : PENDING
  User                   : jarunanp
  Queue                  : normal.4h
  Command                : sleep 10; echo hello
  Working directory      : $HOME/-
Requested resources
  Requested cores        : 1
  Requested runtime      : 4 h 0 min
  Requested memory       : 1024 MB per core
  Requested scratch      : not specified
  Dependency             : -
Job history
  Submitted at           : 06:03 2021-01-22
  Queue wait time        : 18 sec

RUNNING status

$ bbjobs
Job information
  Job ID                        : 161182479
  Status                        : RUNNING
  Running on node               : eu-ms-025-27 
  User                          : jarunanp
  Queue                         : normal.4h
  Command                       : sleep 10; echo hello
  Working directory             : $HOME/-
Requested resources
  Requested cores               : 1
  Requested runtime             : 4 h 0 min
  Requested memory              : 1024 MB per core
  Requested scratch             : not specified
  Dependency                    : -
Job history
  Submitted at                  : 06:03 2021-01-22
  Started at                    : 06:03 2021-01-22
  Queue wait time               : 20 sec
Resource usage
  Updated at                    : 06:04 2021-01-22
  Wall-clock                    : 4 sec
  Tasks                         : 4
  Total CPU time                : 0 sec
  CPU utilization               : 0.0 %
  Sys/Kernel time               : 0.0 %
  Total resident Memory         : 2 MB
  Resident memory utilization   : 0.2 % 

bpeek

Use bpeek to display the standard output of a given job

$ bpeek jobID

To display the updated information as the standard output grows

$ bpeek -f jobID


bkill

Use bkill to terminate a submitted job

$ bkill 161182774
Job <161182774> is being terminated
bkill options Description
job-ID kill job-ID
0 kill all jobs (yours only)
-J jobname kill most recent job called jobname
-J jobname 0 kill all jobs called jobname
-q queue kill most recent job in queue
-q queue 0 kill all jobs in queue

Job control commands

Job control commands Description
busers user limits, number of pending and running jobs
bqueues queues status (open/closed; active/inactive)
bjobs more or less detailed information about pending and running jobs, and recently finished jobs
bbjobs better bjobs
bhist info about jobs finished in the last hours/days
bpeek display the standard output of a given job
lsf_load show the CPU load of all nodes used by a job
bjob_connect login to a node where your job is running
bkill kill a job

Command shown in green are specific to HPC clusters at ETH and are not standard LSF commands.

Further reading


< Submit a GPU job

Home

Job output >