Job monitoring with SLURM

From ScientificComputing
Jump to: navigation, search

< Submit a GPU job

Home

Job output >


The most frequent job monitoring operations are

  1. Check the job status with squeue and myjobs
  2. Kill a job with scancel

squeue

After submitting a job, the job will wait in a queue to be run on a compute node and has the PD (i.e. pending) status.

$ squeue
JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
6037416 normal.4h    myjob nmarouni PD       0:00      1 (None)

When the job is running on a compute node, it has the R (i.e. running) status.

$ squeue
JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
6037416 normal.4h    myjob nmarouni  R       0:03      1 eu-g5-047-1
squeue options Description
(no option) list all your jobs in all queues
-t STATE list only jobs with a specified STATE. Valid job states include (but are not limited to): PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT
-o "format" use custom output format (see SLURM documentation for details)
-j <job_id_list> show only job(s) with given jobIDs. Requests a comma separated list of job IDs to display
-p partition show only jobs in a specific partition (queue)

myjobs

myjobs displays more human-friendly information than squeue. Here are examples in PENDING and RUNNING status.

PENDING status

$ myjobs -j 6038307
Job information
 Job ID                          : 6038307
 Status                          : PENDING
 Running on node                 : None assigned
 User                            : nmarounina
 Shareholder group               : es_cdss
 Slurm partition (queue)         : gpu.24h
 Command                         : script.sbatch
 Working directory               : /cluster/home/nmarounina
Requested resources
 Requested runtime               : 08:00:00
 Requested cores (total)         : 12
 Requested nodes                 : 1
 Requested memory (total)        : 120000 MiB
Job history
 Submitted at                    : 2023-01-09T15:56:09
 Started at                      : Job did not start yet
 Queue waiting time              : 8 s
 Resource usage
 Wall-clock                      : 
 Total CPU time                  : -
 CPU utilization                 : - %
 Total resident memory           : - MiB
 Resident memory utilization     : - %

RUNNING status

$ myjobs -j 6038307
Job information
 Job ID                          : 6038307
 Status                          : RUNNING
 Running on node                 : eu-g3-022
 User                            : nmarounina
 Shareholder group               : es_cdss
 Slurm partition (queue)         : gpu.24h
 Command                         : script.sbatch
 Working directory               : /cluster/home/nmarounina
Requested resources
 Requested runtime               : 08:00:00
 Requested cores (total)         : 12
 Requested nodes                 : 1
 Requested memory (total)        : 120000 MiB
Job history
 Submitted at                    : 2023-01-09T15:56:09
 Started at                      : 2023-01-09T15:56:38
 Queue waiting time              : 29 s
 Resource usage
 Wall-clock                      : 00:00:36
 Total CPU time                  : 00:00:00
 CPU utilization                 : 0%
 Total resident memory           : 2.94 MiB
 Resident memory utilization     : 0%


scancel

Use scancel to terminate a submitted job

$ scancel 161182774
scancel options Description
job-ID kill job-ID
-n jobname kill all jobs called jobname
-p partition restrict the scancel operation to jobs in this partition
-t state Restrict the scancel operation to jobs in this state
-i Interactive mode. Ask for confirmation before performing the cancel operation


Further reading


< Submit a GPU job

Home

Job output >