Job monitoring with SLURM
From ScientificComputing
The most frequent job monitoring operations are
squeue
After submitting a job, the job will wait in a queue to be run on a compute node and has the PD (i.e. pending) status.
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6037416 normal.4h myjob nmarouni PD 0:00 1 (None)
When the job is running on a compute node, it has the R (i.e. running) status.
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6037416 normal.4h myjob nmarouni R 0:03 1 eu-g5-047-1
squeue options | Description |
---|---|
(no option) | list all your jobs in all queues |
-t STATE | list only jobs with a specified STATE. Valid job states include (but are not limited to): PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT |
-o "format" | use custom output format (see SLURM documentation for details) |
-j <job_id_list> | show only job(s) with given jobIDs. Requests a comma separated list of job IDs to display |
-p partition | show only jobs in a specific partition (queue) |
myjobs
myjobs displays more human-friendly information than squeue. Here are examples in PENDING and RUNNING status.
PENDING status$ myjobs -j 6038307 Job information Job ID : 6038307 Status : PENDING Running on node : None assigned User : nmarounina Shareholder group : es_cdss Slurm partition (queue) : gpu.24h Command : script.sbatch Working directory : /cluster/home/nmarounina Requested resources Requested runtime : 08:00:00 Requested cores (total) : 12 Requested nodes : 1 Requested memory (total) : 120000 MiB Job history Submitted at : 2023-01-09T15:56:09 Started at : Job did not start yet Queue waiting time : 8 s Resource usage Wall-clock : Total CPU time : - CPU utilization : - % Total resident memory : - MiB Resident memory utilization : - % |
RUNNING status$ myjobs -j 6038307 Job information Job ID : 6038307 Status : RUNNING Running on node : eu-g3-022 User : nmarounina Shareholder group : es_cdss Slurm partition (queue) : gpu.24h Command : script.sbatch Working directory : /cluster/home/nmarounina Requested resources Requested runtime : 08:00:00 Requested cores (total) : 12 Requested nodes : 1 Requested memory (total) : 120000 MiB Job history Submitted at : 2023-01-09T15:56:09 Started at : 2023-01-09T15:56:38 Queue waiting time : 29 s Resource usage Wall-clock : 00:00:36 Total CPU time : 00:00:00 CPU utilization : 0% Total resident memory : 2.94 MiB Resident memory utilization : 0% |
scancel
Use scancel to terminate a submitted job
$ scancel 161182774
scancel options | Description |
---|---|
job-ID | kill job-ID |
-n jobname | kill all jobs called jobname |
-p partition | restrict the scancel operation to jobs in this partition |
-t state | Restrict the scancel operation to jobs in this state |
-i | Interactive mode. Ask for confirmation before performing the cancel operation |
Further reading