Using the batch system
Contents
Command Summary
Please find below a table with commands for job submission, monitoring and control
Command | Description |
---|---|
sbatch | Submit scripts to Slurm |
scancel | Kill a job |
srun | Run a parallel job within Slurm (e.g. create a job or do it within the current one) |
squeue | View job and job step information for jobs managed by Slurm |
scontrol | Display information about the resource usage of a job |
sstat | Display the status information of a running job/step |
sacct | Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database |
myjobs | Job information in human readable format |
Introduction
On our HPC cluster, we use the Slurm (Simple Linux Utility for Resource Management) batch system. A basic knowledge of Slurm is required if you would like to work on the HPC clusters of ETH. The present article will show you how to use Slurm to execute simple batch jobs and give you an overview of some advanced features that can dramatically increase your productivity on a cluster.
Using a batch system has numerous advantages:
- single system image — all computing resources in the cluster can be accessed from a single point
- load balancing — the workload is automatically distributed across all available processor cores
- exclusive use — many computations can be executed at the same time without affecting each other
- prioritization — computing resources can be dedicated to specific applications or people
- fair share — a fair allocation of those resources among all users is guaranteed
In fact, our HPC clusters contains so many cores (130,000) and are used by so many people (more than 3,200) that it would be impossible to use it efficiently without a batch system.
All computations on our HPC cluster must be submitted to the batch system. Please do not run any job interactively on the login nodes, except for testing or debugging purposes.
If you are a member of multiple shareholder groups, then please have a look at our wiki page about working in multiple shareholder groups
Basic job submission
We provide a helper tool to facilitate setting up submission commands and/or jobscript for Slurm and LSF
Slurm/LSF Submission Line Advisor
You can specify the resource required by your job and the command and the script will output the corresponding Slurm/LSF submission command or jobscript, depending on your choice.
Slurm provides two different ways of submitting jobs. While we first show the solution with --wrap, we strongly recommend to use scripts as indicated in the section Job scripts. The scripts require a bit more work to run a job but comes with some majors advantages:
- Better reproducibility
- More easy and faster handover (which includes the cluster support when you need our help)
- Can load the modules directly within the script
Simple commands and programs
Submitting a job to the batch system is as easy as:
sbatch --wrap="command [arguments]" sbatch --wrap="/path/to/program [arguments]"
Examples:
[sfux@eu-login-03 ~]$ sbatch --wrap="gzip big_file.dat" Submitted batch job 1010113
[sfux@eu-login-03 ~]$ sbatch --wrap="./hello_world" Submitted batch job 1010171
Two or more commands can be combined together by enclosing them in quotes:
sbatch --wrap="command1; command2"
Example:
[sfux@eu-login-03 ~]$ sbatch --wrap "configure; make; make install" Submitted batch job 1010213.
Quotes are also necessary if you want to use I/O redirection (">", "<"), pipes ("|") or conditional operators ("&&", "||"):
sbatch --wrap="command < data.in > data.out" sbatch --wrap="command1 | command2"
Examples:
[sfux@eu-login-03 ~]$ sbatch --wrap="tr ',' '\n' < comma_separated_list > linebreak_separated_list" Submitted batch job 1010258
[sfux@eu-login-03 ~]$ sbatch --wrap="cat unsorted_list_with_redundant_entries | sort | uniq > sorted_list" Submitted batch job 1010272
Shell scripts
More complex commands may be placed in a shell script, which should then be submitted like this:
sbatch < script sbatch script
Example:
[sfux@eu-login-03 ~]$ sbatch < hello.sh Submitted batch job 1010279.
Output file
By default your job's output and error messages (or stdout and stderr, to be precise) are combined and written into a file named slurm-JobID.out in the directory where you executed sbatch, where JobID is the number assigned to your job by Slurm. You can select a different output file using the option:
sbatch --output=output_file --open-mode=append --wrap="command [argument]"
The option --output output_file in combination with --open-mode=append tells Slurm to append your job's output to output_file. If you want to overwrite this file, use:
sbatch --output output_file --open-mode=truncate --wrap="command [argument]"
Note that this option, like all sbatch options, must be placed before the command that you want to execute in your job. A common mistake is to place sbatch options in the wrong place, like.
sbatch --wrap=command -o output_file ← WRONG!
Error file
It is also possible to store stderr of a job in a separate file (and again, you can choose with the --open-mode parameter if you would like to append or overwrite)
sbatch --error=error_file --open-mode=append --wrap "command [argument]"
Queue / Queues
Slurm uses different queues to manage the scheduling of the jobs. As a user, you don't need to specify which queue to use as it is automatically picked by slurm when you submit the job.
Resource requirements
By default, a batch job can use only one core for up to 1 hour. (The job is killed when it reaches its run-time limit.) If your job needs more resources — time, cores, memory or scratch space —, you must request them when you submit it.
Wall-clock time
The time limits on our clusters are always based on wall-clock (or elapsed) time. You can specify the amount of time needed by your job with several formats using the option:
sbatch --time=minutes ... example: sbatch --time=10 ... sbatch --time=minutes:seconds ... example: sbatch --time=10:50 ... sbatch --time=hours:minutes:seconds ... example: sbatch --time=5:10:50 ... sbatch --time=days-hours ... example: sbatch --time=1-5 ... sbatch --time=days-hours:minutes ... example: sbatch --time=1-5:10 ... sbatch --time=days-hours:minutes:seconds ... example: sbatch --time=1-5:10:50 ...
Examples:
[sfux@eu-login-03 ~]$ sbatch --time=20 --wrap="./Riemann_zeta -arg 26" Submitted batch job 1010305
[sfux@eu-login-03 ~]$ sbatch --time=20:00 --wrap="./solve_Koenigsberg_bridge_problem" Submitted batch job 1010312.
Since our clusters contains processor cores with different speeds two similar jobs will not necessarily take the same time to complete. It is therefore safer to request more time than strictly necessary... but not too much, for shorter jobs have generally a higher priority than longer ones.
The maximum run-time for jobs that can run on most compute nodes in the cluster is 360 hours. We remain the right to stop jobs with a run time of more than 5 days in case of an emergency maintenance.
Number of processor cores
If your job requires multiple cores (or threads), you must request them using the option:
sbatch --ntasks=number_of_cores --wrap="..."
or
sbatch --ntasks=1 --cpus-per-task=number_of_cores --wrap="..."
Please make sure to check the paragraph about parallel job submission before requesting multiple cores.
Note that merely requesting multiple cores does not mean that your application will use them.
Memory
By default the batch system allocates 1024 MB (1 GB) of memory per processor core. A single-core job will thus get 1 GB of memory; a 4-core job will get 4 GB; and a 16-core job, 16 GB. If your computation requires more memory, you must request it when you submit your job:
sbatch --mem-per-cpu=XXX ...
where XXX is an integer. The default unit is MB, but you can also specify the value in GB when adding the suffix "G" after the integer value.
Example:
[sfux@eu-login-03 ~]$ sbatch --mem-per-cpu=2G --wrap="./evaluate_gamma -precision 10e-30" Submitted batch job 1010322
Note: Please note that users cannot request the full memory of a node, as some of the memory is reserved for the operating system of the compute nodes that runs in memory. Therefore if a user for instance requests 256 GiB of memory, then job will not be dispatched to a node with 256 GiB of memory, but on a node with 512 GiB memory or more. As a general rule, jobs that request ~3% less memory than a node has can run on that node type. For instance, on a node with 256 GiB of memory, you can request up to 256*0.97 GiB = 248.32 GiB.
Scratch space
Slurm automatically creates a local scratch directory when your job starts and deletes it when the job ends. This directory has a unique name, which is passed to your job via the variable $TMPDIR.
Unlike memory, the batch system does not reserve any disk space for this scratch directory by default. If your job is expected to write large amounts of temporary data (say, more than 250 MB) into $TMPDIR — or anywhere in the local /scratch file system — you must request enough scratch space when you submit it:
sbatch --tmp=YYY ...
where YYY' is the amount of scratch space needed by your job, in MB per host (there is no setting in Slurm to request it per core). You can also specify the amount in GB by adding the suffix "G" after YYY.
Example:
[sfux@eu-login-03 ~]$ sbatch --tmp=5000 --wrap="./generating_Euler_numbers -num 5000000" Submitted batch job 1010713
Note that /tmp is reserved for the operating system. Do not write temporary data there! You should either use the directory created by Slurm ($TMPDIR) or create your own temporary directory in the local /scratch file system; in the latter case, do not forget to delete this directory at the end of your job.
GPU
There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.
All GPUs in Slurm are configured in non-exclusive process mode, such that you can run multiple processes/threads on a single GPU. Please find below the available GPU node types.
Euler
GPU Model | Slurm specifier | GPU per node | GPU memory per GPU | CPU cores per node | System memory per node | CPU cores per GPU | System memory per GPU | Compute capability | Minimal CUDA version required |
---|---|---|---|---|---|---|---|---|---|
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 36 | 384 GiB | 4.5 | 48 GiB | 7.5 | 10.0 |
NVIDIA GeForce RTX 2080 Ti | rtx_2080_ti | 8 | 11 GiB | 128 | 512 GiB | 16 | 64 GiB | 7.5 | 10.0 |
NVIDIA GeForce RTX 3090 | rtx_3090 | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 8.6 | 11.0 |
NVIDIA GeForce RTX 4090 | rtx_4090 | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 8.9 | 11.8 |
NVIDIA TITAN RTX | titan_rtx | 8 | 24 GiB | 128 | 512 GiB | 16 | 64 GiB | 7.5 | 10.0 |
NVIDIA Quadro RTX 6000 | quadro_rtx_6000 | 8 | 24 GiB | 128 | 512 GiB | 8 | 64 GiB | 7.5 | 10.0 |
NVIDIA Tesla V100-SXM2 32 GiB | v100 | 8 | 32 GiB | 48 | 768 GiB | 6 | 96 GiB | 7.0 | 9.0 |
NVIDIA Tesla V100-SXM2 32 GB | v100 | 8 | 32 GiB | 40 | 512 GiB | 5 | 64 GiB | 7.0 | 9.0 |
Nvidia Tesla A100 (40 GiB) | a100-pcie-40gb | 8 | 40 GiB | 48 | 768 GiB | 6 | 96 GiB | 8.0 | 11.0 |
Nvidia Tesla A100 (80 GiB) | a100_80gb | 10 | 80 GiB | 48 | 1024 GiB | 4.8 | 96 GiB | 8.0 | 11.0 |
You can request one or more GPUs with the command
sbatch --gpus=number of GPUs ...
To run multi-node GPU jobs, you need to use the option --gpus-per-node:
sbatch --gpus-per-node=2 ...
For advanced settings, please have a look at our getting started with GPUs page.
Interactive jobs
If you just want to run a quick test, you can submit it as a batch interactive job. In this case the job's output is not written into a file, but directly to your terminal, as if it were executed interactively:
srun --pty bash -l
Please note that the bash option -l is required to start a login shell.
Example:
[sfux@eu-login-35 ~]$ srun --pty bash -l srun: job 2040660 queued and waiting for resources srun: job 2040660 has been allocated resources [sfux@eu-a2p-515 ~]$
For interactive jobs with X11 forwarding enabled, you need to make sure that you login to the cluster with X11 forwarding enabled and then you can run
srun [Slurm options] --x11 --pty bash -l
Parallel job submission
Before submitting parallel jobs, please make sure that your application can run in parallel at all in order to not waste resources by requesting multiple cores for a serial application. Further more, please do a short scaling analysis to see how well your code scales in parallel before requesting dozens or hundreds of cores.
OpenMP
If your application is parallelized using OpenMP or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of processor cores (or threads) that it can use is controlled by the environment variable OMP_NUM_THREADS. This variable must be set before you submit your job:
export OMP_NUM_THREADS=number_of_cores sbatch --ntasks=1 --cpus-per-task=number_of_cores --wrap="..."
NOTE: if OMP_NUM_THREADS is not set, your application will either use one core only, or will attempt to use all cores that it can find. As you are restricted to your jobs resources, all threads will be bound to the cores allocated to your job. Starting more than 1 thread per core will slow down your application as the threads will be fighting to get time on the CPU.
MPI
Three kinds of MPI libraries are available on our cluster: Open MPI (recommended), Intel MPI and MVAPICH2. Before you can submit and execute an MPI job, you must load the corresponding modules (compiler + MPI, in that order):
module load compiler module load mpi_library
The command used to launch an MPI application is mpirun.
Let's assume for example that hello_world was compiled with GCC 6.3.0 and linked with Open MPI 4.1.4. The command to execute this job on 4 cores is:
module load gcc/6.3.0 module load open_mpi/4.1.4 sbatch -n 4 --wrap="mpirun ./hello_world"
Note that mpirun automatically uses all cores allocated to the job by Slurm. It is therefore not necessary to indicate this number again to the mpirun command itself:
sbatch --ntasks=4 --wrap="mpirun -np 4 ./hello_world" ← "-np 4" not needed!
Pthreads and other threaded applications
Their behavior is similar to OpenMP applications. It is important to limit the number of threads that the application spawns. There is no standard way to do this, so be sure to check the application's documentation on how to do this. Usually a program supports at least one of four ways to limit itself to N threads:
- it understands the OMP_NUM_THREADS=N environment variable,
- it has its own environment variable, such as GMX_NUM_THREADS=N for Gromacs,
- it has a command-line option, such as -nt N (for Gromacs), or
- it has an input-file option, such as num_threads N.
If you are unsure about the program's behavior, please contact us and we will analyze it.
Hybrid jobs
It is possible to run hybrid jobs that mix MPI and OpenMP on our HPC clusters, but this requires a more advanced knowledge of slurm and the hardware.
Job scripts
You can also use a job script to specify all sbatch options using #SBATCH pragmas. We strongly recommend to load the modules within the submission script in order improve the reproducibility.
#!/bin/bash #SBATCH -n 4 #SBATCH --time=8:00 #SBATCH --mem-per-cpu=2000 #SBATCH --tmp=4000 # per node!! #SBATCH --job-name=analysis1 #SBATCH --output=analysis1.out #SBATCH --error=analysis1.err module load xyz/123 command1 command2
The script can the be submitted as
sbatch < script
or
sbatch script
Job monitoring
This section is still work in progress.
squeue
The squeue command allows you to get information about pending, running and shortly finished jobs.
[sfux@eu-login-41 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1433323 normal.4h wrap sfux PD 0:04 1 eu-g1-026-2 1433322 normal.4h wrap sfux R 0:11 1 eu-a2p-483
You can also check only for running jobs (R) or for pending jobs (PD):
[sfux@eu-login-41 ~]$ squeue -t RUNNING JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1433322 normal.4h wrap sfux R 0:28 1 eu-a2p-483 [sfux@eu-login-41 ~]$ squeue -t PENDING JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1433323 normal.4h wrap sfux PD 0:21 1 eu-g1-026-2 [sfux@eu-login-41 ~]$
An overview on all squeue options is available in the squeue documentation:
https://slurm.schedmd.com/squeue.html
scontrol
The command scontrol if one of multiple that allow you to check the information about a running job:
[sfux@eu-login-15 ~]$ scontrol show jobid -dd 1498523 JobId=1498523 JobName=wrap UserId=sfux(40093) GroupId=sfux-group(104222) MCS_label=N/A Priority=1769 Nice=0 Account=normal/es_hpc QOS=es_hpc/normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:38 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-10-27T11:44:30 EligibleTime=2022-10-27T11:44:30 AccrueTime=2022-10-27T11:44:30 StartTime=2022-10-27T11:44:31 EndTime=2022-10-27T12:44:31 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-27T11:44:31 Scheduler=Main Partition=normal.4h AllocNode:Sid=eu-login-15:26645 ReqNodeList=(null) ExcNodeList=(null) NodeList=eu-a2p-528 BatchHost=eu-a2p-528 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* JOB_GRES=(null) Nodes=eu-a2p-528 CPU_IDs=127 Mem=1024 GRES= MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/cluster/home/sfux StdErr=/cluster/home/sfux/slurm-1498523.out StdIn=/dev/null StdOut=/cluster/home/sfux/slurm-1498523.out Power=
Squeue States
Job Status | Description |
---|---|
QOSMaxCpuPerUserLimit | You are using more CPUs than allowed by your share. You can either cancel a running job or wait until they are finished. |
QOSMaxMemoryPerUser | You are using more RAM than allowed by your share. You can either cancel a running job or wait until they are finished. |
QOSMaxGRESPerUser | You are using more generic resources (e.g. GPUs) than allowed by your share. You can either cancel a running job or wait until they are finished. |
PartitionDown | If a maintenance will start soon, some partitions will not be available. Otherwise, it might be an issue on our side. |
Priority | Your job is scheduled, but some other jobs with a higher priority (e.g. that have been longer in the queue) are scheduled before yours. |
ReqNodeNotAvail | Your job requirements cannot match any available nodes. Either wait until some resources are available or reduce your restriction (e.g. RAM, cores, GPU type, GPU RAM, ...). |
Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions | Your job requirements cannot match any available nodes. Either wait until some resources are available or reduce your restriction (e.g. RAM, cores, GPU type, GPU RAM, ...). |
Resources | Your job is waiting for resources to be available. |
InvalidAccount | If you don't specify the account when submitting your job, you might have been removed from a share. Please try to logout / login to update the share information. Otherwise, please check that the account name is not invalid with `my_share_info`. |
PartitionTimeLimit | Your job requests more time than available for the partition. |
JobArrayTaskLimit | Too many jobs in an array are already running, waiting will solve the issue. |
sstat
You can use the sstat command to diplay information about your running jobs, for instance resources like CPU time (MinCPU) and memory usage (MaxRSS):
[sfux@eu-login-35 ~]$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU -j 2039738 JobID NTasks MaxRSS MinCPU ------------ -------- ---------- ---------- 2039738.ext+ 1 0 00:00:00 2039738.bat+ 1 886660K 00:07:14
An overview on all available fields for the format option is provided in the sstat documentation
https://slurm.schedmd.com/sstat.html
sacct
The sacct command allows users to check information on running or finished jobs.
[sfux@eu-login-35 ~]$ sacct --format JobID,User,State,AllocCPUS,Elapsed,NNodes,NTasks,ReqMem,ExitCode JobID User State AllocCPUS Elapsed NNodes NTasks ReqMem ExitCode ------------ --------- ---------- ---------- ---------- -------- -------- ---------- -------- 2039738 sfux RUNNING 4 00:06:01 1 8G 0:0 2039738.bat+ RUNNING 4 00:06:01 1 1 0:0 2039738.ext+ RUNNING 4 00:06:01 1 1 0:0 [sfux@eu-login-35 ~]$
An overview on all format fields for the sacct is available in the documentation
https://slurm.schedmd.com/sacct.html
Please note that the CPU time (TotalCPU) and memory usage (MaxRSS) are only correctly displayed for finished jobs. If you check this properties for running jobs, then it will just show 0. For checking the CPU time and memory usage of running jobs, please use sstat.
myjobs
We are working on providing a bbjobs like wrapper for monitoring Slurm jobs. The wrapper script is called myjobs and accepts a single option -j to specify the jobid.
- Please note that the script only correctly works for simple jobs without additional job steps
- Please note that using the cat emoji 🐈 in a user command will cause the myjobs script to fail
- Please note that the CPU efficiency for multi-node jobs displayed by myjobs is not correct (sstat that is used to get the CPU time of a running job only reports the CPU time of the first node).
The script is still work in progress and we try to improve it continuously.
[sfux@eu-login-39 ~]$ myjobs -j 2647208 Job information Job ID : 2647208 Status : RUNNING Running on node : eu-a2p-277 User : sfux Shareholder group : es_hpc Slurm partition (queue) : normal.24h Command : sbatch --ntasks=4 --time=4:30:00 --mem-per-cpu=2g Working directory : /cluster/home/sfux/testrun/adf/2021_test Requested resources Requested runtime : 04:30:00 Requested cores (total) : 4 Requested nodes : 1 Requested memory (total) : 8192 MiB Requested scratch (per node) : #not yet implemented# Job history Submitted at : 2022-11-18T11:10:37 Started at : 2022-11-18T11:10:37 Queue waiting time : 0 sec Resource usage Wall-clock : 00:10:34 Total CPU time : 00:41:47 CPU utilization : 98.85% Total resident memory : 1135.15 MiB Resident memory utilization : 13.85% [sfux@eu-login-39 ~]$
We are still working on implementing some missing features like displaying the requested local scratch and Sys/Kernel time.
If you would like to get the myjobs output for all your jobs in the queue (pending/running), you can omit the jobid as parameter:
myjobs
for displaying only information about pending jobs, you can use
myjobs -p
for displaying only information about running jobs, you can use
myjobs -r
Please note that these commands might not work for job arrays.
scancel
You can use the scancel to cancel jobs.
[sfux@eu-login-15 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1525589 normal.24 sbatch sfux R 0:11 1 eu-a2p-373 [sfux@eu-login-15 ~]$ scancel 1525589 [sfux@eu-login-15 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [sfux@eu-login-15 ~]$
bjob_connect
Sometimes it is necessary to monitor the job on the node(s) where it is running. On Euler, compute nodes can not be accessed directly via ssh. To access a node where a job is running the tool srun should be used. You can connect to one of your running job with srun.
srun --interactive --jobid JOBID --pty bash
where you need to replace JOBID with the id of your batch job. For jobs running on multiple nodes, you can use --nodelist=NODE to pick one.