Difference between revisions of "Using the batch system"

From ScientificComputing
Jump to: navigation, search
 
(81 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
<noinclude>==Introduction==</noinclude><includeonly>===Introduction===</includeonly>
 
<noinclude>==Introduction==</noinclude><includeonly>===Introduction===</includeonly>
  
On our HPC cluster, we use the IBM '''LSF''' (Load Sharing Facility) batch system. A basic knowledge of LSF is required if you would like to work on the HPC clusters. The present article will show you how to use LSF to execute simple ''batch jobs'' and give you an overview of some advanced features that can dramatically increase your productivity on a cluster.
+
On our HPC cluster, we use the '''Slurm''' (Simple Linux Utility for Resource Management) batch system. A basic knowledge of Slurm is required if you would like to work on the HPC clusters of ETH. The present article will show you how to use Slurm to execute simple ''batch jobs'' and give you an overview of some advanced features that can dramatically increase your productivity on a cluster.
  
 
Using a batch system has numerous advantages:
 
Using a batch system has numerous advantages:
  
 
* ''single system image'' &mdash; all computing resources in the cluster can be accessed from a single point
 
* ''single system image'' &mdash; all computing resources in the cluster can be accessed from a single point
* ''load balancing'' &mdash; the workload is automatically distributed across all available processors
+
* ''load balancing'' &mdash; the workload is automatically distributed across all available processor cores
 
* ''exclusive use'' &mdash; many computations can be executed at the same time without affecting each other
 
* ''exclusive use'' &mdash; many computations can be executed at the same time without affecting each other
 
* ''prioritization'' &mdash; computing resources can be dedicated to specific applications or people
 
* ''prioritization'' &mdash; computing resources can be dedicated to specific applications or people
 
* ''fair share'' &mdash; a fair allocation of those resources among all users is guaranteed
 
* ''fair share'' &mdash; a fair allocation of those resources among all users is guaranteed
  
In fact, our HPC clusters contains so many processors (30,000) and are used by so many people (more than 2,000) that it would be '''impossible''' to use it efficiently without a batch system.
+
In fact, our HPC clusters contains so many cores (130,000) and are used by so many people (more than 3,200) that it would be '''impossible''' to use it efficiently without a batch system.
  
 
All computations on our HPC cluster '''must''' be submitted to the batch system. Please do not run any job interactively on the login nodes, except for testing or debugging purposes.
 
All computations on our HPC cluster '''must''' be submitted to the batch system. Please do not run any job interactively on the login nodes, except for testing or debugging purposes.
 +
 +
If you are a member of multiple shareholder groups, then please have a look at our wiki page about [[Multiple_shareholder_groups|working in multiple shareholder groups]]
  
 
<noinclude>==Basic job submission==</noinclude><includeonly>===Basic job submission===</includeonly>
 
<noinclude>==Basic job submission==</noinclude><includeonly>===Basic job submission===</includeonly>
 +
 +
We provide a '''helper tool''' to facilitate setting up submission commands and/or jobscript for Slurm and LSF
 +
 +
[https://scicomp.ethz.ch/public/lsla/index2.html '''Slurm/LSF Submission Line Advisor''']
 +
 +
You can specify the resource required by your job and the command and the script will output the corresponding Slurm/LSF submission command or jobscript, depending on your choice.
 +
 +
Slurm provides two different ways of submitting jobs. While we first show the solution with <tt>--wrap</tt>, we strongly recommend to use scripts as indicated in the section [[Using_the_batch_system#Job_scripts|Job scripts]].
 +
The scripts require a bit more work to run a job but comes with some majors advantages:
 +
* Better reproducibility
 +
* More easy and faster handover (which includes the cluster support when you need our help)
 +
* Can load the modules directly within the script
  
 
<noinclude>===Simple commands and programs===</noinclude><includeonly>====Simple commands and programs====</includeonly>
 
<noinclude>===Simple commands and programs===</noinclude><includeonly>====Simple commands and programs====</includeonly>
 
Submitting a job to the batch system is as easy as:
 
Submitting a job to the batch system is as easy as:
  
  bsub ''command [arguments]''
+
  sbatch --wrap="''command [arguments]''"
  bsub /path/to/''program [arguments]''  
+
  sbatch --wrap="/path/to/''program [arguments]''"
  
 
Examples:
 
Examples:
  
  [leonhard@euler03 ~]$ '''bsub gzip big_file.dat'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --wrap="gzip big_file.dat"'''
  Generic job.
+
  Submitted batch job 1010113
Job <8146539> is submitted to queue <normal.4h>.
 
  
  [leonhard@euler03 ~]$ '''bsub ./hello_world'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --wrap="./hello_world"'''
  Generic job.
+
  Submitted batch job 1010171
Job <8146540> is submitted to queue <normal.4h>.
 
  
 
Two or more commands can be combined together by enclosing them in quotes:
 
Two or more commands can be combined together by enclosing them in quotes:
  
  bsub "''command1''; ''command2''"
+
  sbatch --wrap="''command1''; ''command2''"
  
 
Example:
 
Example:
  
  [leonhard@euler03 ~]$ '''bsub "configure; make; make install"'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --wrap "configure; make; make install"'''
  Generic job.
+
  Submitted batch job 1010213.
Job <8146541> is submitted to queue <normal.4h>.
 
  
 
Quotes are also necessary if you want to use I/O redirection ("<tt>></tt>", "<tt><</tt>"), pipes ("<tt>|</tt>") or conditional operators ("<tt>&&</tt>", "<tt>||</tt>"):
 
Quotes are also necessary if you want to use I/O redirection ("<tt>></tt>", "<tt><</tt>"), pipes ("<tt>|</tt>") or conditional operators ("<tt>&&</tt>", "<tt>||</tt>"):
  
  bsub "''command'' < ''data.in'' > ''data.out''"
+
  sbatch --wrap="''command'' < ''data.in'' > ''data.out''"
  bsub "''command1'' | ''command2''"
+
  sbatch --wrap="''command1'' | ''command2''"
  
 
Examples:
 
Examples:
  
  [leonhard@euler03 ~]$ '''bsub "tr ',' '\n' < comma_separated_list > linebreak_separated_list"'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --wrap="tr ',' '\n' < comma_separated_list > linebreak_separated_list"'''
  Generic job.
+
  Submitted batch job 1010258
Job <8146542> is submitted to queue <normal.4h>.
 
  
  [leonhard@euler03 ~]$ '''bsub "cat unsorted_list_with_redundant_entries | sort | uniq > sorted_list"'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --wrap="cat unsorted_list_with_redundant_entries | sort | uniq > sorted_list"'''
  Generic job.
+
  Submitted batch job 1010272
Job <8146543> is submitted to queue <normal.4h>.
 
  
 
<noinclude>===Shell scripts===</noinclude><includeonly>====Shell scripts====</includeonly>
 
<noinclude>===Shell scripts===</noinclude><includeonly>====Shell scripts====</includeonly>
 
More complex commands may be placed in a shell script, which should then be submitted like this:
 
More complex commands may be placed in a shell script, which should then be submitted like this:
  
  bsub < ''script''
+
  sbatch < ''script''
 +
sbatch ''script''
  
 
Example:
 
Example:
  
  [leonhard@euler03 ~]$ '''bsub < hello.sh'''
+
  [sfux@eu-login-03 ~]$ '''sbatch < hello.sh'''
  Generic job.
+
  Submitted batch job 1010279.
Job <8146544> is submitted to queue <normal.4h>.
 
 
 
In principle, it is also possible to submit a script as if it were a program:
 
 
 
bsub ''/path/to/script''                &larr;  '''BAD IDEA!'''
 
 
 
however this syntax is '''strongly discouraged''' on our clusters because it does not allow the batch system to "see" what your script is doing, which may lead to errors in the submission and/or execution of your job.
 
  
 
<noinclude>===Output file===</noinclude><includeonly>====Output file====</includeonly>
 
<noinclude>===Output file===</noinclude><includeonly>====Output file====</includeonly>
By default your job's output (or ''standard output'', to be precise) is written into a file named <tt>lsf.o''JobID''</tt>
+
By default your job's output and error messages (or ''stdout'' and ''stderr'', to be precise) are combined and written into a file named <tt>slurm-''JobID''.out</tt>
in the directory where you executed <tt>bsub</tt>, where ''JobID'' is the number
+
in the directory where you executed <tt>sbatch</tt>, where ''JobID'' is the number assigned to your job by Slurm. You can select a different output file using the option:
assigned to your job by LSF. You can select a different output file using the option:
 
 
 
bsub -o ''output_file'' ''command [argument]''
 
  
The option <tt>-o ''output_file''</tt> tells LSF to '''append''' your job's output to ''output_file''. If you want to '''overwrite''' this file, use:
+
sbatch --output=''output_file'' --open-mode=append --wrap="''command [argument]''"
  
bsub -oo ''output_file'' ...
+
The option <tt>--output ''output_file''</tt> in combination with <tt>--open-mode=append</tt> tells Slurm to '''append''' your job's output to ''output_file''. If you want to '''overwrite''' this file, use:
  
Note that this option, like all <tt>bsub</tt> options, must be placed before the ''command'' that you want to execute in your job. A common mistake is to place <tt>bsub</tt> options in the wrong place, like.
+
sbatch --output ''output_file'' --open-mode=truncate --wrap="''command [argument]''"
  
bsub ''command'' -o ''output_file''          &larr;  '''WRONG!'''
+
Note that this option, like all <tt>sbatch</tt> options, must be placed before the ''command'' that you want to execute in your job. A common mistake is to place <tt>sbatch</tt> options in the wrong place, like.
  
<noinclude>===Batch interactive job===</noinclude><includeonly>====Batch interactive job====</includeonly>
+
sbatch --wrap=''command'' -o ''output_file''          &larr;  '''WRONG!'''
If you just want to run a quick test, you can submit it as a ''batch interactive'' job. In this case the job's output is not written into a file, but directly to your terminal, as if it were executed interactively:
 
  
bsub -I ''command [arguments]''
+
<noinclude>===Error file===</noinclude><includeonly>====Error file====</includeonly>
 +
It is also possible to store <tt>stderr</tt> of a job in a separate file (and again, you can choose with the --open-mode parameter if you would like to append or overwrite)
  
Example:
+
  sbatch --error=''error_file'' --open-mode=append --wrap "''command [argument]''"
 
 
  [leonhard@euler03 ~]$ '''bsub -I "env | sort"'''
 
Generic job.
 
Job <8146545> is submitted to queue <normal.4h>.
 
<<Waiting for dispatch ...>
 
  
 
<noinclude>==Resource requirements==</noinclude><includeonly>===Resource requirements===</includeonly>
 
<noinclude>==Resource requirements==</noinclude><includeonly>===Resource requirements===</includeonly>
By default, a batch job can use only '''one processor''' for up to '''4 hours'''. (The job is killed when it reaches its run-time limit.) If your job needs more ''resources'' &mdash; time, processors, memory or scratch space &mdash;, you '''must request''' them when you submit it.
+
By default, a batch job can use only '''one core''' for up to '''1 hour'''. (The job is killed when it reaches its run-time limit.) If your job needs more ''resources'' &mdash; time, cores, memory or scratch space &mdash;, you '''must request''' them when you submit it.
  
 
<noinclude>===Wall-clock time===</noinclude><includeonly>====Wall-clock time====</includeonly>
 
<noinclude>===Wall-clock time===</noinclude><includeonly>====Wall-clock time====</includeonly>
The time limits on our clusters are always based on ''wall-clock'' (or ''elapsed'') time. You can specify the amount of time needed by your job using the option:
+
The time limits on our clusters are always based on ''wall-clock'' (or ''elapsed'') time. You can specify the amount of time needed by your job with several formats using the option:
  
  bsub -W ''minutes'' ...                 example:  bsub -W 90 ...
+
  sbatch --time=''minutes'' ...                        example:  sbatch --time=10 ...
  bsub -W ''HH:MM'' ...                   example:  bsub -W 1:30 ...
+
sbatch --time=''minutes:seconds'' ...                example:  sbatch --time=10:50 ...
 +
sbatch --time=''hours:minutes:seconds'' ...          example:  sbatch --time=5:10:50 ...
 +
sbatch --time=''days-hours'' ...                    example:  sbatch --time=1-5 ...
 +
sbatch --time=''days-hours:minutes'' ...             example:  sbatch --time=1-5:10 ...
 +
  sbatch --time=''days-hours:minutes:seconds'' ...     example:  sbatch --time=1-5:10:50 ...
  
 
Examples:
 
Examples:
  
  [leonhard@euler03 ~]$ '''bsub -W 20 ./Riemann_zeta -arg 26'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --time=20 --wrap="./Riemann_zeta -arg 26"'''
  Generic job.
+
  Submitted batch job 1010305
Job <8146546> is submitted to queue <normal.4h>.
 
  
  [leonhard@euler03 ~]$ '''bsub -W 20:00 ./solve_Koenigsberg_bridge_problem'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --time=20:00 --wrap="./solve_Koenigsberg_bridge_problem"'''
  Generic job.
+
  Submitted batch job 1010312.
Job <8146547> is submitted to queue <normal.24h>.
 
  
Since our clusters contains processors with different speeds two similar jobs will not necessarily take the same time to complete. It is therefore safer to request more time than strictly necessary... but not too much, for shorter jobs have generally a higher priority than longer ones.
+
Since our clusters contains processor cores with different speeds two similar jobs will not necessarily take the same time to complete. It is therefore safer to request more time than strictly necessary... but not too much, for shorter jobs have generally a higher priority than longer ones.
  
The maximum run-time for jobs that can run on most compute nodes in the cluster is 240 hours. We remain the right to stop jobs with a run time of more than 5 days in case of an emergency maintenance.
+
The maximum run-time for jobs that can run on most compute nodes in the cluster is 360 hours. We remain the right to stop jobs with a run time of more than 5 days in case of an emergency maintenance.
  
 
<noinclude>===Number of processor cores===</noinclude><includeonly>====Number of processor cores====</includeonly>
 
<noinclude>===Number of processor cores===</noinclude><includeonly>====Number of processor cores====</includeonly>
If your job requires multiple processors (or threads), you must request them using the option:
+
If your job requires multiple cores (or threads), you must request them using the option:
  
  bsub -n ''number_of_procs'' ...
+
  sbatch --ntasks=''number_of_cores'' --wrap="..."
  
Note that merely ''requesting'' multiple processors does not mean that your application will ''use'' them.
+
or
 +
 
 +
sbatch --ntasks=1 --cpus-per-task=''number_of_cores'' --wrap="..."
 +
 
 +
'''Please make sure to check the paragraph about [[Using_the_batch_system#Parallel_job_submission|parallel job submission]] before requesting multiple cores.'''
 +
 
 +
Note that merely ''requesting'' multiple cores does not mean that your application will ''use'' them.  
  
 
<noinclude>===Memory===</noinclude><includeonly>====Memory====</includeonly>
 
<noinclude>===Memory===</noinclude><includeonly>====Memory====</includeonly>
 
By default the batch system allocates 1024 MB (1 GB) of memory per processor core. A single-core job will thus get 1 GB of memory; a 4-core job will get 4 GB; and a 16-core job, 16 GB. If your computation requires more memory, you '''must''' request it when you submit your job:
 
By default the batch system allocates 1024 MB (1 GB) of memory per processor core. A single-core job will thus get 1 GB of memory; a 4-core job will get 4 GB; and a 16-core job, 16 GB. If your computation requires more memory, you '''must''' request it when you submit your job:
  
  bsub -R "rusage[mem=''XXX'']" ...
+
  sbatch --mem-per-cpu=''XXX'' ...
 +
 
 +
where ''XXX'' is an integer. The default unit is MB, but you can also specify the value in GB when adding the suffix "G" after the integer value.
  
 
Example:
 
Example:
  
  [leonhard@euler03 ~]$ '''bsub -R "rusage[mem=2048]" ./evaluate_gamma -precision 10e-30'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --mem-per-cpu=2G --wrap="./evaluate_gamma -precision 10e-30"'''
  Generic job.
+
  Submitted batch job 1010322
Job <8146548> is submitted to queue <normal.4h>.
 
 
 
where ''XXX'' is the amount of memory needed by your job, in MB '''per processor'''.
 
  
 
<noinclude>===Scratch space===</noinclude><includeonly>====Scratch space====</includeonly>
 
<noinclude>===Scratch space===</noinclude><includeonly>====Scratch space====</includeonly>
LSF automatically creates a local ''scratch'' directory when your job starts and deletes it when the job ends. This directory has a unique name, which is passed to your job via the variable <tt>$TMPDIR</tt>.
+
Slurm automatically creates a local ''scratch'' directory when your job starts and deletes it when the job ends. This directory has a unique name, which is passed to your job via the variable <tt>$TMPDIR</tt>.
  
 
Unlike memory, the batch system '''does not''' reserve any disk space for this scratch directory by default. If your job is expected to write large amounts of temporary data (say, more than 250 MB) into <tt>$TMPDIR</tt> &mdash; or anywhere in the local <tt>/scratch</tt> file system &mdash; you '''must''' request enough scratch space when you submit it:
 
Unlike memory, the batch system '''does not''' reserve any disk space for this scratch directory by default. If your job is expected to write large amounts of temporary data (say, more than 250 MB) into <tt>$TMPDIR</tt> &mdash; or anywhere in the local <tt>/scratch</tt> file system &mdash; you '''must''' request enough scratch space when you submit it:
  
  bsub -R "rusage[scratch=''YYY'']" ...
+
  sbatch --tmp=''YYY'' ...
 +
 
 +
where ''YYY''' is the amount of scratch space needed by your job, in MB '''per host''' (there is no setting in Slurm to request it per core). You can also specify the amount in GB by adding the suffix "G" after '''YYY'''.
  
 
Example:
 
Example:
  
  [leonhard@euler03 ~]$ '''bsub -R "rusage[scratch=5000]" ./generating_Euler_numbers -num 5000000'''
+
  [sfux@eu-login-03 ~]$ '''sbatch --tmp=5000 --wrap="./generating_Euler_numbers -num 5000000"'''
  Generic job.
+
  Submitted batch job 1010713
Job <8146548> is submitted to queue <normal.4h>.
 
  
where ''YYY'' is the amount of scratch space needed by your job, in MB '''per processor'''.
+
Note that <tt>/tmp</tt> is '''reserved for the operating system'''. Do not write temporary data there! You should either use the directory created by Slurm (<tt>$TMPDIR</tt>) or create your own temporary directory in the local <tt>/scratch</tt> file system; in the latter case, do not forget to '''delete''' this directory at the end of your job.
  
Note that <tt>/tmp</tt> is '''reserved for the operating system'''. Do not write temporary data there! You should either use the directory created by LSF (<tt>$TMPDIR</tt>) or create your own temporary directory in the local <tt>/scratch</tt> file system; in the latter case, do not forget to '''delete''' this directory at the end of your job.
+
<noinclude>===GPU===</noinclude><includeonly>====GPU====</includeonly>
 +
There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.
 +
 +
All GPUs in Slurm are configured in non-exclusive process mode, such that you can run multiple processes/threads on a single GPU. Please find below the available GPU node types.
  
<noinclude>===Multiple requirements===</noinclude><includeonly>====Multiple requirements====</includeonly>
+
'''Euler'''
It is possible to combine memory and scratch requirements:
+
{{GPUTable}}
  
bsub -R "rusage[mem=''XXX'']" -R "rusage[scratch=''YYY'']" ...
+
You can request one or more GPUs with the command
  
is equivalent to:
+
  sbatch --gpus=''number of GPUs'' ...
   
 
bsub -R "rusage[mem=''XXX'',scratch=''YYY'']" ...
 
  
<noinclude>===LSF submission line advisor===</noinclude><includeonly>====LSF submission line advisor====</includeonly>
+
To run multi-node GPU jobs, you need to use the option <tt>--gpus-per-node</tt>:
For users that are not yet very experienced with using a batch system, we provide a small helper tool, which simplifies to setup the command for requesting resources from the batch system in order to submit a job.
 
  
https://scicomp.ethz.ch/lsf_submission_line_advisor
+
sbatch --gpus-per-node=2 ...
  
<noinclude>===GPU===</noinclude><includeonly>====GPU====</includeonly>
+
For advanced settings, please have a look at our [[Getting_started_with_GPUs|getting started with GPUs]] page.
Please note that currently on the Leonhard cluster contains GPUs, but the Euler cluster does not. Unlike Euler, which is open to all members of ETH without restriction, Leonhard is reserved exclusively to the groups who have invested in it (the so-called shareholders). Therefore the following information is only relevant for Leonhard shareholders.
 
  
All GPUs in Leonhard are configured in Exclusive Process mode. The GPU nodes have 20&nbsp;cores, 8&nbsp;GPUs, and 256&nbsp;GB of RAM (of which only about 210&nbsp;GB is usable). To run multi-node job, you will need to request <tt>span[ptile=20]</tt>.
+
<noinclude>==Interactive jobs==</noinclude><includeonly>===Interactive jobs===</includeonly>
 +
If you just want to run a quick test, you can submit it as a ''batch interactive'' job. In this case the job's output is not written into a file, but directly to your terminal, as if it were executed interactively:
  
The LSF batch system has partial integrated support for GPUs. To use the GPUs for a job node you need to request the '''ngpus_excl_p''' resource. It refers to the number of GPUs '''per node'''. This is unlike other resources, which are requested '''per core'''.
+
srun --pty ''bash''
  
For example, to run a serial job with one GPU,
+
Example:
bsub -R "rusage[ngpus_excl_p=1]" ./my_cuda_program
 
or on a full node with all eight GPUs and up to 90&nbsp;GB of RAM,
 
bsub -n 20 -R "rusage[mem=4500,ngpus_excl_p=8]" ./my_cuda_program
 
or on two full nodes:
 
bsub -n 40 -R "rusage[mem=4500,ngpus_excl_p=8] span[ptile=20]" ./my_cuda_program
 
  
While your jobs will see all GPUs, LSF will set the [https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ CUDA_VISIBLE_DEVICES] environment variable, which is honored by CUDA programs.
+
[sfux@eu-login-35 ~]$ srun --pty bash
 +
srun: job 2040660 queued and waiting for resources
 +
srun: job 2040660 has been allocated resources
 +
[sfux@eu-a2p-515 ~]$
  
<noinclude>==Parallel job submission==</noinclude><includeonly>===Parallel job submission===</includeonly>
+
For interactive jobs with X11 forwarding enabled, you need to make sure that you login to the cluster with X11 forwarding enabled and then you can run
Before submitting parallel jobs, please make sure that your application can run in parallel at all in order to not waste resources by requesting multiple cores for a serial application. Further more, please do a short [[Scaling_analysis|scaling analysis]] to see how well your code scales in parallel before requesting dozens or hundreds of cores.
 
<!-- You can calculate the CPU efficiency of your job using a script, that requires the name of the LSF output file and the number of cores as input arguments.
 
  
cpu_efficiency.sh:
+
srun [Slurm options] --x11 --pty bash
  
#!/bin/bash
+
<noinclude>==Parallel job submission==</noinclude><includeonly>===Parallel job submission===</includeonly>
+
Before submitting parallel jobs, please make sure that your application can run in parallel at all in order to not waste resources by requesting multiple cores for a serial application. Further more, please do a short [[Parallel_efficiency|scaling analysis]] to see how well your code scales in parallel before requesting dozens or hundreds of cores.
# cpu_efficiency.sh
 
# Samuel Fux, 2016 @ ETH Zurich
 
#
 
# input variables:
 
#
 
# $1 -> file name of the LSF logfile
 
# $2 -> number of cores used in the job
 
 
 
# get start and end time from lsf.o***** file
 
START_TIME_STRING=`grep 'Started at' $1`
 
END_TIME_STRING=`grep 'reported' $1`
 
# convert dates to UNIX timestamp and calculate difference
 
WALL_TIME=$(( $(( `date -d "${END_TIME_STRING[*]:20:24}" +%s` ))-$(( `date -d "${START_TIME_STRING[*]:11:24}" +%s` )) ))
 
# get string that contains CPU time of the job
 
CPU_TIME_STRING=`grep 'CPU time' $1`
 
# filter out CPU time in seconds from the string
 
CPU_TIME_SELECTION=`echo $CPU_TIME_STRING | cut -d ' ' -f 4`
 
# calculated CPU efficiency
 
CPU_EFFICIENCY=`echo "scale=2;(100*$CPU_TIME_SELECTION)/($WALL_TIME*$2) " | bc -l`
 
 
echo "Cores          : $2"
 
echo "Wall time      : $WALL_TIME"
 
echo "CPU time      : $CPU_TIME_SELECTION"
 
echo "CPU efficiency : $CPU_EFFICIENCY %"
 
 
 
The script displays the number of cores, that you have specified, the CPU time used, the wall-clock time and the CPU efficiency.
 
 
 
[leonhard@euler01 ~]$ ls lsf.o7805617
 
lsf.o7805617
 
[leonhard@euler01 ~]$ ./cpu_efficiency.sh lsf.o7805617 4
 
Cores          : 4
 
Wall time      : 6272
 
CPU time      : 24834.58
 
CPU efficiency : 98.98 %
 
-->
 
  
 
<noinclude>===OpenMP===</noinclude><includeonly>====OpenMP====</includeonly>
 
<noinclude>===OpenMP===</noinclude><includeonly>====OpenMP====</includeonly>
If your application is parallelized using [[wikipedia:OpenMP|OpenMP]] or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of processors (or ''threads'') that it can use is controlled by the environment variable <tt>OMP_NUM_THREADS</tt>. This variable must be set '''before''' you submit your job:
+
If your application is parallelized using [[wikipedia:OpenMP|OpenMP]] or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of processor cores (or ''threads'') that it can use is controlled by the environment variable <tt>OMP_NUM_THREADS</tt>. This variable must be set '''before''' you submit your job:
  
  export OMP_NUM_THREADS=''number_of_processors''
+
  export OMP_NUM_THREADS=''number_of_cores''
  bsub -n ''number_of_processors'' ...
+
  sbatch --ntasks=1 --cpus-per-task=''number_of_cores'' --wrap="..."
  
NOTE: if <tt>OMP_NUM_THREADS</tt> is not set, your application will either use one processor only, or will attempt to use '''all''' processors that it can find, '''stealing''' them from other jobs if needed. In other words, your job will either use too few or too many processors.
+
NOTE: if <tt>OMP_NUM_THREADS</tt> is not set, your application will either use one core only, or will attempt to use '''all''' cores that it can find. As you are restricted to your jobs resources, all threads will be bound to the cores allocated to your job. Starting more than 1 thread per core will slow down your application as the threads will be fighting to get time on the CPU.
  
 
<noinclude>===MPI===</noinclude><includeonly>====MPI====</includeonly>
 
<noinclude>===MPI===</noinclude><includeonly>====MPI====</includeonly>
Two kinds of MPI libraries are available on our cluster: Open MPI (recommended) and MVAPICH2. Before you can submit and execute an MPI job, you '''must''' load the corresponding modules (compiler + MPI, in that order):
+
Three kinds of MPI libraries are available on our cluster: Open MPI (recommended), Intel MPI and MVAPICH2. Before you can submit and execute an MPI job, you '''must''' load the corresponding modules (compiler + MPI, in that order):
  
 
  module load ''compiler''
 
  module load ''compiler''
Line 253: Line 216:
 
The command used to launch an MPI application is <tt>mpirun</tt>.
 
The command used to launch an MPI application is <tt>mpirun</tt>.
  
Let's assume for example that <tt>hello_world</tt> was compiled with PGI 15.1 and linked with Open MPI 1.6.5. The command to execute this job on 4 processors is:
+
Let's assume for example that <tt>hello_world</tt> was compiled with GCC 6.3.0 and linked with Open MPI 4.1.4. The command to execute this job on 4 cores is:
  
  module load pgi/15.1
+
  module load gcc/6.3.0
  module load open_mpi/1.6.5
+
  module load open_mpi/4.1.4
  bsub -n 4 mpirun ./hello_world
+
  sbatch -n 4 --wrap="mpirun ./hello_world"
  
Note that <tt>mpirun</tt> automatically uses all processors allocated to the job by LSF. It is therefore not necessary to indicate this number again to the <tt>mpirun</tt> command itself:
+
Note that <tt>mpirun</tt> automatically uses all cores allocated to the job by Slurm. It is therefore not necessary to indicate this number again to the <tt>mpirun</tt> command itself:
  
  bsub -n 4 mpirun -np 4 ./hello_world      &larr;  "-np 4" not needed!
+
  sbatch --ntasks=4 --wrap="mpirun -np 4 ./hello_world"     &larr;  "-np 4" not needed!
  
 
<noinclude>===Pthreads and other threaded applications===</noinclude><includeonly>====Pthreads and other threaded applications====</includeonly>
 
<noinclude>===Pthreads and other threaded applications===</noinclude><includeonly>====Pthreads and other threaded applications====</includeonly>
Line 272: Line 235:
  
 
<noinclude>===Hybrid jobs===</noinclude><includeonly>====Hybrid jobs====</includeonly>
 
<noinclude>===Hybrid jobs===</noinclude><includeonly>====Hybrid jobs====</includeonly>
It is possible to run [[Hybrid_jobs|hybrid jobs that mix MPI and OpenMP]] on our HPC clusters, but we strongly recommend to not submit these kind of jobs.
+
It is possible to run [[Hybrid_jobs|hybrid jobs that mix MPI and OpenMP]] on our HPC clusters, but this requires a more advanced knowledge of slurm and the hardware.
 +
 
 +
<noinclude>==Job scripts==</noinclude><includeonly>===Job scripts===</includeonly>
 +
You can also use a job script to specify all sbatch options using #SBATCH pragmas. We strongly recommend to load the modules within the submission script in order improve the reproducibility.
 +
 
 +
#!/bin/bash
 +
 +
#SBATCH -n 4
 +
#SBATCH --time=8:00
 +
#SBATCH --mem-per-cpu=2000
 +
#SBATCH --tmp=4000                        # per node!!
 +
#SBATCH --job-name=analysis1
 +
#SBATCH --output=analysis1.out
 +
#SBATCH --error=analysis1.err
 +
 +
module load xyz/123
 +
command1
 +
command2
 +
 
 +
The script can the be submitted as
 +
 
 +
sbatch < script
 +
 
 +
or
 +
 
 +
sbatch script
  
 
<noinclude>==Job monitoring==</noinclude><includeonly>===Job monitoring===</includeonly>
 
<noinclude>==Job monitoring==</noinclude><includeonly>===Job monitoring===</includeonly>
Line 280: Line 268:
 
! Command !! Description
 
! Command !! Description
 
|-
 
|-
| <tt>busers</tt>
+
| <tt>squeue</tt>
| user limits, number of pending and running jobs
+
| View job and job step information for jobs managed by Slurm
|-
 
| <tt>bqueues</tt>
 
| queues status (open/closed; active/inactive)
 
 
|-
 
|-
| <tt>bjobs</tt>
+
| <tt>scontrol</tt>
| more or less detailed information about pending, running and recently finished jobs
+
| Display information about the resource usage of a job
 
|-
 
|-
| <tt>bbjobs</tt>
+
| <tt>sstat</tt>
| better bjobs (bjobs with human readable output)
+
| Display the status information of a running job/step
 
|-
 
|-
| <tt>bhist</tt>
+
| <tt>sacct</tt>
| information about jobs that finished in the last hours/days
+
| Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
 
|-
 
|-
| <tt>bpeek</tt>
+
| <tt>myjobs</tt>
| display the standard output of a given job
+
| Job information in human readable format
 
|-
 
|-
| <tt>lsf_load</tt>
+
| <tt>scancel</tt>
| show the CPU load of all nodes used by a job
+
| Kill a job
|-
 
| <tt>bjob_connect</tt>
 
| login to a node where one of your jobs is running
 
|-
 
| <tt>bkill</tt>
 
| kill a job
 
 
|-
 
|-
 
|}
 
|}
  
For an overview on the most common options for the LSF commands, please have a look at the [[LSF_mini_reference|LSF mini reference]].
+
This section is still work in progress.
 +
 
 +
<noinclude>===squeue===</noinclude><includeonly>====squeue====</includeonly>
 +
The <tt>squeue</tt> command allows you to get information about pending, running and shortly finished jobs.
 +
 
 +
[sfux@eu-login-41 ~]$ '''squeue'''
 +
              JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 +
            1433323 normal.4h    wrap    sfux  PD      0:04      1 eu-g1-026-2
 +
            1433322 normal.4h    wrap    sfux  R      0:11      1 eu-a2p-483
 +
 
 +
You can also check only for running jobs (R) or for pending jobs (PD):
 +
 
 +
[sfux@eu-login-41 ~]$ '''squeue -t RUNNING'''
 +
              JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 +
            1433322 normal.4h    wrap    sfux  R      0:28      1 eu-a2p-483
 +
[sfux@eu-login-41 ~]$ '''squeue -t PENDING'''
 +
              JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 +
            1433323 normal.4h    wrap    sfux  PD      0:21      1 eu-g1-026-2
 +
[sfux@eu-login-41 ~]$
 +
 
 +
An overview on all <tt>squeue</tt> options is available in the <tt>squeue</tt> documentation:
 +
 
 +
https://slurm.schedmd.com/squeue.html
 +
 
 +
<noinclude>===scontrol===</noinclude><includeonly>====scontrol====</includeonly>
 +
The command <tt>scontrol</tt> if one of multiple that allow you to check the information about a running job:
 +
 
 +
[sfux@eu-login-15 ~]$ '''scontrol show jobid -dd 1498523'''
 +
JobId=1498523 JobName=wrap
 +
    UserId=sfux(40093) GroupId=sfux-group(104222) MCS_label=N/A
 +
    Priority=1769 Nice=0 Account=normal/es_hpc QOS=es_hpc/normal
 +
    JobState=RUNNING Reason=None Dependency=(null)
 +
    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 +
    DerivedExitCode=0:0
 +
    RunTime=00:00:38 TimeLimit=01:00:00 TimeMin=N/A
 +
    SubmitTime=2022-10-27T11:44:30 EligibleTime=2022-10-27T11:44:30
 +
    AccrueTime=2022-10-27T11:44:30
 +
    StartTime=2022-10-27T11:44:31 EndTime=2022-10-27T12:44:31 Deadline=N/A
 +
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-27T11:44:31 Scheduler=Main
 +
    Partition=normal.4h AllocNode:Sid=eu-login-15:26645
 +
    ReqNodeList=(null) ExcNodeList=(null)
 +
    NodeList=eu-a2p-528
 +
    BatchHost=eu-a2p-528
 +
    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 +
    TRES=cpu=1,mem=1G,node=1,billing=1
 +
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
 +
    JOB_GRES=(null)
 +
      Nodes=eu-a2p-528 CPU_IDs=127 Mem=1024 GRES=
 +
    MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
 +
    Features=(null) DelayBoot=00:00:00
 +
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
 +
    Command=(null)
 +
    WorkDir=/cluster/home/sfux
 +
    StdErr=/cluster/home/sfux/slurm-1498523.out
 +
    StdIn=/dev/null
 +
    StdOut=/cluster/home/sfux/slurm-1498523.out
 +
    Power=
  
<noinclude>===bjobs===</noinclude><includeonly>====bjobs====</includeonly>
+
<noinclude>===sstat===</noinclude><includeonly>====sstat====</includeonly>
The bjobs command allows you to get information about pending, running and shortly finished jobs.
+
You can use the <tt>sstat</tt> command to diplay information about your running jobs, for instance resources like CPU time (MinCPU) and memory usage (MaxRSS):
  
<noinclude>===bbjobs===</noinclude><includeonly>====bbjobs====</includeonly>
+
[sfux@eu-login-35 ~]$ '''sstat --all --format JobID,NTasks,MaxRSS,MinCPU -j 2039738'''
The command bbjobs can be used to see the resource request and usage (cpu, memory, swap, etc.) of any specific job.
+
JobID          NTasks    MaxRSS    MinCPU
 +
------------ -------- ---------- ----------
 +
2039738.ext+        1          0  00:00:00
 +
2039738.bat+        1    886660K  00:07:14
  
bbjobs [-u username -r -a -s -d -p -f -l -P] JOBID
+
An overview on all available fields for the format option is provided in the <tt>sstat</tt> documentation
  
<!--{| cellspacing="3" cellpadding="3" style="width:94%" align="center"-->
+
https://slurm.schedmd.com/sstat.html
{| class="wikitable" style="width:80%"
+
 
! Option !! Description
+
<noinclude>===sacct===</noinclude><includeonly>====sacct====</includeonly>
|-valign="top"
+
The sacct command allows users to check information on running or finished jobs.
|width="20%"|(no option)
+
 
|width=80%|List your jobs &mdash; information, requested resources and usage.
+
[sfux@eu-login-35 ~]$ '''sacct  --format JobID,User,State,AllocCPUS,Elapsed,NNodes,NTasks,ReqMem,ExitCode'''
|-valign="top"
+
JobID            User      State  AllocCPUS    Elapsed  NNodes  NTasks    ReqMem ExitCode
|<tt>-u ''username''</tt>   || user username.
+
------------ --------- ---------- ---------- ---------- -------- -------- ---------- --------
|-valign="top"
+
2039738          sfux    RUNNING          4   00:06:01        1                  8G      0:0
|<tt>-r</tt>               || Show only running jobs.
+
2039738.bat+              RUNNING          4  00:06:01        1        1                0:0
|-valign="top"
+
2039738.ext+              RUNNING          4  00:06:01        1        1                0:0
|<tt>-a</tt>               || Show all jobs.
+
[sfux@eu-login-35 ~]$
|-valign="top"
+
 
|<tt>-s</tt>               || Show only suspended jobs.
+
An overview on all format fields for the <tt>sacct</tt> is available in the documentation
|-valign="top"
+
 
|<tt>-d</tt>               || Show only jobs that ended recently (done).
+
https://slurm.schedmd.com/sacct.html
|-valign="top"
+
 
|<tt>-p</tt>               || Show only pending jobs.
+
Please note that the CPU time (TotalCPU) and memory usage (MaxRSS) are only correctly displayed for finished jobs. If you check this properties for running jobs, then it will just show 0. For checking the CPU time and memory usage of running jobs, please use <tt>sstat</tt>.  
|-valign="top"
+
 
|<tt>-f</tt>               || Show job cpu affinity, which cores it is running.
+
<noinclude>===myjobs===</noinclude><includeonly>====myjobs====</includeonly>
|-valign="top"
+
We are working on providing a <tt>bbjobs</tt> like wrapper for monitoring Slurm jobs. The wrapper script is called <tt>myjobs</tt> and accepts a single option <tt>-j</tt> to specify the jobid.
|<tt>-l</tt>                || Show job information in log format.
+
 
|}
+
* Please note that the script only correctly works for simple jobs without additional job steps
 +
* Please note that the CPU efficiency for multi-node jobs displayed by myjobs is not correct (sstat that is used to get the CPU time of a running job only reports the CPU time of the first node).
  
Example of output for bbjobs:
+
The script is still work in progress and we try to improve it continuously.
  
  [leonhard@euler08 ~]$ bbjobs 31989961
+
  [sfux@eu-login-39 ~]$ myjobs -j 2647208
 
  Job information
 
  Job information
   Job ID                          : 31989961
+
   Job ID                          : 2647208
 
   Status                          : RUNNING
 
   Status                          : RUNNING
   Running on node                : e1268
+
   Running on node                : eu-a2p-277
   User                            : leonhard
+
   User                            : sfux
   Queue                          : normal.4h
+
   Shareholder group              : es_hpc
   Command                        : compute_pq.py
+
  Slurm partition (queue)        : normal.24h
   Working directory              : $HOME/testruns
+
   Command                        : sbatch --ntasks=4 --time=4:30:00 --mem-per-cpu=2g
 +
   Working directory              : /cluster/home/sfux/testrun/adf/2021_test
 
  Requested resources
 
  Requested resources
   Requested cores                : 1
+
  Requested runtime              : 04:30:00
   Requested memory               : 1024 MB per core
+
   Requested cores (total)        : 4
   Requested scratch               : not specified
+
  Requested nodes                 : 1
  Dependency                      : -
+
   Requested memory (total)        : 8192 MiB
 +
   Requested scratch (per node)    : #not yet implemented#
 
  Job history
 
  Job history
   Submitted at                    : 08:45 2016-11-15
+
   Submitted at                    : 2022-11-18T11:10:37
   Started at                      : 08:48 2016-11-15
+
   Started at                      : 2022-11-18T11:10:37
   Queue wait time                 : 140 sec
+
   Queue waiting time             : 0 sec
 
  Resource usage
 
  Resource usage
  Updated at                      : 08:48 2016-11-15
+
   Wall-clock                      : 00:10:34
   Wall-clock                      : 34 sec
+
   Total CPU time                  : 00:41:47
  Tasks                          : 4
+
   CPU utilization                : 98.85%
   Total CPU time                  : 5 sec
+
   Total resident memory          : 1135.15 MiB
   CPU utilization                : 80.0 %
+
   Resident memory utilization    : 13.85%
  Sys/Kernel time                : 0.0 %
+
[sfux@eu-login-39 ~]$
   Total resident memory          : 2 MB
 
   Resident memory utilization    : 0.2 %
 
  
<noinclude>===bjob_connect===</noinclude><includeonly>====bjob_connect====</includeonly>
+
We are still working on implementing some missing features like displaying the requested local scratch and Sys/Kernel time.
Sometimes it is necessary to monitor the job on the node(s) where it is running.
+
 
On Euler, compute nodes can not be accessed directly via ssh. To access a node where a job is running the tool bjob_connect should be used.
+
If you would like to get the myjobs output for all your jobs in the queue (pending/running), you can omit the jobid as parameter:
 +
 
 +
myjobs
 +
 
 +
for displaying only information about pending jobs, you can use
 +
 
 +
myjobs -p
  
bjob_connect JOBID [SSH OPTIONS]
+
for displaying only information about running jobs, you can use
  
The tool will connect directly to the node where the job is running. In the case of multi-node runs, a list of nodes will be printed and one should be chosen to be accessed.
+
myjobs -r
  
Connections to nodes created via bjob_connect must be ended explicitly (exit from terminal) by the user when done with job monitoring.  
+
Please note that these commands might not work for job arrays.
  
<noinclude>==Troubleshooting==</noinclude><includeonly>===Troubleshooting===</includeonly>
+
<noinclude>===scancel===</noinclude><includeonly>====scancel====</includeonly>
<noinclude>===Bsub rejects my job===</noinclude><includeonly>====Bsub rejects my job====</includeonly>
+
You can use the scancel to cancel jobs
If the error message is not self-explanatory, then please report it to the {{Cluster_support}}.
 
  
<noinclude>===My job is stick in the queue since XXX hours/days===</noinclude><includeonly>====My job is stick in the queue since XXX hours/days====</includeonly>
+
[sfux@eu-login-15 ~]$ '''squeue'''
Please try to find out, why the job is pending. You can do this with the following command.
+
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 +
1525589 normal.24 sbatch sfux R 0:11 1 eu-a2p-373
 +
[sfux@eu-login-15 ~]$ '''scancel 1525589'''
 +
[sfux@eu-login-15 ~]$ '''squeue'''
 +
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 +
[sfux@eu-login-15 ~]$
  
bjobs -p
+
<noinclude>===bjob_connect===</noinclude><includeonly>====bjob_connect====</includeonly>
 +
Sometimes it is necessary to monitor the job on the node(s) where it is running.
 +
On Euler, compute nodes can not be accessed directly via ssh. To access a node where a job is running the tool srun should be used.
 +
You can connect to one of your running job with srun.
  
''Individual host-based reasons'' means that the resources requested by your jobs are not available at this time. Some resources may never become available ( e.g. mem=10000000). Some resource requirements may be mutually exclusive.
+
srun --interactive --jobid JOBID --pty bash
  
<noinclude>===My job was sent to the purgatory queue===</noinclude><includeonly>====My job was sent to the purgatory queue====</includeonly>
+
where you need to replace JOBID with the id of your batch job. For jobs running on multiple nodes, you can use <tt>--nodelist=NODE</tt> to pick one.
The ''purgatory'' queue is designed to catch jobs that were not submitted properly, either due to a user error or a bug in the batch system. Please always report this type of problem to the {{Cluster_support}}.
 

Latest revision as of 14:51, 17 May 2023

Introduction

On our HPC cluster, we use the Slurm (Simple Linux Utility for Resource Management) batch system. A basic knowledge of Slurm is required if you would like to work on the HPC clusters of ETH. The present article will show you how to use Slurm to execute simple batch jobs and give you an overview of some advanced features that can dramatically increase your productivity on a cluster.

Using a batch system has numerous advantages:

  • single system image — all computing resources in the cluster can be accessed from a single point
  • load balancing — the workload is automatically distributed across all available processor cores
  • exclusive use — many computations can be executed at the same time without affecting each other
  • prioritization — computing resources can be dedicated to specific applications or people
  • fair share — a fair allocation of those resources among all users is guaranteed

In fact, our HPC clusters contains so many cores (130,000) and are used by so many people (more than 3,200) that it would be impossible to use it efficiently without a batch system.

All computations on our HPC cluster must be submitted to the batch system. Please do not run any job interactively on the login nodes, except for testing or debugging purposes.

If you are a member of multiple shareholder groups, then please have a look at our wiki page about working in multiple shareholder groups

Basic job submission

We provide a helper tool to facilitate setting up submission commands and/or jobscript for Slurm and LSF

Slurm/LSF Submission Line Advisor

You can specify the resource required by your job and the command and the script will output the corresponding Slurm/LSF submission command or jobscript, depending on your choice.

Slurm provides two different ways of submitting jobs. While we first show the solution with --wrap, we strongly recommend to use scripts as indicated in the section Job scripts. The scripts require a bit more work to run a job but comes with some majors advantages:

  • Better reproducibility
  • More easy and faster handover (which includes the cluster support when you need our help)
  • Can load the modules directly within the script

Simple commands and programs

Submitting a job to the batch system is as easy as:

sbatch --wrap="command [arguments]"
sbatch --wrap="/path/to/program [arguments]"

Examples:

[sfux@eu-login-03 ~]$ sbatch --wrap="gzip big_file.dat"
Submitted batch job 1010113
[sfux@eu-login-03 ~]$ sbatch --wrap="./hello_world"
Submitted batch job 1010171

Two or more commands can be combined together by enclosing them in quotes:

sbatch --wrap="command1; command2"

Example:

[sfux@eu-login-03 ~]$ sbatch --wrap "configure; make; make install"
Submitted batch job 1010213.

Quotes are also necessary if you want to use I/O redirection (">", "<"), pipes ("|") or conditional operators ("&&", "||"):

sbatch --wrap="command < data.in > data.out"
sbatch --wrap="command1 | command2"

Examples:

[sfux@eu-login-03 ~]$ sbatch --wrap="tr ',' '\n' < comma_separated_list > linebreak_separated_list"
Submitted batch job 1010258
[sfux@eu-login-03 ~]$ sbatch --wrap="cat unsorted_list_with_redundant_entries | sort | uniq > sorted_list"
Submitted batch job 1010272

Shell scripts

More complex commands may be placed in a shell script, which should then be submitted like this:

sbatch < script
sbatch script

Example:

[sfux@eu-login-03 ~]$ sbatch < hello.sh
Submitted batch job 1010279.

Output file

By default your job's output and error messages (or stdout and stderr, to be precise) are combined and written into a file named slurm-JobID.out in the directory where you executed sbatch, where JobID is the number assigned to your job by Slurm. You can select a different output file using the option:

sbatch --output=output_file --open-mode=append --wrap="command [argument]" 

The option --output output_file in combination with --open-mode=append tells Slurm to append your job's output to output_file. If you want to overwrite this file, use:

sbatch --output output_file --open-mode=truncate --wrap="command [argument]"

Note that this option, like all sbatch options, must be placed before the command that you want to execute in your job. A common mistake is to place sbatch options in the wrong place, like.

sbatch --wrap=command -o output_fileWRONG!

Error file

It is also possible to store stderr of a job in a separate file (and again, you can choose with the --open-mode parameter if you would like to append or overwrite)

sbatch --error=error_file --open-mode=append --wrap "command [argument]"

Resource requirements

By default, a batch job can use only one core for up to 1 hour. (The job is killed when it reaches its run-time limit.) If your job needs more resources — time, cores, memory or scratch space —, you must request them when you submit it.

Wall-clock time

The time limits on our clusters are always based on wall-clock (or elapsed) time. You can specify the amount of time needed by your job with several formats using the option:

sbatch --time=minutes ...                        example:  sbatch --time=10 ...
sbatch --time=minutes:seconds ...                example:  sbatch --time=10:50 ...
sbatch --time=hours:minutes:seconds ...          example:  sbatch --time=5:10:50 ...
sbatch --time=days-hours ...                     example:  sbatch --time=1-5 ...
sbatch --time=days-hours:minutes ...             example:  sbatch --time=1-5:10 ...
sbatch --time=days-hours:minutes:seconds ...     example:  sbatch --time=1-5:10:50 ...

Examples:

[sfux@eu-login-03 ~]$ sbatch --time=20 --wrap="./Riemann_zeta -arg 26"
Submitted batch job 1010305
[sfux@eu-login-03 ~]$ sbatch --time=20:00 --wrap="./solve_Koenigsberg_bridge_problem"
Submitted batch job 1010312.

Since our clusters contains processor cores with different speeds two similar jobs will not necessarily take the same time to complete. It is therefore safer to request more time than strictly necessary... but not too much, for shorter jobs have generally a higher priority than longer ones.

The maximum run-time for jobs that can run on most compute nodes in the cluster is 360 hours. We remain the right to stop jobs with a run time of more than 5 days in case of an emergency maintenance.

Number of processor cores

If your job requires multiple cores (or threads), you must request them using the option:

sbatch --ntasks=number_of_cores --wrap="..."

or

sbatch --ntasks=1 --cpus-per-task=number_of_cores --wrap="..."

Please make sure to check the paragraph about parallel job submission before requesting multiple cores.

Note that merely requesting multiple cores does not mean that your application will use them.

Memory

By default the batch system allocates 1024 MB (1 GB) of memory per processor core. A single-core job will thus get 1 GB of memory; a 4-core job will get 4 GB; and a 16-core job, 16 GB. If your computation requires more memory, you must request it when you submit your job:

sbatch --mem-per-cpu=XXX ...

where XXX is an integer. The default unit is MB, but you can also specify the value in GB when adding the suffix "G" after the integer value.

Example:

[sfux@eu-login-03 ~]$ sbatch --mem-per-cpu=2G --wrap="./evaluate_gamma -precision 10e-30"
Submitted batch job 1010322

Scratch space

Slurm automatically creates a local scratch directory when your job starts and deletes it when the job ends. This directory has a unique name, which is passed to your job via the variable $TMPDIR.

Unlike memory, the batch system does not reserve any disk space for this scratch directory by default. If your job is expected to write large amounts of temporary data (say, more than 250 MB) into $TMPDIR — or anywhere in the local /scratch file system — you must request enough scratch space when you submit it:

sbatch --tmp=YYY ...

where YYY' is the amount of scratch space needed by your job, in MB per host (there is no setting in Slurm to request it per core). You can also specify the amount in GB by adding the suffix "G" after YYY.

Example:

[sfux@eu-login-03 ~]$ sbatch --tmp=5000 --wrap="./generating_Euler_numbers -num 5000000"
Submitted batch job 1010713

Note that /tmp is reserved for the operating system. Do not write temporary data there! You should either use the directory created by Slurm ($TMPDIR) or create your own temporary directory in the local /scratch file system; in the latter case, do not forget to delete this directory at the end of your job.

GPU

There are GPU nodes in the Euler cluster. The GPU nodes are reserved exclusively to the shareholder groups that invested into them. Guest users and shareholder that purchase CPU nodes but no GPU nodes cannot use the GPU nodes.

All GPUs in Slurm are configured in non-exclusive process mode, such that you can run multiple processes/threads on a single GPU. Please find below the available GPU node types.

Euler

GPU Model LSF Specifier (GPU driver > 450.80.02) Slurm specifier GPU memory per GPU CPU cores per node CPU memory per node
NVIDIA GeForce GTX 1080 NVIDIAGeForceGTX1080 gtx_1080 8 GiB 20 256 GiB
NVIDIA GeForce GTX 1080 Ti NVIDIAGeForceGTX1080Ti gtx_1080_ti 11 GiB 20 256 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti rtx_2080_ti 11 GiB 36 384 GiB
NVIDIA GeForce RTX 2080 Ti NVIDIAGeForceRTX2080Ti rtx_2080_ti 11 GiB 128 512 GiB
NVIDIA GeForce RTX 3090 NVIDIAGeForceRTX3090 rtx_3090 24 GiB 128 512 GiB
NVIDIA TITAN RTX NVIDIATITANRTX titan_rtx 24 GiB 128 512 GiB
NVIDIA Quadro RTX 6000 QuadroRTX6000 quadro_rtx_6000 24 GiB 128 512 GiB
NVIDIA Tesla V100-SXM2 32 GiB TeslaV100_SXM2_32GB v100 32 GiB 48 768 GiB
NVIDIA Tesla V100-SXM2 32 GB TeslaV100_SXM2_32GB v100 32 GiB 40 512 GiB
Nvidia Tesla A100 (40 GiB) NVIDIAA100_PCIE_40GB a100-pcie-40gb 40 GiB 48 768 GiB
Nvidia Tesla A100 (80 GiB) unavailable a100_80gb 80 GiB 48 1024 GiB

You can request one or more GPUs with the command

sbatch --gpus=number of GPUs ...

To run multi-node GPU jobs, you need to use the option --gpus-per-node:

sbatch --gpus-per-node=2 ...

For advanced settings, please have a look at our getting started with GPUs page.

Interactive jobs

If you just want to run a quick test, you can submit it as a batch interactive job. In this case the job's output is not written into a file, but directly to your terminal, as if it were executed interactively:

srun --pty bash

Example:

[sfux@eu-login-35 ~]$ srun --pty bash
srun: job 2040660 queued and waiting for resources
srun: job 2040660 has been allocated resources
[sfux@eu-a2p-515 ~]$

For interactive jobs with X11 forwarding enabled, you need to make sure that you login to the cluster with X11 forwarding enabled and then you can run

srun [Slurm options] --x11 --pty bash

Parallel job submission

Before submitting parallel jobs, please make sure that your application can run in parallel at all in order to not waste resources by requesting multiple cores for a serial application. Further more, please do a short scaling analysis to see how well your code scales in parallel before requesting dozens or hundreds of cores.

OpenMP

If your application is parallelized using OpenMP or linked against a library using OpenMP (Intel MKL, OpenBLAS, etc.), the number of processor cores (or threads) that it can use is controlled by the environment variable OMP_NUM_THREADS. This variable must be set before you submit your job:

export OMP_NUM_THREADS=number_of_cores
sbatch --ntasks=1 --cpus-per-task=number_of_cores --wrap="..."

NOTE: if OMP_NUM_THREADS is not set, your application will either use one core only, or will attempt to use all cores that it can find. As you are restricted to your jobs resources, all threads will be bound to the cores allocated to your job. Starting more than 1 thread per core will slow down your application as the threads will be fighting to get time on the CPU.

MPI

Three kinds of MPI libraries are available on our cluster: Open MPI (recommended), Intel MPI and MVAPICH2. Before you can submit and execute an MPI job, you must load the corresponding modules (compiler + MPI, in that order):

module load compiler
module load mpi_library

The command used to launch an MPI application is mpirun.

Let's assume for example that hello_world was compiled with GCC 6.3.0 and linked with Open MPI 4.1.4. The command to execute this job on 4 cores is:

module load gcc/6.3.0
module load open_mpi/4.1.4
sbatch -n 4 --wrap="mpirun ./hello_world"

Note that mpirun automatically uses all cores allocated to the job by Slurm. It is therefore not necessary to indicate this number again to the mpirun command itself:

sbatch --ntasks=4 --wrap="mpirun -np 4 ./hello_world"      ←  "-np 4" not needed!

Pthreads and other threaded applications

Their behavior is similar to OpenMP applications. It is important to limit the number of threads that the application spawns. There is no standard way to do this, so be sure to check the application's documentation on how to do this. Usually a program supports at least one of four ways to limit itself to N threads:

  • it understands the OMP_NUM_THREADS=N environment variable,
  • it has its own environment variable, such as GMX_NUM_THREADS=N for Gromacs,
  • it has a command-line option, such as -nt N (for Gromacs), or
  • it has an input-file option, such as num_threads N.

If you are unsure about the program's behavior, please contact us and we will analyze it.

Hybrid jobs

It is possible to run hybrid jobs that mix MPI and OpenMP on our HPC clusters, but this requires a more advanced knowledge of slurm and the hardware.

Job scripts

You can also use a job script to specify all sbatch options using #SBATCH pragmas. We strongly recommend to load the modules within the submission script in order improve the reproducibility.

#!/bin/bash

#SBATCH -n 4
#SBATCH --time=8:00
#SBATCH --mem-per-cpu=2000
#SBATCH --tmp=4000                        # per node!!
#SBATCH --job-name=analysis1
#SBATCH --output=analysis1.out
#SBATCH --error=analysis1.err

module load xyz/123
command1
command2

The script can the be submitted as

sbatch < script

or

sbatch script

Job monitoring

Please find below a table with commands for job monitoring and job control

Command Description
squeue View job and job step information for jobs managed by Slurm
scontrol Display information about the resource usage of a job
sstat Display the status information of a running job/step
sacct Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
myjobs Job information in human readable format
scancel Kill a job

This section is still work in progress.

squeue

The squeue command allows you to get information about pending, running and shortly finished jobs.

[sfux@eu-login-41 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1433323 normal.4h     wrap     sfux  PD      0:04      1 eu-g1-026-2
           1433322 normal.4h     wrap     sfux  R       0:11      1 eu-a2p-483

You can also check only for running jobs (R) or for pending jobs (PD):

[sfux@eu-login-41 ~]$ squeue -t RUNNING
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1433322 normal.4h     wrap     sfux  R       0:28      1 eu-a2p-483
[sfux@eu-login-41 ~]$ squeue -t PENDING
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1433323 normal.4h     wrap     sfux  PD      0:21      1 eu-g1-026-2
[sfux@eu-login-41 ~]$ 

An overview on all squeue options is available in the squeue documentation:

https://slurm.schedmd.com/squeue.html

scontrol

The command scontrol if one of multiple that allow you to check the information about a running job:

[sfux@eu-login-15 ~]$ scontrol show jobid -dd 1498523
JobId=1498523 JobName=wrap
   UserId=sfux(40093) GroupId=sfux-group(104222) MCS_label=N/A
   Priority=1769 Nice=0 Account=normal/es_hpc QOS=es_hpc/normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:38 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-10-27T11:44:30 EligibleTime=2022-10-27T11:44:30
   AccrueTime=2022-10-27T11:44:30
   StartTime=2022-10-27T11:44:31 EndTime=2022-10-27T12:44:31 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-27T11:44:31 Scheduler=Main
   Partition=normal.4h AllocNode:Sid=eu-login-15:26645
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=eu-a2p-528
   BatchHost=eu-a2p-528
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   JOB_GRES=(null)
     Nodes=eu-a2p-528 CPU_IDs=127 Mem=1024 GRES=
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/cluster/home/sfux
   StdErr=/cluster/home/sfux/slurm-1498523.out
   StdIn=/dev/null
   StdOut=/cluster/home/sfux/slurm-1498523.out
   Power=

sstat

You can use the sstat command to diplay information about your running jobs, for instance resources like CPU time (MinCPU) and memory usage (MaxRSS):

[sfux@eu-login-35 ~]$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU -j 2039738
JobID          NTasks     MaxRSS     MinCPU
------------ -------- ---------- ----------
2039738.ext+        1          0   00:00:00
2039738.bat+        1    886660K   00:07:14

An overview on all available fields for the format option is provided in the sstat documentation

https://slurm.schedmd.com/sstat.html

sacct

The sacct command allows users to check information on running or finished jobs.

[sfux@eu-login-35 ~]$ sacct  --format JobID,User,State,AllocCPUS,Elapsed,NNodes,NTasks,ReqMem,ExitCode
JobID             User      State  AllocCPUS    Elapsed   NNodes   NTasks     ReqMem ExitCode
------------ --------- ---------- ---------- ---------- -------- -------- ---------- --------
2039738           sfux    RUNNING          4   00:06:01        1                  8G      0:0
2039738.bat+              RUNNING          4   00:06:01        1        1                 0:0
2039738.ext+              RUNNING          4   00:06:01        1        1                 0:0
[sfux@eu-login-35 ~]$

An overview on all format fields for the sacct is available in the documentation

https://slurm.schedmd.com/sacct.html

Please note that the CPU time (TotalCPU) and memory usage (MaxRSS) are only correctly displayed for finished jobs. If you check this properties for running jobs, then it will just show 0. For checking the CPU time and memory usage of running jobs, please use sstat.

myjobs

We are working on providing a bbjobs like wrapper for monitoring Slurm jobs. The wrapper script is called myjobs and accepts a single option -j to specify the jobid.

  • Please note that the script only correctly works for simple jobs without additional job steps
  • Please note that the CPU efficiency for multi-node jobs displayed by myjobs is not correct (sstat that is used to get the CPU time of a running job only reports the CPU time of the first node).

The script is still work in progress and we try to improve it continuously.

[sfux@eu-login-39 ~]$ myjobs -j 2647208
Job information
 Job ID                          : 2647208
 Status                          : RUNNING
 Running on node                 : eu-a2p-277
 User                            : sfux
 Shareholder group               : es_hpc
 Slurm partition (queue)         : normal.24h
 Command                         : sbatch --ntasks=4 --time=4:30:00 --mem-per-cpu=2g
 Working directory               : /cluster/home/sfux/testrun/adf/2021_test
Requested resources
 Requested runtime               : 04:30:00
 Requested cores (total)         : 4
 Requested nodes                 : 1
 Requested memory (total)        : 8192 MiB
 Requested scratch (per node)    : #not yet implemented#
Job history
 Submitted at                    : 2022-11-18T11:10:37
 Started at                      : 2022-11-18T11:10:37
 Queue waiting time              : 0 sec
Resource usage
 Wall-clock                      : 00:10:34
 Total CPU time                  : 00:41:47
 CPU utilization                 : 98.85%
 Total resident memory           : 1135.15 MiB
 Resident memory utilization     : 13.85%
[sfux@eu-login-39 ~]$ 

We are still working on implementing some missing features like displaying the requested local scratch and Sys/Kernel time.

If you would like to get the myjobs output for all your jobs in the queue (pending/running), you can omit the jobid as parameter:

myjobs

for displaying only information about pending jobs, you can use

myjobs -p

for displaying only information about running jobs, you can use

myjobs -r

Please note that these commands might not work for job arrays.

scancel

You can use the scancel to cancel jobs

[sfux@eu-login-15 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1525589 normal.24 sbatch sfux R 0:11 1 eu-a2p-373
[sfux@eu-login-15 ~]$ scancel 1525589
[sfux@eu-login-15 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[sfux@eu-login-15 ~]$

bjob_connect

Sometimes it is necessary to monitor the job on the node(s) where it is running. On Euler, compute nodes can not be accessed directly via ssh. To access a node where a job is running the tool srun should be used. You can connect to one of your running job with srun.

srun --interactive --jobid JOBID --pty bash

where you need to replace JOBID with the id of your batch job. For jobs running on multiple nodes, you can use --nodelist=NODE to pick one.