- 1 About us
- 2 Access
- 2.1 Who can use the central clusters of ETH?
- 2.2 How do I get an account?
- 2.3 How can I become shareholder?
- 2.4 Why can't my browser access euler.ethz.ch?
- 2.5 How do I open a terminal session (shell)?
- 2.6 How do I open a graphical session (X11)?
- 2.7 X11-forwarding with -X does not work, what am I doing wrong?
- 2.8 How can I change my password?
- 2.9 Can I change my default shell?
- 3 Software
- 3.1 Do you provide any software on your clusters?
- 3.2 Why does my 32-bit executable not work on your clusters?
- 3.3 Can I run Windows executables on the clusters?
- 3.4 Can you please update GLIBC on the clusters?
- 3.5 Is it necessary to recompile or can I just copy my application to a cluster ?
- 3.6 Are development tools available on the clusters?
- 3.7 How do I set up my environment for these compilers?
- 3.8 How do I compile MPI applications?
- 3.9 Can I use another implementation of MPI?
- 3.10 What about OpenMP applications?
- 3.11 What scientific libraries are available on the clusters?
- 3.12 Can you please allow me to run sudo for installing my code?
- 3.13 Why can't I install my application into /usr/bin and /usr/lib64?
- 3.14 Is there a license available for application XYZ?
- 3.15 Why SVN is not working on the cluster?
- 4 Environment modules
- 5 Submitting jobs
- 5.1 Can I run an application on the login nodes?
- 5.2 Can I access a compute node via ssh or rsh?
- 5.3 How do I execute a program on the cluster?
- 5.4 How do I submit a simple command?
- 5.5 How do I submit a shell script?
- 5.6 How do I submit a parallel job?
- 5.7 What are the processor and time limits?
- 5.8 What is the maximal amount of memory that I can use?
- 5.9 Can I use GPUs?
- 5.10 Which queue should I choose?
- 5.11 How many jobs can I submit?
- 5.12 How much time should I request for my job?
- 5.13 What happens when a job reaches its time limit?
- 5.14 My job is terminated with the error message slurmstepd: error: poll(): Bad address
- 5.15 How do I submit a series of jobs (job chaining)?
- 5.16 I can't use srun in a GPU job
- 6 Monitoring jobs
- 6.1 When does my job start?
- 6.2 How can I check the status of my job(s)?
- 6.3 Why is my job waiting for a long time in the queue?
- 6.4 Where is my job's output?
- 6.5 Can I see my job's output in real time?
- 6.6 How do I know when my job is done?
- 6.7 Can I see the resources used by my job(s)?
- 6.8 How do I kill a job?
- 7 Data management and file transfer
- 7.1 How much disk space is available on the clusters?
- 7.2 How much space can I use?
- 7.3 How can I check my quota usage?
- 7.4 What happens when I reach my quota?
- 7.5 What if I need more space?
- 7.6 Why is there a limit for the number of files in my home/scratch directory?
- 7.7 Why is storage in the cluster more expensive than cheap external USB 3 hard drives?
- 7.8 How long can I keep files in the scratch directories?
- 7.9 Why did you delete my files in scratch?
- 7.10 Are my files backed up regularly?
- 7.11 How do I restore a file from a backup?
- 7.12 What is the recommended way to transfer files from/to the cluster?
- 7.13 Why is file transfer very slow?
- 8 Support
- 9 Miscellaneous
- 9.1 How can I credit or acknowledge the usage of the central clusters of ETH in a publication?
- 9.2 Do I need to be logged in when a job is executed?
- 9.3 Can I let my co-workers run jobs from my account?
- 9.4 What are your recommendations regarding security?
- 9.5 I have a problem, can I come to your office and bring my laptop?
Who are we and what do we do?
How much do our services cost?
Guest users can use the cluster for free and get access to a limited amount of resources. These resources (public share) are not guaranteed and can only be used if enough free resources are available in the cluster. If you need a guaranteed share of resources, then you can become a shareholder of the cluster by buying a share of the clusters resources. The service description and the current price list are available on the IT service catalogue. contact us if you are interested in getting more information.
Where can I find more information?
Who can use the central clusters of ETH?
Any member of ETH Zurich may use the central clusters operated by the HPC group. Professors and institutes who participated in the financing of the clusters — the so-called shareholders — are guaranteed a share of the resources proportional to their investment. Other users — guest users — share the public resources financed by the IT Services. Researchers from other Swiss and international institutions can use the services, as long as they have a collaboration with an institute of ETH Zurich.
How do I get an account?
The procedure depends on the service you intend to use:
- Everybody at ETH Zurich can use the Euler cluster. On first login of a new user a verification code is sent to the users ETH email address (USERNAME@ethz.ch, with USERNAME being the ETH account name). The user is then prompted to enter the verification code and by entering the correct code, the cluster account of the user is created.
If you need a more computing resources than you would get as a guest user, then your research group/institute/department can become a shareholder by financing a share of a cluster. Refer to the Scientific Compute Clusters service page of the IT Services for further details. Traditionally only groups within ETH Zürich could become shareholders. Since July 2016, this privilege has been expanded to other institutions in the ETH Domain, namely EAWAG, EMPA, PSI and WSL.
Why can't my browser access euler.ethz.ch?
Euler is a cluster, not a website. The address euler.ethz.ch can therefore only be reached via SSH (see below how), not HTTP.
How do I open a terminal session (shell)?
For security reasons, you can only access our services from within the ETH network. If you are outside the ETH network, you have to establish a VPN connection first. From a Linux or a Mac OS X computer, you can login with
How do I open a graphical session (X11)?
Graphical sessions on Euler are based on X11. This does not provide you with a remote desktop, but allow you to launch graphical applications on the cluster and forward their windows to your local workstation. To do this, you need a so-called X11 Server program on your workstation:
- On most Linux distributions, X11 is built-in, you do not need to install anything
- On macOS, you need to install XQuartz
- On Windows, you need to install for example MobaXterm, Cygwin/X, Xming or XWin-32
Once you have installed and launched the X11 Server program on your workstation, use ssh -Y to login:
ssh -Y email@example.com
The -Y option creates an SSH tunnel between your workstation and the cluster, which allows X11 to communicate between the client and server.
X11-forwarding with -X does not work, what am I doing wrong?
As described above, you have to use the -Y option for X11-forwarding. Log in with:
ssh -Y username@hostname
How can I change my password?
You cannot change your password on Euler because this system uses ETH authentication. If you want to change your ETH password, go to http://password.ethz.ch.
Can I change my default shell?
Do you really want to do that? Bash is the default shell for all users. The configuration of our services is complex and everything is tested extensively using bash. It is therefore the only shell that we fully support. You are free to use a different shell, but you are doing so at your own risks.
Do you provide any software on your clusters?
On our clusters, we provide a wide range of centrally installed applications and libraries. There are two software stacks on Euler, which contain commercial as well as open source software. An overview on all centrally installed applications can be found on our wiki:
For continuity and reproducibility, we also keep the old software stack:
Why does my 32-bit executable not work on your clusters?
Our clusters are pure 64-bit systems. Your 32-bit executable might runs without problems in some cases, but there are certain limitations. A 32-bit executable can only use up to 3 GB of virtual memory. If you try to use more, this might results in a segmentation fault or an out of memory error message. The solution for this problem is to recompile your application for 64-bit.
Can I run Windows executables on the clusters?
Windows executables do not run under Linux. In order to be able to run your application on our clusters, you need to make sure that it is a 64-bit binary for Linux.
Can you please update GLIBC on the clusters?
The libc is part of the operating system. Updating libc is equivalent to updating the operating system on the cluster. Therefore we can not just update libc. If your executable requires a newer version of libc (GLIBC), then please consider recompiling the executable from its source code directly on the cluster, where you would like to run it.
Is it necessary to recompile or can I just copy my application to a cluster ?
Statically linked, single-processor executables built on standard x86 Linux platforms should run without any problem on our clusters. Recompliling may improves the performance, though. Dynamically linked executables will not run if the required shared libraries are either not available or not compatible (e.g. 32-bit executable and 64-bit library). Recompiling is recommended.
Are development tools available on the clusters?
On our clusters we provide different versions of the standard compilers from gcc and Intel. To identify the actual versions that are installed on the cluster, please use the module available command:
module available gcc module available intel
Executables corresponding to the compilers:
gcc ← GNU C compiler g++ ← GNU C++ compiler gfortran ← GNU Fortran 90/95 compiler
icc ← Intel C compiler icpc ← Intel C++ compiler ifort ← Intel Fortran 90/95 compiler
How do I set up my environment for these compilers?
On our clusters, we use environment modules to prepare the environment for applications and compilers. By loading the corresponding module with the module load command, e.g.,
module load gcc/8.2.0
the environment variables as PATH, LD_LIBRARY_PATH and so on are adapted to the compiler you were loading.
How do I compile MPI applications?
The compilation of parallel applications based on the Message Passing Interface (MPI) is slightly more complicated. Once you have loaded the compiler of your choice, you must also decide which MPI library you want to use. Two MPI libraries are available on Euler:
- Open MPI (recommended)
- Intel MPI (as part of the Intel OneAPI installation)
- MPICH (only available in the gcc/4.8.5 toolchain)
Applications compiled with Open MPI run on nodes connected to the InfiniBand network. Please check our wiki page about running MPI jobs on Euler.
Open MPI is recommended for all applications.
Two series of modules — openmpi and intel — are available to configure your environment for a particular MPI library. In addition, these modules define wrappers — e.g. mpicc, mpif90 — that greatly simplify the compilation of MPI applications. These wrappers are compiler-dependent and invoke whichever compiler was active (loaded) when you loaded the MPI module. For this reason, the MPI module must absolutely be loaded after the compiler module.
To summarize, the compilation of an MPI application should look somewhat like this:
module load compiler module load MPI library mpicc program -o executable ← C program mpiCC program -o executable ← C++ program mpif77 program -o executable ← Fortran 77 program mpif90 program -o executable ← Fortran 90 program
Can I use another implementation of MPI?
Yes. We provide Intel MPI and MPICH, but we do not compile libraries with support for those MPI implementations. We strongly recommend to use the centrally installed Open MPI library.
What about OpenMP applications?
You can use OpenMP but do not forget to set OMP_NUM_THREADS=#threads and submit it with the sbatch option --ntasks=#threads.
What scientific libraries are available on the clusters?
On our clusters, we provide a large range of scientific libraries and/or applications. Please check the following wiki pages for an overview:
Can you please allow me to run sudo for installing my code?
Due to security reasons, we can not allow users to run sudo for installing their application of choice. The clusters are shared by more than 4000 users, and if we would allow them to use sudo, this could cause a lot of problem, which would affect all other cluster users. We recommend that you install software in your home directory, such that you do not need to run sudo for the installation step.
Why can't I install my application into /usr/bin and /usr/lib64?
The directories /usr/bin and /usr/lib64 are primarily used by the operating system for installing packages through the packet manager and only our system administrators have write access to them. The centrally installed applications and libraries are located in /cluster/apps and user software should be installed in the home directory.
Is there a license available for application XYZ?
The ID SIS HPC team operates and maintains the HPC clusters and provides some more services, but we do not provide any software license at all. Licenses for commercial applications are either provided by the IT shop of ETH or directly by a research group or an institute/department.
Why SVN is not working on the cluster?
When you run subversion, you might have the following error: svn: E000013: Can't check path '/cluster/home/.svn/wc.db': Permission denied. It comes from the auto mount for the home directories which tries to mount .svn. To avoid this issue, you can use the following trick:
cd /scratch svn co https://github.com/dummy/dummy.git rsync -av dummy.git/.svn ~/. rm -rf /scratch/dummy.git
Can I automatically load modules on login?
There is the possibility to add module load commands either to your $HOME/.bashrc or $HOME/.bash_profile file. These commands will be executed, when you log in to the cluster. We recommend to not load modules automatically on login, because at some point you might forget that there are modules already loaded and load one of the smart modules (like open_mpi/1.6.5), which depend on the modules that you have already loaded. Then you might not get the result that you were looking for. If you would like to load modules automatically, then please add them to the .bash_profile and not to the .bashrc file, as the latter could potentially break some of your workflows.
Is it possible to load modules in a script?
It is possible to load modules in a script. Please find below an excerpt of the script, that we use to install OpenFOAM:
echo "-> Loading modules required for build" module list &> /dev/null || source /cluster/apps/modules/init/bash module purge module load gcc/8.2.0 python/3.10.4
Module load does not work properly, what am I doing wrong?
Could it be, that you did not load all the modules that the software that you would like to use depends on ? For some applications and libraries, it is required that you load a compiler module first. Please have a look at the wiki page about the application and check which modules need to be loaded for using that software.
In the software overview version X is listed, why does module avail not list it?
Please check if you are using the proper software stack. You can check your current setting with the command:
Can I run an application on the login nodes?
Login nodes are the gateway to the cluster and only have very few resources. They are used to compile programs and submit job requests to the compute nodes, not to run applications. You are allowed to run really short programs interactively on the login nodes for testing and debugging purposes, or for pre- or post-processing. Anything else is prohibited, and if you overload the login nodes, your processes will be killed without prior notice.
Can I access a compute node via ssh or rsh?
You can not access a compute node to run a program or a command directly, via ssh, rsh or any other means. From a user's point of view, compute nodes do not exist. If you have submitted a job through the batch system, it is possible to access the node (for advanced job monitoring), where the job is running, with the srun command, which expects the job id of the job as argument.
srun --interactive --jobid <Job ID> --pty bash
If your job is using multiple nodes, you can pick one with --nodelist=NODE.
How do I execute a program on the cluster?
Every command, program or script that you want to execute on the cluster must be submitted to the bach system (Slurm). The command
is used to submit a batch job and to indicate what resources are required for that job. On Euler, two types of resources must be specified: number of processors and computation time:
Shared memory job:
sbatch --ntasks=1 --cpus-per-task=#CPUs --time=HH:MM:SS ...
sbatch --ntasks=#CPUs --time=HH:MM:SS ...When Slurm receives a job, it checks its requirement and either accepts or rejects it. If accepted, the job is dispatched to a batch queue that meets its requirements. The job will remain in the queue until enough resources are available to execute it.
Slurm operates like a "black box". You do not need to know anything about the underlying queue structure to use it. Just tell Slurm what you want, and you'll get it -- or not.
How do I submit a simple command?
To execute a simple Unix command on one processor, use:
sbatch [--time=HH:MM:SS] [--ntasks=1] --wrap="command [arguments]"
The time limit can be expressed as HH:MM:SS, for example "--time=2:30:00". The default time limit is four hours. Since batch jobs are executed on one processor, the argument "--tasks=1" can be omitted. All environment variables defined in your current shell -- including the current working directory -- are automatically passed to your job by Slurm.
How do I submit a shell script?
To execute a shell script, you can use either:
sbatch [optional flags] ./script sbatch [optional flags] < scriptThese forms are not equivalent. In the first case, the script -- which must have "execute" permission -- is read only when the job starts; any change made to it between submission and execution will be "seen" by your job. In the second case, however, the script is read by Slurm when you submit it
If your scirpt contains "#SBATCH" statements, you must use the second form.
How do I submit a parallel job?To execute a parallel program on N processors, you need to specify the number of processors. In addition the parallel code itself must be launched with the corresponding command, as for instance "mpirun".
If you launch a script with the mpirun command, the whole script will be executed in parallel. Therefore you have to be careful to not use any command that can cause a race condition, such as "cp","mv", etc. For this reason, it is often preferable to execute the script on a single processor (without "mpirun") and place "mpirun" inside the script before each command that must be executed in parallel.
What are the processor and time limits?
Please note that processor limits are not fully static and may change over time. Currently the limit for guest users is 48 cores. For shareholder groups, there is no hard limit. The amount of resources that a group can use is determined by their priority in the batch system, which depends on the resources used in the past. If a share is overused, then the priority in the batch system will decrease, whereas it will increase when a share is underused
What is the maximal amount of memory that I can use?
If you are member of a shareholder group, then the maximal amount of memory that you can use in a single core job is 3 TB (even though you are might facing very long queuing times if the share of your shareholder group does not explicitly contain so called ultra-fat memory nodes). For parallel jobs the theoretical limit is higher than 100 TB.
Can I use GPUs?
GPUs in Euler are restricted to shareholder groups that invested into GPU nodes. Guest users and shareholder groups with a pure CPU node share don't have access to GPUs in Euler.
Which queue should I choose?
In principle, you should not choose a queue at all. It is sufficient if you request the amount of resources that your job will require. The batch system will then take care of dispatching your job to the appropriate queue.
How many jobs can I submit?
The number of concurrent job depends on your account status and on your priority in the batch system.
- Guest users can use maximally 48 cores at the same time
- Members of shareholder groups don't have a hard limit on the number of cores. The number of concurrent running jobs depends on the priority in the batch system, which depends on the share size and the recent past usage
There are also limits for the number of pending jobs that you can have at the same time:
- Guest users are limited to 1,000 pending jobs
- Member of shareholder groups are limited to 30,000 pending jobs
How much time should I request for my job?The time you request has a direct influence on the scheduling of your job. Short jobs have higher priority than long jobs. In addition, short jobs can use processors reserved by a large parallel job, if Slurm determines that your job will finish before the expected start time of the large job.
Therefore, you have a pretty good reason to request as little time as possible. On the other hand, you want to make sure that your job has enough time to complete.
What happens when a job reaches its time limit?
When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL, which will terminate the job. You can use the bash feature trap to catch those signals in case you prefer to do some cleanup before a job is terminated:
My job is terminated with the error message slurmstepd: error: poll(): Bad address
Most likely your job tried to use more memory than you requested and was therefore terminated by the batch system.
Please check your job with
myjobs -j JOBID
If you see a memory utilization >= 90%, then please also check your job with
sacct -j JOBID
and look for the status OUT_OF_MEMORY. If you see this status, then please resubmit the job and request more memory.
How do I submit a series of jobs (job chaining)?
Job chaining can be used to split a very long computation into a series of jobs that fit within the allowed time limits. The batch system offers the possibility to set dependency conditions, e.g. job2 should start only when job1 is don, job3 after job2, etc.
| bsub -J job_chain
bsub -J job_chain -w "done(job_chain)"
|sbatch -J job_chain -d singleton|
A job that is submitted with the option -d singleton can begin execution after any previously launched jobs sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time.
You can also use dependency conditions, which use the jobid rather than the job name to define the depencencies.
| Job #1: bsub -J job1 command1
Job #2: bsub -J job2 -w "done(job1)" command2
| Job #1: myjobid=$(sbatch --parsable -J job1 --wrap="command1")|
Job #2: sbatch -J job2 -d afterany:$myjobid --wrap="command2"
In Slurm, sbatch --parsable returns the JOBID of the job
I can't use srun in a GPU job
Slurm badly exports gres resources, so you will need to specify them manually:
#SBATCH --gpus=1 #SBATCH --gres=gpumem:20g srun --gres=gpumem:20g COMMAND
When does my job start?
It is very hard to give an accurate estimate, when a job will start. The starting time of a job is depending on two factors.
- Can the resource request of a job be fulfilled on a compute node in the cluster ?
- Is the user priority of the person that submitted the job higher than all other persons job that have the same or very similar resource requirements ?
How can I check the status of my job(s)?
You can check the status of your job either with the myjobs command (this will show the status of the first jobstep) or with the sacct command (all jobsteps)
LSFUse the LSF command bjobs to see all your jobs (in all states) with their unique job identifier. Additional details can be obtained with the option "-l" (lowercase "L").
The command bjobs -p lists only pending jobs and indicates why they are pending. The most common reasons are explained in the table below.
|New job is waiting for scheduling||Your jobs's requirements are being analyzed|
|Individual host based reasons||A complicated way to say that not enough processors are available (literally, all hosts are unable to run your job for various, individual reasons)|
|The user has reached his/her job slot limit||Don't you think you are using enough processors already?|
|Job dependency condition not satisfied||Your job is waiting for another job to complete|
|The queue is inactivated by its time windows||This queue is active only during pre-defined time windows; your job will be considered for execution when the next window is open|
|Dependency condition invalid or never satisfied||Your job's dependency condition is false or can not be determined (usually because the status of the previous job in the chain is unknown)|
In the last case the job will never run. The simplest solution is to kill it and resubmit it with the correct dependency condition (or none at all). Alternatively, you can remove the dependency condition using the command bmod -w JOBID
Why is my job waiting for a long time in the queue?
You can check the pending reason of your job with the command
squeue -j <Job ID>
The last field on the right side of the output will show the pending reason
|BadConstraints||The job's constraints can not be satisfied|
|BeginTime||The job's earliest start time has not yet been reached.|
|Cleaning||The job is being requeued and still cleaning up from its previous execution.|
|Dependency||This job has a dependency on another job that has not been satisfied.|
|DependencyNeverSatisfied||This job has a dependency on another job that will never be satisfied.|
|InvalidAccount||The job's account is invalid.|
|InvalidQOS||The job's QOS is invalid.|
|JobHeldAdmin||The job is held by a system administrator.|
|JobHeldUser||The job is held by the user.|
|JobLaunchFailure||The job could not be launched. This may be due to a file system problem, invalid program name, etc.|
|NodeDown||A node required by the job is down.|
|NonZeroExitCode||The job terminated with a non-zero exit code.|
|PartitionDown||The partition required by this job is in a DOWN state.|
|PartitionInactive||The partition required by this job is in an Inactive state and not able to start jobs.|
|Priority||One or more higher priority jobs exist for this partition or advanced reservation.|
|QOSGrp*Limit||The job's QOS has reached an aggregate limit on some resource.|
|QOSJobLimit||The job's QOS has reached its maximum job count.|
|QOSMax*Limit||The job requests a resource that violates a per-job limit on the requested QOS.|
|QOSResourceLimit||The job's QOS has reached some resource limit.|
|ReqNodeNotAvail||Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding.|
|Reservation||The job is waiting its advanced reservation to become available.|
|Resources||The job is waiting for resources to become available.|
|SystemFailure||Failure of the Slurm system, a file system, the network, etc.|
Where is my job's output?
Be default, your job's output (and error) is stored in a file called slurm-JOBID.out, where JOBID corresponds to the job id of the job. If you use the "-o" or "-e" argument for the bsub command, you can give the output and the error file different names.
sbatch -o job.out -e job.err ...
The stdout and stderr of a job are written in real time, unless the software is buffering it.
By default, your job's output (and error) is stored in a file called lsf.oJOBID, where JOBID corresponds to the job id of the job. If you use the "-o" or "-e" argument for the bsub command, you can give the output and the error file different names.
bsub -o job.out -e job.err ...
Can I see my job's output in real time?
Contrary to LSF, in Slurm the stdout and stderr are not buffered. The slurm-JOBID.out file is created at the beginning of the job and updated in real time. If you don't get stdout and stderr in real time, then most likely the software that you are running is buffering the stdout and stderr (for instance Python and Julia). You would need to disable the buffering in the software to get stdout and stderr in real time.
You can check the output of a particular running job with the bpeek command. You can specify the job either via its job id, or via its job name:
bpeek JOBID bpeek -J JOBNAME
Add the -f option to follow the output in realtime:
bpeek -f JOBID
How do I know when my job is done?
You can instruct the batch system to notify you by e-mail when your job begins and ends using the following options:
bsub -B ... bsub -N ...
sbatch --mail-type=BEGIN ... sbatch --mail-type=END,FAIL ...
Multiple types can be combined in one option, e.g.
sbatch --mail-type=BEGIN,END,FAIL ...
Notifications are sent to your official ETH e-mail address.
Can I see the resources used by my job(s)?
You can display the load and resource usage (memory, swap, etc.) of any specific job with the following commands:
[sfux@euler01 ~]$ bbjobs 25445659 Job information Job ID : 25445659 Status : RUNNING Running on node : 8*e1374 User : sfux Queue : normal.4h Command : mpirun solve_Basel_problem -accuracy 10e-8 Working directory : $HOME/unsolved_problems/basel_problem Requested resources Requested cores : 8 Requested memory : 1024 MB per core Requested scratch : not specified Dependency : - Job history Submitted at : 13:42 1735-08-22 Started at : 13:43 1735-08-22 Queue wait time : 34 sec Resource usage Updated at : 13:44 1735-08-22 Wall-clock : 59 sec Tasks : 12 Total CPU time : 7 min CPU utilization : 99.8 % Sys/Kernel time : 0.1 % Total resident memory : 8150 MB Resident memory utilization : 99.2 % Affinity per Host Host : e1374 Task affinity : by core Cores : /0/0/0 Memory affinity : not defined
[sfux@eu-login-39 ~]$ myjobs -j 2647208 Job information Job ID : 2647208 Status : RUNNING Running on node : eu-a2p-277 User : sfux Shareholder group : es_hpc Slurm partition (queue) : normal.24h Command : sbatch --ntasks=4 --time=4:30:00 --mem-per-cpu=2g Working directory : /cluster/home/sfux/testrun/adf/2021_test Requested resources Requested runtime : 04:30:00 Requested cores (total) : 4 Requested nodes : 1 Requested memory (total) : 8192 MiB Requested scratch (per node) : #not yet implemented# Job history Submitted at : 2022-11-18T11:10:37 Started at : 2022-11-18T11:10:37 Queue waiting time : 0 sec Resource usage Wall-clock : 00:10:34 Total CPU time : 00:41:47 CPU utilization : 98.85% Total resident memory : 1135.15 MiB Resident memory utilization : 13.85% [sfux@eu-login-39 ~]$
How do I kill a job?
LSF:The LSF command bkill is used to kill pending or running jobs. For obvious reasons, you can kill only jobs that you own. Use bkill JOBID to kill a particular job, or bkill 0 (zero) to kill all your jobs (running and pending). You can kill a job by name using bkill -J jobname (this will kill the last job with that name) or a whole series of jobs with bkill -J jobname 0 (zero again)
You can use bkill to send a signal to a job without necessarily killing it. For example, if your application is programmed to save results when it receives a USR2 signal (i.e. the signal sent by LSF when the time limit is reached), you can trigger this action manually with the command bkill -s USR2 JOBID
The Slurm command scancel is used to kill pending or running jobs.
[sfux@eu-login-15 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1525589 normal.24 sbatch sfux R 0:11 1 eu-a2p-373 [sfux@eu-login-15 ~]$ scancel 1525589 [sfux@eu-login-15 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [sfux@eu-login-15 ~]$
Data management and file transfer
How much disk space is available on the clusters?
Every user gets a home directory with a quota of 20 GB and 200'000 files and directories. In addition every user gets his own personal scratch directory, where he can store up to 2.5 TB of data for a short time (scratch space is per definition used for temporary storage of data). If you plan to use your personal scratch directory, then please carefully read the usage rules first in order to avoid misunderstandings. The usage rules can be displayed with the command
If you are the owner of a central NAS share hosted by the ID Storage group, then this can also be mounted on the cluster. Private NAS systems can also be mounted on the cluster, but we do not provide any support for them.
Shareholders have the option to buy more permanent storage in the cluster.
How much space can I use?
You can store up to 20 GB of data in your home directory (permanent storage) and temporary up to 2.5 TB in your personal scratch directory (temporary storage). Shareholders can additionally buy as much storage as they need.
How can I check my quota usage?
We provide the command lquota for users to check their quota usage
[sfux@eu-login-36 ~]$ lquota +-----------------------------+-------------+------------------+------------------+------------------+ | Storage location: | Quota type: | Used: | Soft quota: | Hard quota: | +-----------------------------+-------------+------------------+------------------+------------------+ | /cluster/home/sfux | space | 8.17 GB | 17.18 GB | 21.47 GB | | /cluster/home/sfux | files | 33431 | 80000 | 100000 | +-----------------------------+-------------+------------------+------------------+------------------+ | /cluster/shadow | space | 8.19 kB | 2.15 GB | 2.15 GB | | /cluster/shadow | files | 3 | 50000 | 50000 | +-----------------------------+-------------+------------------+------------------+------------------+ | /cluster/scratch/sfux | space | 20.48 kB | 2.50 TB | 2.70 TB | | /cluster/scratch/sfux | files | 5 | 1000000 | 1500000 | +-----------------------------+-------------+------------------+------------------+------------------+ [sfux@eu-login-36 ~]$
For checking quotas of project and work storage, you can provide the path to the storage share as argument to the script
[sfux@eu-login-36 ~]$ lquota /cluster/project/sis +-----------------------------+-------------+------------------+------------------+------------------+ | Storage location: | Quota type: | Used: | Soft quota: | Hard quota: | +-----------------------------+-------------+------------------+------------------+------------------+ | /cluster/project/sis | space | 13.17 TB | - | 16.50 TB | | /cluster/project/sis | files | 1676434 | - | 31876696 | +-----------------------------+-------------+------------------+------------------+------------------+ [sfux@eu-login-36 ~]$ lquota /cluster/work/sis +-----------------------------+-------------+------------------+------------------+------------------+ | Storage location: | Quota type: | Used: | Soft quota: | Hard quota: | +-----------------------------+-------------+------------------+------------------+------------------+ | /cluster/work/sis | space | 15.42 TB | 30.00 TB | 33.00 TB | | /cluster/work/sis | files | 1606372 | 10000000 | 11000000 | +-----------------------------+-------------+------------------+------------------+------------------+ [sfux@eu-login-36 ~]$
What happens when I reach my quota?Quotas have both soft and hard limits. The soft limit is the amount of disk space you can use on a day-to-day basis. You may exceed it temporarily but you can never go beyond the hard limit.
You will be warned by e-mail before you reach the soft limit. At this point you will have 5 days to reduce your disk usage. If you are still over quota after this so-called "grace period", you will not be allowed to write a single file in your home directory.
What if I need more space?
If you need more space, then you can for instance become a shareholder and buy storage directly inside the cluster. Another option would be to buy a central NAS share from the ID storage group and mount it on the cluster or you can also mount your private group NAS (we do not provide support for this).
Why is there a limit for the number of files in my home/scratch directory?
There is a nightly backup for the Home directories of all users on Euler. If there are too many files (before introducing these limits, there were in total about 100 Million files), then the nightly backup cannot finish and users don't have a backup of their data anymore. Therefore we had to introduce strict limits on the Home directories. Users are informed by email, when they reach 80% of the file/directory quota.
The limit on the scratch directory had to be introduced, because having a lot of small files (on the order of Millions) slows down the storage system where the personal scratch directories are located and this affects all users. The storage system is optimized for medium and large files. Small file, here means on the order of KB. Medium files are considered to have a size of multiple MB's.
Why is storage in the cluster more expensive than cheap external USB 3 hard drives?
- Enterprise class hardware
- Much better network
- Managed by ID SIS HPC
How long can I keep files in the scratch directories?
[sfux@euler06 ~]$ grep -A1 "2)" $SCRATCH/__USAGE_RULES__ 2) Files older than 15 days will be automatically DELETED without prior notice.
Why did you delete my files in scratch?
If the files in your personal scratch directory have been deleted, then they were older than 15 days and were deleted due to the purge policy of the personal scratch directory. Please read again the usage rules that can be displayed with the command
There it is clearly indicated that files older than 15 days will be deleted without prior notice to the user. If you need storage space for a longer time (or permanent storage), then please check out the different options that we provide for our clusters.
Are my files backed up regularly?
Home directories are saved every hour and backed up every night. Project and work storage shares are backed up on a weekly basis for disaster recovery.Shareholders have the option to buy a daily backup for their project or work shares.Scratch directories are not backed up at all. Do not use them to store valuable important data over long periods of time.
How do I restore a file from a backup?
Only a system administrator can restore backed up files. Please contact the cluster support and indicate the exact location (full path) of the file(s) you wish to recover.
However, your $HOME does provide a "hidden" .snapshot directory in every subdirectory, which holds exact copies of the status of that directory at different times. If you like to restore individual files (or whole directory structures) from this location, then you may simply copy them out. The .snapshots are read only.
[sfux@eu-login-02:~ ]$ ls -al .snapshot/ total 84 drwxrwxrwx 20 root root 8192 Oct 12 08:05 . drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 .. drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 daily.2017-10-07_0120 drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 daily.2017-10-08_0120 drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 daily.2017-10-09_0120 drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 daily.2017-10-10_0120 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 daily.2017-10-11_0120 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 daily.2017-10-12_0120 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0005 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0105 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0205 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0305 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0405 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0505 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0605 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0705 drwxr-x--- 2 sfux T0000 4096 Oct 10 13:55 hourly.2017-10-12_0805 drwxr-x--- 2 sfux T0000 4096 Sep 15 09:11 weekly.2017-09-24_0636 drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 weekly.2017-10-01_0636 drwxr-x--- 2 sfux T0000 4096 Sep 26 06:59 weekly.2017-10-08_0636
What is the recommended way to transfer files from/to the cluster?
For smaller and medium amount of data, we recommend to use the standard command line tools scp or rsync to transfer files from/to the cluster.
files: scp source/file username@hostname:destination scp username@hostname:source/file destination directories: scp -r source/directory username@hostname:destination scp -r username@hostname:source/directory destination
For sftp, it is usually easier to use a drag-and-drop graphical interface as it is for instance provided by WinSCP (Windows). But be aware of possible incompatibilities between Windows and Linux. There are some handy conversion tools called dos2unix and unix2dos.
Why is file transfer very slow?
There can be multiple reasons for slow file transfer:
- Bad network
- Bad performance on the system hosting the data that you like to transfer to Euler
- Problems with a storage system on Euler
How can I create a support request?
If you would like to create a support request, then send an email to
Alternatively, you can also use the smartdesk interface
to create a support request.
What information should I provide in a support request?
Please make sure that you provide all relevant information about your problem, otherwise we cannot provide help.
- Which software are you using?
- Which modules did you load?
- If you report a problem about a job, then please always provide the corresponding jobid, the complete sbatch (or bsub) command that you used and if possible the slurm-*.out (or lsf.o*) logfile
- Always provide the complete error message, not just parts
- If you report a problem with a file or storage system, provide the complete path to the file or storage system
The more information we have about the problem, the higher is the chance that we can resolve the issue.
How can I credit or acknowledge the usage of the central clusters of ETH in a publication?
We are very thankful if our users acknowledge the usage of the central HPC clusters in their publications, but it is not mandatory at all.
If you are using Euler as a guest user, we kindly ask you to acknowledge this in your publications (acknowledgment section). This will help us keep Euler open to all members of ETH, as the cluster's public share is financed by the (limited) budget of Scientific IT Services.
There is no standard sentence how Euler should be cited, but please find below some examples from publications from this list:
- "The numerical simulations were performed on the Euler cluster operated by the High Performance Computing group at ETH Zürich."
- "Numerical simulations were performed on the ETH Zürich Euler cluster"
- "The calculations were run on the Euler cluster of ETH Zürich"
- "The simulations were performed on the ETH Euler cluster"
- "All simulations were performed on the ETH-Zürich Euler cluster"
- "Calculations were carried out on the ETH Euler cluster"
A list of scientific publications referencing the HPC clusters of ETH is provided on our wiki.
Do I need to be logged in when a job is executed?
No, after submitting a job, you can log out without any consequences. Since the job is submitted to the batch system, LSF will take care about its execution on the compute nodes. For this you do not need to be logged in.
Can I let my co-workers run jobs from my account?
No, you cannot. The HPC clusters of ID SIS HPC are subject to ETH's acceptable use policy for IT resources (Benutzungsordnung für Telematik an der ETH Zürich, BOT), which states that accounts are strictly personal. But you can share your input files with your co-workers, and they can run the jobs with their ETHZ account.
What are your recommendations regarding security?First, keep your own account secure. Do not share your account with anyone. Choose a strong password and change it regularly. Please inform us immediately if you suspect that someone has been using your account without authorization.
Second, keep your personal workstation secure. Do not believe that your workstation is safe just because it's not running Windows. Linux is an easy target for hackers, especially if you have not installed the latest security patches. So please keep your system up-to-date, whether you are using Windows, Mac OS X, Linux, or any other flavor of UNIX.
I have a problem, can I come to your office and bring my laptop?
We are a small team and provide services to more than 4000 users. It would be very nice if we could meet all of our users, but our schedule does not allow for this. We would prefer, if you first use the main support channels as for instance the ticket system, the service desk, or our support email address. If this does not help, or if it is about a very difficult case, then please ask us for scheduling a meeting with you, where we can discuss the problem more detailed. This way we can make sure, that the specialists for the topic that you would like to discuss are present.