Migrating to Ubuntu
Contents
- 1 Introduction
- 2 What is new
- 3 What stays the same
- 4 Timeline
- 5 Information for users
- 5.1 Login
- 5.2 Software stack
- 5.3 User environment
- 5.4 Slurm
- 5.5 Jupyterhub
- 5.6 Backward compatibility
- 5.7 Support
- 5.8 Known issues
- 5.9 FAQ
- 5.9.1 Basic questions about the migration
- 5.9.2 Batch system
- 5.9.3 Modules
- 5.9.4 Issues with software packages
- 5.9.4.1 I need to run my notebook urgently but it does not work on Ubuntu
- 5.9.4.2 Emacs 29.1 is crashing when I connect with X11 to Euler
- 5.9.4.3 Tensorflow is crashing with the error Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
- 5.9.4.4 Why is Matlab filling up my home directory with installations of the MathWorks Service Host?
- 5.9.5 Intel compilers / Intel MPI
Introduction
The operating system of Euler, CentOS 7, will reach its end-of-life on 30 June 2024. No new version of CentOS will be released in the future, and no security updates will be published for the current version. The only way to provide a secure computing environment after this date is to migrate the whole Euler cluster to a more recent operating system (OS).
After considering different options (Rocky Linux, RHEL, Ubuntu) we decided to migrate to Ubuntu 22.04 LTS. (The upcoming 24.04 LTS cannot be used because it is not yet compatible with some of the hardware in Euler.) Our goal is to make this transition as smooth as possible to the users, but since we are dealing with a different Linux distribution, some changes and adjustments (e.g. regarding workflows) are unavoidable.
What is new
Ubuntu 22.04 LTS provides a much newer kernel (5.15) than CentOS 7 (3.10), and well as newer versions of Glibc. This will simplify the installation of newer software, and should also bring some performance benefits, due to better support for newer CPU architectures.
Some applications, scripts and workflows may need to be recompiled and/or adapted to run on Ubuntu. We are launching today a beta testing phase to let you check if your codes run correctly on Ubuntu, and to make some adjustments if necessary.
What stays the same
The migration to the new OS does not affect your data. All files in /cluster/home, /cluster/scratch, /cluster/project and /cluster/work will remain exactly the same.
The Slurm batch system has already been ported to Ubuntu and is working fine. There will be no changes to the shareholder model or to the Slurm partitions, except during the beta testing phase, where some functionalities will be limited.
Timeline
The migration will be done in stages as follows:
- Early April: beta phase 1, Ubuntu installed on some login nodes and CPU nodes
- Early May: beta phase 2, Ubuntu also installed on some GPU nodes
- Late May: official release of Ubuntu on Euler, operating side-by-side with CentOS
- 4-6 June: cluster maintenance and migration of Jupyterhub to Ubuntu
- June: migration of compute nodes to Ubuntu (20% on 6 June, 40% on 17 June, 60% on 24 June, 80% on 1 July)
- 27 June: ssh connections to "euler.ethz.ch" redirected to Ubuntu nodes
- End of August: retirement of CentOS
In parallel, we will introduce a new software stack compiled specially for the new OS, starting with the most commonly used software. Due to complex support issues, the installation of some commercial software may take some time.
Information for users
Login
Since 27 June 2024, Ubuntu is the default OS on Euler, so the login procedure is the same as it was on CentOS:
ssh username@euler.ethz.ch
You can still login to the CentOS part with
ssh username@login-centos.euler.ethz.ch
However, this will only be possible for a few more days; it will stop working in when the last nodes have been switched to Ubuntu.
Software stack
We are providing a new software stack for the Ubuntu setup. In order to improve the isolation between stacks and commercial software, the stacks will require a module load. You can see available versions and load the most recent one with
module avail stack module load stack
We provide a list of all modules available in each stack.
Please check if the tools/libraries that you need for your work are already available. If they are available then please switch to Ubuntu as soon as possible and test your workflow. If software that you require for your work is missing, then please open a ticket by writing to cluster support.
Note that many low-level modules (i.e., modules needed only by other modules) are now hidden by default, so module avail and module spider will only show you the higher-level modules. To see all modules, you can use the commands
module --show_hidden avail module --show_hidden spider SOFTWARE
For more advanced users, we are also providing a way for users to install their own software stack on top of the centrally-provided one using spack.
Work in progress
Here you can find a list of software that we are about to install in the new software environment.
List 1 (GCC):
- GAMESS-US
- GDAL +hdf4+python
- gpu-burn
- meep
- opencv
- relion 3 & 5
- spades 4.0.0
- Xtb
List 2 (Intel, commercial, special cases):
- Abaqus 2024 (commercial)
- Hyperworks 2021.2 (commercial)
- netcdf-c (intel)
- netcdf-cxx4 (intel)
- netcdf-fortran (intel)
- Quantum Espresso (intel)
List 3 (Python/R):
- climada
User environment
You may need to edit your ~/.bashrc to work with the new OS, in particular if you are loading some modules by default, since the module names/versions may differ in the new OS.
Slurm
A separate instance of Slurm is deployed on all beta nodes (login and compute), so you cannot submit jobs to Ubuntu nodes from CentOS nodes or the other way around.
Jupyterhub
The service has been migrated during the cluster maintenance on 4-6 June. The configuration files are now read from ~/.config/euler/jupyterhub. Please move them manually if you wish to conserve your configuration.
Backward compatibility
In order to simplify the migration, we provide a compatibility tool with the command run-centos7 which can be used to get a shell (e.g. run-centos7) or to run a command (e.g. run-centos7 ls) in a CentOS 7 container running on top of Ubuntu.
sfux@eu-login-15:~$ uname -a Linux eu-login-15 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux sfux@eu-login-15:~$ run-centos7 Apptainer> module avail intel -------------------------------------------------------------------------------------- /cluster/apps/lmodules/Compiler/gcc/4.8.5 --------------------------------------------------------------------------------------- intel-tbb/2017.5 intel-tbb/2018.2 intel-tbb/2020.3 (D) --------------------------------------------------------------------------------------------- /cluster/apps/lmodules/Core ---------------------------------------------------------------------------------------------- intel/18.0.1 intel/19.1.0 intel/2018.1 intel/2020.0 intel/2022.1.2 (D) intel_tools/18.0.1 intel_tools/2018.1 (D) Where: D: Default Module Use "module spider" to find all possible modules. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys". Apptainer>
The container can also be used to run batch jobs:
sfux@eu-login-43:~$ cat run.sh #!/bin/bash #SBATCH -n 2 #SBATCH --time=1:00:00 #SBATCH --mem-per-cpu=4000 #SBATCH --tmp=4000 module load qchem/4.3_SMP . /cluster/apps/qchem/4.3_SMP/x86_64/qcenv.sh qchem -np 2 water.in water.out sfux@eu-login-43:~$ sbatch run-centos7 ./run.sh Submitted batch job 3011063 sfux@eu-login-43:~$
We strongly recommend to do a full migration as we only provide support for software on Ubuntu. Only use the container if for some reason you cannot migrate your setup to Ubuntu (finishing a project, using an old software that is not ported to Ubuntu). Please also note that we cannot do any changes to the container. It is provided on an as-is basis.
Support
In case of problem, please check the "Known issues" below. If your problem is not listed, please contact cluster-support@id.ethz.ch and mention that your issue is with the Ubuntu beta.
Known issues
- For certain software, the beta stack contains the same version twice (for instance llvm, openmm). We will fix this for the new stack that is deployed after the beta test is finished.
- The performances of openmpi are low. This will be fixed in the production software stack. If you wish to test it, you can set MODULEPATH=/cluster/software/stacks/2024-05/spack/share/spack/lmod/linux-ubuntu22.04-x86_64/Core:/cluster/software/lmods and load the required modules (we don't provide any guarantee on the availability of this stack as we are working on it)
- OpenMPI in the 2024-05 stack has a bug that causes jobs to fail on Euler IX nodes. Please use OpenMPI from the 2024-04 stack or from 2024-06 (work in progress)
FAQ
Basic questions about the migration
How to check on which OS I am running ?
Run
uname -a | grep Ubuntu
If you see an output, it means that you are on Ubuntu otherwise on centos.
I get a first login access code sent, but I already often logged into Euler
This is a known problem. We are working on resolving it. For the moment please try again to login as this only affects some of the login nodes.
Batch system
I cannot see any of my old jobs with sacct or myjobs
The CentOS and Ubuntu part of the cluster are separate and have their own Slurm instance.
- When you login to euler.ethz.ch (Ubuntu), you can only see jobs that you submitted in the Ubuntu part
- When you login to login-centos.ethz.ch (CentOS), you can only see jobs that you submitted in the CentOS part
Why do multi-gpu jobs fail when running them with srun?
If you would like to run a multigpu job and start a python script with srun instead of mpirun (this is the preferred way in Slurm), then you need to repeat the GPU resource request for the srun command, otherwise the job cannot run on multiple GPUs
Example:
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=2 #SBATCH --mem-per-cpu=2G #SBATCH --time=01:00:00 #SBATCH --gpus=2 module load eth_proxy stack/2024-06 python_cuda/3.11.6 srun --gpus=2 python train.py
Modules
There is no GCC avialable
In the new setup, you need to load a stack first, which will then load automatically the corresponding module for GCC
- stack/2024-04 -> GCC 8.5.0
- stack/2024-06 -> GCC 12.2.0
Why am I getting an error about $MODULESHOME when logging in to the Ubuntu part of the cluster?
When using one of the commands
- env2lmod
- lmod2env
- source /cluster/apps/local/env2lmod.sh or . /cluster/apps/local/env2lmod.sh
- source /cluster/apps/local/lmod2env.sh or . /cluster/apps/local/lmod2env.sh
then login to Ubuntu will result in the message:
Current modulesystem could not be determined $MODULESHOME=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-12.2.0/lmod-8.7.24-ou4i7x2rgiaysly4vgawaga6muhkdye4/lmod/lmod Please logout and login again
These commands are used to switch between software stacks on Euler on CentOS. These software stacks are no longer available on Ubuntu, therefore the resolution is just to remove the commands from your scripts of .bashrc or .bash_profile.
Module commands are not working in sbatch --wrap=""
When using the --wrap option, Slurm will generate a script on-the-fly, but use the /bin/sh interpreter. On CentOS, sh was a symlink to bash, while on Ubuntu, the symlink points to dash, which is not compatible with modules. The workaround for this is to explicitly use bash
- Load the modules before submitting the job with sbatch, as the shell in the batch system will inherit the environment from the parent shell which has the modules loaded
- Use a jobscript with the first line being #!/bin/bash
- Start a bash shell:
sbatch [Slurm options] --wrap="bash -c 'my_commands' "
I cannot find the python_gpu module
In order to increase the flexibility regarding the GPUs, we renamed it to python_cuda/X.X.X where X.X.X are the version number. It will allow us to also support AMD GPUs in the future.
Issues with software packages
I need to run my notebook urgently but it does not work on Ubuntu
The tools nbconvert can help you to use jupyter from the command line.
Emacs 29.1 is crashing when I connect with X11 to Euler
When connecting with X11 to Euler and loading emacs/29.1, you need to load in addition the module gdk-pixbuf/2.42.10-ihevv7i
module load stack/2024-06 emacs/29.1 gdk-pixbuf/2.42.10-ihevv7i
This will prevent emacs from crashing.
Tensorflow is crashing with the error Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
Please set the environment variables
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_EULER_ROOT export CUDA_DIR=$CUDA_EULER_ROOT
after loading a python_cuda module. This should resolve the error.
Why is Matlab filling up my home directory with installations of the MathWorks Service Host?
Newer Matlab versions make an installation of the MathWorks Service Host (MSH, https://ch.mathworks.com/matlabcentral/answers/2111226-what-is-the-mathworks-service-host-and-why-is-it-running) in the users home directory in
$HOME/.MathWorks/ServiceHost
The installation is required for the Matlab licensing. The installation uses 600MB-700MB of space. Matlab R2024a will make such an installation for every host that Matlab is executed on. This is obviously a bad decision on an HPC cluster with more than 1000 hosts. We informed MathWorks support and they provided a solution to prevent Matlab from filling up the users home directory with those installations.
We have now installed an updated version of MSH for Matlab R2024a, such that only one installation of MSH is required per user instead of one installation per user per host.
If you have redundant installations of MSH in your home directory, then please remove those installations with the commands:
rm -rf ~/.MathWorks/ServiceHost rm -rf ~/.MATLABConnector
If you start Matlab the next time, a single installation of the newer MSH will be performed that will then be reused for all Matlab instances started by a user on any arbitrary host in the cluster.
Intel compilers / Intel MPI
Intel compiler - internal error: 0_9007
When compiling code with the intel compilers on a login node, it sometimes fails with the error message
internal error: 0_9007
This usually means that the compiler is running out of memory. The solution for this problem is to submit an interactive batch job to compile your code:
srun --ntasks=1 --mem-per-cpu=4g --time=01:00:00 --pty bash
In the interative batch job, compilation should work. If you still get the same error, then please request more memory
Intel C++ compiler - icpc/icpx constantly fail with error about header file location
The intel C++ compiler requires a host GCC installation to compile C++ code. On the new Ubuntu setup, one needs to specify the location of the GCC host compiler, otherwise the C++ compiler will fail with the error message
icpx: error: C++ header location not resolved; check installed C++ dependencies
or for the classic C++ compiler
icpc: error #10417: Problem setting up the Intel(R) Compiler compilation environment. Requires 'install path' setting gathered from 'g++'
The soltuion is to specify the location of the host GCC compiler with a compiler option:
icpx --gcc-toolchain=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn
or for the classic C++ compiler
icpc -gxx-name=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn/bin/g++
In case you don't directly use the compiler, but invoke it via another software, then setting the option via the $CXXFLAGS variable works as well
export CXXFLAGS="--gcc-toolchain=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn"
or for the classical C++ compiler
export CXXFLAGS="-gxx-name=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn/bin/g++"
Please note that for --gcc-toolchain option, you need to specify the path to the top-level directory of the GCC installation, while for the -gxx-name option you need to specify the full path to g++.