Migrating to Ubuntu

From ScientificComputing
Revision as of 13:04, 18 September 2024 by Sfux (talk | contribs) (Work in progress)

Jump to: navigation, search

Introduction

The operating system of Euler, CentOS 7, will reach its end-of-life on 30 June 2024. No new version of CentOS will be released in the future, and no security updates will be published for the current version. The only way to provide a secure computing environment after this date is to migrate the whole Euler cluster to a more recent operating system (OS).

After considering different options (Rocky Linux, RHEL, Ubuntu) we decided to migrate to Ubuntu 22.04 LTS. (The upcoming 24.04 LTS cannot be used because it is not yet compatible with some of the hardware in Euler.) Our goal is to make this transition as smooth as possible to the users, but since we are dealing with a different Linux distribution, some changes and adjustments (e.g. regarding workflows) are unavoidable.

What is new

Ubuntu 22.04 LTS provides a much newer kernel (5.15) than CentOS 7 (3.10), and well as newer versions of Glibc. This will simplify the installation of newer software, and should also bring some performance benefits, due to better support for newer CPU architectures.

Some applications, scripts and workflows may need to be recompiled and/or adapted to run on Ubuntu. We are launching today a beta testing phase to let you check if your codes run correctly on Ubuntu, and to make some adjustments if necessary.

What stays the same

The migration to the new OS does not affect your data. All files in /cluster/home, /cluster/scratch, /cluster/project and /cluster/work will remain exactly the same.

The Slurm batch system has already been ported to Ubuntu and is working fine. There will be no changes to the shareholder model or to the Slurm partitions, except during the beta testing phase, where some functionalities will be limited.

Timeline

The migration will be done in stages as follows:

  • Early April: beta phase 1, Ubuntu installed on some login nodes and CPU nodes
  • Early May: beta phase 2, Ubuntu also installed on some GPU nodes
  • Late May: official release of Ubuntu on Euler, operating side-by-side with CentOS
  • 4-6 June: cluster maintenance and migration of Jupyterhub to Ubuntu
  • June: migration of compute nodes to Ubuntu (20% on 6 June, 40% on 17 June, 60% on 24 June, 80% on 1 July)
  • 27 June: ssh connections to "euler.ethz.ch" redirected to Ubuntu nodes
  • End of August: retirement of CentOS

In parallel, we will introduce a new software stack compiled specially for the new OS, starting with the most commonly used software. Due to complex support issues, the installation of some commercial software may take some time.

Information for users

Login

Since 27 June 2024, Ubuntu is the default OS on Euler, so the login procedure is the same as it was on CentOS:

ssh username@euler.ethz.ch

You can still login to the CentOS part with

ssh username@login-centos.euler.ethz.ch

However, this will only be possible for a few more days; it will stop working in when the last nodes have been switched to Ubuntu.

Software stack

We are providing a new software stack for the Ubuntu setup. In order to improve the isolation between stacks and commercial software, the stacks will require a module load. You can see available versions and load the most recent one with

module avail stack
module load stack

We provide a list of all modules available in each stack.

Please check if the tools/libraries that you need for your work are already available. If they are available then please switch to Ubuntu as soon as possible and test your workflow. If software that you require for your work is missing, then please open a ticket by writing to cluster support.

Note that many low-level modules (i.e., modules needed only by other modules) are now hidden by default, so module avail and module spider will only show you the higher-level modules. To see all modules, you can use the commands

module --show_hidden avail
module --show_hidden spider SOFTWARE

For more advanced users, we are also providing a way for users to install their own software stack on top of the centrally-provided one using spack.

Work in progress

Here you can find a list of software that we are about to install in the new software environment.

List 1 (GCC):

  • GAMESS-US
  • GDAL +hdf4+python
  • gpu-burn
  • meep
  • opencv
  • relion 3 & 5
  • spades 4.0.0
  • Xtb

List 2 (Intel, commercial, special cases):

  • Abaqus 2024 (commercial)
  • Hyperworks 2021.2 (commercial)
  • netcdf-c (intel)
  • netcdf-cxx4 (intel)
  • netcdf-fortran (intel)
  • Quantum Espresso (intel)

List 3 (Python/R):

  • climada

User environment

You may need to edit your ~/.bashrc to work with the new OS, in particular if you are loading some modules by default, since the module names/versions may differ in the new OS.

Slurm

A separate instance of Slurm is deployed on all beta nodes (login and compute), so you cannot submit jobs to Ubuntu nodes from CentOS nodes or the other way around.

Jupyterhub

The service has been migrated during the cluster maintenance on 4-6 June. The configuration files are now read from ~/.config/euler/jupyterhub. Please move them manually if you wish to conserve your configuration.

Backward compatibility

In order to simplify the migration, we provide a compatibility tool with the command run-centos7 which can be used to get a shell (e.g. run-centos7) or to run a command (e.g. run-centos7 ls) in a CentOS 7 container running on top of Ubuntu.

sfux@eu-login-15:~$ uname -a
Linux eu-login-15 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
sfux@eu-login-15:~$ run-centos7
Apptainer> module avail intel

-------------------------------------------------------------------------------------- /cluster/apps/lmodules/Compiler/gcc/4.8.5 ---------------------------------------------------------------------------------------
   intel-tbb/2017.5    intel-tbb/2018.2    intel-tbb/2020.3 (D)

--------------------------------------------------------------------------------------------- /cluster/apps/lmodules/Core ----------------------------------------------------------------------------------------------
   intel/18.0.1    intel/19.1.0    intel/2018.1    intel/2020.0    intel/2022.1.2 (D)    intel_tools/18.0.1    intel_tools/2018.1 (D)

  Where:
   D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".


Apptainer>

The container can also be used to run batch jobs:

sfux@eu-login-43:~$ cat run.sh
#!/bin/bash

#SBATCH -n 2
#SBATCH --time=1:00:00
#SBATCH --mem-per-cpu=4000
#SBATCH --tmp=4000

module load qchem/4.3_SMP

. /cluster/apps/qchem/4.3_SMP/x86_64/qcenv.sh

qchem -np 2 water.in water.out
sfux@eu-login-43:~$ sbatch run-centos7 ./run.sh
Submitted batch job 3011063
sfux@eu-login-43:~$

We strongly recommend to do a full migration as we only provide support for software on Ubuntu. Only use the container if for some reason you cannot migrate your setup to Ubuntu (finishing a project, using an old software that is not ported to Ubuntu). Please also note that we cannot do any changes to the container. It is provided on an as-is basis.

Support

In case of problem, please check the "Known issues" below. If your problem is not listed, please contact cluster-support@id.ethz.ch and mention that your issue is with the Ubuntu beta.

Known issues

  • For certain software, the beta stack contains the same version twice (for instance llvm, openmm). We will fix this for the new stack that is deployed after the beta test is finished.
  • The performances of openmpi are low. This will be fixed in the production software stack. If you wish to test it, you can set MODULEPATH=/cluster/software/stacks/2024-05/spack/share/spack/lmod/linux-ubuntu22.04-x86_64/Core:/cluster/software/lmods and load the required modules (we don't provide any guarantee on the availability of this stack as we are working on it)
  • OpenMPI in the 2024-05 stack has a bug that causes jobs to fail on Euler IX nodes. Please use OpenMPI from the 2024-04 stack or from 2024-06 (work in progress)

FAQ

Basic questions about the migration

How to check on which OS I am running ?

Run

 uname -a | grep Ubuntu

If you see an output, it means that you are on Ubuntu otherwise on centos.

I get a first login access code sent, but I already often logged into Euler

This is a known problem. We are working on resolving it. For the moment please try again to login as this only affects some of the login nodes.

Batch system

I cannot see any of my old jobs with sacct or myjobs

The CentOS and Ubuntu part of the cluster are separate and have their own Slurm instance.

  • When you login to euler.ethz.ch (Ubuntu), you can only see jobs that you submitted in the Ubuntu part
  • When you login to login-centos.ethz.ch (CentOS), you can only see jobs that you submitted in the CentOS part

Why do multi-gpu jobs fail when running them with srun?

If you would like to run a multigpu job and start a python script with srun instead of mpirun (this is the preferred way in Slurm), then you need to repeat the GPU resource request for the srun command, otherwise the job cannot run on multiple GPUs

Example:

#!/bin/bash 

#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=2G
#SBATCH --time=01:00:00
#SBATCH --gpus=2

module load eth_proxy stack/2024-06 python_cuda/3.11.6
srun --gpus=2 python train.py

Modules

There is no GCC avialable

In the new setup, you need to load a stack first, which will then load automatically the corresponding module for GCC

  • stack/2024-04 -> GCC 8.5.0
  • stack/2024-06 -> GCC 12.2.0

Why am I getting an error about $MODULESHOME when logging in to the Ubuntu part of the cluster?

When using one of the commands

  • env2lmod
  • lmod2env
  • source /cluster/apps/local/env2lmod.sh or . /cluster/apps/local/env2lmod.sh
  • source /cluster/apps/local/lmod2env.sh or . /cluster/apps/local/lmod2env.sh

then login to Ubuntu will result in the message:

Current modulesystem could not be determined
$MODULESHOME=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-12.2.0/lmod-8.7.24-ou4i7x2rgiaysly4vgawaga6muhkdye4/lmod/lmod
Please logout and login again

These commands are used to switch between software stacks on Euler on CentOS. These software stacks are no longer available on Ubuntu, therefore the resolution is just to remove the commands from your scripts of .bashrc or .bash_profile.

Module commands are not working in sbatch --wrap=""

When using the --wrap option, Slurm will generate a script on-the-fly, but use the /bin/sh interpreter. On CentOS, sh was a symlink to bash, while on Ubuntu, the symlink points to dash, which is not compatible with modules. The workaround for this is to explicitly use bash

  • Load the modules before submitting the job with sbatch, as the shell in the batch system will inherit the environment from the parent shell which has the modules loaded
  • Use a jobscript with the first line being #!/bin/bash
  • Start a bash shell:
sbatch [Slurm options] --wrap="bash -c 'my_commands' "

I cannot find the python_gpu module

In order to increase the flexibility regarding the GPUs, we renamed it to python_cuda/X.X.X where X.X.X are the version number. It will allow us to also support AMD GPUs in the future.

Issues with software packages

I need to run my notebook urgently but it does not work on Ubuntu

The tools nbconvert can help you to use jupyter from the command line.

Emacs 29.1 is crashing when I connect with X11 to Euler

When connecting with X11 to Euler and loading emacs/29.1, you need to load in addition the module gdk-pixbuf/2.42.10-ihevv7i

module load stack/2024-06 emacs/29.1 gdk-pixbuf/2.42.10-ihevv7i

This will prevent emacs from crashing.

Tensorflow is crashing with the error Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice

Please set the environment variables

export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_EULER_ROOT
export CUDA_DIR=$CUDA_EULER_ROOT

after loading a python_cuda module. This should resolve the error.

Why is Matlab filling up my home directory with installations of the MathWorks Service Host?

Newer Matlab versions make an installation of the MathWorks Service Host (MSH, https://ch.mathworks.com/matlabcentral/answers/2111226-what-is-the-mathworks-service-host-and-why-is-it-running) in the users home directory in

$HOME/.MathWorks/ServiceHost

The installation is required for the Matlab licensing. The installation uses 600MB-700MB of space. Matlab R2024a will make such an installation for every host that Matlab is executed on. This is obviously a bad decision on an HPC cluster with more than 1000 hosts. We informed MathWorks support and they provided a solution to prevent Matlab from filling up the users home directory with those installations.

We have now installed an updated version of MSH for Matlab R2024a, such that only one installation of MSH is required per user instead of one installation per user per host.

If you have redundant installations of MSH in your home directory, then please remove those installations with the commands:

rm -rf ~/.MathWorks/ServiceHost
rm -rf ~/.MATLABConnector

If you start Matlab the next time, a single installation of the newer MSH will be performed that will then be reused for all Matlab instances started by a user on any arbitrary host in the cluster.

Intel compilers / Intel MPI

Intel compiler - internal error: 0_9007

When compiling code with the intel compilers on a login node, it sometimes fails with the error message

internal error: 0_9007

This usually means that the compiler is running out of memory. The solution for this problem is to submit an interactive batch job to compile your code:

srun --ntasks=1 --mem-per-cpu=4g --time=01:00:00 --pty bash

In the interative batch job, compilation should work. If you still get the same error, then please request more memory

Intel C++ compiler - icpc/icpx constantly fail with error about header file location

The intel C++ compiler requires a host GCC installation to compile C++ code. On the new Ubuntu setup, one needs to specify the location of the GCC host compiler, otherwise the C++ compiler will fail with the error message

icpx: error: C++ header location not resolved; check installed C++ dependencies

or for the classic C++ compiler

icpc: error #10417: Problem setting up the Intel(R) Compiler compilation environment.  Requires 'install path' setting gathered from 'g++'

The soltuion is to specify the location of the host GCC compiler with a compiler option:

icpx --gcc-toolchain=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn

or for the classic C++ compiler

icpc -gxx-name=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn/bin/g++

In case you don't directly use the compiler, but invoke it via another software, then setting the option via the $CXXFLAGS variable works as well

export CXXFLAGS="--gcc-toolchain=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn"

or for the classical C++ compiler

export CXXFLAGS="-gxx-name=/cluster/software/stacks/2024-06/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-12.2.0-bj2twcnwcownogkldo6ndfylxx5sqpbn/bin/g++"

Please note that for --gcc-toolchain option, you need to specify the path to the top-level directory of the GCC installation, while for the -gxx-name option you need to specify the full path to g++.