Integration of Leonhard Open into Euler

From ScientificComputing
Revision as of 15:54, 21 September 2023 by Byrdeo (talk | contribs) (When)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Why

The Leonhard Open cluster, introduced in 2017 as a new platform for big data analytics and GPU computing, has become a victim of its own success. Due to the very high demand for GPU nodes, it has reached the space, power and cooling limits of our data center in Zurich. For this reason, all new GPU nodes bought in the last 12 months have been installed in Euler in Lugano. This has led to a situation where customers who initially bought a share of Leonhard ended up with GPU nodes in both clusters. Since moving individual shareholders from Leonhard to Euler is not practical, the Scientific IT Services have decided to completely integrate Leonhard Open into Euler. (The Leonhard Med cluster is not affected by this change.) This will benefit not only existing shareholders who had to deal with two separate clusters, but also future customers who had difficulty choosing between Euler and Leonhard. It will also simplify the work of the cluster management team.

How

The existing Leonhard GPU nodes will physically remain in Zurich but will be logically moved into the Euler network. They will be integrated into the cluster management tools and batch system of Euler.

All files currently stored in the "work" and "project" file systems of Leonhard Open will be transferred to their equivalent in Euler. This operation will be done by the cluster management team and will be mostly transparent to the users. It will require a short down-time (less than 24 hours) during which Leonhard Open users will not be able to access their data. Once the transfer is done, they will find their files in Euler under the usual path — unless they already have a "work" or "project" directory in Euler, in which case they will find their files in a sub-directory called "leonhard".

Due to potential conflicts between the two clusters, the contents of "home" and "scratch" will not be transferred. Every user will have to copy the files they want to keep themselves. For this purpose, the login nodes and file systems of Leonhard Open will remain accessible (in read-only mode) for one month after the integration.

All Leonhard Open shares will be transferred to Euler. Leonhard users will therefore enjoy the same shareholder privileges and priority on Euler as they did on Leonhard Open. Apart from the hostname (euler.ethz.ch instead of login.leonhard.ethz.ch) nothing will change for the users.

The software environment of Euler has already been modified to support GPUs and features the same toolchains (GCC 4.8.5, 6.3.0, 8.2.0 and Intel 18.0.1). Packages that were only available on Leonhard Open are being installed on Euler to make the migration of your workflows as seamless as possible. The cluster support team will be happy to assist you in porting your workflows from Leonhard Open to Euler and will install any missing packages on demand.

When

The integration will take place on 14-15 September. The detailed schedule is:

Date and time Task Status
Now - 14.09.2021 Batch queues of Leonhard Open will be progressively inactivated to drain the compute nodes and ensure that no job is running on 14.09.2021 Done
14.09.2021, 07:00 All batch queues will be closed, compute nodes will be taken out of operation and reconfigured as Euler nodes Done
14.09.2021, 15:00 All login nodes will be closed, Leonhard Open will be taken off-line Done
14.09.2021, 15:00 - 15.09.2021, 12:00 (noon) Work and project data will be transferred/synchronized from Leonhard Open to Euler Done
15.09.2021, 12:00 Leonhard Open users will find their data in Euler under the usual path (with a few exceptions) Done
15.09.2021, 12:00 The login nodes of Leonhard Open will be reopened for one month to allow users to copy data in their "home" and "scratch" directories (if needed) Done
14.10.2021, 12:00 Access to Leonhard Open will be closed, all remaining user data will be deleted Done

All together, Leonhard Open users will not be able to access their data from 15:00 Tuesday to 12:00 Wednesday.

Current status

15 September, 12:00

  • Leonhard Open nodes have been reconfigured as Euler nodes and will be progressively activated in the batch system
  • Leonhard shares have been transferred to Euler
  • All "project" volumes have been copied over to Euler and are accessible there without restriction
  • The copy and verification of "work" volumes is still in progress, we expect that most of these volumes will be accessible later this afternoon (check this page for updates)
  • The login nodes of Leonhard Open are accessible again and will remain open until 14 October to allow users to copy data from their "home" and "scratch" (if applicable) directories

15 September, 13:30

  • The copy and verification of "work" volumes is complete
  • Those groups who already had a "work" volume on Euler will find their Leonhard data in a subdirectory called "leonhard"
  • Some quotas adjustments will take place over the next few days; in the meantime, the values reported by "lquota" may be incorrect
  • This concludes the integration of Leonhard Open into Euler. Thank you for your patience!

What YOU need to do

Please read this section carefully; some actions are needed on your part to ensure a smooth integration.

  • Do not submit long jobs on Leonhard Open that would not finish before 07:00, Tuesday 14 September. The batch system would not be able to start them.
  • Verify that you can login to Euler. If you have never done so, you will need to go through the account verification process.
  • Check that all the applications and libraries you need are available in the new software stack of Euler. If not, contact us (see "Need help?" below).
  • You can already start copying files from your "home" or "scratch" directory from Leonhard Open to Euler before 14 September.
  • You must finish copying those files before 14 October.
  • If you do not need data from your "work" or "project" directories, you can already run jobs on Euler before the migration. However, your Leonhard shareholders privileges (including access to GPUs if applicable) will not be transferred to Euler until 12:00, Wednesday 15 September. If you are not already an Euler shareholder, you will be treated as a guest user until then.

Login nodes

Euler login nodes are in terms of resources considerably weaker than the ones from Leonhard Open:

  • 32 GB memory
  • 4 CPUs

We have 50 of them, but they have to serve a lot more users. Therefore we must insist, that all users respect the following limits:

  1. not more than 4 (threads + processes) on login nodes, the initial ssh and bash not counted
  2. not more than 2 GB virtual memory consumed on login nodes
  3. no process running for longer than 60 CPU-minutes on login nodes (excluding file-transfer / -handling)
  4. no unattended processes on login nodes (demons, remote-agents, ...)

For example Microsoft Visual Studios remote extension used on Euler login nodes violates limits 1 and 4. You can use MVS on Euler, but not on login nodes. Please use it only as documented under

https://scicomp.ethz.ch/wiki/VSCode

FAQ

Why are you doing this change now?

The decision to integrate Leonhard Open into Euler was taken last year already but the change was delayed due to Covid-19 and last year's cyber-attack against many HPC sites. We have used this time to do a proof-of-concept to verify that Leonhard Open nodes in Zurich could be integrated into the Euler cluster in Lugano. The date was set during the summer holiday, before the start of the Fall semester, to minimise the impact on students and teachers who rely on Leonhard for their courses.

Does this change affect Leonhard Med?

No. Leonhard Med will continue to exist as a separate system but will operated by a dedicated team. Its users and data are not affected by this integration.

What happens to my Leonhard Open share?

Nothing. You will get exactly the same resources on Euler that you had on Leonhard Open and the temporal validity of your share will remain the same.

Does this integration bring any benefits to Leonhard Open shareholders?

Since Euler is much larger than Leonhard Open, it will provide more elasticity, thus allowing for higher peak usage. Also, Euler contains new GPUs that are not available on Leonhard, such as Nvidia Tesla A100.

Do I need to transfer my data to Euler?

You will need to transfer data from your "home" directory and your personal "scratch" directory. We will take care of transferring the data from your "work" and "project" directories (if applicable).

How can I transfer my data to Euler?

You can use rsync to copy the contents of your "home" directory from Leonhard Open into a "leonhard" subdirectory on Euler. Simply login to Euler and execute the command:

rsync -Sav login.leonhard.ethz.ch:./ $HOME/leonhard/

Note: we use a subdirectory in this example because your "home" directory on Leohnard Open contains a lot of hidden files & directories (e.g. .profile, .ssh), which may conflict with their equivalent on Euler. You may need to edit them by hand to make sure that they work properly on Euler (check in particular host names, modules, aliases, batch queues, etc.).

Likewise, to copy the contents of your "scratch" directory into the same location on Euler (not a subdirectory), execute:

rsync -Sav login.leonhard.ethz.ch:$SCRATCH/ $SCRATCH/

What if I do not have enough space on Euler to copy the contents of my "home" directory?

Unfortunately "home" quotas are set system-wide and cannot be raised, even temporarily, for individual users. Also, we do not have enough capacity to raise the quota of all users.

If you have a lot of data in your "home" on both clusters, there is a high probability that you have installed the same applications and/or datasets on both clusters. Therefore we recommend that you clean up your "home" on Leonhard Open before copying its contents to Euler, or that you selectively copy only the files that do not already exist on Euler. Please note that this clean-up must be done before 15:00, Tuesday 14 September because after that date all file systems of Leonhard Open will be read-only. Alternatively, you can copy your Leonhard Open "home" into your Euler "scratch", and do the clean-up on Euler.

Can I access my NAS share also on Euler?

If your NAS share only has an NFS export for the IP range of the Leonhard Open cluster, then please contact the person responsible for the NAS share and ask to change the NFS export to make the NAS share accessible on Euler:

https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Central_NAS.2FCDS

Example NFS export for Euler:

# cat /etc/exports
/export 129.132.93.64/26(rw,root_squash,secure) 10.205.0.0/16(rw,root_squash,secure) 10.204.0.0/16(rw,root_squash,secure)

Is all the software that was available on Leonhard Open also available on Euler?

To prepare the integration of Leonhard Open into Euler we already started some months ago to identify differences in the software stacks of Leonhard Open and Euler and to install missing software packages on Euler. There will still be some packages missing that we can install again on request.

ERROR:105: Unable to locate a modulefile for 'gcc/6.3.0'

Euler features two software stacks (old/new). The old software stack is still set as default upon login (this will change in the future). You can either switch forth and back between the old (env modules) and the new (lmod modules) software stack unsing the commands env2lmod and lmod2env

https://scicomp.ethz.ch/wiki/New_SPACK_software_stack_on_Euler

or set the new software stack as permanent default upon login by running once the command

set_software_stack.sh new

After running this command, you would need to logout and login again to make the change becoming active.

 https://scicomp.ethz.ch/wiki/Setting_permanent_default_for_software_stack_upon_login

Some of the python_gpu modules are missing on Euler

The main difference between python and python_gpu modules is that the python_gpu modules automatically load some additional modules (CUDA, cuDNN, NCCL) which are required for many GPU packages. We already provided a python_gpu/3.8.5 (GCC 6.3.0) environment on Euler. I have now created a python_gpu/3.7.4 (GCC 6.3.0) module, which points to the existing python/3.7.4 (GCC 6.3.0) installation and loads the same versions of CUDA, cuDNN and NCCL as in the python_gpu/3.7.4 environment on Leonhard Open.

Need help?

If you have a question that is not covered by the FAQ above, or any concern about the integration of Leonhard Open into Euler (e.g., if you have a workflow tightly coupled with Leonhard Open), do not hesitate to contact our cluster support team.