Clusters temporarily closed for security reasons (14 May 2020)

From ScientificComputing
Revision as of 22:21, 17 June 2020 by Byrdeo (talk | contribs) (2020-06-02 11:00)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

A cyber-attack has been conducted against several European HPC and academic compute sites. Some of our systems have been compromised. Please regularly check this page for updates regarding the status of the Euler, the Leonhard Open and the Leonhard Med cluster.

Thank you for your understanding and sorry for the inconvenience.

Updates

2020-05-14 14:30

As a precaution, we have closed our clusters temporarily. After a careful analysis of our systems, we have not found any indication that Euler and/or Leonhard have been compromised. We have therefore decided to open the login nodes again.

Please note that the closure affected only the login nodes. The clusters' compute nodes and the jobs running there have not been affected by this closure.

Based on the information we have received from HPC sites that have been hacked, the attacker apparently used a compromised SSH. As a general rule, please keep your operating system up-to-date and always apply security patches as soon as they are released. If you have reason to believe that your personal computer has been compromised, we recommend that you delete all your SSH keys and generate new ones.

2020-05-15 12:00

After receiving new information on the attack of several HPC European infrastructures it was discovered that some of our HPC systems have been compromised. Based on this information access to all clusters has been closed with immediate effect. Please note that the closure affected only the login nodes. The clusters' compute nodes and the jobs running there have not been affected by this closure.

2020-05-15 14:00

The clusters will remain closed until we know how the attack took place and how to protect our systems against it. This will most likely take several days, possibly weeks. Please do not contact us to ask when they will be accessible again. We do not have enough information to answer this question at this stage.

2020-05-16 09:00

We have set up a FAQ section on this page (see below) where we will post answers to the questions that we have received so far. If you have any question, kindly check this FAQ before contacting us.

2020-05-18 19:00

The IT Security Center of ETH is investigating the incident in close coordination with the HPC Group, also coordinating with the other affected HPC sites and the responsible authorities. In parallel, the HPC Group is looking at ways to further strengthen the already strict security measures on our HPC systems. We are taking this security incident very seriously and will take all measures to protect our infrastructure even better against future attacks. We will start reopening the service on our clusters as soon as possible and will give you more details about the timeline in the coming days.

2020-05-20 17:30

We are reinstalling our clusters from scratch to ensure that we can provide you with a clean and safe working environment. If all goes well, Euler should be open again at the middle of next week (tentatively: 27 May) and Leonhard a week or so later.

We have reasons to believe that the attacker stole a user's credentials to gain access to our clusters. For this reason, we recommend that all users change their ETH (LDAP) password as soon as possible. All SSH keys stored on the cluster will be disabled. People using SSH keys will be required to generate new private/public key pairs, preferably of type ed25519, protected with a non-empty passphrase. Detailed instructions will be published on this wiki before we reopen Euler.

2020-05-22 16:00

A special login node has been set up to let user of Leonhard Open access data on this cluster. Please refer to this wiki page for details.

2020-05-26 17:30

The reinstallation of Euler is taking more time than expected. We hope to reopen the login nodes towards the end of this week.

2020-06-02 11:00

Euler will to return to normal operation Tuesday 2 June around 2 pm.

2020-06-17 22:00

The reinstallation of Leonhard (Open and Med) is almost complete, we expect that all login nodes will be reopened later this week. Compute nodes will be progressively reinstalled, tested and added to the batch system.

FAQ

I have stored files in my personal scratch directory ($SCRATCH). Will the regular purge of personal scratch directories continue while the clusters are closed?

Purging of the personal scratch directories has been suspended while the clusters are closed.

Are batch jobs affected by the closure of the clusters?

The compute nodes are being drained for reinstallation. All queues have been inactivated. Jobs that were running when the cluster was closed will complete normally, except maybe some very long jobs that will need to be terminated. New jobs will not be started. Pending jobs will remain in the queue until the cluster is back in operation.

I have a paper deadline and urgently need to access data on the clusters that I don't have stored locally. When will access to data on the cluster be possible?

Our first priority (Plan A) is to bring the clusters on-line as quickly as possible. Euler is still closed. Leonhard is partially open (data access only; no computation) while we reinstall it.

HPC resources are critical for our research. When will it again be possible to compute on the HPC clusters of ETH?

We are fully aware of this and are working around the clock — literally! — to bring you a clean and safe working environment as quickly as possible. Based on the current progress, we estimate that Euler will be reopened around 27 May. Leonhard will follow about a week later. A temporary solution has been put in place to access its storage systems in the meantime.

Do you know how Euler and Leonhard have been compromised?

The attacker apparently got access through a compromised user account. The details are still being investigated.

Have my data been accessed, copied or modified?

The investigation is still on-going. So far, there is no indication that user data have been tampered with. This is not a ransomware attack.

What measures have been put in place to prevent similar events in the future?

Security starts with you. You must protect your account (strong password, SSH keys protected by passphrase, etc.) to ensure that no-one else but you can use it. We are also taking additional measures to protect the system. For obvious reasons we cannot disclose the details to the public.