Emergency maintenance to fix security vulnerability (CVE-2016-5195)

From ScientificComputing
Jump to: navigation, search

A recently published vulnerability in the Linux kernel (CVE-2016-5195) allows any user to get full control of the operating system. This is a critical security issue, which leaves us with no choice but to take BOTH Brutus and Euler OFF-LINE until the issue has been fixed.

Since we cannot exclude the possibility that someone already exploited this vulnerability, all login nodes and compute nodes will need to be wiped clean and their OS reinstalled from scratch, before they can be put back in production.

The reinstallation of the login and compute nodes will affect only system files stored in these nodes' local file system (/bin, /etc, /sbin, /scratch, /tmp, /usr, etc.). User data (/cluster/home, /cluster/scratch, /cluster/work, /cluster/project) do not pose any security risk and will therefore not be touched in any way.

At the time of writing neither Red Hat nor CentOS have released a patch for the operating system that we are using on Brutus and Euler. No-one knows how long this will take. Please refrain from submitting tickets or sending emails asking when Brutus and Euler will be back on-line. We will publish regular status updates on this page and notify all cluster users by email when Brutus and Euler are on-line again.

Thank you for your understanding

Updates

2016-10-25 13:30
Red Hat released a patch for RHEL 7 yesterday evening. It may take some time until they release one for RHEL 6, and then for CentOS to port it to the version we are using on our clusters (CentOS 6.8).
Our local kernel expert has therefore decided to write her own patch for CentOS 6.8, based on the information publicly available about the kernel's vulnerability. The cluster support team is testing it right now. As far as we can tell, it fixes the vulnerability, but we still have to make sure that the new kernel does not have any undesirable side effects. If these tests are successful, we will deploy it to the login nodes of Euler, and then progressively reinstall all compute nodes. That should allow us to (partly) reopen Euler while we wait for the official patch for CentOS 6.8.
2016-10-25 15:15
We have installed our custom-made patch on the login nodes of Euler and reopened them to all users. (You should have received a notification by email.)
Please note that we are doing this primarily to let Euler users to access their data on the cluster. All compute nodes will remain closed until they are reinstalled. This process will span several days as we need to wait until they are empty before we can reinstall them. (We don't want to kill the jobs running there unless absolutely necessary.) In the meantime, the computing capacity of Euler will remain severely limited, which will result in long queueing times. In a first phase, only short (4h) jobs will be allowed to run. Longer jobs (24h) will be allowed once we are certain that our custom-made patch does not have any undesirable side effects, and once a sufficient number of compute nodes have been reinstalled and put back into production. Very long jobs (120h) will not be allowed to run until CentOS releases an official patch for CentOS 6.8.
2016-10-25 16:30
Our first priority is to bring Euler back into production.
Brutus uses an older kernel that requires a different patch. It will therefore remain off-line until an official patch for CentOS 6.4 becomes available.
2016-10-26 13:15
We are currently testing the batch system and the InfiniBand network of Euler with a first series of compute nodes running our patched kernel. The short queues (4h) will be reactivated as soon as we are confident that everything works fine.
The patch for CentOS 6.8 has just been released. We will compare its source code to our custom patch. Depending on the outcome, we will continue to deploy our own or switch to the official patch.
2016-10-27 11:00
The short queues (4h) have been reactivated. The rest of the queues will be gradually activated if no problems are detected.
2016-10-27 18:00
The long queues of Euler have been reactivated this afternoon. We are continuing to reinstall compute nodes as they become empty.
There is still no official patch for the version of CentOS used on Brutus. We have therefore decided to upgrade one of the login nodes to a newer (patched) version of CentOS, so that Brutus users can at least access their data on this cluster. Unfortunately we cannot guarantee that this new OS version is fully compatible with the hardware and software of Brutus. We will need to run many tests before we can upgrade some compute nodes too. In the meantime, all batch queues of Brutus will remain inactive.
2016-10-28 13:30
Euler is operating (almost) normally again. More than half of the cluster's compute nodes have been reinstalled already. All 4h, 24h and 120h queues are active and processing jobs.
A second login node of Brutus has been upgraded to CentOS 6.8. Everything seems to be running fine, so we have decided to upgrade compute nodes to the same version. Over 120 compute nodes have been reinstalled and are ready to be put back in production. As for Euler, we will reactivate the 1h queue first, and if all goes well, proceed with the 8h and 36h queue. As a precaution, the 7d queues will remain inactive until next week.
The new version of CentOS is incompatible with the (very old) Nvidia Tesla GPUs in Brutus. These GPUs will therefore be taken out of operation permanently. As a consequence, jobs requesting GPUs will no longer be able to run on Brutus.
2016-10-31 10:00
The long queues of Brutus (36h and 7d) will remain inactive while we investigate an issue with the updated Lustre client, which causes some jobs to hang when accessing /cluster/scratch_xl.
2016-11-02 17:00
Brutus will be taken OFF-LINE tomorrow (Nov 3rd) to upgrade its Lustre servers. This short-notice upgrade is necessary because the current Lustre version is not compatible with CentOS 6.8.
2016-11-04 14:00
The Lustre servers of Brutus have been successfully upgraded. The cluster is operational; all login nodes and almost all compute nodes are up and running. All queues are active except the 7d queues, which will be reactivated after the weekend.