Emergency maintenance to fix security vulnerability (CVE-2016-5195)

From ScientificComputing
Revision as of 09:59, 27 October 2016 by Urbanb (talk | contribs)

Jump to: navigation, search

A recently published vulnerability in the Linux kernel (CVE-2016-5195) allows any user to get full control of the operating system. This is a critical security issue, which leaves us with no choice but to take BOTH Brutus and Euler OFF-LINE until the issue has been fixed.

Since we cannot exclude the possibility that someone already exploited this vulnerability, all login nodes and compute nodes will need to be wiped clean and their OS reinstalled from scratch, before they can be put back in production.

The reinstallation of the login and compute nodes will affect only system files stored in these nodes' local file system (/bin, /etc, /sbin, /scratch, /tmp, /usr, etc.). User data (/cluster/home, /cluster/scratch, /cluster/work, /cluster/project) do not pose any security risk and will therefore not be touched in any way.

At the time of writing neither Red Hat nor CentOS have released a patch for the operating system that we are using on Brutus and Euler. No-one knows how long this will take. Please refrain from submitting tickets or sending emails asking when Brutus and Euler will be back on-line. We will publish regular status updates on this page and notify all cluster users by email when Brutus and Euler are on-line again.

Thank you for your understanding


2016-10-25 13:30 :

  • Red Hat released a patch for RHEL 7 yesterday evening. It may take some time until they release one for RHEL 6, and then for CentOS to port it to the version we are using on our clusters (CentOS 6.8).
  • Our local kernel expert has therefore decided to write her own patch for CentOS 6.8, based on the information publicly available about the kernel's vulnerability. The cluster support team is testing it right now. As far as we can tell, it fixes the vulnerability, but we still have to make sure that the new kernel does not have any undesirable side effects. If these tests are successful, we will deploy it to the login nodes of Euler, and then progressively reinstall all compute nodes. That should allow us to (partly) reopen Euler while we wait for the official patch for CentOS 6.8.

2016-10-25 15:15 :

  • We have installed our custom-made patch on the login nodes of Euler and reopened them to all users. (You should have received a notification by email.)
  • Please note that we are doing this primarily to let Euler users to access their data on the cluster. All compute nodes will remain closed until they are reinstalled. This process will span several days as we need to wait until they are empty before we can reinstall them. (We don't want to kill the jobs running there unless absolutely necessary.) In the meantime, the computing capacity of Euler will remain severely limited, which will result in long queueing times. In a first phase, only short (4h) jobs will be allowed to run. Longer jobs (24h) will be allowed once we are certain that our custom-made patch does not have any undesirable side effects, and once a sufficient number of compute nodes have been reinstalled and put back into production. Very long jobs (120h) will not be allowed to run until CentOS releases an official patch for CentOS 6.8.

2016-10-25 16:30 :

  • Our first priority is to bring Euler back into production.
  • Brutus uses an older kernel that requires a different patch. It will therefore remain off-line until an official patch for CentOS 6.4 becomes available.

2016-10-26 13:15 :

  • We are currently testing the batch system and the InfiniBand network of Euler with a first series of compute nodes running our patched kernel. The short queues (4h) will be reactivated as soon as we are confident that everything works fine.
  • The patch for CentOS 6.8 has just been released. We will compare its source code to our custom patch. Depending on the outcome, we will continue to deploy our own or switch to the official patch.

2016-10-27 11:00 :

  • The short queues (4h) have been reactivated. The rest of the queues will be gradually activated if no problems are detected.