Transition from LSF to Slurm

From ScientificComputing
Revision as of 10:30, 6 May 2022 by Urbanb (talk | contribs) (What will change)

Jump to: navigation, search

Since Gonzales in 2004, all central clusters at ETH have used Platform's Load Sharing Facility (LSF) as a batch system. This product, which in the meantime has been bought and renamed Spectrum LSF by IBM, is the only commercial software used for the operation of the Euler cluster. Since LSF is licensed per core, its cost has increased significantly over the years and has become harder and harder to justify. Therefore, the Scientific IT Services decided last year to phase out LSF in favour of Slurm, an open-source batch system used by most HPC centres throughout the world -- including CSCS in Lugano.

What will remain the same

Today LSF is used not only to schedule job on the cluster, but also to ensure that all shareholders get their "fair share" of the cluster. The Cluster Support team has invested a lot of effort to implement this functionality in Slurm.

What will change

The LSF commands that you use today to submit and monitor jobs — bsub, bjobs — and their various options will not work with Slurm.

Some services that are tightly coupled to LSF, sush as CLC Genomics Server and MATLAB Distributed Computing Server, will not be migrated to Slurm but will be phased out together with LSF.

How will the transition from LSF to Slurm take place

Slurm is already installed on Euler and it can be used to run small jobs on a small subset of nodes. We will soon start a public beta where users will be able to test it and adapt their scripts & workflows to this new batch system. This public beta will run until September 2022.

All new compute nodes that will be installed later this year will be managed directly by Slurm. Existing nodes will be progressively migrated from LSF to Slurm over the summer. We expect that by September, 60% of Euler's computing capacity will be managed by Slurm and 40% by LSF.

Our goal is to phase out LSF by the end of 2022. However, we may keep a small number of compute nodes under the control of LSF for a few more months, in case some shareholders need more time to transition complex workflows from LSF to Slurm.