Transition from LSF to Slurm

From ScientificComputing
Revision as of 12:34, 30 November 2022 by Lhausammann (talk | contribs)

Jump to: navigation, search

Since Gonzales in 2004, all central clusters at ETH have used Platform's Load Sharing Facility (LSF) as a batch system. This product, which in the meantime has been bought and renamed Spectrum LSF by IBM, is the only commercial software used for the operation of the Euler cluster. Since LSF is licensed per core, its cost has increased significantly over the years and has become harder and harder to justify. Therefore, the Scientific IT Services decided last year to phase out LSF in favour of Slurm, an open-source batch system used by most HPC centres throughout the world — including CSCS in Lugano.

What will remain the same

Slurm offers the same basic functionalities as LSF: you can use it to submit jobs, monitor their progress, and kill them if necessary. As in LSF, a job can be a single command, a parallel program using MPI or openMP, or a complex script. Slurm also supports GPUs and advanced features like job arrays.

To make the transition easier, we have configured Slurm to work in the same way as LSF: you only need to specify the resources needed by your job, such as number of cores and GPUs (if applicable), memory, run-time, etc. Slurm will analyse your job's requirements and will automatically send it to the right partition. (Slurm uses partitions instead of queues, but the idea is the same.)

Today LSF is used not only to schedule job on the cluster, but also to ensure that all shareholders get their "fair share" of the cluster. The Cluster Support team has invested a lot of effort to implement this functionality in Slurm. Shareholder groups and priorities will therefore work the same way in Slurm as in LSF.

What will change

The LSF commands that you use today to submit and monitor jobs — bsub, bjobs — and their various options will need to be replaced with their Slurm equivalent as described here.

Some services that are tightly coupled to LSF, such as CLC Genomics Server and MATLAB Distributed Computing Server, will not be migrated to Slurm but will be phased out together with LSF at the end of 2022.

How will the transition from LSF to Slurm take place

Slurm is already installed on Euler and it can be used to run jobs on a small subset of nodes. We will soon start a public beta where users will be able to test it and adapt their scripts & workflows to this new batch system. This public beta will run until September 2022.

All new compute nodes that will be installed later this year will be managed by Slurm. Existing nodes will be progressively migrated from LSF to Slurm over the summer. We expect that by September, 60% of Euler's computing capacity will be managed by Slurm and 40% by LSF.

Our goal is to phase out LSF by the end of 2022. However, we may keep a small number of compute nodes under the control of LSF for a few more months, in case some shareholders need more time to transition complex workflows from LSF to Slurm.

Known issues

  • We are aware that job monitoring with Slurm might not be as convenient as for LSF with bbjobs. We are working on a bbjobs equivalent for Slurm
  • We got some reports about performance issues with some jobs when running them through Slurm instead of LSF, we are working on investigating this issues

More information