MPI on Euler

From ScientificComputing
Jump to: navigation, search

Introduction

The recent network drivers on Euler for the Infiniband high-speed interconnect are no longer supporting the BTL OpenIB transport layer that is for instance used in OpenMPI <= 4.0.2. This has some consequences for MPI jobs that users run on Euler. Further more the Euler VI and VII nodes have very new Mellanox ConnectX-6 network cards, which only support very recent MPI versions. That jobs can make use of this high-speed interconnect between the compute nodes, the MPI implementation which is used needs to be compatible with the current network driver (Mellanox OFED) and may some configuration options need to be set.

Infiniband vs. Ethernet

Please note that not all hardware generations in Euler have the infiniband highspeed interconnect, which gives you the best performance for MPI jobs. Euler V and VIII nodes are connected with Ethernet. If you run an MPI job, then please add the sbatch option

-C ib

to assure that your job is dispatched to nodes that have infiniband.

OpenMPI

OpenMPI 4.0.2 or newer

Jobs using OpenMPI 4.0.2 or new should not have any problems using Infiniband as OpenMPI is compiled with UCX support. No additional configuration is required. Multi node jobs will also run fine on Euler VI and VII nodes

OpenMPI older than 4.0.2

For OpenMPI versions older than 4.0.2 there are certain restrictions. On Euler VI and VII nodes, running an unsupported OpenMPI version works, but only for single node jobs (those don't use the infiniband interconnect), which limits those to 128 cores. Multi node jobs only work with OpenMPI >= 4.0.2. We implemented a rule in the batch system that prevents multi node jobs that use an unsupported OpenMPI version cannot be dispatched to Euler VI and VII nodes.

Jobs using OpenMPI older than 4.0.2 cannot use the Infiniband network (unless you compiled a version yourself with UCX support) in Euler as the centrally provided installations are using the BTL OpenIB transport layer from verbs, which is no longer supported by the current network driver (Mellanox OFED). You can still run jobs with those installations (for instance to reproduce older results) on Euler III, IV and V nodes, but you would need to disable the BTL OpenIB transport layer with the following mpirun option:

-mca btl ^openib

This will disable the Infiniband high-speed interconnect and jobs will fall back to use the slower ethernet, but this still allows to reproduce older results just with a lower performance.

Intel MPI

Intel MPI requires some configuration options (which are different for each Intel MPI version and nodes type). Users don't know to which node type their jobs get dispatched when not targeting a particular CPU model and therefore we recommend to use a script instead of directly specifying the mpirun command on the slurm command line. In the script you can set configuration options based on the node type. Please find below an example script:

#!/bin/bash

# Load your modules here
# module load intel

# setting configuration options based on intel / node type the job is running on

# get hosttype
hname=`hostname | cut -d "-" -f 2`

echo $INTEL_PYTHONHOME | grep -q '2022.1.2'
old_intel=$?
if [ $old_intel == 1 ];
then
  export I_MPI_PMI_LIBRARY=/cluster/apps/gcc-4.8.5/openmpi-4.1.4-pu2smponvdeu574nqolsw4rynnagngch/lib/libpmi.so
  case $hname in

          g1 | a2p)
                  # euler VI or VII node
                  # set variables for Intel MPI here
                  export FI_PROVIDER=verbs
          ;;
          *)
                  # not Euler VI or VII
                  export FI_PROVIDER=tcp
          ;;
  esac
fi

# command to run
srun ...

Known Issues

Using TCP provider might result in lower performances (30% on our benchmarks). If you wish to have the best performances with intel, please move to the most recent version of the intel compiler.

Commercial software

Abaqus

Abaqus does not support the Slurm batch system, therefore you need to submit your jobs correctly to make sure they can use the resources allocated by the batch system.

Abaqus provides different MPI implementations that can be used with the software. We have tested Abaqus on Euler using the default IntelMPI implementation. Since Abaqus does not provide support for the Slurm batch system, you need to provide the host list to the software in the format mp_host_list=[['host1',n1],['host2',n2],...,['hostx',nx]]. The host list needs to be written into an Abaqus environment file abaqus_v6.env. There are 3 locations that Abaqus is checking for environment files

  • install_directory/os/SMA/site/abaqus_v6.env
  • $HOME/abaqus_v6.env
  • current_directory/abaqus_v6.env

The host list is a particular property of a job, therefore we recommend to write the host list into a file in the same directory as the input file.

Please find below an example Slurm job script to run Abaqus with MPI on multiple compute nodes:

#!/bin/bash
#SBATCH -n 8
#SBATCH --nodes=2
#SBATCH --tasks-per-node=4
#SBATCH --time=08:00:00
#SBATCH --mem-per-cpu=4000
#SBATCH --tmp=50g
#SBATCH --constraint=ib

module load intel/2022.1.2 abaqus/2023

unset SLURM_GTIDS

echo "mp_host_list=[$(srun hostname | uniq -c | awk '{print("[\047"$2"\047,"$1"]")}' | paste -s -d ",")]" > abaqus_v6.env 

abaqus job=test cpus=8 input=my_abaqus_input_file scratch=$TMPDIR mp_mode=MPI

Please note the following differences to running Abaqus in threads mode:

  • #SBATCH --constraint=ib will make sure that the job is allocating nodes with the fast infiniband interconnect to handle internode communication efficiently
  • It is important to unset the variable SLURM_GTIDS, because the MPI implementations bundled with Abaqus do not support Slurm
  • The example above explicitly requests multiple nodes (--nodes=2) and specifies the number of cores per node (--tasks-per-node=4), but the commands for creating the host list also work when you only request a number of cores (--ntasks=8) and let Slum decide if those are allocated on one or multiple hosts
  • The job script will create a file abaqus_v6.env in the current directory. If there is already such a file existing, then it will be replaced with the new file

LS-DYNA

ANSYS LS-DYNA does not have Slurm support, therefore it is required to specify the list of nodes allocated by Slurm to the software.

We recommend to use the following code snippet for this purpose. It will store the information about the allocated nodes in the format required by LS-DYNA:

# Extract from SLURM the information about cluster machine hostnames and number of tasks per node:
machines=""
for i in $(scontrol show hostnames=$SLURM_JOB_NODELIST); do
        machines=$machines:$i:$SLURM_NTASKS_PER_NODE
done
machines=${machines:1}
# For later check of this information, echo it to stdout so that the information is captured in the job file:
echo $machines

Example jobscript:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=8:00:00 
#SBATCH --constraint=ib # request infiniband network for MPI job

# Load ANSYS module
module load ansys/23.1_research

# Required for running ANSYS products with Slurm
unset SLURM_GTIDS

# Extract from SLURM the information about cluster machine hostnames and number of tasks per node:
machines=""
for i in $(scontrol show hostnames=$SLURM_JOB_NODELIST); do
        machines=$machines:$i:$SLURM_NTASKS_PER_NODE
done
machines=${machines:1}
# For later check of this information, echo it to stdout so that the information is captured in the job file:
echo $machines

# Set variables for old IntelMPI
unset I_MPI_PMI_LIBRARY
export FI_PROVIDER=verbs

# Set ANSYS input file and solver
lsdyna231 -dp -mpp i=./input.k -machines $machines