Difference between revisions of "MPI on Euler"

From ScientificComputing
Jump to: navigation, search
(OpenMPI older than 4.0.2)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
 
The recent network drivers on Euler for the Infiniband high-speed interconnect are no longer supporting the BTL OpenIB transport layer that is for instance used in OpenMPI <= 4.0.2. This has some consequences for MPI jobs that users run on Euler. Further more the Euler VI and VII nodes have very new Mellanox ConnectX-6 network cards, which only support very recent MPI versions. That jobs can make use of this high-speed interconnect between the compute nodes, the MPI implementation which is used needs to be compatible with the current network driver (Mellanox OFED) and may some configuration options need to be set.
 
The recent network drivers on Euler for the Infiniband high-speed interconnect are no longer supporting the BTL OpenIB transport layer that is for instance used in OpenMPI <= 4.0.2. This has some consequences for MPI jobs that users run on Euler. Further more the Euler VI and VII nodes have very new Mellanox ConnectX-6 network cards, which only support very recent MPI versions. That jobs can make use of this high-speed interconnect between the compute nodes, the MPI implementation which is used needs to be compatible with the current network driver (Mellanox OFED) and may some configuration options need to be set.
 +
 +
==Infiniband vs. Ethernet==
 +
Please note that not all hardware generations in Euler have the infiniband highspeed interconnect, which gives you the best performance for MPI jobs. Euler V and VIII nodes are connected with Ethernet. If you run an MPI job, then please add the sbatch option
 +
 +
-C ib
 +
 +
to assure that your job is dispatched to nodes that have infiniband.
  
 
==OpenMPI==
 
==OpenMPI==
Line 9: Line 16:
 
For OpenMPI versions older than 4.0.2 there are certain restrictions. On Euler VI and VII nodes, running an unsupported OpenMPI version works, but only for single node jobs (those don't use the infiniband interconnect), which limits those to 128 cores. Multi node jobs only work with OpenMPI >= 4.0.2. We implemented a rule in the batch system that prevents multi node jobs that use an unsupported OpenMPI version cannot be dispatched to Euler VI and VII nodes.
 
For OpenMPI versions older than 4.0.2 there are certain restrictions. On Euler VI and VII nodes, running an unsupported OpenMPI version works, but only for single node jobs (those don't use the infiniband interconnect), which limits those to 128 cores. Multi node jobs only work with OpenMPI >= 4.0.2. We implemented a rule in the batch system that prevents multi node jobs that use an unsupported OpenMPI version cannot be dispatched to Euler VI and VII nodes.
  
Jobs using OpenMPI older than 4.0.2 cannot use the Infiniband network (unless you compiled a version yourself with UCX support) in Euler as the centrally provided installations are using the BTL OpenIB transport layer from verbs, which is no longer supported by the current network driver (Mellanox OFED). You can still run jobs with those installations (for instance to reproduce older results) on Euler IV and V nodes, but you would need to disable the BTL OpenIB transport layer with the following mpirun option:
+
Jobs using OpenMPI older than 4.0.2 cannot use the Infiniband network (unless you compiled a version yourself with UCX support) in Euler as the centrally provided installations are using the BTL OpenIB transport layer from verbs, which is no longer supported by the current network driver (Mellanox OFED). You can still run jobs with those installations (for instance to reproduce older results) on Euler III, IV and V nodes, but you would need to disable the BTL OpenIB transport layer with the following mpirun option:
  
 
  -mca btl ^openib
 
  -mca btl ^openib
Line 16: Line 23:
  
 
==Intel MPI==
 
==Intel MPI==
Intel MPI requires some configuration options (which are different for each Intel MPI version) to use the Mellanox infiniband cards on Euler VI and VII. Users don't know to which node type their jobs get dispatched when not targeting a particular CPU model and therefore we recommend to use a script instead of directly specifying the mpirun command on the bsub command line. In the script you can set configuration options based on the node type. Please find below an example script:
+
Intel MPI requires some configuration options (which are different for each Intel MPI version and nodes type). Users don't know to which node type their jobs get dispatched when not targeting a particular CPU model and therefore we recommend to use a script instead of directly specifying the mpirun command on the slurm command line. In the script you can set configuration options based on the node type. Please find below an example script:
  
 
  #!/bin/bash
 
  #!/bin/bash
 
   
 
   
  # setting configuration options based on node type the job is running on
+
# Load your modules here
 +
# module load intel
 +
 +
  # setting configuration options based on intel / node type the job is running on
 
   
 
   
 
  # get hosttype
 
  # get hosttype
 
  hname=`hostname | cut -d "-" -f 2`
 
  hname=`hostname | cut -d "-" -f 2`
 
   
 
   
  case $hname in
+
  echo $INTEL_PYTHONHOME | grep -q '2022.1.2'
 +
old_intel=$?
 +
if [ $old_intel == 1 ];
 +
then
 +
  export I_MPI_PMI_LIBRARY=/cluster/apps/gcc-4.8.5/openmpi-4.1.4-pu2smponvdeu574nqolsw4rynnagngch/lib/libpmi.so
 +
  case $hname in
 
   
 
   
        g1 | a2p)
+
          g1 | a2p)
                # euler VI or VII node
+
                  # euler VI or VII node
                # set variables for Intel MPI here
+
                  # set variables for Intel MPI here
        ;;
+
                  export FI_PROVIDER=verbs
        *)
+
          ;;
                # not Euler VI or VII
+
          *)
        ;;
+
                  # not Euler VI or VII
  esac
+
                  export FI_PROVIDER=tcp
 +
          ;;
 +
  esac
 +
  fi
 
   
 
   
 
  # command to run
 
  # command to run
  mpirun ...
+
  srun ...
 
 
===Intel oneAPI 2022.1.2 [work in progress]===
 
We are still testing configurations to find the most suitable one. There is a new provider called MLX, which should be used.
 
 
 
===Intel 19.1.0/2020.1===
 
For intel 19.1.0 (2020.1), please use the following variables
 
 
 
export FI_PROVIDER=verbs
 
export I_MPI_FABRICS=shm:ofi
 
 
In this version the older DAPL provider which can be used with Intel 18.0.1 has been deprecated and can no longer be used.
 
===Intel 18.0.1/2018.1===
 
For Intel 18.0.1 (2018.1) there are two configurations possible.
 
 
 
Configuration 1 (using the OFA provider):
 
 
 
export I_MPI_FABRICS=shm:ofa
 
export I_MPI_FALLBACK=0
 
export I_MPI_OFA_ADAPTER_NAME="mlx5_0"
 
 
 
Configuration 2 (using the DAPL provider):
 
 
 
export I_MPI_FABRICS=shm:dapl
 
export I_MPI_DAPL_UD=1
 
export I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx5_0-1u
 
  
===Versions older than 18.0.1===
+
=== Known Issues ===
We have not tested the Intel MPI implementations from Intel versions older than 18.0.1. Those installations are from the old software stack an we can no longer support them. They are provided on an as-is basis.
+
Using TCP provider might result in lower performances (30% on our benchmarks). If you wish to have the best performances with intel, please move to the most recent version of the intel compiler.

Latest revision as of 09:40, 18 January 2023

Introduction

The recent network drivers on Euler for the Infiniband high-speed interconnect are no longer supporting the BTL OpenIB transport layer that is for instance used in OpenMPI <= 4.0.2. This has some consequences for MPI jobs that users run on Euler. Further more the Euler VI and VII nodes have very new Mellanox ConnectX-6 network cards, which only support very recent MPI versions. That jobs can make use of this high-speed interconnect between the compute nodes, the MPI implementation which is used needs to be compatible with the current network driver (Mellanox OFED) and may some configuration options need to be set.

Infiniband vs. Ethernet

Please note that not all hardware generations in Euler have the infiniband highspeed interconnect, which gives you the best performance for MPI jobs. Euler V and VIII nodes are connected with Ethernet. If you run an MPI job, then please add the sbatch option

-C ib

to assure that your job is dispatched to nodes that have infiniband.

OpenMPI

OpenMPI 4.0.2 or newer

Jobs using OpenMPI 4.0.2 or new should not have any problems using Infiniband as OpenMPI is compiled with UCX support. No additional configuration is required. Multi node jobs will also run fine on Euler VI and VII nodes

OpenMPI older than 4.0.2

For OpenMPI versions older than 4.0.2 there are certain restrictions. On Euler VI and VII nodes, running an unsupported OpenMPI version works, but only for single node jobs (those don't use the infiniband interconnect), which limits those to 128 cores. Multi node jobs only work with OpenMPI >= 4.0.2. We implemented a rule in the batch system that prevents multi node jobs that use an unsupported OpenMPI version cannot be dispatched to Euler VI and VII nodes.

Jobs using OpenMPI older than 4.0.2 cannot use the Infiniband network (unless you compiled a version yourself with UCX support) in Euler as the centrally provided installations are using the BTL OpenIB transport layer from verbs, which is no longer supported by the current network driver (Mellanox OFED). You can still run jobs with those installations (for instance to reproduce older results) on Euler III, IV and V nodes, but you would need to disable the BTL OpenIB transport layer with the following mpirun option:

-mca btl ^openib

This will disable the Infiniband high-speed interconnect and jobs will fall back to use the slower ethernet, but this still allows to reproduce older results just with a lower performance.

Intel MPI

Intel MPI requires some configuration options (which are different for each Intel MPI version and nodes type). Users don't know to which node type their jobs get dispatched when not targeting a particular CPU model and therefore we recommend to use a script instead of directly specifying the mpirun command on the slurm command line. In the script you can set configuration options based on the node type. Please find below an example script:

#!/bin/bash

# Load your modules here
# module load intel

# setting configuration options based on intel / node type the job is running on

# get hosttype
hname=`hostname | cut -d "-" -f 2`

echo $INTEL_PYTHONHOME | grep -q '2022.1.2'
old_intel=$?
if [ $old_intel == 1 ];
then
  export I_MPI_PMI_LIBRARY=/cluster/apps/gcc-4.8.5/openmpi-4.1.4-pu2smponvdeu574nqolsw4rynnagngch/lib/libpmi.so
  case $hname in

          g1 | a2p)
                  # euler VI or VII node
                  # set variables for Intel MPI here
                  export FI_PROVIDER=verbs
          ;;
          *)
                  # not Euler VI or VII
                  export FI_PROVIDER=tcp
          ;;
  esac
fi

# command to run
srun ...

Known Issues

Using TCP provider might result in lower performances (30% on our benchmarks). If you wish to have the best performances with intel, please move to the most recent version of the intel compiler.