AMD MI250 Evaluation

From ScientificComputing
Jump to: navigation, search

Introduction

A test node with 4 AMD MI250 GPU cards is available for testing. Each of these GPU cards is a pair of independent GPUs with 64 GiB memory each, connected with AMD Infinity Fabric links.

These four GPUs are therefore seen as 8 individual GPUs with 64 GiB of GPU RAM each.

Quickstart quide

The following quickstart tutorial shows you a quick but complete example of running a calculation on the AMD GPU. It

  1. sets up your environment for Apptainer,
  2. downloads the PyTorch quickstart tutorial script, and
  3. runs it on an AMD GPU node using an AMD-build PyTorch container that we cache locally.

For further information about the steps, please refer to the followings sections on this wiki page.

One-time setup of Apptainer

grep -qE "(SINGULARITY|APPTAINER)_CACHEDIR" ~/.bashrc || echo 'export APPTAINER_CACHEDIR="$SCRATCH/.singularity"' >> ~/.bashrc
id -Gn | grep -qw SINGULARITY || /cluster/apps/local/get-access

Run tutorial

# Enable access to the Slurm cluster containing the AMD GPU evaluation node:
SLURM_CONF=/cluster/adm/slurm-amdgpu/slurm/etc/slurm.conf
# Setup environment
module load eth_proxy
mkdir ~/amd-gpu-quickstart
cd ~/amd-gpu-quickstart
# Download PyTorch quickstart tutorial.
wget https://github.com/pytorch/tutorials/raw/main/beginner_source/basics/quickstart_tutorial.py
# Submit interactive Slurm job to run the PyTorch quickstart tutorial.
srun --pty --gpus=amdgpu:1 --mem-per-cpu=4g apptainer-amdgpu /cluster/work/apptainer/rocm/pytorch_latest.sif python quickstart_tutorial.py

Usage

We have placed this GPU node into a separate Slurm cluster. To use this cluster, you will need to set the SLURM_CONF environment variable:

SLURM_CONF=/cluster/adm/slurm-amdgpu/slurm/etc/slurm.conf

To run a job on the AMD GPUs, you need to request the amdgpu GPU model:

sbatch --cpus-per-task=1 --gpus=amdgpu:1 --mem-per-cpu=32g --wrap="./my_gpu_program"

Of course you should adjust the memory requirement and any other sbatch options to your use case. You should avoid using any other GPU options for this evaluation, such as selecting GPU memory.

Please note that during the test phase, we only allow jobs with a maximal runtime of 24 hours.

Software

We currently do not provide any software that has been compiled for these AMD GPUs. You will need to use Apptainer (née Singularity) containers to use them, such as those provided on Dockerhub by AMD. Please refer to the summary of available images and an example of using them.

For your convenience, during this beta period we are providing two Apptainer .sif images of AMD's ROCm-enabled PyTorch and Tensorflow. They are available in /cluster/work/apptainer/rocm as

Working with Apptainer containers

The AMD GPU containers tend to be very big (multiple Gigabytes)! We therefore recommend the following steps for downloading and using them.

Apptainer environment

Add the following line to your .bashrc file:

export APPTAINER_CACHEDIR="$SCRATCH/.apptainer"

to leverage your personal scratch, ($SCRATCH) directory for storing the large temporary files used by Apptainer.

Downloading a container image

On a login node, run

cd $SCRATCH
apptainer pull docker://rocm/pytorch:latest

This will pull the PyTorch Docker image, convert it to the Apptainer .sif format, and store it in $SCRATCH. Due to its size, this can take up to an hour.

You should end up with a pytorch_latest.sif image file in your $SCRATCH directory:

ls -ld $SCRATCH/pytorch_latest.sif
-rwxr-x--- 1 urbanb urbanb-group 16994959360 Mar 20 14:20 /cluster/scratch/urbanb/pytorch_latest.sif

Running a container image

We provide the apptainer-amdgpu wrapper to run an Apptainer container image. For example,

[apps@eu-login-41 ~]$ srun --pty --gpus=amdgpu:1 --mem-per-cpu=32g apptainer-amdgpu /cluster/work/apptainer/rocm/pytorch_latest.sif/pytorch_latest.sif python
Python 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>

Note that all cluster locations such as your $HOME and $SCRATCH are available in the container.

To run a script, just append it to the previous command. For example, the Quickstart tutorial can be tested as

[apps@eu-login-41 ~]$ module load eth_proxy
[apps@eu-login-41 ~]$ wget https://github.com/pytorch/tutorials/raw/main/beginner_source/basics/quickstart_tutorial.py
[apps@eu-login-41 ~]$ srun --gpus=amdgpu:1 --mem-per-cpu=4g apptainer-amdgpu $SCRATCH/pytorch_latest.sif python ~/quickstart_tutorial.py

Support

We only provide limited support for running software on this evaluation GPU. However, do not hesitate to contact us in case of problems submitting jobs or accessing the GPU. Please state in your ticket that it concerns the AMD GPU!

We would appreciate feedback about your experience regarding the performance and capabilities about the GPU itself.