# Nvidia DGX-1 with Tensor Cores

## Contents

## Introduction

The newest extension to the Leonhard cluster are four **Nvidia DGX-1 deep learning servers**, each equipped with 8 Tesla V100 GPU cards connected with NVLink2 interconnect. The Tesla V100 GPU cards were developed by Nvidia especially for deep learning applications. Each of the 8 Tesla V100 cards has next to the 5120 CUDA cores additional 640 Tensor Cores, which amounts to a **total of 40960 CUDA cores and 5120 Tensor Cores per DGX-1 server**. According to the DGX-1 data sheet, it provides a performance of 1 PFLOP (mixed precision).

## Open beta

We would like to kindly invite our Leonhard users to join the open beta test, in order to collect some experience with the new DGX-1 servers. To run a job on the new DGX-1 servers, you need to add the following bsub option to your submission command

-R volta

If you use this resource, then you still need to request the GPUs with the option **-R "rusage[ngpus_excl_p=1]"**.

## Specifications

GPU | 8x Tesla V100 |

GPU memory | 32 GB HBM2 memory per card |

GPU memory bandwidth | 900 GB/s |

GPU interconnect | NVlink2, 300 GB/s per card |

CPU | 2x Intel Xeon E5-2698v4 (20 cores) |

Memory | 512 GB |

Storage | 4x 1.92 TB SSD |

Peak performance | 1 PFLOP (mixed precision), according to Nvidia |

## Tensor Cores

Tesla V100’s Tensor Cores are programmable matrix-multiply-and-accumulate units.

- Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C, where A, B, C and D are 4×4 matrices
- Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate)

Two CUDA libraries that use Tensor Cores are cuBLAS and cuDNN.

### cuBLAS

cuBLAS uses Tensor Cores to speed up GEMM computations (GEMM is the BLAS term for a matrix-matrix multiplication).

Existing cuBLAS GEMM codes need to be adapted:

- The routine must be a GEMM; currently, only GEMMs support Tensor Core execution.
- The math mode must be set to CUBLAS_TENSOR_OP_MATH. Floating point math is not associative, so the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-Tensor Core math routines. cuBLAS requires the user to “opt in” to the use of Tensor Cores.
- All of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight.
- The input and output data types for the matrices must be either half precision or single precision. (Only CUDA_R_16F is shown above, but CUDA_R_32F also is supported.)

GEMMs that do not satisfy the above rules will fall back to a non-Tensor Core implementation.

### cuDNN

cuDNN uses Tensor Cores to speed up both convolutions and recurrent neural networks (RNNs).

Notice a few changes from common cuDNN use:

- The convolution algorithm must be ALGO_1 (IMPLICIT_PRECOMP_GEMM for forward). Other convolution algorithms besides ALGO_1 may use Tensor Cores in future cuDNN releases.
- The math type must be set to CUDNN_TENSOR_OP_MATH. As in cuBLAS, the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-Tensor Core math routines, so cuDNN requires the user to “opt in” to the use of Tensor Cores.
- Both input and output channel dimensions must be a multiple of eight. Again as in cuBLAS, the Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the input data must be multiples of eight.
- The input, filter, and output data types for the convolutions must be half precision.

Convolutions that do not satisfy the above rules will fall back to a non-Tensor Core implementation.

### Leonhard

On Leonhard, Tensor Cores are supported by programs that can use

- CUDA 9.0.176 and newer versions
- cuDNN 7.0.3 and newer version

Python versions **python_gpu/2.7.14** and **python_gpu/3.6.4 ** will automatically load the **cuda/9.0.176** and the **cudnn/7.0** (7.0.3) modules, which support. If you would like to use the Tensor Cores with other Python installations, then please make sure that you load the following modules

module load cuda/9.0.176 cudnn/7.0

or

module load cuda/9.0.176 cudnn/7.3

or

module load cuda/9.2.88 cudnn/7.2

Please note that the combination of `cuda/9.2.88` and `cudnn/7.3` does not work, as cudnn/7.3 was compiled by Nvidia linking it to CUDA 9.0.176.

**Please make sure that your code fulfills the requirements listed in the sections cuBLAS and/or cuDNN before you submit your jobs to the DGX-1 servers.**

TensorFlow 1.5 is the first version that supports CUDA 9.0 and cuDNN 7.0 and therefore Tensor Cores. According to the Nvidia documentation, the following frameworks can use Tensor Cores automatically if FP16 storage is enabled:

- NVCaffe
- Caffe2
- MXNet
- PyTorch
- TensorFlow
- Theano

## Useful links

- https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
- https://devblogs.nvidia.com/tensor-ops-made-easier-in-cudnn/
- https://devblogs.nvidia.com/tensor-cores-mixed-precision-scientific-computing/
- https://devblogs.nvidia.com/video-mixed-precision-techniques-tensor-cores-deep-learning/
- https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#framework
- https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf