Nvidia DGX-1 with Tensor Cores
The newest extension to the Leonhard cluster are four Nvidia DGX-1 deep learning servers, each equipped with 8 Tesla V100 GPU cards connected with NVLink2. The Tesla V100 GPU cards were developed by Nvidia for deep learning applications. Each of the 8 Tesla V100 cards has next to the 5120 CUDA cores additional 640 Tensor Cores, which amounts to a total 40960 CUDA cores and 5120 Tensor Cores per DGX-1 server. According to the DGX-1 data sheet, it provides a performance of 1 PFLOP (mixed precision).
We would like to kindly invite our Leonhard users to join the open beta, in order to test the new DGX-1 servers. We will soon provide information on how jobs can be run on the DGX-1 servers.
|GPU||8x Tesla V100|
|GPU memory||32 GB HBM2 memory per card|
|GPU memory bandwidth||900 GB/s|
|GPU interconnect||NVlink2, 300 GB/s per card|
|CPU||2x Intel Xeon E5-2698v4 (20 cores)|
|Storage||4x 1.92 TB SSD|
Tesla V100’s Tensor Cores are programmable matrix-multiply-and-accumulate units.
- Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C, where A, B, C and D are 4×4 matrices
- Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate)
Two CUDA libraries that use Tensor Cores are cuBLAS and cuDNN.
cuBLAS uses Tensor Cores to speed up GEMM computations (GEMM is the BLAS term for a matrix-matrix multiplication).
Existing cuBLAS GEMM codes need to be adapted:
- The routine must be a GEMM; currently, only GEMMs support Tensor Core execution.
- The math mode must be set to CUBLAS_TENSOR_OP_MATH. Floating point math is not associative, so the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-Tensor Core math routines. cuBLAS requires the user to “opt in” to the use of Tensor Cores.
- All of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight.
- The input and output data types for the matrices must be either half precision or single precision. (Only CUDA_R_16F is shown above, but CUDA_R_32F also is supported.)
GEMMs that do not satisfy the above rules will fall back to a non-Tensor Core implementation.
cuDNN uses Tensor Cores to speed up both convolutions and recurrent neural networks (RNNs).
Notice a few changes from common cuDNN use:
- The convolution algorithm must be ALGO_1 (IMPLICIT_PRECOMP_GEMM for forward). Other convolution algorithms besides ALGO_1 may use Tensor Cores in future cuDNN releases.
- The math type must be set to CUDNN_TENSOR_OP_MATH. As in cuBLAS, the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-Tensor Core math routines, so cuDNN requires the user to “opt in” to the use of Tensor Cores.
- Both input and output channel dimensions must be a multiple of eight. Again as in cuBLAS, the Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the input data must be multiples of eight.
- The input, filter, and output data types for the convolutions must be half precision.
Convolutions that do not satisfy the above rules will fall back to a non-Tensor Core implementation.
On Leonhard, Tensor Cores are supported by programs that can use
- CUDA 9.0.176 and newer versions
- cuDNN 7.0.3 and newer version
Python versions python_gpu/2.7.14 and python_gpu/3.6.4 will automatically load the cuda/9.0.176 and the cudnn/7.0 (7.0.3) modules, which support. If you would like to use the Tensor Cores with other Python installations, then please make sure that you load the following modules
module load cuda/9.0.176 cudnn/7.0
module load cuda/9.2.88 cudnn/7.3
TensorFlow 1.5 is the first version that supports CUDA 9.0 and cuDNN 7.0 and therefore Tensor Cores. According to the nvidia documentation, the following frameworks can use Tensor Cores automatically if FP16 storage is enabled: