Difference between revisions of "Neural network training with TensorFlow on GPU"

From ScientificComputing
Jump to: navigation, search
Line 8: Line 8:
 
Load the Python module which contains TensorFlow 2.0.0 package
 
Load the Python module which contains TensorFlow 2.0.0 package
 
  [jarunanp@eu-login-11 ~]$ module load gcc/6.3.0 python_gpu/3.8.5 eth_proxy
 
  [jarunanp@eu-login-11 ~]$ module load gcc/6.3.0 python_gpu/3.8.5 eth_proxy
 +
 
  The following have been reloaded with a version change:
 
  The following have been reloaded with a version change:
 +
  1) gcc/4.8.5 => gcc/6.3.0
 +
 +
[jarunanp@eu-login-29 ~]$ module list
 
   
 
   
   1) gcc/4.8.5 => gcc/6.3.0
+
Currently Loaded Modules:
 +
   1) StdEnv            4) cuda/11.0.3    7) python_gpu/3.8.5
 +
  2) gcc/6.3.0         5) cudnn/8.0.5    8) eth_proxy
 +
  3) openblas/0.2.20  6) nccl/2.7.8-1
 +
 
  
 
Check if we could import the TensorFlow package
 
Check if we could import the TensorFlow package

Revision as of 09:55, 15 June 2021

< Examples

Load modules

We will use the new software stack in this tutorial:

[jarunanp@eu-login-11 ~]$ env2lmod  

Load the Python module which contains TensorFlow 2.0.0 package

[jarunanp@eu-login-11 ~]$ module load gcc/6.3.0 python_gpu/3.8.5 eth_proxy

The following have been reloaded with a version change:
  1) gcc/4.8.5 => gcc/6.3.0
[jarunanp@eu-login-29 ~]$ module list

Currently Loaded Modules:
  1) StdEnv            4) cuda/11.0.3    7) python_gpu/3.8.5
  2) gcc/6.3.0         5) cudnn/8.0.5    8) eth_proxy
  3) openblas/0.2.20   6) nccl/2.7.8-1


Check if we could import the TensorFlow package

[jarunanp@eu-login-11 tf_gpu]$ python -c "import tensorflow as tf; print(tf.__version__)"
2.3.0

A neural network model

Create a working directory on $SCRATCH

[jarunanp@eu-login-11 ~]$ cd $SCRATCH
[jarunanp@eu-login-11 jarunanp]$ mkdir tf_gpu
[jarunanp@eu-login-11 tf_gpu]$ cd tf_gpu

Download the script train_mnist_gpu.py containing a neural network model which is trained on MNIST dataset. This example is taken from TensorFlow tutorials.

[jarunanp@eu-login-11 tf_gpu]$ wget https://scicomp.ethz.ch/public/examples/tensorflow/train_mnist_gpu.py

Request an interactive session on a compute node

[jarunanp@eu-login-11 tf_gpu]$ bsub -n 4 -R "rusage[ngpus_excl_p=2]" -Is bash
Generic job.
Job <175537249> is submitted to queue <gpu.4h>.
<<Waiting for dispatch ...>>
<<Starting on eu-g3-001>>
FILE: /sys/fs/cgroup/cpuset/lsf/euler/job.175537249.83337.1623700766/tasks[jarunanp@eu-g3-001 tf_gpu]$ 
[jarunanp@eu-g3-001 tf_gpu]$

Launch the training

[jarunanp@eu-g3-001 tf_gpu]$ python train_mnist_gpu.py

< Examples