Using local scratch

From ScientificComputing
Jump to: navigation, search

Introduction

Compute nodes in the our clusters have a local disk that can be used for local scratch. Using local scratch has the advantage, that I/O operations are carried out inside the compute node and don't go over the network.

When should local scratch be used?

The local scratch is useful if your job produces up to a few hundred gigabytes of file useful only for the lifetime of the job. For example, some programs can use local scratch as a cache to avoid recomputing expensive calculations.

The local scratch should be used if a program performs inefficient input/output (I/O) such as when a program

  • performs I/O in small chunks (e.g., reading a file several bytes at a time)
  • re-reads the same file multiple times
  • continuously opens and closes files

These operations are especially stressful for networked file systems such as those containing the home and global scratch directories. While it is better to fix a program with such I/O patters, the local scratch lets such programs run without severely affecting other users or the cluster as a whole.

Using the local scratch

To use the local scratch, you must request space using the --tmp= sbatch/srun option, where X is the size in MB/core. For example, to request 1000 MB (about one GB) of local scratch, you can submit your job this way:

sbatch --tmp=1000 < my_jobs_script.sh

The batch system will create a directory for your job set a $TMPDIR environment variable that points to that directory. The directory is automatically cleaned up when your job is finished.

Using the local scratch with staging

In the cases where the local scratch is used to mitigate inefficient I/O, a common use case is to copy input files to the local scratch, run the calculation there, and then copy all the new or changed files back to the original location. A simplified version of this workflow is listed below:

#!/bin/bash
# Copy files to local scratch
rsync -aq ./ ${TMPDIR}
# Run commands
cd $TMPDIR
do_my_calculation
# Copy new and changed files back.
# Slurm saves the path of the directory from which the job was submitted in $SLURM_SUBMIT_DIR
# LSF saves the path of the directory from which the job was submitted in $LS_SUBCWD.
rsync -auq ${TMPDIR}/ $SLURM_SUBMIT_DIR

A more complex script is found in /cluster/apps/local/runintemp.sh. It will also trigger a copy of files from the local scratch to the original directory if the job is about to run out of time. Run the script without options for usage instructions.