Too much space is used by your output files

From ScientificComputing
Jump to: navigation, search

The information on this wiki page refers to a buffer for stdout that is used by the LSF batch system. Due to the Transition from LSF to Slurm this page will soon become obsolete.

Introduction

On our clusters, data written to stdout/stderr are buffered in a shadow file system with a small quota of 2 GB per user (due to technical reasons, there is no user quota set for the shadow file system on ETHSec). When this quota is reached, all jobs would crash. We have therefore recently modified the batch system to detect this condition and preemptively reject new jobs until the data stored by these jobs in the shadow file system have been removed.

Error message

Users receive then an error message

 Too much space is used by your output files
 in the LSF batch system's temporary directory.

You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact cluster support.

How to solve this problem

Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:

  1. Kill all jobs to prevent further problems
  2. Modify the program to NOT write all the output to stdout
  3. Resubmit all jobs

Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:

bsub [LSF options] "program [arguments] > program.out"

(Note that the quote above are necessary; otherwise the redirection operator would apply to bsub instead of the program.).


If the information written to stdout and stderr is not needed at all, then users can also redirect if to the virtual device /dev/null, then the information is not stored at all.

bsub [LSF options] "program [arguments] &> /dev/null"

Please find below some documentation about the most common redirection operators.

Redirection operator Description
> redirect stdout
2> redirect stderr
&> redirect stdout and stderr

In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time $LSB_JOBINDEX variable, e.g.:

bsub [LSF options] "program [arguments] 2> program_\$LSB_JOBINDEX.error"

Be careful though: writing a lot of data to stdout or stderr is aways a BAD IDEA because it slows down the program and overloads the cluster's file system. The "shadow" file system and its 2 GB quota are a protection against misbehaving jobs; you are bypassing them at your own risks. The BEST solution is to modify the program to reduce or eliminate this unnecessary I/O.