Difference between revisions of "Too much space is used by your output files"

From ScientificComputing
Jump to: navigation, search
(How to solve this problem)
Line 20: Line 20:
  
 
  bsub [LSF options] "program [arguments] > program.error"
 
  bsub [LSF options] "program [arguments] > program.error"
 +
 +
(Note that the quote above are necessary; otherwise the redirection
 +
operator would apply to bsub instead of the program.)
  
 
{| class="wikitable
 
{| class="wikitable
Line 34: Line 37:
 
| redirect stdout and stderr
 
| redirect stdout and stderr
 
|}
 
|}
 
(Note that the quote above are necessary; otherwise the redirection
 
operator would apply to bsub instead of the program.)
 
  
 
In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time <tt>$LSB_JOBINDEX</tt> variable, e.g.:
 
In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time <tt>$LSB_JOBINDEX</tt> variable, e.g.:

Revision as of 10:09, 23 February 2017

Introduction

On our clusters, data written to stdout/stderr are buffered in a shadow file system with a small quota of 2 GB per user. When this quota is reached, all jobs would crash. We have therefore recently modified the batch system to detect this condition and preemptively reject new jobs until the data stored by these jobs in the shadow file system have been removed.

Error message

Users receive then an error message

 Too much space is used by your output files
 in the LSF batch system's temporary directory.

You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact cluster support.

How to solve this problem

Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:

  1. Kill all jobs to prevent further problems
  2. Modify the program to NOT write all the output to stdout
  3. Resubmit all jobs

Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:

bsub [LSF options] "program [arguments] > program.error"

(Note that the quote above are necessary; otherwise the redirection operator would apply to bsub instead of the program.)

Redirection operator Description
> redirect stdout
2> redirect stderr
&> redirect stdout and stderr

In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time $LSB_JOBINDEX variable, e.g.:

bsub [LSF options] "program [arguments] 2> program_\$LSB_JOBINDEX.error"

Be careful though: writing a lot of data to stdout or stderr is aways a BAD IDEA because it slows down the program and overloads the cluster's file system. The "shadow" file system and its 2 GB quota are a protection against misbehaving jobs; you are bypassing them at your own risks. The BEST solution is to modify the program to reduce or eliminate this unnecessary I/O.