Difference between revisions of "Too much space is used by your output files"

From ScientificComputing
Jump to: navigation, search
(Howto solve this problem)
Line 10: Line 10:
 
You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact [mailto:cluster-support@id.ethz.ch cluster support].
 
You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact [mailto:cluster-support@id.ethz.ch cluster support].
  
==Howto solve this problem==
+
==How to solve this problem==
 
Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:
 
Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:
  
Line 19: Line 19:
 
Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:
 
Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:
  
  bsub [LSF options] "program [arguments] 2> program.error"
+
  bsub [LSF options] "program [arguments] > program.error"
 +
 
 +
{| class="wikitable
 +
!Redirection operator
 +
!Description
 +
|-
 +
| >
 +
| redirect stdout
 +
|-
 +
| 2>
 +
| redirect stderr
 +
|-
 +
| &>
 +
| redirect stdout and stderr
 +
|}
  
 
(Note that the quote above are necessary; otherwise the redirection
 
(Note that the quote above are necessary; otherwise the redirection

Revision as of 10:08, 23 February 2017

Introduction

On our clusters, data written to stdout/stderr are buffered in a shadow file system with a small quota of 2 GB per user. When this quota is reached, all jobs would crash. We have therefore recently modified the batch system to detect this condition and preemptively reject new jobs until the data stored by these jobs in the shadow file system have been removed.

Error message

Users receive then an error message

 Too much space is used by your output files
 in the LSF batch system's temporary directory.

You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact cluster support.

How to solve this problem

Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:

  1. Kill all jobs to prevent further problems
  2. Modify the program to NOT write all the output to stdout
  3. Resubmit all jobs

Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:

bsub [LSF options] "program [arguments] > program.error"
Redirection operator Description
> redirect stdout
2> redirect stderr
&> redirect stdout and stderr

(Note that the quote above are necessary; otherwise the redirection operator would apply to bsub instead of the program.)

In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time $LSB_JOBINDEX variable, e.g.:

bsub [LSF options] "program [arguments] 2> program_\$LSB_JOBINDEX.error"

Be careful though: writing a lot of data to stdout or stderr is aways a BAD IDEA because it slows down the program and overloads the cluster's file system. The "shadow" file system and its 2 GB quota are a protection against misbehaving jobs; you are bypassing them at your own risks. The BEST solution is to modify the program to reduce or eliminate this unnecessary I/O.