Too much space is used by your output files
On our clusters, data written to stdout/stderr are buffered in a shadow file system with a small quota of 2 GB per user. When this quota is reached, all jobs would crash. We have therefore recently modified the batch system to detect this condition and preemptively reject new jobs until the data stored by these jobs in the shadow file system have been removed.
Users receive then an error message
Too much space is used by your output files in the LSF batch system's temporary directory.
You cannot clean up your files in the shadow file system yourself. If you receive this error message, then please contact cluster support.
Howto solve this problem
Writing so much data to stdout or stderr does not only fill up the shadow file system; it also slows down your jobs. You should therefore:
- Kill all jobs to prevent further problems
- Modify the program to NOT write all the output to stdout
- Resubmit all jobs
Modifying the program to NOT write all these output to stdout might not be possible in all cases. For such cases you can redirect the program's stderr/stdout to a file using a command like:
bsub [LSF options] "program [arguments] 2> program.error"
(Note that the quote above are necessary; otherwise the redirection operator would apply to bsub instead of the program.)
In case of a job array, all individual jobs will write their output into the same file, which may not be desirable. This can be avoided using the run-time $LSB_JOBINDEX variable, e.g.:
bsub [LSF options] "program [arguments] 2> program_\$LSB_JOBINDEX.error"
Be careful though: writing a lot of data to stdout or stderr is aways a BAD IDEA because it slows down the program and overloads the cluster's file system. The "shadow" file system and its 2 GB quota are a protection against misbehaving jobs; you are bypassing them at your own risks. The BEST solution is to modify the program to reduce or eliminate this unnecessary I/O.