Difference between revisions of "Best practices on Lustre parallel file systems"

From ScientificComputing
Jump to: navigation, search
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
Lustre is a '''parallel distributed file system'''. Files are distributed across multiple servers, and then striped across multiple disks. On the Euler and the Leonhard cluster, our Lustre file system are hosting the global '''scratch''' and '''work''' directories
+
On the Euler and the Leonhard cluster, the global '''scratch''' and '''work''' directories
  
 
  /cluster/scratch/$USER
 
  /cluster/scratch/$USER
 
  /cluster/work/
 
  /cluster/work/
  
and it is optimized especially for '''parallel I/O''' and '''large files'''.
+
are hosted on Lustre file systems. The are optimized especially for '''parallel I/O''' and '''large files'''. Those file systems are '''shared among many users'''. If you are
 
 
The Lustre file system is '''shared among many users''', therefore please note that
 
  
 
*working with a large number of small files
 
*working with a large number of small files
Line 13: Line 11:
 
*accessing the same file with hundreds of processes
 
*accessing the same file with hundreds of processes
  
will not only slow down your jobs. '''It can overload the entire file system affecting all users'''. Therefore please carefully read our best practices guide before using <tt>/cluster/work</tt> or <tt>/cluster/scratch</tt>.
+
then this will not only slow down your jobs. '''It can overload the entire file system affecting all users'''. Therefore please carefully read our best practices guide before using <tt>/cluster/work</tt> and/or <tt>/cluster/scratch</tt>.
 +
 
 +
==Troubleshooting==
 +
If you experience lag on <tt>/cluster/scratch</tt> or <tt>/cluster/work</tt>, then please consider the following recommendations:
 +
 
 +
*If you need to edit small text files, then please copy them to your home directory, as the home directories are not affected by the lag
 +
*If you need to run computations that access data sets on Lustre, then please try to use local scratch whenever possible, such that the data from your personal scratch directory only needs to be accessed, when copying it to the local scratch
 +
*https://scicomp.ethz.ch/wiki/Using_local_scratch
 +
 
 +
Please try to avoid sourcing scripts hosted in /cluster/work or /cluster/scratch in your .bashrc or .bash_profile as this will also the login experiencing a lag.
  
 
==Lustre architecture==
 
==Lustre architecture==
 +
Lustre is a '''parallel distributed file system'''. Files are distributed across multiple servers, and then striped across multiple disks.
 +
 
A Lustre file system has three major functional units:
 
A Lustre file system has three major functional units:
  
Line 29: Line 38:
  
 
==Best practices==
 
==Best practices==
 +
===Avoid unnecessary I/O operations===
 +
In many programs, there are options to control I/O to make them more or less verbose. In general I/O operations are slowing down your computation, because during I/O operations the CPU is waiting and doing nothing. Therefore only do I/O if it is required and provides an added value to your computation. Otherwise try to avoid unneccessary I/O operations whenever possible.
 +
 
===Limit repetitive Open/Close operations===
 
===Limit repetitive Open/Close operations===
 
If you need to write a lot of values into a file as part of a loop, then there are multiple ways of achieving this task. Please make sure that you never put the open and close statements inside the loop as shown in this Python example:
 
If you need to write a lot of values into a file as part of a loop, then there are multiple ways of achieving this task. Please make sure that you never put the open and close statements inside the loop as shown in this Python example:
Line 45: Line 57:
  
 
===Limit repetitive "stat" operations===
 
===Limit repetitive "stat" operations===
If you are running a code that needs at some point to check if a file exists, it is sufficient to check for this every few seconds
+
If you are running a code that needs at some point to wait until a certain file is created, it is sufficient to check for this every few seconds. Checking this every few milliseconds can create a lot of unnecessary file stat calls.
  
 
===Directory listings: ls vs. ls -l===
 
===Directory listings: ls vs. ls -l===
Line 53: Line 65:
 
* Only use <tt>ls -l</tt> if you also need to know about the file size
 
* Only use <tt>ls -l</tt> if you also need to know about the file size
  
===Don't store a large number of files in a single directory===
+
===Use subdirectories instead of storing all files in a single directory===
 +
When a file is accessed, Lustre puts a lock on the parent directory. If many files are opened in the same directory, then this will cause contention. To minimize contention, distribute your files into a subdirectory structure. This way your files are also organized and easier to handle.
  
===Avoid Accessing Small Files on Lustre Filesystems===
+
If you accidently run an <tt>ls -l</tt> instead of just <tt>ls</tt>, then it also makes a difference if the directory contains 20 or 100'000 files.
  
===Use a Stripe Count of 1 for Directories with Many Small Files===
+
===Use local scratch instead of doing computations directly on Lustre===
 +
If your data set fits into local scratch (storage inside the compute node), then try to use local scratch instead of Lustre.
  
===Avoid Accessing Executables on Lustre Filesystems===
+
https://scicomp.ethz.ch/wiki/Using_local_scratch
  
===Increase the Stripe Count for Parallel Access to the Same File===
+
A typical workflow could be to copy your files from Lustre to local scratch at the beginning of a job, then process the files and copy back the results of the job from local scratch to Lustre.
  
===Restripe Large Files===
+
===Use other storage locations for small files===
 +
The Lustre file system is the worst place to store a lot of small files. Other file systems like $HOME or local scratch ($TMPDIR, only on compute nodes) are much better suited to deal with small files. If you have to store a lot of small files on Lustre, then please at least tar them up to a single file. For processing those files, untar them to local scratch at the beginning of the job, process them on the compute node and at the end of the job tar up the results and copy back the archive to Lustre.
  
 
===Limit the Number of Processes Performing Parallel I/O===
 
===Limit the Number of Processes Performing Parallel I/O===
  
 
===Avoid Having Multiple Processes Open the Same File(s) at the Same Time===
 
===Avoid Having Multiple Processes Open the Same File(s) at the Same Time===
 
==Troubleshooting==
 
If you experience lag on <tt>/cluster/scratch</tt> or <tt>/cluster/work</tt>, then please consider the following recommendations:
 
 
*If you need to edit small text files, then please copy them to your home directory, as the home directories are not affected by the lag
 
*If you need to run computations that access data sets on Lustre, then please try to use local scratch whenever possible, such that the data from your personal scratch directory only needs to be accessed, when copying it to the local scratch
 
*https://scicomp.ethz.ch/wiki/Using_local_scratch
 
 
    Please try to avoid sourcing scripts hosted in /cluster/work or /cluster/scratch in your .bashrc or .bash_profile
 
  
 
==Working with stripes (advanced users)==
 
==Working with stripes (advanced users)==

Revision as of 13:29, 12 February 2019

Introduction

On the Euler and the Leonhard cluster, the global scratch and work directories

/cluster/scratch/$USER
/cluster/work/

are hosted on Lustre file systems. The are optimized especially for parallel I/O and large files. Those file systems are shared among many users. If you are

  • working with a large number of small files
  • running thousands of unnecessary I/O operations per second (running Open/Close in a loop)
  • accessing the same file with hundreds of processes

then this will not only slow down your jobs. It can overload the entire file system affecting all users. Therefore please carefully read our best practices guide before using /cluster/work and/or /cluster/scratch.

Troubleshooting

If you experience lag on /cluster/scratch or /cluster/work, then please consider the following recommendations:

  • If you need to edit small text files, then please copy them to your home directory, as the home directories are not affected by the lag
  • If you need to run computations that access data sets on Lustre, then please try to use local scratch whenever possible, such that the data from your personal scratch directory only needs to be accessed, when copying it to the local scratch
  • https://scicomp.ethz.ch/wiki/Using_local_scratch

Please try to avoid sourcing scripts hosted in /cluster/work or /cluster/scratch in your .bashrc or .bash_profile as this will also the login experiencing a lag.

Lustre architecture

Lustre is a parallel distributed file system. Files are distributed across multiple servers, and then striped across multiple disks.

A Lustre file system has three major functional units:

  • Metadata servers (MDS) that stores namespace metadata, such as filenames, directories, access permissions, and file layout.
  • Object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices.
  • Client(s) that access and use the data.

When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access and/or create the file, then the layout of an existing file is returned a new file is created.

For read or write operations, the client then interprets the file layout, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes.

After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes.

Best practices

Avoid unnecessary I/O operations

In many programs, there are options to control I/O to make them more or less verbose. In general I/O operations are slowing down your computation, because during I/O operations the CPU is waiting and doing nothing. Therefore only do I/O if it is required and provides an added value to your computation. Otherwise try to avoid unneccessary I/O operations whenever possible.

Limit repetitive Open/Close operations

If you need to write a lot of values into a file as part of a loop, then there are multiple ways of achieving this task. Please make sure that you never put the open and close statements inside the loop as shown in this Python example:

for i in range(1000):
    f=open('test2.txt', 'a')
    f.write(some_data)
    f.close()

This will cause that the same file is opened and closed 1000 times, which causes a total of 2000 I/O operations and 1998 of them are unnecessary. It is sufficient to open the file once, then write all values to it and close it at the end, resulting in only 2 I/O operations

f=open('test1.txt', 'w')
for i in range(1000):
    f.write(some_data)
f.close()

Limit repetitive "stat" operations

If you are running a code that needs at some point to wait until a certain file is created, it is sufficient to check for this every few seconds. Checking this every few milliseconds can create a lot of unnecessary file stat calls.

Directory listings: ls vs. ls -l

If you run the ls command for listing a file or a directory, then it will query the MDS for this information. But when running the command with the -l option, it will also need to access the OSS to look up the file size, which creates additional load on the storage system.

  • Use ls if you would like to list files and directories
  • Only use ls -l if you also need to know about the file size

Use subdirectories instead of storing all files in a single directory

When a file is accessed, Lustre puts a lock on the parent directory. If many files are opened in the same directory, then this will cause contention. To minimize contention, distribute your files into a subdirectory structure. This way your files are also organized and easier to handle.

If you accidently run an ls -l instead of just ls, then it also makes a difference if the directory contains 20 or 100'000 files.

Use local scratch instead of doing computations directly on Lustre

If your data set fits into local scratch (storage inside the compute node), then try to use local scratch instead of Lustre.

https://scicomp.ethz.ch/wiki/Using_local_scratch

A typical workflow could be to copy your files from Lustre to local scratch at the beginning of a job, then process the files and copy back the results of the job from local scratch to Lustre.

Use other storage locations for small files

The Lustre file system is the worst place to store a lot of small files. Other file systems like $HOME or local scratch ($TMPDIR, only on compute nodes) are much better suited to deal with small files. If you have to store a lot of small files on Lustre, then please at least tar them up to a single file. For processing those files, untar them to local scratch at the beginning of the job, process them on the compute node and at the end of the job tar up the results and copy back the archive to Lustre.

Limit the Number of Processes Performing Parallel I/O

Avoid Having Multiple Processes Open the Same File(s) at the Same Time

Working with stripes (advanced users)

Lustre will always try to distribute your data across all OSTs. The striping parameters can be tuned per file or directory.

How to display the current striping settings

The default stripe setting of a file or directory can be shown with the command lfs getstripe:

[sfux@eu-login-24-ng ~]$ lfs getstripe $SCRATCH/__USAGE_RULES__ 
/cluster/scratch/sfux/__USAGE_RULES__
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  3
        obdidx           objid           objid           group
             3          619261        0x972fd                0 

[sfux@eu-login-24-ng ~]$

For directories, use the -d option

[sfux@eu-login-24-ng ~]$ lfs getstripe -d $SCRATCH
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
[sfux@eu-login-24-ng ~]$ 
  • stripe_count = -1 : Use the filesystem default stripe count (= spread data to all OSTs)
  • stripe_size = 1048576 : Use 1 MiB stripe/chunk size
  • stripe_offset = -1: Let Lustre choose the next OST (you shouldn't change this)

How to change stripe settings

The stripe setting of a directory can be changed with the command lfs setstripe.

Note!

  • You can not change the striping of existing files
  • You can always change the striping parameters of an existing directory
  • It is possible to create files with non-default striping parameters with the lfs command
  • A subdirectory inherits all stripe parameters from its parent directory (if not changed via lfs setstripe)