Difference between revisions of "Best practices on Lustre parallel file systems"

From ScientificComputing
Jump to: navigation, search
(Created page with "==Introduction== Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. Files are distributed across multiple servers, and the...")
 
(Optimizing HDF5 file layout for best performance on Lustre)
 
(27 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. Files are distributed across multiple servers, and then striped across multiple disks.  
+
On the Euler and the Leonhard cluster, the global '''scratch''' and '''work''' directories
 +
 
 +
/cluster/scratch/$USER
 +
/cluster/work/
 +
 
 +
are hosted on Lustre file systems. The are optimized especially for '''parallel I/O''' and '''large files'''. Those file systems are '''shared among many users'''. If you are
 +
 
 +
*working with a large number of small files
 +
*running thousands of unnecessary I/O operations per second (running Open/Close in a loop)
 +
*accessing the same file with hundreds of processes
 +
 
 +
then this will not only slow down your jobs. '''It can overload the entire file system affecting all users'''. Therefore please carefully read our [[#Best_practices|best practices guide]] before using <tt>/cluster/work</tt> and/or <tt>/cluster/scratch</tt>.
 +
 
 +
==Lustre architecture==
 +
Lustre is a '''parallel distributed file system'''. Files are distributed across multiple servers, and then striped across multiple disks.
  
 
A Lustre file system has three major functional units:
 
A Lustre file system has three major functional units:
  
One or more metadata servers (MDS) nodes that has one or more metadata target (MDT) devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a local disk filesystem. However, unlike block-based distributed filesystems, such as GPFS and PanFS, where the metadata server controls all of the block allocation, the Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server. The ability to have multiple MDTs in a single filesystem is a new feature in Lustre 2.4, and allows directory subtrees to reside on the secondary MDTs, while 2.7 and later allow large single directories to be distributed across multiple MDTs as well.
+
* '''Metadata servers (MDS)''' that stores namespace metadata, such as filenames, directories, access permissions, and file layout.  
One or more object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices. Depending on the server's hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
+
* '''Object storage server (OSS)''' nodes that store file data on one or more object storage target (OST) devices.
Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allows concurrent and coherent read and write access to the files in the filesystem.
+
* '''Client(s)''' that access and use the data.
The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these devices are on separate nodes communicating over a network. Each MDT and OST may be part of only a single filesystem, though it is possible to have multiple MDTs or OSTs on a single node that are part of different filesystems. The Lustre Network (LNet) layer can use several types of network interconnects, including native InfiniBand verbs, Omni-Path, RoCE, and iWARP via OFED, TCP/IP on Ethernet, and other proprietary network technologies such as the Cray Gemini interconnect. In Lustre 2.3 and earlier, Myrinet, Quadrics, Cray SeaStar and RapidArray networks were also supported, but these network drivers were deprecated when these networks were no longer commercially available, and support was removed completely in Lustre 2.8. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
+
 
 +
When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access the file, then the layout of the file is returned.
 +
 
 +
For read or write operations, the client then interprets the file layout, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes.
 +
 
 +
After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes.
 +
 
 +
==Best practices==
 +
===Avoid unnecessary I/O operations===
 +
In many programs, there are options to control I/O to make them more or less verbose. In general I/O operations are slowing down your computation, because during I/O operations the CPU is waiting and doing nothing. Therefore only do I/O if it is required and provides an added value to your computation. Otherwise try to avoid unneccessary I/O operations whenever possible.
 +
 
 +
===Limit repetitive Open/Close operations===
 +
If you need to write a lot of values into a file as part of a loop, then there are multiple ways of achieving this task. Please make sure that you never put the open and close statements inside the loop as shown in this Python example:
 +
 
 +
for i in range(1000):
 +
    f=open('test2.txt', 'a')
 +
    f.write(some_data)
 +
    f.close()
 +
 
 +
This will cause that the same file is opened and closed 1000 times, which causes a total of 2000 I/O operations and 1998 of them are unnecessary. It is sufficient to open the file once, then write all values to it and close it at the end, resulting in only 2 I/O operations
 +
 
 +
f=open('test1.txt', 'w')
 +
for i in range(1000):
 +
    f.write(some_data)
 +
f.close()
 +
 
 +
===Limit repetitive "stat" operations===
 +
If you are running a code that needs at some point to wait until a certain file is created, it is sufficient to check for this every 5-10 seconds. Checking this without any delay in a loop can cause more than 1000 checks per second create a lot of unnecessary file stat calls.
 +
 
 +
===Directory listings: ls vs. ls -l vs. ls --color===
 +
If you run the <tt>ls</tt> command for listing a file or a directory, then it will query the MDS for this information. But when running the command with the <tt>-l</tt> option, it will also need to access the OSSes to look up the file size, which creates additional load on the storage system.
 +
 
 +
* Use <tt>ls</tt> if you would like to list files and directories
 +
* Only use <tt>ls -l</tt> if you also need to know about the file size
 +
* Unset the default <tt>color</tt> option for <tt>ls</tt> (<tt>ls --color</tt>) if you need to do ls on Lustre directories with many files: <tt>ls --color=never</tt> or to permanently disable it during your session <tt>$ \ls</tt>
 +
 
 +
===Use subdirectories instead of storing all files in a single directory===
 +
When a file is accessed, Lustre puts a lock on the parent directory. If many files are opened in the same directory, then this will cause contention. To minimize contention, distribute your files into a subdirectory structure. This way your files are also organized and easier to handle.
 +
 
 +
If you accidentally run an <tt>ls -l</tt> instead of just <tt>ls</tt>, then it also makes a difference if the directory contains 20 or 100'000 files.
 +
 
 +
===Try to use local scratch for serial jobs===
 +
If your data set fits into local scratch (≤ 300 GB), then try to use local scratch instead of Lustre as it will be faster in almost all cases.
 +
 
 +
https://scicomp.ethz.ch/wiki/Using_local_scratch
 +
 
 +
A typical workflow could be to copy your files from Lustre to local scratch at the beginning of a job, then process the files and copy back the results of the job from local scratch to Lustre.
 +
 
 +
Jobs working with many small files might be improved by a huge factor with the latency reduction provided by local scratch.
 +
 
 +
===Use other storage locations for small files===
 +
The Lustre file system is the worst place to store a lot of small files. Other file systems like $HOME or local scratch ($TMPDIR, only on compute nodes) are much better suited to deal with small files. If you have to store a lot of small files on Lustre, then please at least tar them up to a single file. For processing those files, untar them to local scratch at the beginning of the job, process them on the compute node and at the end of the job tar up the results and copy back the archive to Lustre.
 +
 
 +
===Avoid reading the same region of a file from many processes at the same time===
 +
If you are running a series of jobs in parallel that are all accessing the same region of a file, then this will cause performance problems. It might be better to either split the file into parts that can be used by individual jobs, or to work on multiple copies of the same file. Each job could for instance copy the file to local scratch in order to avoid contention.
 +
 
 +
==Working with stripes (advanced users)==
 +
Lustre will always try to distribute your data across all OSTs. The striping parameters can be tuned '''per file''' or '''directory'''.
 +
 
 +
===How to display the current striping settings===
 +
The default stripe setting of a file or directory can be shown with the command '''lfs getstripe'''. It is configured by default to 1:
 +
 
 +
[sfux@eu-login-24-ng ~]$ lfs getstripe $SCRATCH/__USAGE_RULES__
 +
/cluster/scratch/sfux/__USAGE_RULES__
 +
lmm_stripe_count:  1
 +
lmm_stripe_size:    1048576
 +
lmm_pattern:        1
 +
lmm_layout_gen:    0
 +
lmm_stripe_offset:  3
 +
        obdidx          objid          objid          group
 +
              3          619261        0x972fd                0
 +
 +
[sfux@eu-login-24-ng ~]$
 +
 
 +
* stripe_count = 1 : Use the filesystem default stripe count
 +
* stripe_size = 1048576 : Use 1 MiB stripe/chunk size
 +
* stripe_offset = -1: Let Lustre choose the next OST (you shouldn't change this)
 +
 
 +
===Hints for proper striping count===
 +
 
 +
The best stripping count depends mostly on the IO pattern access, the file size and the number of nodes used to access a file. In order to help users distribute their data this command will automatically set the proper striping depending on the size of the file. The striping will adapt to the size of the file while the file grows:
 +
 
 +
 
 +
lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 <file_to_be_created>
 +
 
 +
===How to change stripe settings===
 +
The stripe setting of a directory can be changed with the command '''lfs setstripe''':
 +
 
 +
[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 my_directory
 +
 
 +
Please note:
 +
* You '''can not''' change the striping of '''existing files'''
 +
* You '''can''' always change the striping parameters for new files with the <tt>lfs command</tt>
 +
* A subdirectory '''inherits''' all stripe parameters from its parent directory (if not changed via lfs setstripe)
  
The storage used for the MDT and OST backing filesystems is normally provided by hardware RAID devices, though will work with any block devices. Since Lustre 2.4, the MDT and OST can also use ZFS for the backing filesystem in addition to ext4, allowing them to effectively use JBOD storage instead of hardware RAID devices. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by the backing filesystem and return this data to the clients. This allows Lustre to take advantage of improvements and features in the underlying filesystem, such as compression and data checksums in ZFS. Clients do not have any direct access to the underlying storage, which ensures that a malfunctioning or malicious client cannot corrupt the filesystem structure.
+
Example:
  
An OST is a dedicated filesystem that exports an interface to byte ranges of file objects for read/write operations. An MDT is a dedicated filesystem that stores inodes, directories, POSIX and extended file attributes, controls file access permissions/ACLs, and tells clients the layout of the object(s) that make up each regular file. MDTs and OSTs currently use either an enhanced version of ext4 called ldiskfs, or ZFS/DMU for back-end data storage to store files/objects[55] using the open source ZFS-on-Linux port.[56]
+
[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 my_new_file
  
When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access and/or create the file, either the layout of an existing file is returned to the client or a new file is created on behalf of the client, if requested. For read or write operations, the client then interprets the file layout in the logical object volume (LOV) layer, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes. With this approach, bottlenecks for client-to-OSS communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
+
The file will be precreated with zero bytes
  
After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as GPFS and OCFS allow direct access to the underlying storage by all of the clients in the filesystem, which requires a large back-end SAN attached to clients, and increases the risk of filesystem corruption from misbehaving/defective clients.
+
Change the stripe size of a file:
  
 +
* Create and empty file with the preferred stripe settings
 +
* Copy the original file into the new one
 +
* Move the new one to the original file name
  
 +
Example:
  
==Best practices==
+
[sfux@eu-login-02-ng test]$ pwd
 +
/cluster/scratch/sfux/test
 +
[sfux@eu-login-02-ng test]$ ls
 +
DFT.tar.gz
 +
[sfux@eu-login-02-ng test]$ lfs getstripe DFT.tar.gz
 +
DFT.tar.gz
 +
lmm_stripe_count:  1
 +
lmm_stripe_size:    1048576
 +
lmm_pattern:        1
 +
lmm_layout_gen:    0
 +
lmm_stripe_offset:  3
 +
        obdidx          objid          objid          group
 +
              3      218244665      0xd022639                0
 +
 
 +
The original file has a stripe count of 1. Now it is changed to the maximal stripe count (-1 sets it to the maximal stripe count):
 +
 
 +
[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 DFT.tar.gz_tmp
 +
[sfux@eu-login-02-ng test]$ cp -a DFT.tar.gz DFT.tar.gz_tmp
 +
[sfux@eu-login-02-ng test]$ mv DFT.tar.gz_tmp DFT.tar.gz
 +
[sfux@eu-login-02-ng test]$
 +
 
 +
==Optimizing HDF5 file layout for best performance on Lustre==
 +
For HDF5 files, there are different ways to create the file layout when storing data. There are some guides about how one can optimize HDF5 function calls and the layout of the HDF5 file to get optimal performance on Lustre file systems:
  
==Working with stripes==
+
* [[media:Tuning_HDF5_for_Lustre_File_Systems.pdf|Tuning HDF5 for Lustre file systems]]
 +
* [https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?start=5 Introduction to scientific I/O from NERSC]

Latest revision as of 10:09, 11 March 2022

Introduction

On the Euler and the Leonhard cluster, the global scratch and work directories

/cluster/scratch/$USER
/cluster/work/

are hosted on Lustre file systems. The are optimized especially for parallel I/O and large files. Those file systems are shared among many users. If you are

  • working with a large number of small files
  • running thousands of unnecessary I/O operations per second (running Open/Close in a loop)
  • accessing the same file with hundreds of processes

then this will not only slow down your jobs. It can overload the entire file system affecting all users. Therefore please carefully read our best practices guide before using /cluster/work and/or /cluster/scratch.

Lustre architecture

Lustre is a parallel distributed file system. Files are distributed across multiple servers, and then striped across multiple disks.

A Lustre file system has three major functional units:

  • Metadata servers (MDS) that stores namespace metadata, such as filenames, directories, access permissions, and file layout.
  • Object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices.
  • Client(s) that access and use the data.

When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access the file, then the layout of the file is returned.

For read or write operations, the client then interprets the file layout, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes.

After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes.

Best practices

Avoid unnecessary I/O operations

In many programs, there are options to control I/O to make them more or less verbose. In general I/O operations are slowing down your computation, because during I/O operations the CPU is waiting and doing nothing. Therefore only do I/O if it is required and provides an added value to your computation. Otherwise try to avoid unneccessary I/O operations whenever possible.

Limit repetitive Open/Close operations

If you need to write a lot of values into a file as part of a loop, then there are multiple ways of achieving this task. Please make sure that you never put the open and close statements inside the loop as shown in this Python example:

for i in range(1000):
    f=open('test2.txt', 'a')
    f.write(some_data)
    f.close()

This will cause that the same file is opened and closed 1000 times, which causes a total of 2000 I/O operations and 1998 of them are unnecessary. It is sufficient to open the file once, then write all values to it and close it at the end, resulting in only 2 I/O operations

f=open('test1.txt', 'w')
for i in range(1000):
    f.write(some_data)
f.close()

Limit repetitive "stat" operations

If you are running a code that needs at some point to wait until a certain file is created, it is sufficient to check for this every 5-10 seconds. Checking this without any delay in a loop can cause more than 1000 checks per second create a lot of unnecessary file stat calls.

Directory listings: ls vs. ls -l vs. ls --color

If you run the ls command for listing a file or a directory, then it will query the MDS for this information. But when running the command with the -l option, it will also need to access the OSSes to look up the file size, which creates additional load on the storage system.

  • Use ls if you would like to list files and directories
  • Only use ls -l if you also need to know about the file size
  • Unset the default color option for ls (ls --color) if you need to do ls on Lustre directories with many files: ls --color=never or to permanently disable it during your session $ \ls

Use subdirectories instead of storing all files in a single directory

When a file is accessed, Lustre puts a lock on the parent directory. If many files are opened in the same directory, then this will cause contention. To minimize contention, distribute your files into a subdirectory structure. This way your files are also organized and easier to handle.

If you accidentally run an ls -l instead of just ls, then it also makes a difference if the directory contains 20 or 100'000 files.

Try to use local scratch for serial jobs

If your data set fits into local scratch (≤ 300 GB), then try to use local scratch instead of Lustre as it will be faster in almost all cases.

https://scicomp.ethz.ch/wiki/Using_local_scratch

A typical workflow could be to copy your files from Lustre to local scratch at the beginning of a job, then process the files and copy back the results of the job from local scratch to Lustre.

Jobs working with many small files might be improved by a huge factor with the latency reduction provided by local scratch.

Use other storage locations for small files

The Lustre file system is the worst place to store a lot of small files. Other file systems like $HOME or local scratch ($TMPDIR, only on compute nodes) are much better suited to deal with small files. If you have to store a lot of small files on Lustre, then please at least tar them up to a single file. For processing those files, untar them to local scratch at the beginning of the job, process them on the compute node and at the end of the job tar up the results and copy back the archive to Lustre.

Avoid reading the same region of a file from many processes at the same time

If you are running a series of jobs in parallel that are all accessing the same region of a file, then this will cause performance problems. It might be better to either split the file into parts that can be used by individual jobs, or to work on multiple copies of the same file. Each job could for instance copy the file to local scratch in order to avoid contention.

Working with stripes (advanced users)

Lustre will always try to distribute your data across all OSTs. The striping parameters can be tuned per file or directory.

How to display the current striping settings

The default stripe setting of a file or directory can be shown with the command lfs getstripe. It is configured by default to 1:

[sfux@eu-login-24-ng ~]$ lfs getstripe $SCRATCH/__USAGE_RULES__ 
/cluster/scratch/sfux/__USAGE_RULES__
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  3
        obdidx           objid           objid           group
             3          619261        0x972fd                0 

[sfux@eu-login-24-ng ~]$
  • stripe_count = 1 : Use the filesystem default stripe count
  • stripe_size = 1048576 : Use 1 MiB stripe/chunk size
  • stripe_offset = -1: Let Lustre choose the next OST (you shouldn't change this)

Hints for proper striping count

The best stripping count depends mostly on the IO pattern access, the file size and the number of nodes used to access a file. In order to help users distribute their data this command will automatically set the proper striping depending on the size of the file. The striping will adapt to the size of the file while the file grows:


lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 <file_to_be_created>

How to change stripe settings

The stripe setting of a directory can be changed with the command lfs setstripe:

[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 my_directory

Please note:

  • You can not change the striping of existing files
  • You can always change the striping parameters for new files with the lfs command
  • A subdirectory inherits all stripe parameters from its parent directory (if not changed via lfs setstripe)

Example:

[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 my_new_file

The file will be precreated with zero bytes

Change the stripe size of a file:

  • Create and empty file with the preferred stripe settings
  • Copy the original file into the new one
  • Move the new one to the original file name

Example:

[sfux@eu-login-02-ng test]$ pwd
/cluster/scratch/sfux/test
[sfux@eu-login-02-ng test]$ ls
DFT.tar.gz
[sfux@eu-login-02-ng test]$ lfs getstripe DFT.tar.gz 
DFT.tar.gz
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  3
        obdidx           objid           objid           group
             3       218244665      0xd022639                0

The original file has a stripe count of 1. Now it is changed to the maximal stripe count (-1 sets it to the maximal stripe count):

[sfux@eu-login-02-ng test]$ lfs setstripe -E 500M -c 1 -E 10G -c 2 -E 50G -c 4 -E 200G -c 8 -E -1 -c 20 DFT.tar.gz_tmp
[sfux@eu-login-02-ng test]$ cp -a DFT.tar.gz DFT.tar.gz_tmp
[sfux@eu-login-02-ng test]$ mv DFT.tar.gz_tmp DFT.tar.gz
[sfux@eu-login-02-ng test]$

Optimizing HDF5 file layout for best performance on Lustre

For HDF5 files, there are different ways to create the file layout when storing data. There are some guides about how one can optimize HDF5 function calls and the layout of the HDF5 file to get optimal performance on Lustre file systems: