Best practices on Lustre parallel file systems

From ScientificComputing
Revision as of 08:49, 12 February 2019 by Sfux (talk | contribs)

Jump to: navigation, search


Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. Files are distributed across multiple servers, and then striped across multiple disks.

A Lustre file system has three major functional units:

  • Metadata servers (MDS) nodes that has one or more metadata target (MDT) devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file I/O operations
  • Object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices.
  • Client(s) that access and use the data.

When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access and/or create the file, either the layout of an existing file is returned to the client or a new file is created on behalf of the client, if requested. For read or write operations, the client then interprets the file layout in the logical object volume (LOV) layer, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes. With this approach, bottlenecks for client-to-OSS communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as GPFS and OCFS allow direct access to the underlying storage by all of the clients in the filesystem, which requires a large back-end SAN attached to clients, and increases the risk of filesystem corruption from misbehaving/defective clients.

Best practices

Working with stripes