Best practices on Lustre parallel file systems
Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. Files are distributed across multiple servers, and then striped across multiple disks.
A Lustre file system has three major functional units:
One or more metadata servers (MDS) nodes that has one or more metadata target (MDT) devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a local disk filesystem. However, unlike block-based distributed filesystems, such as GPFS and PanFS, where the metadata server controls all of the block allocation, the Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server. The ability to have multiple MDTs in a single filesystem is a new feature in Lustre 2.4, and allows directory subtrees to reside on the secondary MDTs, while 2.7 and later allow large single directories to be distributed across multiple MDTs as well. One or more object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices. Depending on the server's hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs. Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allows concurrent and coherent read and write access to the files in the filesystem. The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these devices are on separate nodes communicating over a network. Each MDT and OST may be part of only a single filesystem, though it is possible to have multiple MDTs or OSTs on a single node that are part of different filesystems. The Lustre Network (LNet) layer can use several types of network interconnects, including native InfiniBand verbs, Omni-Path, RoCE, and iWARP via OFED, TCP/IP on Ethernet, and other proprietary network technologies such as the Cray Gemini interconnect. In Lustre 2.3 and earlier, Myrinet, Quadrics, Cray SeaStar and RapidArray networks were also supported, but these network drivers were deprecated when these networks were no longer commercially available, and support was removed completely in Lustre 2.8. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is normally provided by hardware RAID devices, though will work with any block devices. Since Lustre 2.4, the MDT and OST can also use ZFS for the backing filesystem in addition to ext4, allowing them to effectively use JBOD storage instead of hardware RAID devices. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by the backing filesystem and return this data to the clients. This allows Lustre to take advantage of improvements and features in the underlying filesystem, such as compression and data checksums in ZFS. Clients do not have any direct access to the underlying storage, which ensures that a malfunctioning or malicious client cannot corrupt the filesystem structure.
An OST is a dedicated filesystem that exports an interface to byte ranges of file objects for read/write operations. An MDT is a dedicated filesystem that stores inodes, directories, POSIX and extended file attributes, controls file access permissions/ACLs, and tells clients the layout of the object(s) that make up each regular file. MDTs and OSTs currently use either an enhanced version of ext4 called ldiskfs, or ZFS/DMU for back-end data storage to store files/objects using the open source ZFS-on-Linux port.
When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access and/or create the file, either the layout of an existing file is returned to the client or a new file is created on behalf of the client, if requested. For read or write operations, the client then interprets the file layout in the logical object volume (LOV) layer, which maps the file logical offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes. With this approach, bottlenecks for client-to-OSS communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as GPFS and OCFS allow direct access to the underlying storage by all of the clients in the filesystem, which requires a large back-end SAN attached to clients, and increases the risk of filesystem corruption from misbehaving/defective clients.