Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
…
2 pages
1 file
To improve the checkpoint bandwidth of critical applications at LANL, we developed the Parallel Log Structured File System (PLFS)[1]. PLFS is a transformative I/O middleware layer placed within our storage stack. It transforms a concurrently written single shared file into non-shared component pieces. This reorganized I/O has made write size a non-issue and improved checkpoint performance by orders of magnitude, meeting the project's L2 milestone to show increased performance for checkpointing with LANL codes. LANL is working together with EMC under an umbrella Cooperative Research and Development Agreement (CRADA) to further enhance, design, build, test, and deploy PLFS. PLFS has been integrated with multiple types of storage systems, including cloud storage, and has shown improvements in file storage sizes and metadata rates.
2009
Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application's preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.
2012 19th International Conference on High Performance Computing, 2012
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale computing will exacerbate this problem. To meet checkpointing demands and sustain application-perceived throughput at exascale, multi-tiered hierarchical storage architectures involving solid-state burst buffers are being considered. In this paper, we describe the design and implementation of cento, a multilevel, content-addressable checkpoint file system for large-scale HPC systems. cento achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints. Through a detailed analysis of checkpoint dumps, we assess the benefits of data reduction for scientific applications that are representative of production workloads. We observe upto 40% data reduction within a limited sample of representative workloads. Finally, experiments on existing systems show a decrease in checkpoint commit latencies by 5 to 20% reducing the load on the parallel file system.
2012
Abstract Input/Output (I/O) operations can represent a significant proportion of run-time when large scientific applications are run in parallel and at scale. In order to address the growing divergence between processing speeds and I/O performance, the Parallel Log-structured File System (PLFS) has been developed by EMC Corporation and the Los Alamos National Laboratory (LANL) to improve the performance of parallel file activities.
Research results demonstrate that a log-structured file system (LFS) offers the potential for dramatically improved write performance, faster recovery time, and faster file creation and deletion than traditional UNIX file systems. This paper presents a redesign and implementation of the Sprite [ROSE91] log-structured file system that is more robust and integrated into the vnode interface . Measurements show its performance to be superior to the 4BSD Fast File System (FFS) in a variety of benchmarks and not significantly less than FFS in any test. Unfortunately, an enhanced version of FFS (with read and write clustering) [MCVO91] provides comparable and sometimes superior performance to our LFS. However, LFS can be extended to provide additional functionality such as embedded transactions and versioning, not easily implemented in traditional file systems.
1992
Robotic storage devices offer huge storage capacity at a low cost per byte, but with large access times. Integrating these devices into the storage hierarchy presents a challenge to file system designers. Log-structured file systems (LFSs) were developed to reduce latencies involved in accessing disk devices, but their sequential write patterns match well with tertiary storage characteristics. Unfortunately, existing versions only manage memory caches and disks, and do not support a broader storage hierarchy.
2012
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) was the world's largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF's diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smallerscale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider's original design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB/sec today to 1T B/sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies.
2006
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
2006
File system designers continue to look to new architectures to improve scalability. Object-based storage diverges from server-based (e. g. NFS) and SAN-based storage systems by coupling processors and memory with disk drives, delegating low-level allocation to object storage devices (OSDs) and decoupling I/O (read/write) from metadata (file open/close) operations. Even recent object-based systems inherit decades-old architectural choices going back to early UNIX file systems, however, limiting their ability to effectively scale to hundreds of petabytes. We present Ceph, a distributed file system that provides excellent performance and reliability with unprecedented scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable OSDs. We leverage OSD intelligence to distribute data replication, failure detection and recovery with semi-autonomous OSDs running a specialized local object storage file system (EBOFS). Finally, Ceph is built around a dynamic distributed metadata management cluster that provides extremely efficient metadata management that seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. We present performance measurements under a variety of workloads that show superior I/O performance and scalable metadata management (more than a quarter million metadata ops/sec).
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpointrestart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017
We present Nswap2L-FS, a fast, adaptable, and heterogeneous storage system for backing file data in clusters. Nswap2L-FS particularly targets backing temporary files, such as those created by data-intensive applications for storing intermediate results. Our work addresses the problem of how to efficiently and effectively make use of heterogeneous storage devices that are increasingly common in clusters. Nswap2L-FS implements a two-layer device design. The top layer transparently manages a set of bottom layer physical storage devices, which may include SSD, HDD, and its own implementation of network RAM. Nswap2L-FS appears to node operating systems as a single, fast backing storage device for file systems, hiding the complexity of heterogeneous storage management from OS subsystems. Internally, it implements adaptable and tunable policies that specify where data should be placed and whether data should be migrated from one underlying physical device to another based on resource usage and the characteristics of different devices. We present solutions to challenges that are specific to supporting backing filesystems, including how to efficiently support a wide range of I/O request sizes and balancing fast storage goals with expectations of persistence of stored file data. Nswap2L-FS defines relaxed persistence guarantees on individual file writes to achieve faster I/O accesses; less stringent persistence semantics allow it to make use of network RAM to store file data, resulting in faster file I/O to applications. Relaxed persistence guarantees are acceptable in many situations, particularly those involving short-lived data such as temporary files. Nswap2L-FS provides a persistence snapshot mechanism that can be used by applications or checkpointing systems to ensure that file data are persistent at certain points in their execution. Nswap2L-FS is implemented as a Linux block device driver that can be added as a file partition on individual cluster nodes. Experimental results show that file-intensive applications run faster when using Nswap2L-FS as backing store. Additionally, its adaptive data placement and migration policies, which make effective use of different underlying physical storage devices, result in performance exceeding that of any single device.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Computer Applications, 2010
2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018
Journal of Computer Science and Technology, 2020
IEEE International Conference on High Performance Computing, Data, and Analytics, 2012
ACM Transactions on Computer Systems, 1996