We develop OrcFS, Orchestrated File System for Flash storage. It vertically integrates the log-structured file system and the Flash-based storage device eliminating all the redundancies across the layers. A few modern file systems adopt sophisticated append-only data structures to manage its space in an effort to optimize the behavior with respect to the append-only nature of the Flash medium. While the benefit of adopting append-only data structure seems to be fairly promising, it makes the stack of software layers full of unnecessary redundancies which leaves substantial room for improvement. The redundancies include (i) redundant levels of indirection (address translation) , (ii) duplicate efforts to reclaim the invalid blocks which is called segment cleaning and garbage collection in the file system layer and in the storage device, respectively, and (iii) excessive over-provisioning, i.e. separate over-provisioning areas in each layer. OrcFS eliminates the redundancies via distributing the address translation, segment cleaning (or garbage collection), bad block management, and wear-leveling across the layers. OrcFS reduces the device mapping table requirement to 1/465 against the page mapping and removes 1/4 of the write volume under heavy random write workload. In varmail, OrcFS achieves 56% performance gain against EXT4.
Successfully integrating cloud storage as a primary storage layer in the I/O stack is highly challenging due to the two inherent critical issues the high latency of cloud I/Os and the unconventional pricing model of cloud storage. Caching is a crucial technology to minimize the latency and price of cloud I/Os. Unfortunately, the current cloud caching schemes are designed by adopting miss reduction as the sole objective, while ignoring the fact that various cache misses could have distinct actual effects in term of latency and monetary cost. In this paper, we present a cost-aware caching scheme specifically designed for cloud storage, called GDS-LC. The proposed scheme offers a comprehensive cache design by considering not only the access locality but also the associated latency and price. With GDS-LC, we can effectively filter out the high-latency and high-price cloud I/Os and thus successfully reshape the cloud I/O streams to the desired low-latency and low-cost pattern. We have built a prototype to emulate a typical cloud client cache and evaluate the GDS-LC scheme with Amazon Simple Storage Services (S3) in three different scenarios, local cloud, Internet cloud, and heterogeneous cloud. Our experimental results show very promising results.
Emerging byte-addressable non-volatile memory (NVRAM) is expected to replace block device storages as an alternative low latency persistent storage device. If NVRAM is used as a persistent storage device, a cache line instead of a disk page will be the unit of data transfer, consistency, and durability. In this work, we design and develop clfB-tree - a B-tree structure whose tree node fits in a single cache line. We employ existing write combining store buffer and restricted transactional memory (RTM) to provide a failure-atomic cache line write operation. Using the failure-atomic cache line write operations, we atomically update a clfB-tree node via a single cache line flush instruction without major changes in hardware. However, there exist many processors that do not provide SW interface for transactional memory. For those processors, our proposed clfB-tree achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes. We evaluate the performance of clfB-tree on an NVRAM emulation board with ARM Cortex A-9 processor and a workstation that has Intel Xeon E7-4809 v3 processor. Our experimental results show clfB-tree outperforms wB-tree and CDDS B-tree by a large margin in terms of both insertion and search performance
LSM-tree has been widely used in data management production systems for write-intensive workloads. However, as read and write workloads co-exist under LSM-tree, data accesses can experience long latency and low throughput due to the interferences to buffer caching from the compaction, a major and frequent operation in LSM-tree. After a compaction, the existing data blocks are reorganized and written to other locations on disks. As a result, the related data blocks that have been loaded in the buffer cache are invalidated since their referencing addresses are changed, causing serious performance degradations. In order to re-enable high-speed buffer caching during intensive writes, we propose Log-Structured buffered-Merge tree (simplified as LSbM-tree) by adding a compaction buffer on disks, to minimize the cache invalidations on buffer cache caused by compactions. The compaction buffer efficiently and adaptively maintains the frequently visited data sets. In LSbM, strong locality objects can be effectively kept in the buffer cache with minimum or without harmful invalidations. With the help of a small on-disk compaction buffer, LSbM achieves a high query performance by enabling effective buffer caching, while retaining all the merits of LSM-tree for write-intensive data processing, and providing high bandwidth of disks for range queries.
Since both NAND flash memory manufacturers and users are turning their attentions from planar architecture towards 3D architecture, it becomes more critical and urgent to have a strong understanding of the characteristics of 3D NAND flash memory. These characteristics, especially those different from planar NAND flash, are the foundations of efficient flash management. In this paper, we characterize a state-of-the-art 3D floating gate NAND flash memory through comprehensive experiments on an FPGA-based 3D NAND flash evaluation platform. Then, we present distinct observations on its performance and reliability, such as operation latencies and various error patterns, and carefully analyze them from physical and circuit-level perspectives. Although 3D NAND flash provides much higher storage densities than planar NAND flash, it faces new performance challenges in garbage collection overhead and program performance variation, and more complicated reliability issues due to e.g., distinct location dependence and value dependence of errors. We also summarize the differences of 3D NAND flash from planar NAND flash and discuss design implications on flash memory management brought by the architecture innovation. We believe that our work will facilitate developing novel 3D NAND flash-oriented designs to achieve better performance and reliability.
IBM Spectrum Scale's parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of byproduct trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In this article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and describe the process of designing GPFS' new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead even under intensive I/O workloads.
Accurately modeling drive-managed SMR disks is a challenge, requiring an array of approaches including both existing disk modeling techniques as well as new techniques for inferring internal translation layer algorithms. In this work we present the first predictive simulation model of a generally-available drive-managed SMR disk. Despite the use of unknown proprietary algorithms in this device, our model that is derived from external measurements is able to predict mean latency within a few percent, and with an RMS cumulative latency error of 25% or less for most workloads tested. These variations, although not small, are in most cases less than three times the drive-to-drive variation seen among seemingly identical drives.
We introduce new methods to replay intensive block I/O workloads more accurately. These methods canbe used to reproduce realistic workloads for benchmarking, performance validation, and tuning of a high-performance block storage device/system. In this paper, we study several sources in the stock operating system that introduce uncertainty in the workload replay. Based on the remedies of these findings, we design and develop a new replay tool called hfplayer that replay intensive block I/O workloads in a similar unscaled environment with more accuracy. To replay a given workload trace in a scaled environment with faster storage or host server, the dependency between I/O requests becomes crucial since the timing and ordering of I/O requests is expected to change according to these dependencies. Therefore, we propose a heuristic way of speculating I/O dependencies in a block I/O trace. Using the generated dependency graph, hfplayer tries to propagate I/O related performance gains appropriately along the I/O dependency chains and mimics original application behavior when it executes in a scaled environment. We evaluate hfplayer with a wide range of workloads using several accuracy metrics and find that it produces better accuracy when compared with other replay approaches.
Persistent memory provides data persistence at main memory with emerging non-volatile main memories (NVMMs). Recent persistent memory file systems aggressively use direct access, which directly copy data between user buffer and the storage layer, to avoid the double-copy overheads through the OS page cache. However, we observe they all suffer from slow writes due to NVMMs asymmetric read-write performance and much slower performance than DRAM. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory, to combine both buffering and direct access for fine-grained file system operations. HiNFS uses an NVMM-aware Write Buffer to buffer the lazy-persistent file writes in DRAM, while performing direct access to NVMM for eager- persistent file writes. It directly reads file data from both DRAM and NVMM, by ensuring read consistency with a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. HiNFS also employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing I/Os. Evaluations show that HiNFS significantly improves throughput by up to 184% and reduces execution time by up to 64% comparing with state-of-the-art persistent memory file systems PMFS and EXT4-DAX.
Abstract The paper presents an analysis of drive workloads from enterprise storage systems. The drive workloads are obtained from field return units from a cross-section of enterprise storage system vendors and thus provides a view of the workload characteristics over a wide spectrum of end-user applications. The workload parameters that have been characterized include transfer lengths, access patterns, locality and throughput. The study shows that reads are the dominant workload accounting for 80% of the accesses to the drive. Writes are dominated by short block random accesses while reads range from random to highly sequential. A trend analysis over the period 2010-2014 shows that the workload has remained fairly constant even as the capacities of the drives shipped has steadily increased. The study shows that the data stored on disk drives is relatively cold on average less than 4% of the drive capacity is accessed in a given 2 hour interval.
Log-Structure Merge tree (LSM-tree) has been one of the mainstream indexes in key-value systems supporting a variety of write-intensive Internet applications in todays data centers. However, the performance of LSM-tree is seriously hampered by constantly occurring compaction procedures, which incur significant write amplification and degrade the write throughput. To alleviate the performance degradation caused by compactions, we introduce a light-weight compaction tree (LWC-tree), a variant of LSM-tree index optimized for minimizing the write amplification and maximizing the system throughput. The light-weight compaction drastically decreases write amplification by appending data in a table and only merging the metadata that has much smaller size. We have implemented three key-value LWC-stores based on the LWC-tree on different storage mediums. The LWC-store is particularly optimized for SMR drives as it eliminates the multiplicative I/O amplification from both LSM-trees and SMR drives. Due to the light-weight compaction procedure, LWC-store reduces the write amplification by a factor of up to 5× compared to the popular LevelDB key-value store. Moreover, the random write throughput of the LWC-tree on SMR drives is significantly improved by 467% even compared with LevelDB on conventional HDDs. Furthermore, LWC-tree has wide applicability and delivers impressive performance improvement in various conditions.