Successfully integrating cloud storage as a primary storage layer in the I/O stack is highly challenging due to the two inherent critical issues the high latency of cloud I/Os and the unconventional pricing model of cloud storage. Caching is a crucial technology to minimize the latency and price of cloud I/Os. Unfortunately, the current cloud caching schemes are designed by adopting miss reduction as the sole objective, while ignoring the fact that various cache misses could have distinct actual effects in term of latency and monetary cost. In this paper, we present a cost-aware caching scheme specifically designed for cloud storage, called GDS-LC. The proposed scheme offers a comprehensive cache design by considering not only the access locality but also the associated latency and price. With GDS-LC, we can effectively filter out the high-latency and high-price cloud I/Os and thus successfully reshape the cloud I/O streams to the desired low-latency and low-cost pattern. We have built a prototype to emulate a typical cloud client cache and evaluate the GDS-LC scheme with Amazon Simple Storage Services (S3) in three different scenarios, local cloud, Internet cloud, and heterogeneous cloud. Our experimental results show very promising results.
In this paper, we design and implement a cooperative shingle-aware file system, called CosaFS, on heterogeneous storage devices that mix solid-state drives (SSDs) and shingled magnetic recording (SMR) technology to improve the overall performance of storage systems. The basic idea of CosaFS is to classify objects as hot or cold objects based on a proposed Lookahead with Recency Weight (LRW) scheme. If an object is identified as a hot (small) object, it will be served by SSD. Otherwise, cold (large) objects are stored on SMR. For an SMR, large objects can be accessed in large sequential blocks, rendering the performance of their accesses comparable with that of accessing the same large sequential blocks as if they were stored on a hard drive. Small objects, such as inodes and directories, are stored on the SSD where seeks for such objects are nearly free. With thorough empirical studies, we demonstrate that CosaFS, as a cooperative shingle-aware file system, with meta-data separation and cache-assistance is a very effective way to handle the disk-based data demanded by the shingled writes, and outperforms the device- and host-side shingle-aware file systems in terms of throughput, IOPS, and access latency as well.
This paper proposes to use a frequency based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Given a newly accessed item and an eviction candidate from the cache, our scheme decides, based on the recent access history, whether it is worth admitting the new item into the cache at the expense of the eviction candidate. Realizing this concept is enabled through a novel approximate LFU structure called TinyLFU, which maintains an approximate representation of the access frequency of a large sample of recently accessed items. TinyLFU is very compact and light-weight as it builds upon Bloom filter theory. We study the properties of TinyLFU through simulations of both synthetic workloads as well as multiple real traces from several sources. These simulations demonstrate the performance boost obtained by enhancing various replacement policies with the TinyLFU eviction policy. Also, a new combined replacement and eviction policy scheme nicknamed W-TinyLFU is presented. W-TinyLFU is demonstrated to obtain equal or better hit-ratios than other state of the art replacement policies on these traces. It is the only scheme to obtain such good results on all traces.
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In this paper, we address this issue by proposing a practical repair layering framework called DoubleR. We design two families of practical double regenerating codes (DRC), which not only minimize the cross-rack repair traffic, but also have several practical properties that improve state-of-the-art regenerating codes. We implement and deploy DoubleR atop Hadoop Distributed File System (HDFS), and show that DoubleR maintains the theoretical guarantees of DRC and improves the repair performance of regenerating codes in both node recovery and degraded read operations.
Hardware consolidation in the datacenter often leads to scalability bottlenecks from heavy utilization of critical resources, such as the storage and network bandwidth. Host-side caching on durable media is already applied at block level to reduce the storage backend load, but has received criticism for added overhead, restricted sharing, and possible data loss at client crash. We introduce a journal to the kernel-level client of an object-based distributed filesystem in order to improve durability at high I/O performance and reduced shared resource utilization. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata, supports native file sharing among clients, and provides flexible configuration of durable data staging at the host. Over a prototype that we have implemented, we experimentally quantify the performance and efficiency of the proposed Arion system in comparison to a production system. We run microbenchmarks and application-level workloads over a local cluster and a public cloud. We demonstrate reduced latency by 60% and improved performance up to 150% at reduced server network and disk bandwidth by 41% and 77%, respectively. The performance improvement reaches 92% for 16 relational databases as clients, and gets as high as 11.3x with 2 key-value stores as clients.
In multi-level cell (MLC) NAND flash memory, two logical pages are overlapped on a single physical page. Even after a logical page is programmed, the data can be corrupted if the programming of the coexisting logical page is interrupted. This phenomenon is called paired page interference. This paper proposes a novel software technique to deal with the paired page interference without any additional hardware or extra page write. The proposed technique utilizes valid pages in the victim block during garbage collection (GC) as the backup against the interference, and pairs them with incoming pages written by the host. This approach eliminates undesirable page copy to backup pages against the interference. However, such a strategy has an adverse effect on the hot/cold separation policy, which is essential to improve the efficiency of GC. To limit the downside, we devise a metric to estimate the benefit of GCMix on-the-fly so that GCMix can be adaptively utilized only when the benefit outweighs the overhead. Evaluations using synthetic and real workloads show GCMix can effectively deal with the paired page interference, reducing the write amplification factor by up to 17.3% compared to the traditional technique while providing comparable I/O performance.
Emerging byte-addressable non-volatile memory (NVRAM) is expected to replace block device storages as an alternative low latency persistent storage device. If NVRAM is used as a persistent storage device, a cache line instead of a disk page will be the unit of data transfer, consistency, and durability. In this work, we design and develop clfB-tree - a B-tree structure whose tree node fits in a single cache line. We employ existing write combining store buffer and restricted transactional memory (RTM) to provide a failure-atomic cache line write operation. Using the failure-atomic cache line write operations, we atomically update a clfB-tree node via a single cache line flush instruction without major changes in hardware. However, there exist many processors that do not provide SW interface for transactional memory. For those processors, our proposed clfB-tree achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes. We evaluate the performance of clfB-tree on an NVRAM emulation board with ARM Cortex A-9 processor and a workstation that has Intel Xeon E7-4809 v3 processor. Our experimental results show clfB-tree outperforms wB-tree and CDDS B-tree by a large margin in terms of both insertion and search performance
Flash storage has become the mainstream destination for storage users. However, SSDs do not always deliver the performance that users expect. The core culprit of flash performance instability is the well-known garbage collection (GC) process, which causes long delays as the SSD cannot serve (blocks) incoming I/Os, which then induces the long tail latency problem. We present ttFlash as a solution to this problem. ttFlash is a tiny-tail flash drive (SSD) that eliminates GC-induced tail latencies by circumventing GC-blocked I/Os with four novel strategies: plane-blocking GC, rotating GC, GC-tolerant read, and GC-tolerant flush. These four strategies leverage the timely combination of modern SSD internal technologies such as powerful controller, parity-based redundancies (RAIN), and capacitor-backed RAM. Our strategies are dependent on the use of intra-plane copyback operations. Through an extensive evaluation, we show that ttFlash comes significantly close to a no-GC scenario. Specifically, between 9999.99th percentiles, ttFlash is only 1.0 to 2.6x slower than the no-GC case, while a base approach suffers from 5138ms GC-induced slowdowns.
IBM Spectrum Scale's parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of byproduct trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In this article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and describe the process of designing GPFS' new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead even under intensive I/O workloads.
Accurately modeling drive-managed SMR disks is a challenge, requiring an array of approaches including both existing disk modeling techniques as well as new techniques for inferring internal translation layer algorithms. In this work we present the first predictive simulation model of a generally-available drive-managed SMR disk. Despite the use of unknown proprietary algorithms in this device, our model that is derived from external measurements is able to predict mean latency within a few percent, and with an RMS cumulative latency error of 25% or less for most workloads tested. These variations, although not small, are in most cases less than three times the drive-to-drive variation seen among seemingly identical drives.
We introduce new methods to replay intensive block I/O workloads more accurately. These methods canbe used to reproduce realistic workloads for benchmarking, performance validation, and tuning of a high-performance block storage device/system. In this paper, we study several sources in the stock operating system that introduce uncertainty in the workload replay. Based on the remedies of these findings, we design and develop a new replay tool called hfplayer that replay intensive block I/O workloads in a similar unscaled environment with more accuracy. To replay a given workload trace in a scaled environment with faster storage or host server, the dependency between I/O requests becomes crucial since the timing and ordering of I/O requests is expected to change according to these dependencies. Therefore, we propose a heuristic way of speculating I/O dependencies in a block I/O trace. Using the generated dependency graph, hfplayer tries to propagate I/O related performance gains appropriately along the I/O dependency chains and mimics original application behavior when it executes in a scaled environment. We evaluate hfplayer with a wide range of workloads using several accuracy metrics and find that it produces better accuracy when compared with other replay approaches.
Persistent memory provides data persistence at main memory with emerging non-volatile main memories (NVMMs). Recent persistent memory file systems aggressively use direct access, which directly copy data between user buffer and the storage layer, to avoid the double-copy overheads through the OS page cache. However, we observe they all suffer from slow writes due to NVMMs asymmetric read-write performance and much slower performance than DRAM. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory, to combine both buffering and direct access for fine-grained file system operations. HiNFS uses an NVMM-aware Write Buffer to buffer the lazy-persistent file writes in DRAM, while performing direct access to NVMM for eager- persistent file writes. It directly reads file data from both DRAM and NVMM, by ensuring read consistency with a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. HiNFS also employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing I/Os. Evaluations show that HiNFS significantly improves throughput by up to 184% and reduces execution time by up to 64% comparing with state-of-the-art persistent memory file systems PMFS and EXT4-DAX.
To design write buffer and FTL for SSD, previous studies have tried to increase overall performance by parallel I/O and garbage collection reduction. Recent works have proposed pattern-based management, which uses the request size and read- or write-intensiveness to apply different policies to each type of data. However, the locations of read and write requests are closely related, and the pattern of each type of data can be changed. In this work, we propose SUPA, a single unified read-write buffer and pattern-change-aware FTL on multi-channel SSD architecture. To increase both read and write hit ratios on the buffer based on locality, we use a single unified read-write buffer for both clean and dirty blocks. To handle pattern-changed blocks, we add a pattern handler between the buffer and the FTL. To reduce policy switching overhead for pattern-changed data, if pattern change is detected, pattern handler saves the corresponding data to the two locations handled by different policies respectively. In total, our evaluations show that SUPA can get up to 2.0 and 3.9 times less read and write latency, respectively, without loss of lifetime.
Abstract The paper presents an analysis of drive workloads from enterprise storage systems. The drive workloads are obtained from field return units from a cross-section of enterprise storage system vendors and thus provides a view of the workload characteristics over a wide spectrum of end-user applications. The workload parameters that have been characterized include transfer lengths, access patterns, locality and throughput. The study shows that reads are the dominant workload accounting for 80% of the accesses to the drive. Writes are dominated by short block random accesses while reads range from random to highly sequential. A trend analysis over the period 2010-2014 shows that the workload has remained fairly constant even as the capacities of the drives shipped has steadily increased. The study shows that the data stored on disk drives is relatively cold on average less than 4% of the drive capacity is accessed in a given 2 hour interval.
Log-Structure Merge tree (LSM-tree) has been one of the mainstream indexes in key-value systems supporting a variety of write-intensive Internet applications in todays data centers. However, the performance of LSM-tree is seriously hampered by constantly occurring compaction procedures, which incur significant write amplification and degrade the write throughput. To alleviate the performance degradation caused by compactions, we introduce a light-weight compaction tree (LWC-tree), a variant of LSM-tree index optimized for minimizing the write amplification and maximizing the system throughput. The light-weight compaction drastically decreases write amplification by appending data in a table and only merging the metadata that has much smaller size. We have implemented three key-value LWC-stores based on the LWC-tree on different storage mediums. The LWC-store is particularly optimized for SMR drives as it eliminates the multiplicative I/O amplification from both LSM-trees and SMR drives. Due to the light-weight compaction procedure, LWC-store reduces the write amplification by a factor of up to 5× compared to the popular LevelDB key-value store. Moreover, the random write throughput of the LWC-tree on SMR drives is significantly improved by 467% even compared with LevelDB on conventional HDDs. Furthermore, LWC-tree has wide applicability and delivers impressive performance improvement in various conditions.
Emerging non-volatile RAM (NVRAM) have a limit on the number of writes that can be made to any cell. This motivates the need for wear-leveling to distribute the writes evenly among the cells. Unlike NAND Flash, cells in NVRAM can be rewritten without erasing the entire containing block, avoiding the issues of garbage collection, motivating alternate approaches to the problem. In this paper, we propose a hierarchical wear-leveling model called Ouroboros Wear-leveling. Ouroboros uses a two-level strategy whereby frequent low-cost intra-region wear-leveling at small granularity is combined with inter-region wear-leveling at a larger time interval and granularity. Ouroboros is a hybrid migration scheme that exploits correct demand predictions in making better wear-leveling decisions, while using randomization to avoid attacks by deterministic access patterns. We also propose a way to optimize wear-leveling parameters to meet a target smoothness level under limited time and space overhead constraints for different memory architectures and trace characteristics. Several experiments are performed on synthetically-generated memory traces with special characteristics, block-level storage traces, and memory-line-level memory traces. The results show that Ouroboros Wear-leveling can distribute writes smoothly across the whole NVRAM with up to 0.2% space overhead and 0.52% time overhead for a 512GB memory.
Introduction to the Special Issue on USENIX FAST 2017
Data visualization is a thriving field of computer science, with widespread impact on diverse scientific disciplines, from medicine and meteorology to visual data mining. Advances in large scale storage systems, as well as low level storage technology, played a significant role in accelerating the applicability and adoption of modern visualization techniques. Ironically, the cobblers children have no shoes: researchers who wish to analyze storage systems and devices are usually limited to a variety of static histograms and basic displays. The dynamic nature of data movement on flash has motivated the introduction of SSDPlayer, a graphical tool for visualizing the various processes that cause data movement on SSDs. In 2015, we used the initial version of SSDPlayer to demonstrate how visualization can assist researchers and developers in their understanding of modern, complex flash-based systems.While we continued to use SSDPlayer for analysis purposes, we found it extremely useful for education and presentation purposes as well. In this paper, we describe our experience from two years of using, sharing, and extending SSDPlayer, and how similar techniques can further advance storage systems research and education.