Systematic Erasure Codes with Optimal Repair Bandwidth and Storage
Emerging byte-addressable non-volatile memory (NVRAM) is expected to replace block device storages as an alternative low latency persistent storage device. If NVRAM is used as a persistent storage device, a cache line instead of a disk page will be the unit of data transfer, consistency, and durability. In this work, we design and develop clfB-tree - a B-tree structure whose tree node fits in a single cache line. We employ existing write combining store buffer and restricted transactional memory (RTM) to provide a failure-atomic cache line write operation. Using the failure-atomic cache line write operations, we atomically update a clfB-tree node via a single cache line flush instruction without major changes in hardware. However, there exist many processors that do not provide SW interface for transactional memory. For those processors, our proposed clfB-tree achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes. We evaluate the performance of clfB-tree on an NVRAM emulation board with ARM Cortex A-9 processor and a workstation that has Intel Xeon E7-4809 v3 processor. Our experimental results show clfB-tree outperforms wB-tree and CDDS B-tree by a large margin in terms of both insertion and search performance
Flash storage has become the mainstream destination for storage users. However, SSDs do not always deliver the performance that users expect. The core culprit of flash performance instability is the well-known garbage collection (GC) process, which causes long delays as the SSD cannot serve (blocks) incoming I/Os, which then induces the long tail latency problem. We present ttFlash as a solution to this problem. ttFlash is a tiny-tail flash drive (SSD) that eliminates GC-induced tail latencies by circumventing GC-blocked I/Os with four novel strategies: plane-blocking GC, rotating GC, GC-tolerant read, and GC-tolerant flush. These four strategies leverage the timely combination of modern SSD internal technologies such as powerful controller, parity-based redundancies (RAIN), and capacitor-backed RAM. Our strategies are dependent on the use of intra-plane copyback operations. Through an extensive evaluation, we show that ttFlash comes significantly close to a no-GC scenario. Specifically, between 9999.99th percentiles, ttFlash is only 1.0 to 2.6x slower than the no-GC case, while a base approach suffers from 5138ms GC-induced slowdowns.
Recent research has shown that applications often incorrectly implement crash consistency. We present the Crash-Consistent File System (ccfs), a file system that improves the correctness of application-level crash consistency protocols while maintaining high performance. A key idea in ccfs is the abstraction of a stream. Within a stream, updates are committed in program order, improving correctness; across streams, there are no ordering restrictions, enabling scheduling flexibility and high performance. We empirically demonstrate that applications running atop ccfs achieve high levels of crash consistency. Further, we show that ccfs performance under standard file-system benchmarks is excellent, in the worst case on par with the highest performing modes of Linux ext4, and in some cases notably better. Overall, we demonstrate that both application correctness and high performance can be realized in a modern file system.
Modern systems use networks extensively, accessing both services and storage across local and remote networks. Latency is a key performance challenge, and packing multiple small operations into fewer large ones is an effective way to amortize that cost, especially after years of significant improvement in bandwidth but not latency. To this end, the NFSv4 protocol supports a compounding feature to combine multiple operations. Yet compounding has been underused since its conception because the synchronous POSIX file-system API issues only one (small) request at a time. We propose vNFS, an NFSv4.1-compliant client that exposes a vectorized high-level API and leverages NFS compound procedures to maximize performance. We designed and implemented vNFS as a user-space RPC library that supports an assortment of bulk operations on multiple files and directories. We found it easy to modify several UNIX utilities, an HTTP/2 server, and Filebench to use vNFS. We evaluated vNFS under a wide range of workloads and network latency conditions, showing that vNFS improves performance even for low-latency networks. On high-latency networks, vNFS can improve performance by as much as two orders of magnitude.
Accurately modeling drive-managed SMR disks is a challenge, requiring an array of approaches including both existing disk modeling techniques as well as new techniques for inferring internal translation layer algorithms. In this work we present the first predictive simulation model of a generally-available drive-managed SMR disk. Despite the use of unknown proprietary algorithms in this device, our model that is derived from external measurements is able to predict mean latency within a few percent, and with an RMS cumulative latency error of 25% or less for most workloads tested. These variations, although not small, are in most cases less than three times the drive-to-drive variation seen among seemingly identical drives.
Persistent memory provides data persistence at main memory with emerging non-volatile main memories (NVMMs). Recent persistent memory file systems aggressively use direct access, which directly copy data between user buffer and the storage layer, to avoid the double-copy overheads through the OS page cache. However, we observe they all suffer from slow writes due to NVMMs asymmetric read-write performance and much slower performance than DRAM. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory, to combine both buffering and direct access for fine-grained file system operations. HiNFS uses an NVMM-aware Write Buffer to buffer the lazy-persistent file writes in DRAM, while performing direct access to NVMM for eager- persistent file writes. It directly reads file data from both DRAM and NVMM, by ensuring read consistency with a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. HiNFS also employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing I/Os. Evaluations show that HiNFS significantly improves throughput by up to 184% and reduces execution time by up to 64% comparing with state-of-the-art persistent memory file systems PMFS and EXT4-DAX.
Besides well-known benefits, commodity cloud storage also raises concerns that include security, reliability, and consistency. We present Hybris key-value store, the first robust hybrid cloud storage system, aiming at addressing these concerns leveraging both private and public cloud resources. Hybris robustly replicates metadata on trusted private premises (private cloud), separately from data which is dispersed (using replication or erasure coding) across multiple untrusted public clouds. Hybris maintains metadata stored on private premises at the order of few dozens of bytes per key, avoiding the scalability bottleneck at the private cloud. In turn, the hybrid design allows to efficiently and robustly tolerate cloud outages, but also potential malice in clouds without overhead. Finally, Hybris leverages strong metadata consistency to guarantee to Hybris applications strong data consistency without any modifications to the eventually consistent public clouds. We implemented Hybris in Java and evaluated it using a series of micro and macro-benchmarks. Our results show that Hybris significantly outperforms comparable multi-cloud storage systems and approaches the performance of bare-bone commodity public cloud storage.
To design write buffer and FTL for SSD, previous studies have tried to increase overall performance by parallel I/O and garbage collection reduction. Recent works have proposed pattern-based management, which uses the request size and read- or write-intensiveness to apply different policies to each type of data. However, the locations of read and write requests are closely related, and the pattern of each type of data can be changed. In this work, we propose SUPA, a single unified read-write buffer and pattern-change-aware FTL on multi-channel SSD architecture. To increase both read and write hit ratios on the buffer based on locality, we use a single unified read-write buffer for both clean and dirty blocks. To handle pattern-changed blocks, we add a pattern handler between the buffer and the FTL. To reduce policy switching overhead for pattern-changed data, if pattern change is detected, pattern handler saves the corresponding data to the two locations handled by different policies respectively. In total, our evaluations show that SUPA can get up to 2.0 and 3.9 times less read and write latency, respectively, without loss of lifetime.
Log-Structure Merge tree (LSM-tree) has been one of the mainstream indexes in key-value systems supporting a variety of write-intensive Internet applications in todays data centers. However, the performance of LSM-tree is seriously hampered by constantly occurring compaction procedures, which incur significant write amplification and degrade the write throughput. To alleviate the performance degradation caused by compactions, we introduce a light-weight compaction tree (LWC-tree), a variant of LSM-tree index optimized for minimizing the write amplification and maximizing the system throughput. The light-weight compaction drastically decreases write amplification by appending data in a table and only merging the metadata that has much smaller size. We have implemented three key-value LWC-stores based on the LWC-tree on different storage mediums. The LWC-store is particularly optimized for SMR drives as it eliminates the multiplicative I/O amplification from both LSM-trees and SMR drives. Due to the light-weight compaction procedure, LWC-store reduces the write amplification by a factor of up to 5× compared to the popular LevelDB key-value store. Moreover, the random write throughput of the LWC-tree on SMR drives is significantly improved by 467% even compared with LevelDB on conventional HDDs. Furthermore, LWC-tree has wide applicability and delivers impressive performance improvement in various conditions.
Emerging non-volatile RAM (NVRAM) have a limit on the number of writes that can be made to any cell. This motivates the need for wear-leveling to distribute the writes evenly among the cells. Unlike NAND Flash, cells in NVRAM can be rewritten without erasing the entire containing block, avoiding the issues of garbage collection, motivating alternate approaches to the problem. In this paper, we propose a hierarchical wear-leveling model called Ouroboros Wear-leveling. Ouroboros uses a two-level strategy whereby frequent low-cost intra-region wear-leveling at small granularity is combined with inter-region wear-leveling at a larger time interval and granularity. Ouroboros is a hybrid migration scheme that exploits correct demand predictions in making better wear-leveling decisions, while using randomization to avoid attacks by deterministic access patterns. We also propose a way to optimize wear-leveling parameters to meet a target smoothness level under limited time and space overhead constraints for different memory architectures and trace characteristics. Several experiments are performed on synthetically-generated memory traces with special characteristics, block-level storage traces, and memory-line-level memory traces. The results show that Ouroboros Wear-leveling can distribute writes smoothly across the whole NVRAM with up to 0.2% space overhead and 0.52% time overhead for a 512GB memory.
We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous problems related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. We also find that the above outcomes arise due to fundamental problems in file-system fault handling that are common across many systems. Our results have implications for the design of next generation fault-tolerant distributed and cloud storage systems.
Introduction to the Special Issue on USENIX FAST 2017
NetApp® WAFL® is a transactional file system that uses the copy-on-write mechanism to support fast write performance and efficient snapshot creation. However, copy-on-write increases the demand on the file system to find free blocks quickly, which makes rapid free space reclamation essential. Inability to find free blocks quickly may impede allocations for incoming writes. Efficiency is also important, because the task of reclaiming free space may consume CPU and other resources at the expense of client operations. In this paper, we describe the evolution (over more than a decade) of the WAFL algorithms and data structures for reclaiming space with minimal impact to the overall performance of the storage appliance.