Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
Phase Change Memory (PCM) has drawn great attention as a main memory due to its attractive characteristics such as non-volatility, byte-addressability, and in-place update. However, since the capacity of PCM is not fully mature yet, hybrid memory architecture that consists of DRAM and PCM has been suggested as a main memory. In addition, page replacement algorithm based on hybrid memory architecture is actively being studied because existing page replacement algorithms cannot be used on hybrid memory architecture in that they do not consider the two weaknesses of PCM: high write latency and low endurance. In this paper, to mitigate the above hardware limitations of PCM, we revisit the page cache layer for the hybrid memory architecture and propose a novel page replacement algorithm, called M-CLOCK, to improve the performance of hybrid memory architecture and the lifespan of PCM. In particular, M-CLOCK aims to reduce the number of PCM writes that negatively affect the performance of hybrid memory architecture. Experimental results clearly show that M-CLOCK outperforms the state-of-the-art page replacement algorithms in terms of the number of PCM writes and effective memory access time by up to 98% and 9.4 times, respectively.
Introduction to the Special Issue on USENIX FAST 2018
In this paper, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. We propose an enhanced flash-aware key-value cache manager, which consists a novel unified address mapping module, an integrated garbage collection policy, a dynamic over-provisioning space management, and a customized wear-leveling policy, to directly drive the flash management. A thin intermediate library layer which provides a slab-based abstraction of low-level flash memory space and an API interface for directly and easily operating flash devices. A special flash memory SSD hardware that exposes flash physical details is adopted to store key-value items. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.
The demand for preserving large amount of data in long-term has brought challenges to store data in a cost-effective way. While modern optical discs guarantee 50-year lifecycle, single optical discs' lack of performance and capacity has limited their use in datacenters. This paper presents a Rack-scale Optical disc library System, ROS in short, which provides a PB-level total capacity and inline access to thousands of optical discs within a 42U Rack. A rotatable roller and robotic arm are designed to increase disc placement density and simplify the mechanical structure. A hierarchical storage system based on SSDs, hard disks and optical discs is proposed to hide the delay of mechanical operations. An optical library file system (OLFS) based on FUSE is proposed to schedule mechanical operations and organize data with a POSIX user interface. We further optimize OLFS by reducing unnecessary user/kernel context switches inheriting from FUSE. We evaluate ROS on key performance metrics including the mechanical operation delays and software overhead on a prototype PB-level ROS system. The results show that ROS in network attached storage mode almost saturates the bandwidth of samba in 10GbE network. ROS is able to hide internal complex behaviors and be deployed in data centers.
We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of Par through the design and implementation of corruption-tolerant replication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.
Persistent Memory devices present properties that are uniquely different from prior technologies for which applications have been built. Unfortunately, the conventional approach to building applications fail to either efficiently utilize these new devices or provide programmers a seamless development experience. We have built LibPM, a Persistent Memory Library that implements an easy-to-use container abstraction for consuming PM. LibPMs containers are data hosting units that can store arbitrarily complex data types while preserving their integrity and consistency. Consequently, LibPMs containers provide a generic interface to applications, allowing applications to store and manipulate arbitrarily structured data with strong durability and consistency properties, all without having to navigate all the myriad pitfalls of programming PM directly. By providing a simple and high performing transactional update mechanism, LibPM allows applications to manipulate persistent data at the speed of memory. The container abstraction and automatic persistent data discovery mechanisms within LibPM also simplify porting legacy applications to PM. From a performance perspective, LibPM closely matches and often exceeds the performance of state-of-the-art application libraries for PM. For instance, LibPMs performance is 195X better for write intensive workloads and 2.6X better for read intensive workloads when compared with the Pmem.IO persistent memory library.
SSDs achieve high performance using internal parallelism and a Flash Translation Layer (FTL). Unfortunately, current state-of-the-art cache-based FTLs like the Demand-based Flash Translation Layer (DFTL) do not allow IO schedulers to take full advantage of internal parallelism because they impose a tight coupling between the logical-to-physical address translation and the data access. In this work, we propose an innovative IO handling scheme called Parallel-DFTL that works with the DFTL to break the coupled address translation operations from data accesses. It separates address translation and data access operations in different queues and a Parallel-LRU cache replacement algorithm to allow the SSD to use its flash access channel resources concurrently and fully for both types of operations. We present a performance model of FTL schemes that predicts the benefit of Parallel-DFTL against DFTL. We implemented our approach in an SSD simulator using real SSD device parameters, and used trace-driven simulation to evaluate its efficacy. Parallel-DFTL improved overall performance by up to 32% for the real IO workloads we tested, and up to two orders of magnitude for our synthetic test workloads. We also found that Parallel-DFTL is able to achieve reasonable performance with a very small cache size.
Bringing Order to Chaos: Barrier-Enabled I/O Stack for Flash Storage
Storage disaggregation separates compute and storage to different nodes in order to allow for independent resource scaling and thus, better hardware resource utilization. While disaggregation of hard-drives storage is a common practice, NVMe-SSD (i.e., PCIe-based SSD) disaggregation is considered more challenging. This is because SSDs are significantly faster than hard drives, so the latency overheads (due to both network and CPU processing) as well as the extra compute cycles needed for the offloading stack become much more pronounced. In this work we characterize the overheads of NVMe-SSD disaggregation. We show that NVMe-over-Fabrics (NVMe-oF) -- a recently-released remote storage protocol specification -- reduces the overheads of remote access to a bare minimum, thus greatly increasing the cost-efficiency of Flash disaggregation. Specifically, while recent work showed that SSD storage disaggregation via iSCSI degrades application-level throughput by 20\%, we report on negligible performance degradation with NVMe-oF -- both when using stress-tests as well as with a more-realistic KV-store workload.
We present a framework called HIL (Hierarchically Interacting Logs) for constructing FTLs (Flash Translation Layers). The main goal of the HIL framework is to heal the Achilles heel - the crash recovery - of FTLs (hence its name). Nonetheless, the framework itself is general enough to encompass not only block-mapped and page-mapped FTLs but also many of their variants including hybrid ones because of its compositional nature. Crash recovery within the HIL framework proceeds in two phases: structural recovery and functional recovery. During the structural recovery, residual effects due to program operations ongoing at the time of the crash are eliminated in an atomic manner using shadow paging. During the functional recovery, operations that would have been performed if there had been no crash are replayed in a redo-only fashion. Both phases operate in an idempotent manner, preventing repeated crashes during recovery from causing any additional problems. We demonstrate the practicality of the proposed HIL framework by implementing a prototype and showing that its performance during normal execution and also during crash recovery is at least as good as those of state-of-the-art SSDs.
File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice. In this work, we demonstrate via fault injection that checkers of widely used file systems may leave the file system in an uncorrectable state if the repair is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck, and then build a general logging library (rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: (1) rfsck-ext, a robust checker for Ext-family file systems, and (2) rfsck-xfs, a robust checker for XFS, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) comparing to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.
Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with full-path indexing are prohibitively expensive. This paper shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The paper introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value API and simplifies the overall file system design. We implemented this mechanism in BetrFS, an in-kernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, XFS, and ZFS, across a variety of workloads.
During the past decade, network and storage devices have undergone rapid performance improvements, delivering ultra-low latency and several Gbps of bandwidth. Nevertheless, current network and storage stacks fail to deliver this hardware performance to the applications, often due to the loss of I/O efficiency from stalled CPU performance. While many efforts attempt to address this issue solely on either the network or the storage stack, achieving high-performance for networked-storage applications requires a holistic approach that considers both. In this paper, we present FlashNet, a software I/O stack that unifies high-performance network properties with flash storage access and management. FlashNet builds on RDMA principles and abstractions to provide a direct, asynchronous, end-to-end data path between a client and remote flash storage. The key insight behind FlashNet is to co-design the stack's components (an RDMA controller, a flash controller, and a file system) to enable cross-stack optimizations and maximize I/O efficiency. In micro-benchmarks, FlashNet improves 4kB network IOPS by 38.6% to 1.22M, decreases access latency by 43.5% to 50.4 ¼secs, and prolongs the flash lifetime by 1.6-5.9x for writes. We illustrate the capabilities of FlashNet by building a Key-Value store, and porting a distributed data store that uses RDMA on it.
Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While the storage density is increasing and costs are dropping, technology advances have led to reduced endurance and larger variations across blocks, which cannot be compensated solely by stronger ECC or read-retry schemes, but have to be addressed holistically. We present novel flash-management technologies that reduce write amplification, achieve better wear leveling, and significantly enhance endurance without sacrificing performance, to bring next-generation flash to the levels required in enterprise storage. In particular, we introduce block calibration, which determines optimal read-threshold voltages, new garbage-collection and data-placement schemes with heat segregation and health binning to overcome variations in flash blocks, and show how these techniques can be integrated in an optimized and interdependent fashion. These complementary flash-management algorithms were designed and implemented in simulators, hardware test platforms, and an enterprise-level all-flash array. By adopting these techniques, we show that the average endurance of all blocks, rather than that of the worst block, determines the overall endurance thereby enhancing endurance by 58.8%. By combining all schemes we improve endurance by up to 15x compared with the baseline even without a stronger ECC.
The emerging Phase Change Memory (PCM) is considered as a promising candidate to replace DRAM as the next generation main memory since it has better scalability and lower leakage power. However, the high write power consumption has become a main challenge in adopting PCM as main memory. In addition to the fact that writing to PCM cells requires high write current and voltage, current loss in the charge pumps (CPs) also contributes a large percentage of the high power consumption. The pumping efficiency of a PCM chip is a concave function of the write current. Based on the characteristics of the concave function, the overall pumping efficiency can be improved if the write current is uniform. In this paper, we propose the peak-to-average (PTA) write scheme, which smooths the write current fluctuation by regrouping write units. Specifically, we calculate the current requirements for each write unit by their values when they are evicted from the last level cache. When the write units are waiting in the memory controller, we regroup the write units by two efficient online algorithms to reach the current-uniform goal. Experimental results show that LLC-Assistance PTA achieved 9.7\% of overall energy saving compared to the baseline.