ACM Transactions on

Storage (TOS)

Latest Articles

Protocol-Aware Recovery for Consensus-Based Distributed Storage

We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge... (more)

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the... (more)

Bringing Order to Chaos: Barrier-Enabled I/O Stack for Flash Storage

This work is dedicated to eliminating the overhead required for guaranteeing the storage order in the modern IO stack. The existing block device adopts a prohibitively expensive approach in ensuring the storage order among write requests: interleaving the write requests with Transfer-and-Flush. For exploiting the cache barrier command for flash... (more)

M-CLOCK: Migration-optimized Page Replacement Algorithm for Hybrid Memory Architecture

Phase Change Memory (PCM) has drawn great attention as a main memory due to its attractive characteristics such as non-volatility, byte-addressability, and in-place update. However, since the capacity of PCM is not fully mature yet, hybrid memory architecture that consists of DRAM and PCM has been suggested as a main memory. In addition, page... (more)

DIDACache: An Integration of Device and Application for Flash-based Key-value Caching

Key-value caching is crucial to today’s low-latency Internet services. Conventional key-value cache systems, such as Memcached, heavily rely on expensive DRAM memory. To lower Total Cost of Ownership, the industry recently is moving toward more cost-efficient flash-based solutions, such as Facebook’s McDipper [14] and... (more)

ROS: A Rack-based Optical Storage System with Inline Accessibility for Long-Term Data Preservation

The combination of the explosive growth in digital data and the demand to preserve much of these data in the long term has made it imperative to find a more cost-effective way than HDD arrays and a more easily accessible way than tape libraries to store massive amounts of data. While modern optical discs are capable of guaranteeing more than... (more)


  • TOS EiC Professor Sam H. Noh of UNIST named as ACM Distinguished Member
    A complete list of 2017 ACM Distinguished Members can be found here.

  • CFP - Special Issue on NVM and Storage (in detail)

  • TOS Editor-in-Chief featured in "People of ACM"
    Sam H. Noh is Editor-in-Chief of ACM Transactions on Storage (TOS) - featured in the periodic series "People of ACM", full article available here
    November 01, 2016

  • ACM Transaction on Storage (TOS) welcomes Sam H. Noh as its new Editor-in-Chief for a 3-year term, effective August 1, 2016.
    Sam H. Noh is a professor and Head of the School of the Electrical and Computer Engineering School at UNIST (Ulsan National Institute of Science and Technology) in Ulsan, Korea and a leader in the use of new memory technology such as flash memory and non-volatile memory in storage.
    - August 01, 2016

Forthcoming Articles
Leveraging Glocality of Cube-Based Networks for Fast Failure Recovery in Distributed RAM Storage

Distributed RAM storage aggregates the RAM of servers in data center networks (DCN) to provide extremely high I/O performance for large-scale cloud systems. For quick recovery of storage server failures, MemCube exploits the proximity of the BCube network to limit the recovery traffic to the recovery servers' 1-hop neighborhood. However, previous design is applicable only to the symmetric BCube(n,k) network with n^{k+1} nodes, and has suboptimal recovery performance due to congestion and contention. To address these problems, in this paper we propose CubeX, which (i) generalizes the "1-hop" principle of MemCube for arbitrary cube-based networks, and (ii) improves the throughput and recovery performance of RAM-based key-value (KV) store via cross-layer optimizations. At the core of CubeX is to leverage the glocality (= globality + locality) of cube-based networks: it scatters backup data across a large number of disks globally distributed throughout the cube, and restricts all recovery traffic within the small local range of each server node. Our evaluation shows that CubeX not only efficiently supports RAM-based KV store for cube-based networks, but also significantly outperforms MemCube and RAMCloud in both throughput and recovery time.

Introduction to the Special Section on Systor 2017

LibPM: Simplifying Application Usage of Persistent Memory

Persistent Memory devices present properties that are uniquely different from prior technologies for which applications have been built. Unfortunately, the conventional approach to building applications fail to either efficiently utilize these new devices or provide programmers a seamless development experience. We have built LibPM, a Persistent Memory Library that implements an easy-to-use container abstraction for consuming PM. LibPMs containers are data hosting units that can store arbitrarily complex data types while preserving their integrity and consistency. Consequently, LibPMs containers provide a generic interface to applications, allowing applications to store and manipulate arbitrarily structured data with strong durability and consistency properties, all without having to navigate all the myriad pitfalls of programming PM directly. By providing a simple and high performing transactional update mechanism, LibPM allows applications to manipulate persistent data at the speed of memory. The container abstraction and automatic persistent data discovery mechanisms within LibPM also simplify porting legacy applications to PM. From a performance perspective, LibPM closely matches and often exceeds the performance of state-of-the-art application libraries for PM. For instance, LibPMs performance is 195X better for write intensive workloads and 2.6X better for read intensive workloads when compared with the Pmem.IO persistent memory library.

Exploiting Internal Parallelism for Address Translation in Solid State Drives

SSDs achieve high performance using internal parallelism and a Flash Translation Layer (FTL). Unfortunately, current state-of-the-art cache-based FTLs like the Demand-based Flash Translation Layer (DFTL) do not allow IO schedulers to take full advantage of internal parallelism because they impose a tight coupling between the logical-to-physical address translation and the data access. In this work, we propose an innovative IO handling scheme called Parallel-DFTL that works with the DFTL to break the coupled address translation operations from data accesses. It separates address translation and data access operations in different queues and a Parallel-LRU cache replacement algorithm to allow the SSD to use its flash access channel resources concurrently and fully for both types of operations. We present a performance model of FTL schemes that predicts the benefit of Parallel-DFTL against DFTL. We implemented our approach in an SSD simulator using real SSD device parameters, and used trace-driven simulation to evaluate its efficacy. Parallel-DFTL improved overall performance by up to 32% for the real IO workloads we tested, and up to two orders of magnitude for our synthetic test workloads. We also found that Parallel-DFTL is able to achieve reasonable performance with a very small cache size.

Performance Characterization of NVMe-over-Fabrics Storage Disaggregation

Storage disaggregation separates compute and storage to different nodes in order to allow for independent resource scaling and thus, better hardware resource utilization. While disaggregation of hard-drives storage is a common practice, NVMe-SSD (i.e., PCIe-based SSD) disaggregation is considered more challenging. This is because SSDs are significantly faster than hard drives, so the latency overheads (due to both network and CPU processing) as well as the extra compute cycles needed for the offloading stack become much more pronounced. In this work we characterize the overheads of NVMe-SSD disaggregation. We show that NVMe-over-Fabrics (NVMe-oF) -- a recently-released remote storage protocol specification -- reduces the overheads of remote access to a bare minimum, thus greatly increasing the cost-efficiency of Flash disaggregation. Specifically, while recent work showed that SSD storage disaggregation via iSCSI degrades application-level throughput by 20\%, we report on negligible performance degradation with NVMe-oF -- both when using stress-tests as well as with a more-realistic KV-store workload.

HIL: A Framework for Compositional FTL Development and Provably-Correct Crash Recovery

We present a framework called HIL (Hierarchically Interacting Logs) for constructing FTLs (Flash Translation Layers). The main goal of the HIL framework is to heal the Achilles heel - the crash recovery - of FTLs (hence its name). Nonetheless, the framework itself is general enough to encompass not only block-mapped and page-mapped FTLs but also many of their variants including hybrid ones because of its compositional nature. Crash recovery within the HIL framework proceeds in two phases: structural recovery and functional recovery. During the structural recovery, residual effects due to program operations ongoing at the time of the crash are eliminated in an atomic manner using shadow paging. During the functional recovery, operations that would have been performed if there had been no crash are replayed in a redo-only fashion. Both phases operate in an idempotent manner, preventing repeated crashes during recovery from causing any additional problems. We demonstrate the practicality of the proposed HIL framework by implementing a prototype and showing that its performance during normal execution and also during crash recovery is at least as good as those of state-of-the-art SSDs.

Liquid Cloud Storage

A liquid system provides durable object storage based on spreading redundantly generated data across a network of hundreds to thousands of potentially unreliable storage nodes. A liquid system uses a combination of a large code, lazy repair, and a flow storage organization. We show that a liquid system can be operated to enable flexible and essentially optimal combinations of storage durability, storage overhead, repair bandwidth usage, and access performance.

Towards Robust File System Checkers

File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice. In this work, we demonstrate via fault injection that checkers of widely used file systems may leave the file system in an uncorrectable state if the repair is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck, and then build a general logging library (rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: (1) rfsck-ext, a robust checker for Ext-family file systems, and (2) rfsck-xfs, a robust checker for XFS, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) comparing to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.

Efficient Directory Mutations in a Full-Path-Indexed File System

Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with full-path indexing are prohibitively expensive. This paper shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The paper introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value API and simplifies the overall file system design. We implemented this mechanism in BetrFS, an in-kernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, XFS, and ZFS, across a variety of workloads.

FlashNet: Flash/Network Stack Co-Design

During the past decade, network and storage devices have undergone rapid performance improvements, delivering ultra-low latency and several Gbps of bandwidth. Nevertheless, current network and storage stacks fail to deliver this hardware performance to the applications, often due to the loss of I/O efficiency from stalled CPU performance. While many efforts attempt to address this issue solely on either the network or the storage stack, achieving high-performance for networked-storage applications requires a holistic approach that considers both. In this paper, we present FlashNet, a software I/O stack that unifies high-performance network properties with flash storage access and management. FlashNet builds on RDMA principles and abstractions to provide a direct, asynchronous, end-to-end data path between a client and remote flash storage. The key insight behind FlashNet is to co-design the stack's components (an RDMA controller, a flash controller, and a file system) to enable cross-stack optimizations and maximize I/O efficiency. In micro-benchmarks, FlashNet improves 4kB network IOPS by 38.6% to 1.22M, decreases access latency by 43.5% to 50.4 ┬╝secs, and prolongs the flash lifetime by 1.6-5.9x for writes. We illustrate the capabilities of FlashNet by building a Key-Value store, and porting a distributed data store that uses RDMA on it.

Management of Next-Generation NAND Flash to achieve Enterprise-Level Endurance and Latency Targets

Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While the storage density is increasing and costs are dropping, technology advances have led to reduced endurance and larger variations across blocks, which cannot be compensated solely by stronger ECC or read-retry schemes, but have to be addressed holistically. We present novel flash-management technologies that reduce write amplification, achieve better wear leveling, and significantly enhance endurance without sacrificing performance, to bring next-generation flash to the levels required in enterprise storage. In particular, we introduce block calibration, which determines optimal read-threshold voltages, new garbage-collection and data-placement schemes with heat segregation and health binning to overcome variations in flash blocks, and show how these techniques can be integrated in an optimized and interdependent fashion. These complementary flash-management algorithms were designed and implemented in simulators, hardware test platforms, and an enterprise-level all-flash array. By adopting these techniques, we show that the average endurance of all blocks, rather than that of the worst block, determines the overall endurance thereby enhancing endurance by 58.8%. By combining all schemes we improve endurance by up to 15x compared with the baseline even without a stronger ECC.

Write Energy Reduction for PCM via Pumping Efficiency Improvement

The emerging Phase Change Memory (PCM) is considered as a promising candidate to replace DRAM as the next generation main memory since it has better scalability and lower leakage power. However, the high write power consumption has become a main challenge in adopting PCM as main memory. In addition to the fact that writing to PCM cells requires high write current and voltage, current loss in the charge pumps (CPs) also contributes a large percentage of the high power consumption. The pumping efficiency of a PCM chip is a concave function of the write current. Based on the characteristics of the concave function, the overall pumping efficiency can be improved if the write current is uniform. In this paper, we propose the peak-to-average (PTA) write scheme, which smooths the write current fluctuation by regrouping write units. Specifically, we calculate the current requirements for each write unit by their values when they are evicted from the last level cache. When the write units are waiting in the memory controller, we regroup the write units by two efficient online algorithms to reach the current-uniform goal. Experimental results show that LLC-Assistance PTA achieved 9.7\% of overall energy saving compared to the baseline.

All ACM Journals | See Full Journal Index

Search TOS
enter search term and/or author name