ACM Transactions on

Storage (TOS)



Volume 15, Issue 1, March 2019 is now available.


We would like to thank our colleagues who have served as reviewers for ACM TOS between 2016 ~ 2018. Through this list, we express our deepest gratitude for sacrificing your time and effort in providing your valuable comments. Thanks!


Special Issue on Computational Storage

Since the first hard disk drive (HDD) was introduced in 1956, storage devices have remained “dumb” for more than 60 years. However, ever growing demand for big data processing and recent advances in storage technology are reshaping the traditional CPU-centric computing paradigm. Many studies show that the energy consumed by data movement is starting to exceed the energy consumed by computation. Especially, the advent of high-performance solid-state drives (SSDs) based on non-volatile memory (e.g., NAND flash memory, 3D XPoint, etc.) opens up new opportunities for storage-centric computing paradigm.

Read more.

Forthcoming Articles
Efficient Erasure-Coded Block Storage System Based on Speculative Partial Write

In this paper we present PariX, a speculative partial write scheme which supports fast small writes in erasure-coded block storage systems. We transform the original formula of parity calculation to use the data deltas (between the current/original data values), instead of the parity deltas, to calculate the parities in journal replay. For each partial write, this allows PariX to speculatively log only the new value of the data without reading its original value. For a series of n partial writes to the same data, PariX performs pure write (instead of write-after-read) for the last n ? 1 ones while only introducing a small penalty of an extra network RTT (round-trip time) to the first one. We design and implement a PariX-based Block Storage system (PBS) which provides high-performance virtual disk service for VMs running cloud-oblivious applications. PBS not only supports fast partial writes, but also realizes efficient full writes, background journal replay and fast failure recovery with strong consistency guarantees. Both micro benchmarks and trace-driven evaluation show that PBS provides efficient block storage and outperforms state-of-the-art EC-based systems by orders of magnitude.

Information Leakage in Encrypted Deduplication via Frequency Analysis: Attacks and Defenses

Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adversary to launch frequency analysis against the ciphertext chunks and infer the content of the original plaintext chunks. In this paper, we study how frequency analysis affects information leakage in encrypted deduplication storage, from both attack and defense perspectives. Specifically, we target backup workloads, and propose a new inference attack that exploits chunk locality to increase the coverage of inferred chunks. We further combine the new inference attack with the knowledge of chunk sizes and show its attack effectiveness against variable-size chunks. We conduct trace-driven evaluation on both real-world and synthetic datasets and show that our proposed attacks infer a significant fraction of plaintext chunks under backup workloads. To defend against frequency analysis, we present two defense approaches, namely MinHash encryption and scrambling. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling scheme effectively mitigates the severity of the inference attacks, while maintaining high storage efficiency and incurring limited metadata access overhead.

GraphOne: A Data Store for Real-time Analytics on Evolving Graphs

There is a growing need to perform real-time analytics on evolving graphs in order to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations. To address this challenge, we have designed and developed GraphOne, a graph data store that combines two complementary graph storage formats (edge list and adjacency list), and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities with only a small data duplication. Experimental results show that GraphOne achieves an ingestion rate of two to three orders of magnitude higher than graph databases, while delivering algorithmic performance comparable to a static graph system. GraphOneis able to deliver 5.36x higher update rate and over 3x better analytics performance compared to a state-of-the- art dynamic graph system

Countering Fragmentation in an Enterprise Storage System

As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. Similarly, wasted space can also occur when a file system writes a collection of blocks out to object storage as a single large object, as the constituent blocks can become free at different times. The impact of fragmentation also depends on the underlying storage media. This paper describes how the NetApp WAFL file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The paper analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.

Instalytics: Cluster Filesystem Co-design for Big-data Analytics

We present the design, implementation, and evaluation of INSTalytics a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

Determining Data Distribution for Large Disk Enclosures with 3-D Data Templates

Existing RAID solutions partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures. We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction. Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.

Sketching Volume Capacities in Deduplicated Storage

The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work we introduce a new approach to analyzing capacities in deduplicated storage environments. We provide sketch-based estimations of fundamental capacity measures required for managing a storage system: How much physical space would be reclaimed if a volume or group of volumes were to be removed from a system (the {\em reclaimable} capacity) and how much of the physical space should be attributed to each of the volumes in the system (the {\em attributed} capacity). Our methods also support capacity queries for volume groups across multiple storage systems, e.g., how much capacity would a volume group consume after being migrated to another storage system? We provide analytical accuracy guarantees for our estimations as well as empirical evaluations. Our technology is integrated into a prominent all-flash storage array and exhibits high performance even for very large systems. We also demonstrate how this method opens the door for performing placement decisions at the data center level and obtaining insights on deduplication in the field.

Characterizing output behaviors of a production supercomputer: analysis and implications

This paper studies the output behavior of the Titan supercomputer and its Lustre file stores. We introduce a statistical benchmarking methodology that collects/combines samples over times and settings: 1) To measure the performance impact of parameter choices against the interference in the production setting; 2) to derive the performance of individual stages/components in the multi-stage write pipelines, and their variations over time. We find that Titan's I/O system is highly variable with two major implications: 1) Stragglers lessen the benefit of coupled I/O parallelism. I/O parallelism is most effective when the application distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets, in a balanced way with minimal contention. 2) our results also suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify "good locations" in the machine or in the file system: component performance is driven by transient load conditions, and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.

The Composite-file File System: Decoupling One-to-one Mapping of Files and Metadata for Better Performance

The design and implementation of traditional file systems typically use the one-to-one mapping of logical files to their physical metadata representations. File system optimizations generally follow this rigid mapping and miss opportunities for an entire class of optimizations. We designed, implemented, and evaluated a composite-file file system, which allows many-to-one mappings of files to metadata. Through exploring different mapping strategies, our empirical evaluation shows up to a 27% performance improvement under webserver and software development workloads, for both disks and SSDs. This result demonstrates that our approach of relaxing file-to-metadata mapping is promising.

LDJ: Version Consistency Is Almost Free on Commercial Storage Devices

All ACM Journals | See Full Journal Index

Search TOS
enter search term and/or author name