In this paper we present PariX, a speculative partial write scheme which supports fast small writes in erasure-coded block storage systems. We transform the original formula of parity calculation to use the data deltas (between the current/original data values), instead of the parity deltas, to calculate the parities in journal replay. For each partial write, this allows PariX to speculatively log only the new value of the data without reading its original value. For a series of n partial writes to the same data, PariX performs pure write (instead of write-after-read) for the last n ? 1 ones while only introducing a small penalty of an extra network RTT (round-trip time) to the first one. We design and implement a PariX-based Block Storage system (PBS) which provides high-performance virtual disk service for VMs running cloud-oblivious applications. PBS not only supports fast partial writes, but also realizes efficient full writes, background journal replay and fast failure recovery with strong consistency guarantees. Both micro benchmarks and trace-driven evaluation show that PBS provides efficient block storage and outperforms state-of-the-art EC-based systems by orders of magnitude.
Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adversary to launch frequency analysis against the ciphertext chunks and infer the content of the original plaintext chunks. In this paper, we study how frequency analysis affects information leakage in encrypted deduplication storage, from both attack and defense perspectives. Specifically, we target backup workloads, and propose a new inference attack that exploits chunk locality to increase the coverage of inferred chunks. We further combine the new inference attack with the knowledge of chunk sizes and show its attack effectiveness against variable-size chunks. We conduct trace-driven evaluation on both real-world and synthetic datasets and show that our proposed attacks infer a significant fraction of plaintext chunks under backup workloads. To defend against frequency analysis, we present two defense approaches, namely MinHash encryption and scrambling. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling scheme effectively mitigates the severity of the inference attacks, while maintaining high storage efficiency and incurring limited metadata access overhead.
There is a growing need to perform real-time analytics on evolving graphs in order to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations. To address this challenge, we have designed and developed GraphOne, a graph data store that combines two complementary graph storage formats (edge list and adjacency list), and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities with only a small data duplication. Experimental results show that GraphOne achieves an ingestion rate of two to three orders of magnitude higher than graph databases, while delivering algorithmic performance comparable to a static graph system. GraphOneis able to deliver 5.36x higher update rate and over 3x better analytics performance compared to a state-of-the- art dynamic graph system
As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. Similarly, wasted space can also occur when a file system writes a collection of blocks out to object storage as a single large object, as the constituent blocks can become free at different times. The impact of fragmentation also depends on the underlying storage media. This paper describes how the NetApp WAFL file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The paper analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.
Existing RAID solutions partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures. We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction. Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.
This paper studies the output behavior of the Titan supercomputer and its Lustre file stores. We introduce a statistical benchmarking methodology that collects/combines samples over times and settings: 1) To measure the performance impact of parameter choices against the interference in the production setting; 2) to derive the performance of individual stages/components in the multi-stage write pipelines, and their variations over time. We find that Titan's I/O system is highly variable with two major implications: 1) Stragglers lessen the benefit of coupled I/O parallelism. I/O parallelism is most effective when the application distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets, in a balanced way with minimal contention. 2) our results also suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify "good locations" in the machine or in the file system: component performance is driven by transient load conditions, and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.
The design and implementation of traditional file systems typically use the one-to-one mapping of logical files to their physical metadata representations. File system optimizations generally follow this rigid mapping and miss opportunities for an entire class of optimizations. We designed, implemented, and evaluated a composite-file file system, which allows many-to-one mappings of files to metadata. Through exploring different mapping strategies, our empirical evaluation shows up to a 27% performance improvement under webserver and software development workloads, for both disks and SSDs. This result demonstrates that our approach of relaxing file-to-metadata mapping is promising.