Editor-in Chief Letter
We develop OrcFS, Orchestrated File System for Flash storage. It vertically integrates the log-structured file system and the Flash-based storage device eliminating all the redundancies across the layers. A few modern file systems adopt sophisticated append-only data structures to manage its space in an effort to optimize the behavior with respect to the append-only nature of the Flash medium. While the benefit of adopting append-only data structure seems to be fairly promising, it makes the stack of software layers full of unnecessary redundancies which leaves substantial room for improvement. The redundancies include (i) redundant levels of indirection (address translation) , (ii) duplicate efforts to reclaim the invalid blocks which is called segment cleaning and garbage collection in the file system layer and in the storage device, respectively, and (iii) excessive over-provisioning, i.e. separate over-provisioning areas in each layer. OrcFS eliminates the redundancies via distributing the address translation, segment cleaning (or garbage collection), bad block management, and wear-leveling across the layers. OrcFS reduces the device mapping table requirement to 1/465 against the page mapping and removes 1/4 of the write volume under heavy random write workload. In varmail, OrcFS achieves 56% performance gain against EXT4.
Deduplication is essential in disk-based backup systems, but there are few long-term studies of backup workloads. Past studies either were of a small static snapshot or covered only a short periods of time. We collected 21 months of data from a shared user file system (33 users and over 4,000 snapshots). We analyzed the data set, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node analysis, our focus was individual-user data. Despite apparently similar roles among our users, we found significant differences in their deduplication ratios; and we found high deduplication ration among some users. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk, but it also leads to high data imbalance across nodes. We found that large chunking sizes are better for cluster deduplication, as they reduce data-routing overhead, while their impact on deduplication ratios was small. We draw interesting conclusions from both single-node and cluster deduplication analysis, and make recommendations for future deduplication systems design.
LSM-tree has been widely used in data management production systems for write-intensive workloads. However, as read and write workloads co-exist under LSM-tree, data accesses can experience long latency and low throughput due to the interferences to buffer caching from the compaction, a major and frequent operation in LSM-tree. After a compaction, the existing data blocks are reorganized and written to other locations on disks. As a result, the related data blocks that have been loaded in the buffer cache are invalidated since their referencing addresses are changed, causing serious performance degradations. In order to re-enable high-speed buffer caching during intensive writes, we propose Log-Structured buffered-Merge tree (simplified as LSbM-tree) by adding a compaction buffer on disks, to minimize the cache invalidations on buffer cache caused by compactions. The compaction buffer efficiently and adaptively maintains the frequently visited data sets. In LSbM, strong locality objects can be effectively kept in the buffer cache with minimum or without harmful invalidations. With the help of a small on-disk compaction buffer, LSbM achieves a high query performance by enabling effective buffer caching, while retaining all the merits of LSM-tree for write-intensive data processing, and providing high bandwidth of disks for range queries.
Since both NAND flash memory manufacturers and users are turning their attentions from planar architecture towards 3D architecture, it becomes more critical and urgent to have a strong understanding of the characteristics of 3D NAND flash memory. These characteristics, especially those different from planar NAND flash, are the foundations of efficient flash management. In this paper, we characterize a state-of-the-art 3D floating gate NAND flash memory through comprehensive experiments on an FPGA-based 3D NAND flash evaluation platform. Then, we present distinct observations on its performance and reliability, such as operation latencies and various error patterns, and carefully analyze them from physical and circuit-level perspectives. Although 3D NAND flash provides much higher storage densities than planar NAND flash, it faces new performance challenges in garbage collection overhead and program performance variation, and more complicated reliability issues due to e.g., distinct location dependence and value dependence of errors. We also summarize the differences of 3D NAND flash from planar NAND flash and discuss design implications on flash memory management brought by the architecture innovation. We believe that our work will facilitate developing novel 3D NAND flash-oriented designs to achieve better performance and reliability.
IBM Spectrum Scale's parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of byproduct trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In this article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and describe the process of designing GPFS' new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead even under intensive I/O workloads.
Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data movements and the consumption of very large amounts of energy. To minimize the waste of memory bandwidth, we investigate utilizing processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve both computation and I/O performance. We propose a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place. We study the problem of graph traversal, and we design an efficient graph traversal algorithm in RPBFS. Benefiting from low data movement overhead and high bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations. On a suite of real world graphs, our architecture yields a speedup in graph traversal performance of up to 33.8X, and achieves a reduction in energy over conventional systems of up to 142.8X.
The reuse distance (LRU stack distance) is an essential metric for performance prediction and optimization of storage cache. Over the last four decades, there have been steady improvements in the algorithmic efficiency of reuse distance measurement. This progress is accelerating in recent years both in theory and practical implementation. In this paper, we present a kinetic model of LRU cache memory, based on the average eviction time (AET) of the cached data. The AET model enables fast measurement and use of low-cost sampling. It can produce the miss ratio curve (MRC) in linear time with extremely low space costs. On storage trace benchmarks, AET reduces the time and space costs compared to former techniques. Furthermore, AET is a composable model that can characterize shared cache behavior through sampling and modeling individual programs or traces.
Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. It is suggested that programs should access NVM directly through CPU load and store instructions. To guarantee crash consistency, durable transactions become a common choice of applications for accessing persistent memory data. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions. This paper presents DudeTX, a crash-consistent durable transaction system that avoids the drawbacks of both undo and redo logging. DudeTX uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design enables an out-of-the-box concurrency control mechanism (transactional memory or fine-grained locks) to be used as an independent component. The evaluation results show that DudeTX adds durability to a software transactional memory system with only 7.4%~24.6% throughput degradation. Compared to the existing systems, DudeTX provides 1.7X to 4.4X higher throughput. Moreover, DudeTX can be implemented with existing hardware transactional memory or lock-based concurrency control, leading to a further 1.7X and 8.2X speedup, respectively.
Abstract The paper presents an analysis of drive workloads from enterprise storage systems. The drive workloads are obtained from field return units from a cross-section of enterprise storage system vendors and thus provides a view of the workload characteristics over a wide spectrum of end-user applications. The workload parameters that have been characterized include transfer lengths, access patterns, locality and throughput. The study shows that reads are the dominant workload accounting for 80% of the accesses to the drive. Writes are dominated by short block random accesses while reads range from random to highly sequential. A trend analysis over the period 2010-2014 shows that the workload has remained fairly constant even as the capacities of the drives shipped has steadily increased. The study shows that the data stored on disk drives is relatively cold on average less than 4% of the drive capacity is accessed in a given 2 hour interval.
With the coming of Big Data era, high-energy-efficiency database is demanded for the Internet of things (IoT) application scenarios. The emerging Resistive Random Access Memory (RRAM) has been considered as an energy-efficient replacement of DRAM for next-generation main memory. In this paper, we propose an RRAM-based SQL query unit with process-in-memory (PIM) characteristic. A bidirectional storage structure for a database in RRAM crossbar array is proposed, which avoids redundant data transfer to cache and reduces cache miss rate compared with the storage method in DRAM for in-memory database. The proposed RRAM-based SQL query unit can support a representative subset of SQL queries in memory, and thus further reduce the data transfer cost. The corresponding query optimization method is proposed to fully utilize the PIM characteristics. Simulation results show that the energy efficiency of the proposed RRAM-based SQL query unit is increased by 4 to 6 orders of magnitude compared with the traditional architecture.
Although the Multi-Level-Cell technique is widely adopted by flash-memory vendors to boost the chip density and to lower the cost, it results in serious performance and reliability problems. Different from the past work, a new cell programming method is proposed to not only significantly improve the chip performance but also reduce the potential bit error rate. In particular, a Single-Level-Cell-like programming scheme is proposed to better explore the threshold-voltage relationship to denote different Multi-Level-Cell bit information, which in turn drastically provides a larger window of threshold voltage similar to that found in Single-Level-Cell chips. It could result in less programming iterations and simultaneously a much less reliability problem in programming flash-memory cells. In the experiments, the new programming scheme could accelerate the programming speed up to 742% and even reduce the bit error rate up to 471% for Multi-Level-Cell pages.
With the advanced technology in persistent random access memory (PRAM), PRAM such as 3D XPoint memory is emerging as a promising candidate for the next-generation medium for both (main) memory and storage. Previous works mainly focus on how to overcome the possible endurance issues of PRAM while both main memory and storage own a partition on the same PRAM device. However, a holistic software-level system design should be proposed to fully exploit the benefit of PRAM. This paper proposes a union storage file system (UnistorFS), which aims to jointly manage the PRAM resource for main memory and storage. It realizes the concept of using the PRAM resource as memory and storage interchangeably, so as to achieve resource sharing while main memory and storage coexist on the same PRAM device with no partition or logical boundary. This approach not only enables PRAM resource sharing but also eliminates unnecessary data movements between main memory and storage since they are already in the same address space and can be accessed directly. A series of experiments was conducted on a modified Linux kernel. The results show that the proposed UnistorFS can eliminate unnecessary memory accesses and outperform other PRAM-based file systems.
Coupled with the scaling down of flash technology, the popularity of flash memory motivates the search for methods to increase flash reliability and lifetime. Erasures are the dominant cause of flash cell wear, but reducing them is challenging because flash is a write-once medium---memory cells must be erased prior to writing. An approach that has recently received considerable attention relies on write-once memory (WOM) codes, designed to accommodate additional writes on write-once media. However, the techniques proposed for reusing flash pages with WOM codes are limited in their scope. Many focus on the coding theory alone, while others suggest FTL designs that are application specific, or not applicable due to their complexity, overheads, or specific constraints of MLC flash. This work is the first that addresses all aspects of page reuse within an end-to-end analysis of a general-purpose FTL. We use a hardware evaluation setup to directly measure the short and long-term effects of page reuse on durability and energy consumption, and show that FTL design must take them into account. We provide a detailed analytical model for deriving optimal garbage collection for such FTL designs, and for predicting the benefit from reuse on realistic hardware and workload characteristics.
The byte-addressable non-volatile memory (NVM) is going to reshape conventional computer systems. With advantages of low latency, byte-addressability and non-volatility, NVM can be directly put on the memory bus to replace DRAM. As a result, both system and application softwares have to be adjusted to perceive the fact that the persistent layer moves up to the memory. However, most of the current in-memory data structures will be problematic with consistency issues if not well tuned with NVM. This paper places emphasizes on an important in-memory structure that is widely used in computer systems, i.e., the Red/Black-tree (RB-tree). Since it has a long and complicated update process, the RB-tree is prone to inconsistency problems with NVM. This paper presents an NVM-compatible consistent RB-tree with a new technique named cascade-versioning. The proposed RB-tree (i) is all-time consistent and scalable, and (ii) needs no recovery procedure after system crashes. Experiment results show that the RB-tree for NVM not only achieves the aim of consistency with insignificant spatial overhead, but also yields comparable performance to an ordinary volatile RB-tree.