We develop OrcFS, Orchestrated File System for Flash storage. It vertically integrates the log-structured file system and the Flash-based storage device eliminating all the redundancies across the layers. A few modern file systems adopt sophisticated append-only data structures to manage its space in an effort to optimize the behavior with respect to the append-only nature of the Flash medium. While the benefit of adopting append-only data structure seems to be fairly promising, it makes the stack of software layers full of unnecessary redundancies which leaves substantial room for improvement. The redundancies include (i) redundant levels of indirection (address translation) , (ii) duplicate efforts to reclaim the invalid blocks which is called segment cleaning and garbage collection in the file system layer and in the storage device, respectively, and (iii) excessive over-provisioning, i.e. separate over-provisioning areas in each layer. OrcFS eliminates the redundancies via distributing the address translation, segment cleaning (or garbage collection), bad block management, and wear-leveling across the layers. OrcFS reduces the device mapping table requirement to 1/465 against the page mapping and removes 1/4 of the write volume under heavy random write workload. In varmail, OrcFS achieves 56% performance gain against EXT4.
Deduplication is essential in disk-based backup systems, but there are few long-term studies of backup workloads. Past studies either were of a small static snapshot or covered only a short periods of time. We collected 21 months of data from a shared user file system (33 users and over 4,000 snapshots). We analyzed the data set, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node analysis, our focus was individual-user data. Despite apparently similar roles among our users, we found significant differences in their deduplication ratios; and we found high deduplication ration among some users. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk, but it also leads to high data imbalance across nodes. We found that large chunking sizes are better for cluster deduplication, as they reduce data-routing overhead, while their impact on deduplication ratios was small. We draw interesting conclusions from both single-node and cluster deduplication analysis, and make recommendations for future deduplication systems design.
LSM-tree has been widely used in data management production systems for write-intensive workloads. However, as read and write workloads co-exist under LSM-tree, data accesses can experience long latency and low throughput due to the interferences to buffer caching from the compaction, a major and frequent operation in LSM-tree. After a compaction, the existing data blocks are reorganized and written to other locations on disks. As a result, the related data blocks that have been loaded in the buffer cache are invalidated since their referencing addresses are changed, causing serious performance degradations. In order to re-enable high-speed buffer caching during intensive writes, we propose Log-Structured buffered-Merge tree (simplified as LSbM-tree) by adding a compaction buffer on disks, to minimize the cache invalidations on buffer cache caused by compactions. The compaction buffer efficiently and adaptively maintains the frequently visited data sets. In LSbM, strong locality objects can be effectively kept in the buffer cache with minimum or without harmful invalidations. With the help of a small on-disk compaction buffer, LSbM achieves a high query performance by enabling effective buffer caching, while retaining all the merits of LSM-tree for write-intensive data processing, and providing high bandwidth of disks for range queries.
Since both NAND flash memory manufacturers and users are turning their attentions from planar architecture towards 3D architecture, it becomes more critical and urgent to have a strong understanding of the characteristics of 3D NAND flash memory. These characteristics, especially those different from planar NAND flash, are the foundations of efficient flash management. In this paper, we characterize a state-of-the-art 3D floating gate NAND flash memory through comprehensive experiments on an FPGA-based 3D NAND flash evaluation platform. Then, we present distinct observations on its performance and reliability, such as operation latencies and various error patterns, and carefully analyze them from physical and circuit-level perspectives. Although 3D NAND flash provides much higher storage densities than planar NAND flash, it faces new performance challenges in garbage collection overhead and program performance variation, and more complicated reliability issues due to e.g., distinct location dependence and value dependence of errors. We also summarize the differences of 3D NAND flash from planar NAND flash and discuss design implications on flash memory management brought by the architecture innovation. We believe that our work will facilitate developing novel 3D NAND flash-oriented designs to achieve better performance and reliability.
IBM Spectrum Scale's parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of byproduct trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In this article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and describe the process of designing GPFS' new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead even under intensive I/O workloads.
The reuse distance (LRU stack distance) is an essential metric for performance prediction and optimization of storage cache. Over the last four decades, there have been steady improvements in the algorithmic efficiency of reuse distance measurement. This progress is accelerating in recent years both in theory and practical implementation. In this paper, we present a kinetic model of LRU cache memory, based on the average eviction time (AET) of the cached data. The AET model enables fast measurement and use of low-cost sampling. It can produce the miss ratio curve (MRC) in linear time with extremely low space costs. On storage trace benchmarks, AET reduces the time and space costs compared to former techniques. Furthermore, AET is a composable model that can characterize shared cache behavior through sampling and modeling individual programs or traces.
Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. It is suggested that programs should access NVM directly through CPU load and store instructions. To guarantee crash consistency, durable transactions become a common choice of applications for accessing persistent memory data. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions. This paper presents DudeTX, a crash-consistent durable transaction system that avoids the drawbacks of both undo and redo logging. DudeTX uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design enables an out-of-the-box concurrency control mechanism (transactional memory or fine-grained locks) to be used as an independent component. The evaluation results show that DudeTX adds durability to a software transactional memory system with only 7.4%~24.6% throughput degradation. Compared to the existing systems, DudeTX provides 1.7X to 4.4X higher throughput. Moreover, DudeTX can be implemented with existing hardware transactional memory or lock-based concurrency control, leading to a further 1.7X and 8.2X speedup, respectively.
Abstract The paper presents an analysis of drive workloads from enterprise storage systems. The drive workloads are obtained from field return units from a cross-section of enterprise storage system vendors and thus provides a view of the workload characteristics over a wide spectrum of end-user applications. The workload parameters that have been characterized include transfer lengths, access patterns, locality and throughput. The study shows that reads are the dominant workload accounting for 80% of the accesses to the drive. Writes are dominated by short block random accesses while reads range from random to highly sequential. A trend analysis over the period 2010-2014 shows that the workload has remained fairly constant even as the capacities of the drives shipped has steadily increased. The study shows that the data stored on disk drives is relatively cold on average less than 4% of the drive capacity is accessed in a given 2 hour interval.
Since little has been reported in the literature concerning enterprise storage system file-level request scheduling, we have minimal knowledge about how various scheduling factors affect performance. Moreover, we are in lack of a good understanding on how to enhance request scheduling to adapt to the changing characteristics of workloads and hardware resources. To answer these questions, we first build a request scheduler prototype based on WAFL, a mainstream enterprise file system running on numerous enterprise storage systems world-wide. Next, we use the prototype to quantitatively measure the impact of various scheduling configurations on performance on a NetApp's enterprise-class storage system. Several observations have been made. For example, we discover that in order to improve performance the priority of write requests and non-preempted restarted requests should be boosted in some workloads. Inspired by these observations, we further propose two scheduling enhancement heuristics called SORD (size-oriented request dispatching) and QATS (queue-depth aware time slicing). Finally, we evaluate them by conducting a wide range of experiments using workloads generated by SPC-1 and SFS2014 on both HDD-based and all- ash platforms. Experimental results show that the combination of the two noticeably reduces request latency under some workloads.
Although the Multi-Level-Cell technique is widely adopted by flash-memory vendors to boost the chip density and to lower the cost, it results in serious performance and reliability problems. Different from the past work, a new cell programming method is proposed to not only significantly improve the chip performance but also reduce the potential bit error rate. In particular, a Single-Level-Cell-like programming scheme is proposed to better explore the threshold-voltage relationship to denote different Multi-Level-Cell bit information, which in turn drastically provides a larger window of threshold voltage similar to that found in Single-Level-Cell chips. It could result in less programming iterations and simultaneously a much less reliability problem in programming flash-memory cells. In the experiments, the new programming scheme could accelerate the programming speed up to 742% and even reduce the bit error rate up to 471% for Multi-Level-Cell pages.