Although a set of reactive fault tolerant measures such as RAID have been implemented, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid the possible hard drive failures in advance. A series of models based on the self-monitoring, analysis and reporting technology (SMART) have been proposed to predict impending hard drive failures. Unfortunately, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. In order to address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system. Based on the insights gotten from the analysis, we design an attention-augmented deep architecture for hard drive health status assessment and failure prediction in this paper. The deep architecture, named AMENDER, can not only monitor the status of hard drives but also assist in failure cause diagnose. We evaluate AMENDER through large amounts of experiments based on the real-world datasets.
Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its high garbage collection (GC) overhead in value storage. We propose HashKV, which aims for high update performance atop KV separation under update-intensive workloads. HashKV uses hash-based data grouping, which deterministically maps values to storage space so as to make both updates and GC efficient. We further relax the restriction of such deterministic mappings via simple but useful design extensions. We extensively evaluate various design aspects of HashKV. We show that HashKV achieves 4.6x update throughput and 53.4% less write traffic compared to the current KV separation design. In addition, we demonstrate that we can integrate the design of HashKV with state-of-the-art KV stores and improve their respective performance.
Existing RAID solutions partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures. We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction. Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.
This paper studies the output behavior of the Titan supercomputer and its Lustre file stores. We introduce a statistical benchmarking methodology that collects/combines samples over times and settings: 1) To measure the performance impact of parameter choices against the interference in the production setting; 2) to derive the performance of individual stages/components in the multi-stage write pipelines, and their variations over time. We find that Titan's I/O system is highly variable with two major implications: 1) Stragglers lessen the benefit of coupled I/O parallelism. I/O parallelism is most effective when the application distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets, in a balanced way with minimal contention. 2) our results also suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify "good locations" in the machine or in the file system: component performance is driven by transient load conditions, and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.