Deduplication is essential in disk-based backup systems, but there are few long-term studies of backup workloads. Past studies either were of a small static snapshot or covered only a short periods of time. We collected 21 months of data from a shared user file system (33 users and over 4,000 snapshots). We analyzed the data set, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node analysis, our focus was individual-user data. Despite apparently similar roles among our users, we found significant differences in their deduplication ratios; and we found high deduplication ration among some users. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk, but it also leads to high data imbalance across nodes. We found that large chunking sizes are better for cluster deduplication, as they reduce data-routing overhead, while their impact on deduplication ratios was small. We draw interesting conclusions from both single-node and cluster deduplication analysis, and make recommendations for future deduplication systems design.
In this paper, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. We propose an enhanced flash-aware key-value cache manager, which consists a novel unified address mapping module, an integrated garbage collection policy, a dynamic over-provisioning space management, and a customized wear-leveling policy, to directly drive the flash management. A thin intermediate library layer which provides a slab-based abstraction of low-level flash memory space and an API interface for directly and easily operating flash devices. A special flash memory SSD hardware that exposes flash physical details is adopted to store key-value items. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.
Since little has been reported in the literature concerning enterprise storage system file-level request scheduling, we have minimal knowledge about how various scheduling factors affect performance. Moreover, we are in lack of a good understanding on how to enhance request scheduling to adapt to the changing characteristics of workloads and hardware resources. To answer these questions, we first build a request scheduler prototype based on WAFL, a mainstream enterprise file system running on numerous enterprise storage systems world-wide. Next, we use the prototype to quantitatively measure the impact of various scheduling configurations on performance on a NetApp's enterprise-class storage system. Several observations have been made. For example, we discover that in order to improve performance the priority of write requests and non-preempted restarted requests should be boosted in some workloads. Inspired by these observations, we further propose two scheduling enhancement heuristics called SORD (size-oriented request dispatching) and QATS (queue-depth aware time slicing). Finally, we evaluate them by conducting a wide range of experiments using workloads generated by SPC-1 and SFS2014 on both HDD-based and all- ash platforms. Experimental results show that the combination of the two noticeably reduces request latency under some workloads.
The emerging Phase Change Memory (PCM) is considered as a promising candidate to replace DRAM as the next generation main memory since it has better scalability and lower leakage power. However, the high write power consumption has become a main challenge in adopting PCM as main memory. In addition to the fact that writing to PCM cells requires high write current and voltage, current loss in the charge pumps (CPs) also contributes a large percentage of the high power consumption. The pumping efficiency of a PCM chip is a concave function of the write current. Based on the characteristics of the concave function, the overall pumping efficiency can be improved if the write current is uniform. In this paper, we propose the peak-to-average (PTA) write scheme, which smooths the write current fluctuation by regrouping write units. Specifically, we calculate the current requirements for each write unit by their values when they are evicted from the last level cache. When the write units are waiting in the memory controller, we regroup the write units by two efficient online algorithms to reach the current-uniform goal. Experimental results show that LLC-Assistance PTA achieved 9.7\% of overall energy saving compared to the baseline.