Due to the record transformation in storage layer, the unnecessary processing costs derived from either unwanted fields or unsatisfied rows may be very heavy in complex schemas, significantly wasting the computational resources in the large-scale analytical workloads. We present CORES (Column-Oriented Regeneration Embedding Scheme) to push highly-selective filters down into the column-based storage, where each filter consists of several filtering conditions on a field. By applying highly-selective filters to column scan in storage, we demonstrate that both the IO and the deserialization cost could be significantly reduced by introducing a fine-gained composition based on bitset. We also generalize this technique by two pair-wise operations rollup and drilldown, such that a series of conjunctive filters can effectively deliver their payloads in nested schema. The proposed methods are implemented on an open source platform. For practical purposes, we highlight how to effectively construct a nested column storage and efficiently drive multiple filters by a cost model. We apply this design to the nested relational model especially when hierarchical entities are frequently required by ad-hoc queries. The experiments, covering a real workload and the modified TPCH benchmark, demonstrate that CORES improves the performance by 0.7X<26.9X compared to the state-of-the-art platforms in scan-intensive workloads.
We introduce TxFS, a transactional file system that builds upon a file system?s atomic-update mechanism such as journaling. Though prior work has explored a number of transactional file systems, TxFS has a unique set of properties: a simple API, portability across different hardware, high performance, low complexity (by building on the file-system journal), and full ACID transactions. We port SQLite, OpenLDAP, and Git to use TxFS, and experimentally show that TxFS provides strong crash consistency while providing equal or better performance.
Synchronous I/O has long been a design challenge in file systems. Although open-channel SSDs provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/cold data grouping. The reason lies in the controversy design choices between flash write and read/erase operations. While fine-grained logging improves performance and endurance in writes, it hurts indexing and data grouping efficiency in read and erase operations. In this paper, we propose a flash-friendly data layout by introducing a built-in persistent staging layer, to provide balanced read, write and garbage collection performance. Based on this, we design a new flash file system named StageFS, which decouples the content and structure updates. Content updates are logically logged to the staging layer in a persistence-efficient way, which achieves better write performance and lower write amplification. The updated contents are reorganized into the normal data area for structure update, with improved hot/cold grouping and in page-level indexing way, which is more friendly to read and garbage collection operations. Evaluation results show that, compared to F2FS, StageFS effectively improves performance by up to 211.4% and achieves low garbage collection overhead for workloads with frequent synchronization.
This paper presents a design framework aiming to reduce mass data storage cost in data centers. Its underlying principle is simple: One may noticeably reduce HDD manufacturing cost by significantly (i.e., at least several orders of magnitude) relaxing raw HDD reliability, and meanwhile ensure the eventual data storage integrity via low-cost system-level redundancy. This is called system-assisted HDD bit cost reduction. In order to better utilize both capacity and random IOPS of HDDs, it is desirable to mix data with complementary requirements on capacity and random IOPS in each HDD. Nevertheless, different capacity and random IOPS requirements may demand different raw HDD reliability vs. bit cost trade-offs and hence different forms of system-assisted bit cost reduction. This paper presents a software-centric design framework to realize data-adaptive system-assisted bit cost reduction for data center HDDs. Aiming to improve its practical feasibility, its implementation is solely handled by filesystem and demands an only minor change of the error correction coding (ECC) module inside HDDs. Hence, it is completely transparent to all the other components in the software stack and keeps fundamental HDD design practice intact. We carried out analysis and experiments to evaluate its implementation feasibility and effectiveness.
Traditionally, file systems were implemented as part of OS kernels. As complexity of file systems grew, many new file systems began being developed in user space. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a "toy" not to be used in production, others develop full-fledged production file systems in user space. In this article we analyze the design and implementation of the most widely known user-space file system framework, FUSE, for Linux. We then characterize its performance and resource utilization for a wide range of workloads. Our experiments indicate that depending on the workload and hardware used, throughput degradation caused by FUSE can be completely imperceptible or as high as -83%, even when optimized; latencies of FUSE file system operations can increase from none to 4x when compared to in-kernel Ext4. On the resource utilization side, FUSE can increase relative CPU utilization by up to 31% and underutilize disk bandwidth by as much as -80% compared to Ext4.
Introduction to the Special Section on 2018 USENIX Annual Technical Conference (ATC '18)
Non-volatile memory (NVM) as persistent memory is expected to substitute or complement DRAM in memory hierarchy, due to the strengths of non-volatility, high density, and near-zero standby power. However, due to the requirement of data consistency and hardware limitations of NVM, traditional indexing techniques originally designed for DRAM become inefficient in persistent memory. To efficiently index the data in persistent memory, this paper proposes a write-optimized and high-performance hashing index scheme, called level hashing, with low-overhead consistency guarantee and cost-efficient resizing. Level hashing provides a sharing-based two-level hash table, which achieves a constant-scale search/insertion/deletion/update time complexity in the worst case and rarely incurs extra NVM writes. To cost-efficiently resize this hash table, level hashing leverages an in-place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table to expand a hash table and rehash 2/3 of buckets to shrink a hash table, thus significantly reducing the number of rehashed buckets and improving the resizing performance. Experimental results demonstrate that level hashing achieves 1.4×?3.0× speedup for insertions, 1.2×?2.1× speedup for updates, 4.3× speedup for expanding, and 1.4× speedup for shrinking a hash table, while maintaining high search and deletion performance, compared with state-of-the-art hashing schemes
Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents Solar, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of Solar include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, Solar outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.