Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014, Proceedings of the VLDB Endowment
Compression has historically been used to reduce the cost of storage, I/Os from that storage, and buffer pool utilization, at the expense of the CPU required to decompress data every time it is queried. However, significant additional CPU efficiencies can be achieved by deferring decompression as late in query processing as possible and performing query processing operations directly on the still-compressed data. In this paper, we investigate the benefits and challenges of performing joins on compressed (or encoded) data. We demonstrate the benefit of independently optimizing the compression scheme of each join column, even though join predicates relating values from multiple columns may require translation of the encoding of one join column into the encoding of the other. We also show the benefit of compressing "payload" data other than the join columns "on the fly," to minimize the size of hash tables used in the join. By partitioning the domain of each column ...
ACM SIGMOD Record, 2001
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed format on disk. Despite the abundance of string-valued attributes in relational schemas there is little work on compression for string attributes in a database context. Moreover, none of the previous work suitably addresses the role of the query optimizer: During query execution, data is either eagerly decompressed when it is read into main memory, or data lazily stays compressed in main memory and is decompressed on demand only In this paper, we present an effective approach for database compression based on lightweight, attribute-level compression techniques. We propose a IIierarchical Dictionary Encoding strategy that intelligently selects ...
2005
Compression is a known technique used by many database management systems("DBMS") to increase performance[4, 5, 14]. However, not much research has been done in how compression can be used within column oriented architectures. Storing data in column increases the similarity between adjacent records, thus increase the compressibility of the data. In addition, compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. This thesis presents a column-oriented query executor designed to operate directly on compressed data. We show that operating directly on compressed data can improve query performance. Additionally, the choice of compression scheme depends on the expected query workload, suggesting that for ad-hoc queries we may wish to store a column redundantly under different coding schemes. Furthermore, the executor is designed to be extensible so that the addition of new compression schemes does not impact operator implementation. The executor is part of a larger database system, known as CStore [10].
Proceeding of the 18th ACM conference on Information and knowledge management - CIKM '09, 2009
We present a new class of adaptive algorithms that use compressed bitmap indexes to speed up evaluation of the range join query in relational databases. We determine the best strategy to process a join query based on a fast sub-linear time computation of the join selectivity (the ratio of the number of tuples in the result to the total number of possible tuples). In addition, we use compressed bitmaps to represent the join output compactly: the space requirement for storing the tuples representing the join of two relations is asymptotically bounded by min(h; n . cb), where h is the number of tuple pairs in the result relation, n is the number of tuples in the smaller of the two relations, and cb is the cardinality of the larger column being joined. We present a theoretical analysis of our algorithms, as well as experimental results on large-scale synthetic and real data sets. Our implementations are efficient, and consistently outperform well-known approaches for a range of join selectivity factors. For instance, our count-only algorithm is up to three orders of magnitude faster than the sort-merge approach, and our best bitmap index-based algorithm is 1.2x-80x faster than the sort-merge algorithm, for various query instances. We achieve these speedups by exploiting several inherent performance advantages of compressed bitmap indexes for join processing: an implicit partitioning of the attributes, space-efficiency, and tolerance of highcardinality relations.
Journal of Computers, 2009
Abstract-Loss-less data compression is potentially attractive in database application for storage cost reduction and performance improvement. The existing compression architectures work well for small to large database and provide good performance. But these systems can ...
Proceedings of the 2009 ACM symposium on Applied Computing, 2009
Hash joins combine massive relations in data warehouses, decision support systems, and scientific data stores. Faster hash join performance significantly improves query throughput, response time, and overall system performance. In this work, we demonstrate how using join cardinality improves hash join performance. The key contribution is the development of an algorithm to determine join cardinality in an arbitrary query plan. We implemented early hash join and the join cardinality algorithm in PostgreSQL. Experimental results demonstrate that early hash join has an immediate response time that is an order of magnitude faster than the existing hybrid hash join implementation. One-to-one joins execute up to 50% faster and perform significantly fewer I/Os, and one-to-many joins have similar or better performance over all memory sizes.
Rcc, 2001
The advent of telecommunication era and the constant development of hardware and network structures have encouraged the decentralization of data while increasing the needs to access information from different sites. Query optimization strategies aim to minimize the cost of transferring data across networks. Many techniques and algorithms have been proposed to optimize queries. Perhaps one of the more important algorithms is the AHY algorithm using semi-joins that is implemented by Apers, Hevner and Yao in [1]. Nowadays, a new technique called PERF (Partially Encoded Record Filters) seems to bring some improvement over semi-joins [12]. PERF joins are two-way semi-joins using a bit vector as their backward phase. Our research encompasses applying PERF joins to two well know algorithms: AHY and W, which both deal with query optimization. Programs were designed to implement both the original and the enhanced algorithms. Several experiments were conducted and the results showed a very considerable enhancement obtained by applying the PERF concept. This major improvement led us to further observations and studies.
ACM SIGMOD Record, 2000
In this paper, we show how compression can be integrated into a relational database system. Specifically, we describe how the storage manager, the query execution engine, and the query optimizer of a database system can be extended to deal with compressed data. Our main result is that compression can significantly improve the response time of queries if very light-weight compression techniques are used. We will present such light-weight compression techniques and give the results of running the TPC-D benchmark on a so compressed database and a non-compressed database using the AODB database system, an experimental database system that was developed at the Universities of Mannheim and Passau. Our benchmark results demonstrate that compression indeed offers high performance gains (up to 50%) for IOintensive queries and moderate gains for CPU-intensive queries. Compression can, however, also increase the running time of certain update operations. In all, we recommend to extend today's database systems with lightweight compression techniques and to make extensive use of this feature.
1992
In this paper we introduce the concept of declustering aware joins (DA-Joins), which is a family of join algorithms with each member being determined by the underlying declustering method and the join technique used. We present and analyze a member of the DA-Join family that is based on the parallel hybrid-hash join and the CMD declustering method, which we call DA-Join HH,CMD . We show that DAJoin HH,CMD has very good overall performance. Its main advantages over non declustering-aware joins are (i) its ability to partition the problem into a set of smaller independent sub-problems, each of which can utilize memory more efficiently, and (ii) pruning the problem size by not considering portions of the relations that cannot join. It shares with hash joins the desirable property of scalability with respect to the degree of parallelism. However, it can avoid problems due to data skew that is faced by parallel hash joins. The CMD declustering method has been shown to be optimal for multi-attribute range queries on parallel I/O architectures, and our analysis and experimental evaluation of DA-Join HH,CMD prove that it is also possible to perform joins efficiently on CMD-stored relations, thus providing evidence for the desirability of CMD as a basic technique for multi-attribute declustering of relations in a parallel database.
Database Engineering …, 1997
This paper addresses the question of how informationtheoretically-derived c ompact representations can be applied i n p r actice to improve storage and processing e ciency in DBMS. Compact data representation has the potential for savings in storage, access and processing costs throughout the systems architecture and may alter the balance of usage between disk and solid state storage. To r ealise the potential performance b ene ts, however, novel systems engineering must be adopted to ensure that compression decompression overheads are limited. This paper describe s a b asic approach to storage and processing of relations in a highly compressed form. A vertical columnwise representation is adopted in which columns can dynamically vary incrementally in both length and width. To achieve good p erformance query processing is carried out directly on the compressed relational representation using a compressed r epresentation of the query, thus avoiding decompression overheads. Measurements of performance of the Hibase prototype implementation are c ompared with those obtained f r om conventional DBMS.
21st International Conference on Advanced Networking and Applications (AINA '07), 2007
Bloom filter based algorithms have proven successful as very efficient technique to reduce communication costs of database joins in a distributed setting. However, the full potential of bloom filters has not yet been exploited. Especially in the case of multi-joins, where the data is distributed among several sites, additional optimization opportunities arise, which require new bloom filter operations and computations. In this paper, we present these extensions and point out how they improve the performance of such distributed joins. While the paper focuses on efficient join computation, the described extensions are applicable to a wide range of usages, where bloom filters are facilitated for compressed set representation.
arXiv (Cornell University), 2020
In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particular, during query processing, these systems only keep the data compressed until an operator cannot process the compressed data directly, whereupon the data is decompressed, but not recompressed. Thus, the full potential of compression during query processing is not exploited. To overcome that, we developed a novel compression-enabled processing model as presented in this paper. As we are going to show, the continuous usage of compression for all base data and all intermediates is very beneficial to reduce the overall memory footprint as well as to improve the query performance.
Proceedings. 20th International Conference on Data Engineering, 2004
This paper introduces the hash-merge join algorithm (HMJ, for short); a new non-blocking join algorithm that deals with data items from remote sources via unpredictable, slow, or bursty network traffic. The HMJ algorithm is designed with two goals in mind: (1) Minimize the time to produce the first few results, and (2) Produce join results even if the two sources of the join operator occasionally get blocked. The HMJ algorithm has two phases: The hashing phase and the merging phase. The hashing phase employs an in-memory hash-based join algorithm that produces join results as quickly as data arrives. The merging phase is responsible for producing join results if the two sources are blocked. Both phases of the HMJ algorithm are connected via a flushing policy that flushes in-memory parts into disk storage once the memory is exhausted. Experimental results show that HMJ combines the advantages of two state-of-the-art non-blocking join algorithms (XJoin and Progressive Merge Join) while avoiding their shortcomings.
2016
The paper is devoted to the issue of decomposition of the join relational operator with the aid of distributed column indices. Such decomposition allows one to utilize the modern many-core accelerators (GPU or Intel Xeon Phi) to speed up the query execution for very large databases. Column indices are the new kind of index structures, which exploits "key-value" technics. The paper describes the methods of column index fragmentation based on domain intervals. This technic allows organizing the parallel query processing without exchanges. All column index fragments are stored in main memory in compressed form to conserve space. This approach can be implemented as a coprocessor for relational database systems. The database coprocessor is able to perform resource-intensive operations much more faster than a conventional DBMS.
Proceedings of the 25th …, 2002
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.
j. New Generation Computing, 1988
In the recent investigations of reducing the relational join operation complexity several hash-based partitioned-join stategies have been introduced. All of these strategies depend upon the costly operation of data space partitioning before the join can be carried out. We had previously introduced a partitioned-join based on a dynamic and order preserving multidimensional data organization called DYOP. The present study extends the earlier research on DYOP and constructs a simulation model. The simulation studies on DYOP and subsequent comparisons of all the partitioned-join methodologies including DYOP have proven that space utilization of DYOP improves with the increasing number of attributes. Furthermore, the DYOP based join outperforms all the hash-based methodologies by greatly reducing the total I/O bandwidth required for the entire partitioned-join operation. The comparison model is independent of the architectural issues such as multiprocessing, multiple disk usage, and large memory availability all of which help to further increase the efficiency of the operation.
Proceedings of the ACM SIGMOD 39th International Conference on Management of Data
There exists a need for high performance, read-only mainmemory database systems for OLAP-style application scenarios. Most of the existing works in this area are centered around the domain of column-store databases, which are particularly well suited to OLAP-style scenarios and have been shown to overcome the memory bottleneck issues that have been found to hinder the more traditional row-store database systems. One of the main database operations these systems are focused on optimizing is the JOIN operation. However, all these existing systems use join algorithms that are designed with the unrealistic assumption that there is unlimited temporary memory available to perform the join. In contrast, we propose a Memory Constrained Join algorithm (MCJoin) which is both high performing and also performs all of its operations within a tight given memory constraint. Extensive experimental results show that MCJoin outperforms a naive memory constrained version of the state-of-the-art Radix-Clustered Hash Join algorithm in all of the situations tested, with margins of up to almost 500%.
Proceedings of the …, 2009
Join is an important database operation. As computer architectures evolve, the best join algorithm may change hand. This paper reexamines two popular join algorithms -hash join and sort-merge join -to determine if the latest computer architecture trends shift the tide that has favored hash join for many years. For a fair comparison, we implemented the most optimized parallel version of both algorithms on the latest Intel Core i7 platform. Both implementations scale well with the number of cores in the system and take advantages of latest processor features for performance. Our hash-based implementation achieves more than 100M tuples per second which is 17X faster than the best reported performance on CPUs and 8X faster than that reported for GPUs. Moreover, the performance of our hash join implementation is consistent over a wide range of input data sizes from 64K to 128M tuples and is not affected by data skew. We compare this implementation to our highly optimized sort-based implementation that achieves 47M to 80M tuples per second. We developed analytical models to study how both algorithms would scale with upcoming processor architecture trends. Our analysis projects that current architectural trends of wider SIMD, more cores, and smaller memory bandwidth per core imply better scalability potential for sort-merge join. Consequently, sort-merge join is likely to outperform hash join on upcoming chip multiprocessors. In summary, we offer multicore implementations of hash join and sort-merge join which consistently outperform all previously reported results. We further conclude that the tide that favors the hash join algorithm has not changed yet, but the change is just around the corner.
1996
The widening performance gap between CPU and disk is significant for hash join performance. Most current hash join methods try t o reduce the volume of data transferred between memory and disk. In this paper, we try to reduce hash-join times by reducing random I/O. We study how current algorithms incur random I/O, and propose a new hash join method, Seq+, that converts much of the random 1/0 to sequential I/O. Seq+ uses a new organization for hash buckets on disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write-and readgroups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method, and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our performance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost functions underestimate the cost of current algorithms and overestimate the cost of Seq+, the actual performance gain of Seq+ is likely to be even greater.
2014
Domain encoding is a common technique to compress the columns of a column store and to accelerate many types of queries at the same time. It is based on the assumption that most columns contain a relatively small set of distinct values, in particular string columns. In this paper, we argue that domain encoding is not the end of the story. In real world systems, we observe that a substantial amount of the columns are of string types. Moreover, most of the memory space is consumed by only a small fraction of these columns. To address this issue, we make three main contributions: First we survey several approaches and variants for dictionary compression, i. e., data structures that store the dictionary of domain encoding in a compressed way. As expected, there is a trade-off between size of the data structure and its access performance. This observation can be used to compress rarely accessed data more than frequently accessed data. Furthermore the question which approach has the best ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.