Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005
Compression is a known technique used by many database management systems("DBMS") to increase performance[4, 5, 14]. However, not much research has been done in how compression can be used within column oriented architectures. Storing data in column increases the similarity between adjacent records, thus increase the compressibility of the data. In addition, compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. This thesis presents a column-oriented query executor designed to operate directly on compressed data. We show that operating directly on compressed data can improve query performance. Additionally, the choice of compression scheme depends on the expected query workload, suggesting that for ad-hoc queries we may wish to store a column redundantly under different coding schemes. Furthermore, the executor is designed to be extensible so that the addition of new compression schemes does not impact operator implementation. The executor is part of a larger database system, known as CStore [10].
Database Engineering …, 1997
This paper addresses the question of how informationtheoretically-derived c ompact representations can be applied i n p r actice to improve storage and processing e ciency in DBMS. Compact data representation has the potential for savings in storage, access and processing costs throughout the systems architecture and may alter the balance of usage between disk and solid state storage. To r ealise the potential performance b ene ts, however, novel systems engineering must be adopted to ensure that compression decompression overheads are limited. This paper describe s a b asic approach to storage and processing of relations in a highly compressed form. A vertical columnwise representation is adopted in which columns can dynamically vary incrementally in both length and width. To achieve good p erformance query processing is carried out directly on the compressed relational representation using a compressed r epresentation of the query, thus avoiding decompression overheads. Measurements of performance of the Hibase prototype implementation are c ompared with those obtained f r om conventional DBMS.
International Journal of Advanced Computer Science and Applications
Compression of data in traditional relational database management systems significantly improves the system performance by decreasing the size of the data that results in less data transfer time within the communication environment and higher efficiency in I/O operations. The column-oriented database management systems should perform even better since each attribute is stored in a separate column, so that its sequential values are stored and accessed sequentially on the disk. That further increases the compression efficiency as the entire column is compressed/decompressed at once. The aim of this research is to determine if data compression could improve the performance of HBase, running on a small-sized Hadoop cluster, consisted of one name node and nine data nodes. Test scenario includes performing Insert and Select queries on multiple records with and without data compression. Four data compression algorithms are tested since they are natively supported by HBase-SNAPPY, LZO, LZ4 and GZ. Results show that data compression in HBase highly improves system performance in terms of storage saving. It shrinks data 5 to 10 times (depending on the algorithm) without any noticeable additional CPU load. That allows smaller but significantly faster SSD disks to be used as cluster's primary data storage. Furthermore, the substantial decrease in the network traffic is an additional benefit with major impact on big data processing.
ACM SIGMOD Record, 2001
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed format on disk. Despite the abundance of string-valued attributes in relational schemas there is little work on compression for string attributes in a database context. Moreover, none of the previous work suitably addresses the role of the query optimizer: During query execution, data is either eagerly decompressed when it is read into main memory, or data lazily stays compressed in main memory and is decompressed on demand only In this paper, we present an effective approach for database compression based on lightweight, attribute-level compression techniques. We propose a IIierarchical Dictionary Encoding strategy that intelligently selects ...
ACM SIGMOD Record, 2000
In this paper, we show how compression can be integrated into a relational database system. Specifically, we describe how the storage manager, the query execution engine, and the query optimizer of a database system can be extended to deal with compressed data. Our main result is that compression can significantly improve the response time of queries if very light-weight compression techniques are used. We will present such light-weight compression techniques and give the results of running the TPC-D benchmark on a so compressed database and a non-compressed database using the AODB database system, an experimental database system that was developed at the Universities of Mannheim and Passau. Our benchmark results demonstrate that compression indeed offers high performance gains (up to 50%) for IOintensive queries and moderate gains for CPU-intensive queries. Compression can, however, also increase the running time of certain update operations. In all, we recommend to extend today's database systems with lightweight compression techniques and to make extensive use of this feature.
arXiv (Cornell University), 2020
In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particular, during query processing, these systems only keep the data compressed until an operator cannot process the compressed data directly, whereupon the data is decompressed, but not recompressed. Thus, the full potential of compression during query processing is not exploited. To overcome that, we developed a novel compression-enabled processing model as presented in this paper. As we are going to show, the continuous usage of compression for all base data and all intermediates is very beneficial to reduce the overall memory footprint as well as to improve the query performance.
Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
Decision-support applications in emerging environments require that SQL query results or intermediate results be shipped to clients for further analysis and presentation. These clients may use low bandwidth connections or have severe storage restrictions. Consequently, there is a need to compress the results of a query for efficient transfer and client-side access. This paper explores a variety of techniques that address this issue. Instead of using a fixed method, we choose a combination of compression methods that use statistical and semantic information of the query results to enhance the effect of compression. To represent such a combination, we present a framework of "compression plans" formed by composing primitive compression operators. We also present optimization algorithms that enumerate valid compression plans and choose an optimal plan. Our experiments show that our techniques achieve significant performance improvement over standard compression tools like WinZip.
2012
through this study, we propose two algorithms. The first algorithm describes the concept of compression of domains at attribute level and we call it as “Attribute Domain Compression”. This algorithm can be implemented on both row and columnar databases. The idea behind the algorithm is to reduce the size of large databases as to store them optimally. The second algorithm is also applicable for both concepts of databases but will optimally work for columnar databases. The idea behind the algorithm is to generalize the tuple domains by giving it a value say (n) such that all other n-1 tuples or at least maximum can be identified.
Proceedings of the 25th …, 2002
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.
1999
E cient query processing is critical in a data warehouse environment because the warehouse is very large, queries are often adhoc and complex, and decision support applications typically require interactive response times. Existing approaches often use indexes to speed up such queries. However, the addition of index structures can signi cantly increase storage costs. In this paper, we consider the application of compression techniques to data warehouses. In particular, we examine a recently proposed access structure for warehouses known as DataIndexes, and discuss the application of several wellknown compression methods to this approach. We also include a brief performance analysis, which indicates that the DataIndexing approach is well-suited to compression techniques in many cases.
IBM Journal of Research and Development, 2018
In this paper, we describe how the IBM z14 processor, together with Db2 for z/OS Version 12, can improve data compression rates and thus reduce data storage requirements and cost for large databases. The new processor improves on the compression hardware accelerator available in earlier IBM Z generations, by adding new hardware algorithms that increase the compression ratio and extend the applicability to additional data structures in Db2 for z/OS databases. A new entropy coding step employed after Ziv-Lempel compression reduces the size of data compressed with the prior algorithms by 30% on average. Moreover, the new order-preserving compression algorithm enables database index compression, reducing index sizes by roughly 30%. This results in an overall improvement of 30% of the database size for many applications, with associated benefits in storage requirements, I/O (input/output) bandwidth, and buffer pool efficiency.
2014
Domain encoding is a common technique to compress the columns of a column store and to accelerate many types of queries at the same time. It is based on the assumption that most columns contain a relatively small set of distinct values, in particular string columns. In this paper, we argue that domain encoding is not the end of the story. In real world systems, we observe that a substantial amount of the columns are of string types. Moreover, most of the memory space is consumed by only a small fraction of these columns. To address this issue, we make three main contributions: First we survey several approaches and variants for dictionary compression, i. e., data structures that store the dictionary of domain encoding in a compressed way. As expected, there is a trade-off between size of the data structure and its access performance. This observation can be used to compress rarely accessed data more than frequently accessed data. Furthermore the question which approach has the best ...
Journal of Computers, 2009
Abstract-Loss-less data compression is potentially attractive in database application for storage cost reduction and performance improvement. The existing compression architectures work well for small to large database and provide good performance. But these systems can ...
The Computer Journal, 1998
Future database applications will require significant improvements in performance beyond the capabilities of conventional disk based systems. This paper describes a new approach to database systems architecture, which is intended to take advantage of solid-state memory in combination with data compression to provide substantial performance improvements. The compressed data representation is tailored to the data manipulation operations requirements. The architecture has been implemented and measurements of performance are compared to those obtained using other high-performance database systems. The results indicate from one to five orders of magnitude speed-up in retrieval, equivalent or slightly faster performance during insertion (and compression) of data, while achieving approximately one order of magnitude compression in data volume. The resultant architecture is thus capable of greater cost/effectiveness than conventional approaches.
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, 2009
Column-oriented database systems [19, 23] perform better than traditional row-oriented database systems on analytical workloads such as those found in decision support and business intelligence applications. Moreover, recent work [1, 24] has shown that lightweight compression schemes significantly improve the query processing performance of these systems. One such a lightweight compression scheme is to use a dictionary in order to replace long (variable-length) values of a certain domain with shorter (fixedlength) integer codes. In order to further improve expensive query operations such as sorting and searching, column-stores often use order-preserving compression schemes. In contrast to the existing work, in this paper we argue that orderpreserving dictionary compression does not only pay off for attributes with a small fixed domain size but also for long string attributes with a large domain size which might change over time. Consequently, we introduce new data structures that efficiently support an order-preserving dictionary compression for (variablelength) string attributes with a large domain size that is likely to change over time. The main idea is that we model a dictionary as a table that specifies a mapping from string-values to arbitrary integer codes (and vice versa) and we introduce a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data. Our experiments show that our data structures are as fast as (or in some cases even faster than) other stateof-the-art data structures for dictionaries while being less memory intensive.
Security and Privacy, 2019
A digital watermark is a secret pattern of bits inserted in the data objects to verify the integrity of the contents or to identify ownership. It is an effective technique for protecting various databases against data copyright infringement, piracy, and data tampering in both relational and not-only-Structured Query Language (NoSQL) databases including columnar database. Unfortunately, most of the existing digital watermarking schemes do not take into consideration the side effects that watermarking might have on the database's important characteristics such as data compression and overall performance. Thus, we are in need of database watermarking schemes that are tailored to the target database systems' specific architecture so that it does not interfere with the database's characteristics and performance. Columnar database is a type of NoSQL database where its data are stored in a column-wise manner. This facilitates better query processing (by leaving the unrelated attributes on the disk) as well as higher compression ratio. In this research, we propose a distortion-free fragile watermarking scheme for the columnar database architecture without interfering its underlying data compression scheme and its overall performance. The proposed scheme is flexible and can be adapted to various distributions of data. We tested our proposed scheme on both synthetic and real-world data, and proved its effectiveness. K E Y W O R D S columnar databases, data integrity, database security, digital watermarking, information hiding 1 INTRODUCTION Databases have served the business intelligence industry well over the last few decades. However, in this current era of the "Big Data," where the increasingly large volumes of data are added to the business organizations' databases at a high speed, fast analysis of data is becoming more difficult. Columnar database (DB) is a type of not-only-Structured Query Language (NoSQL) database where data is stored in a column-wise manner, in which all values of an attribute are clustered together. In contrast to row-wise relational databases, columnar DB boosts performance of data analytic transactions by reducing the amount of data that needs to be read from the disk, thus decreasing the bottleneck between Random Access Memory (RAM) and hard drive. For instance, in the traditional relational database, answering a range query would require loading the set of records that falls in the range query and then executing projection query to discard the unwanted attributes. In columnar DB, in contrast, answering range queries would require loading only the columns that contain the attributes that are associated with query. Therefore, the query performance is often increased, especially when the database is very large. Another crucial advantage of storing the data in columns is that it lends itself naturally to compression. Columnar DB achieves a much higher compression ratio and allows many queries to be answered on the compressed data.
Proceedings 2003 VLDB Conference, 2003
The Oracle RDBMS recently introduced an innovative compression technique for reducing the size of relational tables.
Soft Computing, 2018
Structured data are one of the most important segments in the realm of big data analysis that have undeniably prevailed over the years. In recent years, column-oriented design has become a frequent practice to organize structured data in analytical systems. The storage systems that organize data in a column-wise manner are often referred to as column stores. Column-oriented databases or warehouses and spreadsheet applications in particular have recently become a popular and a convenient tool for column-wise data processing and analysis. At the same time, the volume of data is increasing at an extreme rate, which despite the decrease in pricing of storage systems stresses the importance of data compression. Apart from resounding performance gain in large read-mostly data repositories, column-oriented data are easily compressible, which enables efficient query processing and pushes the peak of the overall performance. Many compression algorithms, including the Run Length Encoding (RLE), exploit the similarity among the column values, where repetitions of the same value form columnar runs that can be found in most database systems. This paper presents a comprehensive analysis and comparison of common and well-known meta-heuristics for columnar run minimization, based on standard implementations by using real datasets. We have analyzed genetic algorithms, simulated annealing, cuckoo search, particle swarm optimization, Tabu search, and the bat algorithm. The first three being the most efficient have undergone sensitivity analysis on synthetic datasets to fine-Communicated by V. Loia.
International Journal of Data Mining & Knowledge Management Process, 2013
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for the efficient query processing in read-intensive relational databases. Also, for read-intensive relational databases,extensive research has performed for efficient data storage and query processing. This paper gives an overview of storage and performance optimization techniques used in Column-Stores.
IEEE Transactions on Knowledge and Data Engineering, 1997
Disk I/O has long been a performance bottleneck for very large databases. Database compression can be used to reduce disk I/O bandwidth requirements for large data transfers. In this paper, we explore the compression of large statistical databases and propose techniques for organizing the compressed data such that standard database operations such as retrievals, inserts, deletes and modifications are supported. We examine the applicability and performance of three methods. Two of these are adaptations of existing methods, but the third, called Tuple Differential Coding (TDC) [16], is a new method that allows conventional access mechanisms to be used with the compressed data to provide efficient access. We demonstrate how the performance of queries that involve large data transfers can be improved with these database compression techniques.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.