Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, Proceedings. 20th International Conference on Data Engineering
…
12 pages
1 file
Real datasets are often large enough to necessitate data compression. Traditional 'syntactic' data compression methods treat the table as a large byte string and operate at the byte level. The tradeoff in such cases is usually between the ease of retrieval (the ease with which one can retrieve a single tuple or attribute value without decompressing a much larger unit) and the effectiveness of the compression. In this regard, the use of semantic compression has generated considerable interest and motivated certain recent works.
Journal of Software, 2015
In the last years, that amount of data stored in databases has increased extremely with the widespread use of databases and the rapid adoption of information systems and data warehouse technologies. It is a challenge to store and recover this increased data in an efficient method. This challenge will potentially appeal in database systems for two causes: storage cost reduction and performance improvement. Lossy compression in databases can return better compression ratios than lossless compression in general, but is rarely used due to the concern of losing data. For relational databases, using standard compression techniques like Gzip or Zip don't take advantage of the relational properties; since these techniques don't look at the nature of the data. In this paper, we propose a database compression system that takes advantage of attributes semantics and data-mining models to find frequent attribute pattern with maximum gain to perform compression of massive table's data. Furthermore, the suggested system relies on augmented vector quantization (AVQ) algorithm to achieve lossless compression version without losing any information. Extensive experiments were conducted and the results indicate the superiority of the system with respect to previously known techniques.
Database Engineering …, 1997
This paper addresses the question of how informationtheoretically-derived c ompact representations can be applied i n p r actice to improve storage and processing e ciency in DBMS. Compact data representation has the potential for savings in storage, access and processing costs throughout the systems architecture and may alter the balance of usage between disk and solid state storage. To r ealise the potential performance b ene ts, however, novel systems engineering must be adopted to ensure that compression decompression overheads are limited. This paper describe s a b asic approach to storage and processing of relations in a highly compressed form. A vertical columnwise representation is adopted in which columns can dynamically vary incrementally in both length and width. To achieve good p erformance query processing is carried out directly on the compressed relational representation using a compressed r epresentation of the query, thus avoiding decompression overheads. Measurements of performance of the Hibase prototype implementation are c ompared with those obtained f r om conventional DBMS.
ACM SIGMOD Record, 2001
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed format on disk. Despite the abundance of string-valued attributes in relational schemas there is little work on compression for string attributes in a database context. Moreover, none of the previous work suitably addresses the role of the query optimizer: During query execution, data is either eagerly decompressed when it is read into main memory, or data lazily stays compressed in main memory and is decompressed on demand only In this paper, we present an effective approach for database compression based on lightweight, attribute-level compression techniques. We propose a IIierarchical Dictionary Encoding strategy that intelligently selects ...
2005
This paper proposes the compression of data in Relational Database Management Systems (RDBMS) using existing text compression algorithms. Although the technique proposed is general, we believe it is particularly advantageous for the compression of medium size and large dimension tables in data warehouses. In fact, dimensions usually have a high number of text attributes and a reduction in their size has a big impact in the execution time of queries that join dimensions with fact tables. In general, the high complexity and long execution time of most data warehouse queries make the compression of dimension text attributes (and possible text attributes that may exist in the fact table, such as false facts) an effective approach to speed up query response time. The proposed approach has been evaluated using the well-known TPC-H benchmark and the results show that speed improvements greater than 40% can be achieved for most of the queries.
2001
While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteedquality approximate answers to queries over massive relational data sets. In this paper, we propose Model-Based Semantic Compression (MBSC), a novel datacompression framework that takes advantage of attribute semantics and datamining models to perform lossy compression of massive data tables. We describe the architecture and some of the key algorithms underlying ¢ ¤ £ ¦ ¥ § © ¥ , a model-based semantic compression system that exploits predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, ¢ £ ¦ ¥ § © ¦ ¥ selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, ¢ £ ¦ ¥ § © ¦ ¥ employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets has offered convincing evidence of the effectiveness of ¢ ¤ £ ¦ ¥ § © ¥ 's model-based approach-¢ £ ¦ ¥ § © ¦ ¥ is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference. Several promising directions for future research and possible applications of MBSC in the context of network management are identified and discussed.
IEEE Transactions on Parallel and Distributed Systems, 2016
The Semantic Web comprises enormous volumes of semi-structured data elements. For interoperability, these elements are represented by long strings. Such representations are not efficient for the purposes of applications that perform computations over large volumes of such information. A common approach to alleviate this problem is through the use of compression methods that produce more compact representations of the data. The use of dictionary encoding is particularly prevalent in Semantic Web database systems for this purpose. However, centralized implementations present performance bottlenecks, giving rise to the need for scalable, efficient distributed encoding schemes. In this paper, we propose an efficient algorithm for fast encoding large Semantic Web data. Specially, we present the detailed implementation of our approach based on the state-of-art asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate performance on a cluster of up to 384 cores and datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art approach, we demonstrate a speed-up of 2.6 − 7.4× and excellent scalability. In the meantime, these results also illustrate the significant potential of the APGAS model for efficient implementation of dictionary encoding and contributes to the engineering of more efficient, larger scale Semantic Web applications.
Lecture Notes in Computer Science, 2013
Linked data has experienced accelerated growth in recent years. With the continuing proliferation of structured data, demand for RDF compression is becoming increasingly important. In this study, we introduce a novel lossless compression technique for RDF datasets, called Rule Based Compression (RB Compression) that compresses datasets by generating a set of new logical rules from the dataset and removing triples that can be inferred from these rules. Unlike other compression techniques, our approach not only takes advantage of syntactic verbosity and data redundancy but also utilizes semantic associations present in the RDF graph. Depending on the nature of the dataset, our system is able to prune more than 50% of the original triples without affecting data integrity.
International Journal of Scientific & Engineering Research, 2013
Data compression has a paramount effect on Data warehouse for reducing data size and improving query processing. Distinct compression techniques are feasible at different levels, each of types either give good compression ratio or suitable for query processing. This paper focuses on applying lossless and lossy compression techniques on relational databases. The proposed technique is used at attribute level on Data warehouse by applying lossless compression on three types of attributes (string, integer, and float) and lossy compression on image attribute.
Proceedings 2003 VLDB Conference, 2003
The Oracle RDBMS recently introduced an innovative compression technique for reducing the size of relational tables.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
ACM SIGMOD Record, 2000
International Journal of Engineering, Science and Technology, 2010
Knowledge Representation Meets Databases, 2001
Lecture Notes in Computer Science, 2003
Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789), 2000
arXiv (Cornell University), 2020