Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
In this paper, we propose a new web search engine model based on index-query bit-level compression. The model incorporates two bit-level compression layers both implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index, and the second layer be located after the query parser for query compression to enable bit-level compressed index-query search. This contributes to reducing the size of the index file as well as reducing disk I/O overheads, and consequently yielding higher retrieval rate and performance. The data compression scheme used in this model is the adaptive character wordlength (ACW(n,s)) scheme, which is an asymmetric, lossless, bit-level scheme that permits compressed index-query search. Results investigating the performance of the ACW(n,s) scheme is presented and discussed.
2010
In this paper, we propose a new web search engine model based on index-query bit-level compression. The model incorporates two bit-level compression layers both implemented at the backend processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index, and the second layer be located after the query parser for query compression to enable bit-level compressed index-query search. This contributes to reducing the size of the index file as well as reducing disk I/O overheads, and consequently yielding higher retrieval rate and performance. The data compression scheme used in this model is the adaptive character wordlength (ACW(n,s)) scheme, which is an asymmetric, lossless, bit-level scheme that permits compressed index-query search. Results investigating the performance of the ACW(n,s) scheme is presented and discussed.
International Journal of Information Technology and Web Engineering, 2011
In this paper, the authors present a description of a new Web search engine model, the compressed index-query (CIQ) Web search engine model. This model incorporates two bit-level compression layers implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index (index compressor), and the second layer resides after the query parser for query compression (query compressor) to enable bit-level compressed index-query search. The data compression algorithm used in this model is the Hamming codes-based data compression (HCDC) algorithm, which is an asymmetric, lossless, bit-level algorithm permits CIQ search. The different components of the new Web model are implemented in a prototype CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data and evaluate the performance of the proposed model. The test results demonstrate that the prop...
2011
Abstract: In this paper we present a description of a new Web search engine model, namely, the compressed index-query (CIQ) Web search engine model, which incorporates two bit-level compression layers implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index (index compressor), and the second layer resides after the query parser for query compression (query compressor) to enable bit-level compressed index-query search. The data compression algorithm used in this model is the Hamming codes-based data compression (HCDC) algorithm, which is an asymmetric, lossless, bit-level algorithm permits CIQ search. The different components of the new Web model are implemented in a prototype CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data, and to evaluate the performance of new Web search engine model. The test results demonstra...
1997
Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledge-based systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet. The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression sch...
2012
Finding desired information from large data set is a difficult problem. Information retrieval is concerned with the structure, analysis, organization, storage, searching, and retrieval of information. Index is the main constituent of an IR system. Now a day exponential growth of information makes the index structure large enough affecting the IR system's quality. So compressing the Index structure is our main contribution in this paper. We compressed the document number in inverted file entries using a new coding technique based on run-length encoding. Our coding mechanism uses a specified code which acts over run-length coding. We experimented and found that our coding mechanism on an average compresses 67.34% more than the other techniques.
Proceedings of the 14th international conference on World Wide Web - WWW '05, 2005
The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search effectiveness but also time and space efficiency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this effect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices' storage costs by up to 60% over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no compression at all, the compression rate is higher than 88%, i.e., less than one eighth of the original size. More importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65% of the original time, with no loss in average precision. The new method yields significative improvements when compared against the best known static pruning method for search engine indices. In addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine.
Proceedings of the 23rd International Conference on Database and Expert Systems Applications, 2012
To sustain the tremendous workloads they suffer on a daily basis, Web search engines employ highly compressed data structures known as inverted indexes. Previous works demonstrated that organizing the inverted lists of the index in individual blocks of postings leads to significant efficiency improvements. Moreover, the recent literature has shown that the current state-of-the-art compression strategies such as PForDelta and VSEncoding perform well when used to encode the lists docIDs. In this paper we examine their performance when used to compress the positional values. We expose their drawbacks and we introduce PFBC, a simple yet efficient encoding scheme, which encodes the positional data of an inverted list block by using a fixed number of bits. PFBC allows direct access to the required data by avoiding costly look-ups and unnecessary information decoding, achieving several times faster positions decompression than the state-of-the-art approaches.
2010
Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean, ranked, and phrase queries; which means that it can be used in any situation.
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same function-ality and speed as traditional data structures. In this invited presentation , we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2016
We propose a new compression algorithm that compresses plain texts by using a dictionary-based model and a compressed string-matching approach that can be used with the compressed texts produced by this algorithm. The compression algorithm (CAFTS) can reduce the size of the texts to approximately 41% of their original sizes. The presented compressed string matching approach (SoCAFTS), which can be used with any of the known pattern matching algorithms, is compared with a powerful compressed string matching algorithm (ETDC) and a compressed string-matching tool (Lzgrep). Although the search speed of ETDC is very good in short patterns, it can only search for exact words and its compression performance differs from one natural language to another because of its word-based structure. Our experimental results show that SoCAFTS is a good solution when it is necessary to search for long patterns in a compressed document.
Computer, 2000
In this article we discuss recent methods for compressing the text and the index of text retrieval systems. By compressing both the complete text and the index, the total amount of space is less than half the size of the original text alone. Most surprisingly, the time required to build the index and also to answer a query is much less than if the index and text had not been compressed. This is one of the few cases where there is no space-time trade-o . Moreover, the text can be kept compressed all the time, allowing updates when changes occur in the compressed text.
Computing Research Repository, 2007
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
2009
ABSTRACT Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval?
Journal of Experimental Algorithmics, 2009
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which of...
Arabian Journal for Science and Engineering, 2018
In this article, we present a novel word-based lossless compression algorithm for text files using a semi-static model. We named this method the 'Multi-stream word-based compression algorithm (MWCA)' because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text and stores two dictionaries and a bit vector as side information. In our experiments, MWCA produces a compression ratio of 3.23 bpc on average and 2.88 bpc for files greater than 50 MB; if a variable length encoder such as Huffman coding is used after MWCA, the given ratios are reduced to 2.65 and 2.44 bpc, respectively. MWCA supports exact word matching without decompression, and its multi-stream approach reduces the search time with respect to single-stream algorithms. Additionally, the MWCA multi-stream structure supplies the reduction in network load by requesting only the necessary streams from the database. With the advantage of its fast compressed search feature and multi-stream structure, we believe that MWCA is a good solution, especially for storing and searching big text data.
1997
We present a technique to build an index based on su x arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of the compressed strings to generate the su x array without decompressing. As the compressed text is under 30% of the size of the original text we are able to build the su x array twice as fast on the compressed text. The compressed text plus index is 55-60% of the size of the original text plus index and search times are reduced to approximately half the time. We also present analytical and experimental results for di erent variations of the word-oriented compression paradigm.
Proceedings of the 25th …, 2002
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.
Three information retrieval storage structures are considered to determine their suitability for a World Wide Web search engine: The Wolverhampton Web Library – The Next Generation. The structures are an inverted file, signature file and Pat tree. A number of implementations are considered for each structure. For the index of an inverted file a sorted array, B-tree, B+-tree, trie and hash table are considered. For the signature file vertical and horizontal partitioning schemes are considered and for the Pat tree an array and Patricia tree are considered. A theoretical comparison of the structures is done on seven criteria that include: response time, support for results ranking, search techniques, file maintenance, efficient use of disk space (including the use of compression), scalability and extensibility. The comparison reveals that an inverted file is the most suitable structure, unlike the signature file and Pat tree, which encounter problems with very large corpuses.
Information Processing & Management, 1992
This article reports on a variety of compression algorithms developed in the context of a project to put all the data files for a full-text retrieval system on CD-ROM. In the context of inexpensive pre-processing, a text-compression algorithm is presented that is based on Markov-modeled Huffman coding on an extended alphabet. Data structures are examined for facilitating random access into the compressed text. In addition, new algorithms are presented for compression of word indices, both the dictionaries (word lists) and the text pointers (concordances). The ARTFL database is used as a test case throughout the article.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.