Fast Dictionary-Based Compression for Inverted Indexes

Giulio Ermanno Pibiri; Matthias Petri; Alistair Moffat

Fast Dictionary-Based Compression for Inverted Indexes

Giulio Ermanno Pibiri

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.

Figures (14)

Table 1: Two examples of plurally parsable dictionaries of width € = 4 over the alphabet {a,b,c,d} where symbol “a” is highly probable, symbol “b” is moderately probable, and “-” entries indicate don’t-care values. The last column provides the length of each string and is also stored as part of the dictionary. In (b), index zero is used as the code for rare symbol exceptions.

Table 3: Performance counts reported by the Linux perf tool, com- paring variable-length copying and constant-length copying for € = 16 when decoding the index sequences of Gov2 using a rectan- gular dictionary. Table 2: Total index size in GiB for a complete document-level index (docids and freqs combined, including block-access overhead and dictionary space) for the Gov2 collection using a block size of B = 256 items and the DSV dictionary construction approach.

Figure 3: Packed layout for the dictionary shown in Table 1(b), with € = 4 and b = 3. The first number in each element in start[] is the sequence length. All trailing don’t-care symbols have been trimmed, and dictionary sequences have been removed if they are a prefix o another longer sequence. As an example, when the input codeword is 3 the four integers in dictionary[4 ...7] are copied to the output, anc then the output pointer is incremented by two. In this illustration no provision has been made for frequent symbol exceptions.

Table 4: Average number of decoded integers per codeword, m, when decoding the sequences of Gov2, using DINT-DSF with a rectangular dictionary and b = 16; and predicted and actual de- coding times measured in nanoseconds per integer, based on c, the measured decoding time per codeword.

Figure 4: Example in which optimal parsing requires fewer codewords than greedy parsing. The sequence “aadbaaaa” is being represented relative to the dictionary shown in Table 1(b). The cost below each node is the length of the shortest path from the origin to that point. The parse shown at the bottom of the graph has a cost of 5; whereas the greedy parse shown above requires 6 codewords. In both cases it is assumed that “d” requires a rare symbol exception (rse) codeword, followed by a patch codeword that identifies the symbol required.

Table 5: Number of lists, postings and documents for the Gov2 and CCNEWS collections.

Table 6: Total index size (GiB) and compression rate (bits per in- teger) for docids and freqs for two test collections, using a range of compression techniques. The two DINT implementations both make use of b = 16 and ¢ = 16, single dictionaries, and greedy parsing. The rows are ordered by decreasing total index size.

Table 7: Sequential decoding throughput in nanoseconds per in- teger, measured over the complete index. Both DINT rows make use of b = 16 and a packed dictionary. The rows are ordered by increasing speed of docids decoding on Gov?2.

Figure 5: Percentage of integers, codewords, and dictionary entries associated with each target size for the docids and freqs of Gov2, using the DSF approach and parameters b = 16, € = 16.

Table 8: Total index size (GiB) and compression rate (bits per inte ger) for docids and freqs using b = 16 and ¢ = 16. The first row uses DINT-DSF with greedy parsing; four enhancements are then added and in the penultimate row a total of 12 contexts are used with optimal parsing and an exhaustive search to identify the cheapest context for each block. In the last row, the dictionary indices are then assumed to be input to a set of 12 optimal entropy coders Except for the last row (gray numbers), which contains values that are calculated rather than measured, these results can be directly compared with Table 6.

Table 9: Dictionary space (MiB) for different schemes and corre- sponding decoding speeds, for Gov2. In the three “x1” rows, one dictionary is used for the docids and another for the freqs. In the “x6” rows, six dictionaries are used for each stream, with (in the last row) all six of them combined into a single dictionary[] array, and six start[] arrays maintained.

Figure 6: Final effectiveness-efficiency graph for Gov2. The vertical scale sums the per-posting docids and freqs times for each method; the horizontal scale shows total index size in GiB. The two DINT points both use a single packed dictionary per stream, b = 16, and optimal parsing. Both scales are logarithmic.

Anh Vo

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '98, 1998

Compressed inverted files are the most compact way of indexing large text databases, typically occupying around 10% of the space of the collection they index. The drawback of compression is the need to decompress the index lists during query processing. Here we describe an improved implementation of compressed inverted lists that eliminates almost all redundant decoding and allows extremely fast processing of conjunctive Boolean queries and ranked queries. We also describe a pruning method to reduce the number of candidate documents considered during the evaluation of ranked queries. Experimental results with a database of 510 Mb show that the new mechanism can reduce the CPU and elapsed time for Boolean queries of 4-10 terms to one tenth and one fifth respectively of the standard technique. For ranked queries, the new mechanism reduces both CPU and elapsed time to one third and memory usage to less than one tenth of the standard algorithm, with no degradation in retrieval effectiveness. Permission to make digital/hard copy of all or part of this work for personal or classroom we is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM. Inc. To COW otherwise. to reoublish. to oost on servers or I" to redistribute to lists, requires prior specific permission and/or fee. SIGIR'98,

Log In

Fast Dictionary-Based Compression for Inverted Indexes

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers