0% found this document useful (0 votes)

17 views28 pages

Chapter 5 - Index Compression

The document discusses index compression techniques, focusing on the importance of compressing dictionaries and postings in inverted indexes to save disk space and improve memory usage and data transfer speeds. It presents various compression methods, including lossless and lossy compression, and highlights empirical laws like Heaps' law and Zipf's law that describe vocabulary size and term frequency distributions. The document concludes with a summary of different dictionary compression techniques and their respective sizes, demonstrating significant reductions in space usage.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views28 pages

Chapter 5 - Index Compression

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Index Compression

Dr. Subrat Kumar Nayak

Associate Professor
Dept. of CSE, ITER, SOADU
What to be Discussed?

 Collection statistics in more detail (with RCV1)

 How big will the dictionary and postings be?
 Dictionary compression
 Postings compression
Why compression (in general)?
 Use less disk space
 Saves a little money
 Keep more stuff in memory
 Increases speed
 Increase speed of data transfer from disk to memory
 [read compressed data | decompress] is faster than [read
uncompressed data]
 Premise: Decompression algorithms are fast
 True of the decompression algorithms we use
Why compression for inverted indexes?
 Dictionary
 Make it small enough to keep in main memory
 Make it so small that you can keep some postings lists in main memory too

 Postings file(s)
 Reduce disk space needed
 Decrease time needed to read postings lists from disk
 Large search engines keep a significant part of the postings in memory.
 Compression lets you keep more in memory

 We will devise various IR-specific compression

schemes
Recall Reuters RCV1
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) ~400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)

avg. # bytes per term7.5

non-positional postings 100,000,000
Index parameters vs. what we index
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index positional index

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

(K) % % % % %
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Lossless vs. lossy compression
 Lossless compression: All information is preserved.
 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be viewed as lossy
compression: case folding, stop words, stemming, number
elimination.
 Chap/Lecture 7: Prune postings entries that are unlikely to turn up in
the top k list for any query.
 Almost no loss quality for top k list.
Vocabulary vs. collection size
 How big is the term vocabulary?
 That is, how many distinct words are there?
 Can we assume an upper bound?
 Not really: At least 7020 = 1037 different words of length 20
 In practice, the vocabulary will keep growing with the collection
size
 Especially with Unicode ☺
Vocabulary vs. collection size
 Heaps’ law: M = kTb
 M is the size of the vocabulary, T is the number of tokens in the
collection
 Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a
line with slope about ½
 It is the simplest possible relationship between the two in log-log space
 An empirical finding (“empirical law”)
Heaps’ Law
 For RCV1, the dashed line
 log10M = 0.49 log10T + 1.64 is
the best least squares fit.
 Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.

 Good empirical fit for

Reuters RCV1 !

 For first 1,000,020 tokens,

 law predicts 38,323 terms;
 actually, 38,365 terms

Fig 5.1 p81

Zipf’s law
 Heaps’ law gives the vocabulary size in collections.
 We also study the relative frequencies of terms.
 In natural language, there are a few very frequent terms and very many
very rare terms.
 Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
 cfi ∝ 1/i = K/i where K is a normalizing constant
 cfi is collection frequency: the number of occurrences of the term ti in the
collection.
Zipf consequences
 If the most frequent term (the) occurs cf1 times
 then the second most frequent term (of) occurs cf1/2 times
 the third most frequent term (and) occurs cf1/3 times …
 Equivalent: cfi = ci^k, so
 log cfi = log c + klog i = log c – log i
 Linear relationship between log cfi and log i

 Another power law relationship

Zipf’s law for Reuters RCV1
Compression
 Now, we will consider compressing the space for the dictionary
and postings
 Basic Boolean index only
 No study of positional indexes, etc.
 We will consider compression schemes
Why compress the dictionary?
 Search begins with the dictionary
 We want to keep it in memory
 Memory footprint competition with other applications
 Embedded/mobile devices may have very little memory
 Even if the dictionary isn’t in memory, we want it to be
small for a fast search startup time
 So, compressing the dictionary is important
Dictionary storage - first cut
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265
aachen 65
…. ….
zulu 221

Dictionary search 20 bytes 4 bytes each

structure
Fixed-width terms are wasteful
 Most of the bytes in the Term column are wasted – we allot 20 bytes
for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious or
hydrochlorofluorocarbons.
 Written English averages ~4.5 characters/word.
 Exercise: Why is/isn’t this the number to use for estimating the dictionary
size?
 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?
 Short words dominate token counts but not type average.
Compressing the term list: Dictionary-as-a-String

Store dictionary as a (long) string of characters:

◼
◼Pointer to next word shows end of current word
◼Hope to save up to 60% of dictionary space.

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
 4 bytes per term for Freq.  Now avg. 11
 bytes/term,
 4 bytes per term for pointer to Postings.  not 20.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19  7.6 MB (against 11.2MB for fixed
width)
Blocking
 Store pointers to every kth term string.
 Example below: k=4.
 Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33
29
 Save 9 bytes Lose 4 bytes on
44  on 3 term lengths.
126  pointers.
7
Net
 Example for block size k = 4
 Where we used 3 bytes/pointer without blocking
 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.

Shaved another ~0.5MB. This reduces the size of the

dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.

Why not go with larger k?

Exercise
 Estimate the space usage (and savings compared to 7.6 MB) with blocking,
for block sizes of k = 4, 8 and 16.
Dictionary search without blocking

 Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2*2+4*3+4)/8 ~2.6

Exercise: what if the

frequencies of query terms
were non-uniform but
known, how would you
structure the dictionary
search tree?
Dictionary search with blocking

 Binary search down to 4-term block;

 Then linear search through terms in block.
 Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares
Exercise
Estimate the impact on search performance (and slowdown compared to
k=1) with blocking, for block sizes of k = 4, 8 and 16.
Front coding
 Front-coding:
 Sorted words commonly have long common prefix – store differences only
 (for last k-1 in a block of k)
8automata8automate9automatic10automation

→8automat*a1e2ic3ion

Encodes automat Extra length

beyond automat.
Begins to resemble general string compression. 26
RCV1 dictionary compression summary

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

Unit 2
No ratings yet
Unit 2
157 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Pression
No ratings yet
Pression
44 pages
Lecture4 Compression
No ratings yet
Lecture4 Compression
61 pages
Index Compression Techniques Explained
No ratings yet
Index Compression Techniques Explained
59 pages
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
100% (1)
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
56 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
47 pages
Lecture4 Compression 1per
No ratings yet
Lecture4 Compression 1per
50 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Index Compression for CS Students
No ratings yet
Index Compression for CS Students
48 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
Compression
No ratings yet
Compression
46 pages
C4 Compression
No ratings yet
C4 Compression
44 pages
Index Compression
No ratings yet
Index Compression
46 pages
IR
No ratings yet
IR
8 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Index Compression Techniques
100% (1)
Index Compression Techniques
38 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
L5 Compression
No ratings yet
L5 Compression
60 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Indexing and Compression Basics
No ratings yet
Indexing and Compression Basics
43 pages
Index Construction in Information Retrieval
No ratings yet
Index Construction in Information Retrieval
48 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Week 2 Practice Quiz
No ratings yet
Week 2 Practice Quiz
5 pages
IR Unit 3
No ratings yet
IR Unit 3
66 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
String Processing II
No ratings yet
String Processing II
29 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Run-Length FM-Index for Efficient Text Search
No ratings yet
Run-Length FM-Index for Efficient Text Search
26 pages
ISCL Winter 2007 IR Midterm Solutions
No ratings yet
ISCL Winter 2007 IR Midterm Solutions
6 pages
Information Retrival Using Indexing
No ratings yet
Information Retrival Using Indexing
19 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
No ratings yet
CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
49 pages
10 1016@j Aei 2008 05 001
No ratings yet
10 1016@j Aei 2008 05 001
8 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
A New Approach For Compression On Textual Data
No ratings yet
A New Approach For Compression On Textual Data
4 pages
TF Idf
100% (3)
TF Idf
38 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
Learning Guide Unit 3 - Home
No ratings yet
Learning Guide Unit 3 - Home
10 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Index Construction Guide
No ratings yet
Index Construction Guide
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
L05
No ratings yet
L05
33 pages
Slawomira Grabowska, Time Mangement, Ethics A
No ratings yet
Slawomira Grabowska, Time Mangement, Ethics A
4 pages
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
No ratings yet
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
7 pages
The Impact of Social Isolation and Loneliness On Mental Health and Well
No ratings yet
The Impact of Social Isolation and Loneliness On Mental Health and Well
12 pages
DRMS Data Resource Management System
No ratings yet
DRMS Data Resource Management System
15 pages
Innovation Management PDF
100% (1)
Innovation Management PDF
153 pages
Fins of Fishes
100% (1)
Fins of Fishes
12 pages
Dentalindices - Public Health Dentistry
No ratings yet
Dentalindices - Public Health Dentistry
130 pages
Rainbow Bridge Meditation Technique
100% (4)
Rainbow Bridge Meditation Technique
4 pages
PM Debug Info
No ratings yet
PM Debug Info
20 pages
Principles of Anatomy and Physiology 14th Edition Tortora Fast Access
No ratings yet
Principles of Anatomy and Physiology 14th Edition Tortora Fast Access
310 pages
Court Arraignment: Cruz Drug Case
No ratings yet
Court Arraignment: Cruz Drug Case
3 pages
Science Class Rules & Guidelines
No ratings yet
Science Class Rules & Guidelines
2 pages
Modbus TCP: Industrial Ethernet Protocol
No ratings yet
Modbus TCP: Industrial Ethernet Protocol
2 pages
3-2-1 Strategy for Teachers
No ratings yet
3-2-1 Strategy for Teachers
2 pages
Loan Eligibility ML Project Report
No ratings yet
Loan Eligibility ML Project Report
28 pages
Act 5 - Cdi8
No ratings yet
Act 5 - Cdi8
1 page
FIITJEE Talent Reward Exam Hall Ticket
No ratings yet
FIITJEE Talent Reward Exam Hall Ticket
1 page
Course Outline BS English
No ratings yet
Course Outline BS English
77 pages
Case Study Bhendi Bajar
100% (1)
Case Study Bhendi Bajar
2 pages
Butterfly Host Plants List
No ratings yet
Butterfly Host Plants List
9 pages
Medical Device Standards and Implant Standards
40% (5)
Medical Device Standards and Implant Standards
13 pages
LBC 4.1
No ratings yet
LBC 4.1
523 pages
TS90 Service Manual
No ratings yet
TS90 Service Manual
87 pages
Grade 10 PE Yoga Lesson Plan
100% (1)
Grade 10 PE Yoga Lesson Plan
14 pages
MLW 2-1-2 The Rifle Platoon 1986 Full Obsolete 0 PDF
100% (4)
MLW 2-1-2 The Rifle Platoon 1986 Full Obsolete 0 PDF
273 pages
Control of Liver Fluke
No ratings yet
Control of Liver Fluke
5 pages
Feasibility Study for Tafach Bakery
100% (2)
Feasibility Study for Tafach Bakery
29 pages
Schneider Electric Consumer Units
No ratings yet
Schneider Electric Consumer Units
24 pages
John and Philosophy: A New Reading of The Fourth Gospel 1st Edition Engberg-Pedersen Kindle & PDF Formats
No ratings yet
John and Philosophy: A New Reading of The Fourth Gospel 1st Edition Engberg-Pedersen Kindle & PDF Formats
94 pages
Class XII Mathematics Textbook
No ratings yet
Class XII Mathematics Textbook
6 pages

Chapter 5 - Index Compression

Uploaded by

Chapter 5 - Index Compression

Uploaded by

Index Compression

Dr. Subrat Kumar Nayak

 Collection statistics in more detail (with RCV1)

 We will devise various IR-specific compression

avg. # bytes per term7.5

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

 Good empirical fit for

 For first 1,000,020 tokens,

Fig 5.1 p81

 Another power law relationship

Terms Freq. Postings ptr.

Dictionary search 20 bytes 4 bytes each

Store dictionary as a (long) string of characters:

Freq. Postings ptr. Term ptr.

Freq. Postings ptr. Term ptr.

Shaved another ~0.5MB. This reduces the size of the

Why not go with larger k?

Exercise: what if the

 Binary search down to 4-term block;

Encodes automat Extra length

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

You might also like