0% found this document useful (0 votes)

47 views24 pages

Inverted Index in Information Retrieval

The document discusses the concept of an Inverted Index, a crucial structure in information retrieval that maps terms to the documents in which they appear. It details the construction process, including tokenization, normalization, and indexing, as well as the efficiency of query processing using Boolean retrieval methods. Additionally, it emphasizes the importance of query optimization to minimize processing time when handling multiple terms.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views24 pages

Inverted Index in Information Retrieval

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Inverted Index

Dr. Subrat Kumar Nayak

Associate Professor
Dept. of CSE, ITER, SOADU
Inverted Index
 This is the first major concepts of information retrieval.
 The name is actually redundant: an index always maps back from
terms to the parts of a document where they occur.
 Inverted Index, sometimes coined as Inverted file.
 Inverted Index is used keep a dictionary of terms. Then for each
term, we have a list that records which documents the term occurs
in.
 Each item in the list, which records that a term appeared in a
document is conventionally called a posting.
 The list is then called a postings list (or inverted list), and all the
postings lists taken together are referred to as the postings.
dictionary term is used for the data structure and vocabulary for the set
of terms
Inverted Index
 For each term t, we must store a list of all documents that contain t.
Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is added

to document 14?
Inverted Index
 We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
Inverted Index (Construction)
Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
 Tokenization
 Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution
 Normalization
 Map text and query term to same form
You want U.S.A. and USA to match
 Stemming
 We may wish different forms of a root to match
authorize, authorization
 Stop words
 We may omit very common words (or not)
the, a, to, of
Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Indexer steps: Sort

Sort by terms
At least conceptually
And then docID

Core indexing step

Indexer steps: Dictionary & Postings

 Multiple term entries in a

single document are
merged.
 Split into Dictionary and
Postings
 Doc. frequency information
is added.

Why frequency?
Will discuss later.
Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much
storage do we
need?

Pointers 10
Inverted Index

 Inverted index works much better than the Boolean retrieval method.
 Sorting based inverted indexing is more efficient than the inverted indexing
method since least work needs to be done.
Query processing with an inverted index
How do we process a query? Our focus

Later – what kinds of queries can we process?

Query processing: AND
 Consider processing the query:
Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
The merge
 Walk through the two postings simultaneously, in time linear
in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
Intersecting two postings lists
(a “merge” algorithm)
The merge
 Walk through the two postings simultaneously, in time linear
in the total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
Boolean queries: Exact match
 The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
 Boolean Queries are queries using AND, OR and NOT to join query terms
 Views each document as a set of words
 Is precise: document matches condition or not.
 Perhaps the simplest model to build an IR system on
 Primary commercial retrieval tool for 3 decades.
 Many search systems you still use are Boolean:
 Email, library catalog, macOS Spotlight
Query optimization
 Query optimization is the process of selecting how to organize the work of
answering a query so that the least total amount of work needs to be done by the
system.
 What is the best order for query processing?
 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings, then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar

Query optimization example
 Process in order of increasing freq:
 start with smallest set, then keep cutting further.

This is why we kept

document freq. in dictionary

Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
 Get doc. freq.’s for all terms.
 Estimate the size of each OR by the sum of its doc. freq.’s
(conservative).
 Process in increasing order of OR sizes.
Algorithm to Intersect n terms.

Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
L2 Boolean Retrieval
No ratings yet
L2 Boolean Retrieval
33 pages
Lec 2
No ratings yet
Lec 2
17 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
Unit 1
No ratings yet
Unit 1
181 pages
Boolean Retrieval Model Overview
No ratings yet
Boolean Retrieval Model Overview
40 pages
Chapter 1 - Boolean-Retrieval
No ratings yet
Chapter 1 - Boolean-Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Ir 1
No ratings yet
Ir 1
14 pages
Boolean Retrieval in Information Retrieval
No ratings yet
Boolean Retrieval in Information Retrieval
45 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Boolean Models in Information Retrieval
No ratings yet
Boolean Models in Information Retrieval
52 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
51 pages
Week 6
No ratings yet
Week 6
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
69 pages
Chap1 Boolean
No ratings yet
Chap1 Boolean
39 pages
Applications of Information Retrieval
No ratings yet
Applications of Information Retrieval
23 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Unit I
No ratings yet
Unit I
83 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction to Information Retrieval Basics
No ratings yet
Introduction to Information Retrieval Basics
46 pages
Inverted Indexing for Information Retrieval
No ratings yet
Inverted Indexing for Information Retrieval
32 pages
Ir 1
No ratings yet
Ir 1
59 pages
Indexing for Efficient Retrieval
No ratings yet
Indexing for Efficient Retrieval
26 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
100% (1)
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Library Management System Project Report
No ratings yet
Library Management System Project Report
32 pages
Data Warehousing & Mining Question Bank
No ratings yet
Data Warehousing & Mining Question Bank
10 pages
Stock Market Manipulation Detection
No ratings yet
Stock Market Manipulation Detection
8 pages
ServiceNowFundamentals-eBook by Knewget
No ratings yet
ServiceNowFundamentals-eBook by Knewget
51 pages
Django for Web Developers
No ratings yet
Django for Web Developers
23 pages
Full Stack Java Developer Profile
No ratings yet
Full Stack Java Developer Profile
5 pages
Dbms Module-3 (Mmc103)
No ratings yet
Dbms Module-3 (Mmc103)
24 pages
Full Stack Developer Assignmnet - PanScience Innovations
No ratings yet
Full Stack Developer Assignmnet - PanScience Innovations
3 pages
Excel Lesson - Module 2
No ratings yet
Excel Lesson - Module 2
16 pages
Ethical+Hacking Bug+Bounty+v2
No ratings yet
Ethical+Hacking Bug+Bounty+v2
4 pages
ACFS File System Is Not Configured.: Opatch Auto
No ratings yet
ACFS File System Is Not Configured.: Opatch Auto
3 pages
Experienced QA Analyst Resume
No ratings yet
Experienced QA Analyst Resume
1 page
Mining Frequent Patterns and Correlations
No ratings yet
Mining Frequent Patterns and Correlations
100 pages
Montego Bay Saint James Zip Code
No ratings yet
Montego Bay Saint James Zip Code
1 page
ML Clustering2
No ratings yet
ML Clustering2
11 pages
CV Denis Ceke Ba
No ratings yet
CV Denis Ceke Ba
7 pages
Arcsight - Architecture .
No ratings yet
Arcsight - Architecture .
21 pages
DBMS Questions
No ratings yet
DBMS Questions
15 pages
CRUD Tutorial Using Node JS, Express, React JS and MySQL (Full-Stack)
No ratings yet
CRUD Tutorial Using Node JS, Express, React JS and MySQL (Full-Stack)
18 pages
Transact-SQL Reference (Transact-SQL)
No ratings yet
Transact-SQL Reference (Transact-SQL)
1 page
Sentiment Analysis Pipeline Guide
No ratings yet
Sentiment Analysis Pipeline Guide
8 pages
Antrim County, Michigan, Election Management System Application Security Analysis - by Cyber Ninjas (040921)
No ratings yet
Antrim County, Michigan, Election Management System Application Security Analysis - by Cyber Ninjas (040921)
18 pages
LAB REPORT Database
No ratings yet
LAB REPORT Database
5 pages
SQL - Data Definition and Data Manipulation Exercise
No ratings yet
SQL - Data Definition and Data Manipulation Exercise
9 pages
SAP ABAP Learning From Scratch
No ratings yet
SAP ABAP Learning From Scratch
133 pages
DBMS Engineering Express Notes
No ratings yet
DBMS Engineering Express Notes
6 pages
Big Data Analytics in Business Intelligence
No ratings yet
Big Data Analytics in Business Intelligence
10 pages
SAP Backup Admin Guide
No ratings yet
SAP Backup Admin Guide
14 pages
Frequency Distribution Guide
No ratings yet
Frequency Distribution Guide
5 pages
Bus Ticket Booking and Management System Proposal
100% (3)
Bus Ticket Booking and Management System Proposal
31 pages

Inverted Index in Information Retrieval

Uploaded by

Inverted Index in Information Retrieval

Uploaded by

Inverted Index

Dr. Subrat Kumar Nayak

Brutus 1 2 4 11 31 45 173 174

What happens if the word Caesar is added

Brutus 1 2 4 11 31 45 173 174

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

 Sequence of (Modified token, Document ID) pairs.

I did enact Julius So let it be with

Core indexing step

 Multiple term entries in a

Later – what kinds of queries can we process?

If the list lengths are x and y, the merge takes O(x+y)

If the list lengths are x and y, the merge takes O(x+y)

Query: Brutus AND Calpurnia AND Caesar

This is why we kept

Execute the query as (Calpurnia AND Brutus) AND Caesar.

You might also like