0% found this document useful (0 votes)
41 views37 pages

Unit-1 Complete Notes

The document outlines the course objectives and content for a lecture on Information Retrieval Systems, focusing on fundamental concepts, data structures, algorithms, and indexing techniques. It covers various types of data, the information retrieval process, and the architecture of information retrieval systems, including their applications and evaluation metrics. Additionally, it distinguishes between information retrieval and data retrieval, emphasizing the importance of relevance and user needs in retrieving information from large unstructured datasets.

Uploaded by

kishore.rce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views37 pages

Unit-1 Complete Notes

The document outlines the course objectives and content for a lecture on Information Retrieval Systems, focusing on fundamental concepts, data structures, algorithms, and indexing techniques. It covers various types of data, the information retrieval process, and the architecture of information retrieval systems, including their applications and evaluation metrics. Additionally, it distinguishes between information retrieval and data retrieval, emphasizing the importance of relevance and user needs in retrieving information from large unstructured datasets.

Uploaded by

kishore.rce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)

R23 Regulation V Semester A. Y.: 2025 -26

Lecture Notes
Information Retrieval System(23ML5T01)

Prepared By
Mr. P V Kishore Kumar M.Tech., (Ph.D.)
Associate Professor

Department of CSE (AI & ML)


Ramachandra College of Engineering(A), Eluru.

pg. 1 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Course Objectives:
1 To introduce the fundamental concepts and domain of Information Retrieval (IR) systems
and how they differ from other information systems.
2 To explain the data structures and algorithms fundamental to building and operating
efficient IR systems.
3 To analyse the working of indexing techniques such as inverted files and signature files
and their enhancements.
4 To explore advanced indexing structures like PAT Trees and the role of lexical analysis,
stop lists, and stemming in IR systems.
5 To study various string searching algorithms and their applicability to IR systems for
efficient retrieval.

Unit – I Introduction to Information storage and retrieval systems Domain Analysis of IR


systems, IR and other types of Information Systems, IR System Evaluation Introduction to
Data structures and algorithms related to Information Retrieval: Basic Concepts, Data
structures, Algorithms.
Unit – II Inverted Files and Signature Files Introduction, Structures used in Inverted Files,
building an Inverted file using a sorted array, Modifications to the Basic Techniques.
Signature Files: Concepts of Signature files, Compression, Vertical Partitioning, Horizontal
Partitioning.
Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
Unit – IV Stemming Algorithms and Thesaurus Construction Types of Stemming
algorithms, Experimental Evaluations of Stemming, stemming to Compress Inverted Files.
Thesaurus Construction: Features of Thesauri, Thesaurus Construction, Thesaurus
construction from Texts, Merging existing Thesauri.
Unit – V String Searching Algorithms Introduction, Preliminaries, The Naive Algorithm, The
Knutt-Morris-Pratt Algorithm, The Boyer Moore Algorithm, The Shift-Or Algorithm, The
Karp-Rabin Algorithm.

Text Books:
1 Modern Information Retrieval, Ricardo Baeza-Yates, Neto, PEA,2007.
2 Information Storage and Retrieval Systems: Theory and Implementation, Kowalski, Gerald,
Mark Academic Press, 2000.

pg. 2 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL
1.1. Data, information, Knowledge and Wisdom
Data: - is raw fact, unprocessed information, it simply exists and has no full meaning (does
not have meaning of itself). It can exist in any form, usable or not. Can be symbol.
Example: 2009: – Can be year, amount, number, etc It is raining: -because there is no any
suggestion about the condition of temperature
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation, or processing, by human
or electronic machines.
1.1.1. Types of data
Data can be categorized by its content into three types: Structured, Semi-structured, and
Unstructured data.
Structured data
Structured data is data that adheres to a pre-defined data model and is therefore
straightforward to analyze. Structured data conforms to a tabular format with a relationship
between the different rows and columns.
Common examples of structured data are Excel files or SQL databases. Each of these has
structured rows and columns that can be sorted.
Unstructured data
Unstructured data is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy but may
contain data such as dates, numbers, and facts as well.
Common examples of unstructured data include audio, video files or NoSQL databases.
Semi-structured data
Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data. Therefore, it is also known as a self-
describing structure. Examples of semi-structured data include JSON and XML are forms of
semi-structured data.
Information: is Data that have been processed and has meaning of itself and the meaning is
useful but does not have to be. Information embodies the understanding of a relationship of
some sort, cause and effect.

pg. 3 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


An information or records management system -- most often electronic -- designed to
capture, process, store and retrieve information is the glue that holds a business together.
Knowledge: is the appropriate collection of information, such that its intent is to be useful.
Knowledge is a deterministic process. When someone memorize information (as less-aspiring
test-bound students often do), then they have amassed knowledge.
Wisdom: is an extrapolative and non-deterministic, non-probabilistic process and it goes far
beyond understanding itself. Wisdom is therefore, the process by which we judge what is
right and what is wrong or what is good and what is bad.
Storage: The action of or method of storing something. The place where data is held in an
electromagnetic or optical for access by a computer processor.
Retrieval: The process of getting something back from somewhere. The action of obtaining
or consulting material stored in a computer system. Example: find „BRUTUS AND
CAESAR AND NOT CALPURNIA‟ in the big book of shakespare.
Information storage: The computers can store different types of information in different
ways, depending on what the information is, how much storage it requires and how quickly it
needs to be accessed.
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers). Information retrieval technology has been central to the success of the Web.
Information Retrieval is the process of obtaining relevant information from a collection of
informational resources.
1.2. What is Information Storage and Retrieval (ISR)?
Information storage & retrieval is the process of searching for relevant documents from an
unstructured large corpus that satisfy the user's information need. Information storage &
Retrieval (IS&R) deals with the representation, storage, organization, and access to
information items to satisfy user information needs. In other terms, Information storage &
retrieval is the process of searching for relevant documents from an unstructured large corpus
that satisfy the user’s information need.
1.3. Information Retrieval System (IRS)
(IRS) is self-explanatory from the terminological point of view and refers to a ‘system which
retrieves information’. IRS is concerned with two basic aspects: (i) How to store information,
and (ii) How to retrieve information.
Modern information retrieval systems deal not only with textual information but also with
multimedia information comprising text, audio, images and video. Thus, modern information
retrieval systems deal with storage, organization and access to text, as well as multimedia
information resources. Thus, an IR system is a set of rules and procedures, for performing
some or all of the following operations:
a) Indexing (or constructing of representations of documents)

pg. 4 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


b) Search formulation (or constructing of representations of information needs)
c) Searching (or matching representations of documents against representations of needs) and
d) Index language construction (or generation of rules of representation)
So, information retrieval is collectively defined as a “science of search” or a process, method
and procedure used to select or recall, recorded and/or indexed information from files of data.
1.4. Data versus information retrieval
Data Retrieval systems directly retrieve data from database management systems by
identifying keywords in the queries provided by users and matching them with the documents
in the database.
Data retrieval: the task of determining which documents of a collection contain the
keywords in the user query.
The main reason for this difference is that information retrieval usually deals with natural
language text which is not always well structured and could be semantically ambiguous.

1.5. Information Retrieval serve as Bridge

pg. 5 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


An Information Retrieval System aims at collecting and organizing all the documents
available in one or more subject areas in order to provide them to users as soon as required.
Information retrieval system serves as a bridge between the world of authors and world of
users. An Information Retrieval System serves as a bridge between the world of authors and
the world of readers/users, that is, writers present a set of ideas in a document using a set of
concepts.

1.6. Information Retrieval System Architecture


Before an information retrieval system can actually operate to retrieve some information the
information must have already been stored inside the system this is true both for manual and
computerized systems.
The retrieval system is not likely to have stored the complete text of each document in the
natural language in which it was written. It will have, instead, a document representative
which may have been produced from the documents either manually or automatically.

pg. 6 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Information Retrieval (IR) means finding a set of documents that is relevant to the query. The
ranking of a set of documents is usually performed according to their relevance scores to the
query. The user with information need issues a query to the retrieval system through the
query operational module.
We use a Search Engines to retrieve information from different web sites. A search engine is a
tool designed to search for information. When this occurs on the World Wide Web it is then
called a web search engine. The search results are usually presented in a list and are
commonly called hits. The information may consist of web pages, images, information, and
other types of files. Some search engines also mine data available in databases or open
directories. Unlike Web directories, which are maintained by human editors, search engines
operate algorithmically or are a mixture of algorithmic and human input. Common search
engines include Google, Yahoo, AOL Search, Ask.com, Bing and Look smart.
1.7. Information retrieval System vs. Web Search System

Aspect Classic Information Retrieval Web Search System


System

Input Source A specific, fixed document Publicly accessible web pages (both static and
collection (e.g., library database, dynamic content).
research articles).

Goal Retrieve documents/text relevant Retrieve high-quality, relevant web pages


to the user's information need. matching the user's query.

pg. 7 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Types of Primarily text-based documents. Text, images, audio, video, and dynamic
Content database-generated pages.

Main Tasks 1. Processing the document 1. Processing & representing document


collection.2. Processing queries collection (gathering static pages or
(searching). understanding dynamic pages).2. Processing
queries.

Retrieval Boolean model, Vector space Uses traditional IR models plus link analysis
Models Used model, probabilistic models. (e.g., PageRank), machine learning,
personalization, etc.

User Features Limited personalization; results Highly personalized results, query refinement,
are generic. interactive search experience.

Collection No hyperlinks; documents are Hyperlinks connect related pages, enabling


Features standalone. navigation and ranking based on link structure.

Statistics More difficult to collect on large Easy to gather statistics from large-scale user
Gathering datasets. interactions.

Interactivity Basic search–result display. High interactivity — query suggestions,


expansions, and refinements.

Data Nature Mostly static. Mix of static and dynamic (database-driven)


content.

1.8. Information retrieval Process


Information retrieval is the process of searching for relevant documents from unstructured
large corpus that satisfy user’s information need. IR is simply about finding relevant
information. It focuses on providing the user with easy access to information of their interest.

pg. 8 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Information retrieval is the process of matching the query against the indexed information
objects.
An index is an optimized data structure that is built on top of the information objects.
– allowing faster access for the search process
– The indexer: • tokenizes the text (tokenization)
• removes words with little semantic value (stop-words)
• unifies word families (stemming)
– The same is done for the query as well
1.9. Issues that arise in Information Retrieval
The main issues of the Information Retrieval (IR) are:-
 Document and Query Indexing
 Query Evaluation, and System Evaluation.
 How can interactive query formulation and refinement be supported?
 Comparing representations
 What is a “good” similarity measure & retrieval model?
 How is uncertainty represented?
 Text representation
 What makes a “good” representation?
 How is a representation generated from the text?

pg. 9 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 What are the retrievable objects and how are they organized?
 Information needs representation
 What is an appropriate query language?
 Evaluating the effectiveness of retrieval
 What are good metrics?
1.10. Information retrieval some Applications areas
 Graphical interfaces to support information search
 Information Retrieval & Extraction
 XML retrieval
 Geographic Information Retrieval
 Multimedia information retrieval
 Cross-Language & Multilingual Information Retrieval
 Agent-based (like information filtering, tracking, routing)
 Adversarial Information Retrieval
 Question answering
 Document Summarization
 Text classification
 Multi-database searching
 Document provenance
 Recommender systems
 Information Retrieval & Machine Learning
 Text Mining & Web Mining
 N-Grams in Information Retrieval
IR System Evaluation
Documents are the primary objects in IR systems and there are many operations for them.
In many types of IR systems, documents added to a database must be given unique
identifiers, parsed into their constituent fields, and those fields broken into field
identifiers and terms. Once in the database, one sometimes wishes to mask off certain
fields for searching and display. For example, the searcher may wish to search only the
title and abstract fields of documents for a given query, or may wish to see only the title
and author of retrieved documents.
Information Retrival System is a system it is a capable of stroring, maintaining from a
system. and retrieving of information. This information May Any of the form that is
audio,vedio,text.
Information Retrival System is mainly focus electronic searching and retrieving of
documents.
Information Retrival is a activity of obtaining relevant documents based on user needs
from collection of retrieved documents.
Objectives of Information Retrieval Systems

pg. 10 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Help users find needed information quickly, reducing the extra work (overhead).
Overhead includes:
 Writing the query
 Running the query
 Looking through results
 Reading irrelevant documents
Relevance
 Relevant items = contain the information the user needs.
 Non-relevant items = do not help answer the user’s question.
 For the user, “relevant” = “needed.”
Precision = How many retrieved items are actually relevant.
 Drops if too many non-relevant items are retrieved.
Recall = How many of the relevant items were found.
 Not affected by non-relevant items.
 Once all relevant items are found, recall = 100%.
Example Search Process
 The database can be divided into 4 categories:
1. Relevant & retrieved
2. Relevant & not retrieved
3. Non-relevant & retrieved
4. Non-relevant & not retrieved
Functional Overview

pg. 11 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


A total Information Storage and Retrieval System is composed of four major functional
processes:
1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build process that supports Index Files.

1) Item Normalization
Purpose: Standardize all incoming items so they can be searched easily.
Steps:
o Convert data to a standard format (e.g., text to Unicode, videos to MPEG,
audio to WAV, images to JPEG).
o Break content into tokens (words), identify them, and remove endings
(stemming).
o Remove unnecessary words/numbers (Stop List).

o Organize content into zones (Title, Author, Abstract, etc.) so searches can
focus on specific sections.
For multimedia: Different formats need standardization (video, audio, images).
2) Selective Dissemination of Information (SDI / “Mail”)

pg. 12 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Purpose: Automatically send new items to users who are interested in them.
How it works:
o Each user has a profile (search statement).

o New items are compared to profiles.

o If matched, the item is sent to the user’s “mail file.”

Current limitation: Not yet widely used for multimedia sources.


3) Document Database Search
Purpose: Let users search the entire database for stored items.
How it works:
o Users send ad-hoc queries.

o System searches all stored items (which usually don’t change after being
saved).
4) Index Database Search
Purpose: Allow saving and retrieving specific items for future use.
Types of Index Files:
1. Public Indexes – created by professionals, cover the whole database, many
users can access.
2. Private Indexes – created by individual users, contain only selected items,
limited access.
Automatic File Build (AFB): Helps generate indexes automatically (also called
Information Extraction).
Multimedia in IR Systems
Multimedia (video, audio, images) is stored alongside text in the same system.
Relation to Database Management Systems (DBMS)
Many modern DBMS integrate IR features:
o INQUIRE DBMS – early integration example.

o Oracle (CONVECTIS) – uses a thesaurus to create “themes” for items.

o Informix + RetrievalWare – links structured database data with IR functions.

Precision
Meaning: How accurate the retrieved results are.

pg. 13 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Formula:

Precision=Relevant documents retrieved/Total documents retrieved

Example:
o System retrieves 10 documents.

o 9 are relevant → Precision = 9 ÷ 10 = 0.9 (90%).

Key idea: High precision means fewer irrelevant documents, but it doesn’t tell us if we found
all the relevant ones.
Recall
Meaning: How complete the search is in finding all relevant documents.
Formula:
Recall=Relevant documents retrieved/Total relevant documents in the collectionExample:

o There are 100 relevant documents in the database.

o System retrieves 9 of them → Recall = 9 ÷ 100 = 0.09 (9%).

Key idea: High recall means the system found most of the relevant documents, but it might
also include irrelevant ones.

RETRIEVAL STRATEGIES
 A retrieval strategy is a method (algorithm) that compares a query (Q) with
documents (D1, D2, …, Dn) and calculates a similarity score (SC) for each document.
 The higher the score, the more relevant the document is considered.
 The idea: if more query terms appear in a document, the document is probably more
relevant.

pg. 14 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 Problem: language is tricky — the same idea can be written in different words
(synonyms), which can confuse the system.
Vector Space Model (VSM)
 Both queries and documents are turned into vectors (lists of numbers) in a "term
space".
 The similarity between them is calculated, often using the dot product formula.
 Words in documents and queries get weights based on:
o TF (Term Frequency): How many times a term appears in a document.

o IDF (Inverse Document Frequency): How rare a term is across all


documents (rare terms are more important).

 Example:
Q: "gold silver truck"
D1: "Shipment of gold damaged in a fire"
D2: "Delivery of silver arrived in a silver truck"
D3: "Shipment of gold arrived in a truck"
 After calculating TF-IDF and similarity scores, documents might be ranked like:
D2 > D3 > D1

Inverted Index
 Instead of checking every document each time, an inverted index is made
beforehand.
 For each term, it stores:
o Which documents contain the term.

o How many times the term appears in each document (TF).

 This makes searching much faster.


Term Weighting Improvements
 Basic TF-IDF can give too much importance to a single frequent term.
 To fix this, use log scaling:
Weight=(1+log(tf))×idf
Queries and documents can be weighted differently.
 Example weighting code: lnc.ltc
o l = log scaling

o t = use IDF

o c = cosine normalization (adjust for document length)

pg. 15 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Probabilistic Retrieval Strategies: The probabilistic model computes the similarity
coefficient (SC) between a query and a document as the probability that the document will be
relevant to the query. This reduces the relevance ranking problem to an application of
probability theory. Probability theory can be used to compute a measure of relevance between
a query and a document.
1. Simple Term Weights.
2. Non binary independent model.
3. Language model.
Simple Term Weights (Simplified)
We use term weights to rank documents based on how likely they are to be relevant to a
query.
This comes from the Probability Ranking Principle (PRP):
The best results come when documents are ordered by their probability of being relevant.
 Each term in a query is given a weight = probability that the term will retrieve a
relevant document.
 We combine the weights of all query terms to calculate the overall probability that a
document is relevant.
 If terms are independent, we multiply their probabilities to get the final score.
Example
 A softball team has:
o 75% chance of winning on sunny days.

o 60% chance of winning if the best shortstop plays.

 If these events are independent, combining both gives a higher chance of winning
than either alone.
 In documents:
o Term A might mean a 70% chance of relevance.

o Term B might mean a 60% chance.

o Combined (if independent) → higher probability.

The Independence Assumption


 We assume that the presence of one term does not affect the presence of another
term.

pg. 16 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 In reality, this is often not true (e.g., a document with “apple” in an apple pie query is
more likely to also have “pie”).
Challenges
 Considering dependencies makes calculations much harder.
 We need training data: examples of relevant and non-relevant documents.
 Often, we don’t have enough data to accurately estimate probabilities.
Independence Assumptions (11 & 12)
 11: Terms are independent in relevant documents and in all documents.
 12: Terms are independent in relevant docs and also in non-relevant docs.
Ordering Assumptions (01 & 02)
 01: Rank documents only if they contain matching query terms.
 02: Consider both:
o Terms present in the document.

o Terms absent from the document.

 Different combinations of independence assumptions and ordering rules lead to


different weighting formulas.
 For a term t, we track:
o N: Total docs.

o R: Relevant docs.

o n: Docs containing t.

o r: Relevant docs containing t.

Non-Binary Independence Model


The non-binary independence model is an extension of the probabilistic retrieval model that:
 Considers term frequency (how many times a term appears in a document).
 Adjusts for document length.
 Still assumes term independence like in the binary model, but instead of only
checking if a term appears or not, it accounts for how often it appears.
Difference from Simple Term Weights
 Simple Term Weights: Only checks presence or absence of a term in a document.

pg. 17 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 Non-Binary Independence Model: Calculates separate probabilities for different
term frequencies.
Example
Collection: 10 documents.
 Doc 1: contains the term blue once.
 Doc 2: contains the term blue ten times.
 Both are relevant; 8 other docs are not relevant.
Non-Binary Method
 Calculate probability for each frequency separately:

o P(1∣R)=0.1 → “Blue appears once” in 1 of 10 relevant docs.

o P(10∣R)=0.1 → “Blue appears 10 times” in 1 of 10 relevant docs.

Incorporating Document Length


 Normalize term counts by document size:
o Doc 1: 5 total terms → blue occurs 0.5 times per term.

o Doc 2: 10 total terms → blue occurs 1 time per term.

Non-Relevant Probability

We also calculate P(di∣N) → the probability that a non-relevant document contains a term i
exactly di times.
Final Weight Formula
The weight for a term occurrence count is:

wi(di)=P(di∣R)P(di∣N)
Where:

 P(di∣R) = probability a relevant document contains the term i exactly di times.

 P(di∣N) = probability a non-relevant document contains the term i exactly di times.

Captures strength of term occurrence (frequency) rather than just presence.

Adjusts for document length so large docs don’t unfairly dominate.

Still assumes independence between terms.

pg. 18 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Requires more training data than binary models because probabilities are frequency-
specific.

Language Models
A statistical language model is a probabilistic way to generate text.
It gives the probability of any sequence of words appearing.
In IR, it’s used to estimate how likely a document would “generate” a user’s query.
Unigram Model: The simplest type — treats each word as independent and calculates
probabilities based only on word frequencies.
More complex models: Consider context (e.g., previous words) to predict the next one.
Formal Definition
For a document Di we imagine its language model MDi

We then calculate:

Similarity (Q, Di) = P(Q∣MDi)


This means:
If someone was “reading” document Di what’s the probability they would produce the words
in query Q?
Treat the presence or absence of each term as independent events (Bernoulli model).
Probability of the whole query = product of:
1. Probabilities of terms present in the query.
2. Probabilities of terms absent from the query.
Estimating Term Probabilities
 A simple method: Maximum Likelihood Estimate (MLE).
 For a term tj in document Di:

The Problem with MLE

 If a query term does not appear in a document → P(tj∣MDi)=0


 Since we multiply probabilities for all query terms, one zero makes the whole
similarity score zero.

Example
Query: "gold silver truck"
Documents:

pg. 19 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 D1: "Shipment of gold damaged in a fire" → No "silver" → score = 0
 D2: "Delivery of silver arrived in a silver truck" → No "gold" → score = 0
 D3: "Shipment of gold arrived in a truck" → No "silver" → score = 0
Result: All scores become zero, which is a problem.
 Language models rank documents by probability of generating the query.
 MLE is simple but fails if query terms are missing from the document.
 This leads to the need for smoothing techniques to avoid zero probabilities.
Index term selection
Some words are not good for representing documents, use of all words have
computational cost, increase searching time and storage requirements and using the set of
all words in a collection to index documents generates too much noise for the retrieval
task, therefore, term selection is very important. The main objectives of term selections
are:
• Represent textual documents by a set of keywords called index terms or simply terms
• Increase efficiency by extracting from the resulting document a selected set of terms to
be used for indexing the document
• If full text representation is adopted then all words are used for indexing (not as such
efficient as it will have an overhead, time and space) Index term is also called keyword or
is a word (a single word) or phrase (multiword) in a document whose semantics gives an
indication of the document’s theme (main idea)
 Capture subject discussed
 Help in remembering the documents
Indexing:
– Is the art of organizing information
– Is an association of descriptors (keywords, concepts) to documents in view of future
retrieval – Is a process of constructing document surrogates by assigning identifiers to text
items
– Is the process of storing data in a particular way in order to locate and retrieve the data
– Is the process of analyzing the information content in the language of the indexing system
The main important and well-known Purpose/objective of indexing are
– To give access point to a collection that are expected to be most useful to the users of
information
– To allow easy identification of documents (e.g., find documents by topic)

pg. 20 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


– To relate documents to each other
– To allow prediction of document relevance to a particular information need
There are 2 ways of indexing
Manual Indexing
 Done by human indexers.
 Uses controlled vocabulary (thesaurus, guidelines).
 Based on human judgment and semantic understanding of the document.
 Requires prior knowledge of:
o User’s search terms.

o Indexing vocabulary.

o Collection characteristics.

 Usually done in libraries or archival settings.


 Slower, time-consuming, and costly.
 High accuracy if done by skilled indexers.
Automatic Indexing
 Done by computer systems.
 Scans documents against a controlled vocabulary/taxonomy.
 Fast and suitable for large volumes of data (e.g., internet search engines).
 Advantages:
o Very fast and cost-effective.

o Predictable and consistent.

o Can extract and cluster terms.

o Great for similar types of content.

 Disadvantages:
o Less flexible than humans.

o May need human checking for errors.

 Best for online environments with huge document collections.

pg. 21 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Text Operations
Transforming text into logical representations (index terms) for retrieval.
Main Operations
1. Lexical Analysis / Tokenization
o Splitting text into words or tokens.

o Handling digits, hyphens, punctuation, and letter cases.

2. Elimination of Stop Words


o Removing common words (e.g., "the", "is", "and") that do not help in
retrieval.
3. Stemming
o Removing prefixes/suffixes to get the root form of words (e.g., playing →
play).
4. Thesaurus Construction
o Building structures that show relationships between words for query
expansion.
Preprocessing is Needed?
 Not all words are equally important (nouns are often most meaningful).
 Using all words as index terms causes noise (irrelevant terms).
 Preprocessing reduces vocabulary size, improving retrieval performance.
Logical View of Documents
 Documents are represented by keywords/index terms.

pg. 22 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 In large collections, the list of keywords is reduced for efficiency (text
transformation).
Typical Text Processing System Steps
1. Remove High-Frequency Words
o Use a stop list to remove common “fluff” words.

o Reduces document file size by 30–50%.

2. Suffix Stripping
o Remove the longest possible suffix from words.

o Example: connection, connected → connect.

3. Detect Equivalent Stems


o Group different forms of a word under one stem for consistency.

Lexical Analysis/Tokenization of Text


Definition: Breaking a text into words/tokens (meaningful units) for indexing or searching.
Converts character streams → sequence of words (w1, w2, … wn).
First step in:
 Automatic Indexing (finding index terms).
 Query Processing (analyzing a search query).
Tokens are candidate index terms for retrieval.
Identify words in the text that will be used as index terms.
Decide: What counts as a token? (letters only? letters + numbers? hyphenated words?)
Common Tokenization Rules
 Words are separated by spaces, punctuation, or special characters.
 Common words (“a”, “the”, “of”, etc.) are often ignored.
 Convert all words to lowercase for consistency.
 Numbers and punctuation marks are usually ignored unless meaningful (e.g., MS-
DOS, 510 B.C.).
Issues in Tokenization 1. Hyphenated Words
o state-of-the-art → one token or three?

o Some hyphenated words are unique (e.g., MS-DOS).

2. Punctuation Marks

pg. 23 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


o Remove unless significant (e.g., x.exe, email addresses).

3. Multi-Word Units
o San Francisco, Addis Ababa → treat as one unit or two?

4. Numbers
o Dates, phone numbers, IP addresses – usually not indexed, unless important.

5. Case Sensitivity
o Data vs. data vs. DATA → usually converted to lowercase.

6. Language-Specific Rules
o Tokenization rules vary with the language of the document.

Example
Input: "Friends, Romans, and Countrymen"
Output Tokens:
 Friends
 Romans
 and
 Countrymen
Elimination of Stopwords
Stopwords
 Definition: Very common words in documents that have little or no value for
distinguishing between documents.
 Found in 80% or more of documents.
 Examples:
o Articles: a, an, the

o Pronouns: I, he, she, it, their, his

o Prepositions: on, of, in, about, besides, against

o Conjunctions: and, but, for, nor, or, so, yet

o Verbs: is, are, was, were

o Adverbs: here, there, because, soon, after

o Adjectives: all, any, each, every, few, many, some

 Language-dependent (stopwords differ by language).

pg. 24 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Remove Stopwords?
 They have little semantic content.
 Occupy 50% of the text, reducing storage size by 30–50%.
 Make indices smaller → faster retrieval.
 Helps in better text classification, categorization, and summarization.
Detect Stopwords?
1. By Frequency:
o Sort terms by document frequency (DF) and choose the most frequent as
stopwords.
2. By Stop List:
o Predefined list of common words to exclude.

Stopword Removal – Past vs. Present


 Older IR Systems:
o Stopwords were always removed to save space.

 Modern Search Engines:


o Often keep stopwords because:

 Needed for phrase queries (“King of Denmark”).


 Important in song titles, quotes (“Let it be”, “To be or not to be”).
 Useful in relational queries (“flights to London”).
 Risk of Removal:
o May reduce recall (miss important matches if stopwords are removed).

Normalization
Purpose: Make tokens match despite small differences in their written form.
Idea: Convert both indexed text and query terms into the same standard form.
Example
 U.S.A. → USA (remove periods)
 Case Folding: Convert everything to lowercase.
o Examples:

pg. 25 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 Republican → republican
 Fasil / fasil / FASIL → fasil
 Anti-discriminatory → antidiscriminatory
o Helps match words that are the same except for capitalization.

Advantages
 Matches words even if:
o One appears at the start of a sentence (capitalized)

o Users type in lowercase

o Slight differences in form (Ferrari → ferrari)

Disadvantages
 Not always good for proper names vs. common nouns:
o “General Motors” (company) vs. “general Motors” (ordinary motors)

o “Associated Press”, “Kebede”

 Possible solution: Only lowercase words not at the beginning of a sentence.

DATA STRUCTURE
The knowledge of data structure gives an insight into the
capabilities available to the system.
• Each data structure has a set of associated capabilities.
• Ability to represent the concept and their retrieval system.
• Supports location of those concepts Introduction
Two major data structures in any IRS:
1. One structure stores and manages received items in their
normalized form is called document manger
2. The other data structure contains processing tokens and
associated data to support search.
Result of a search are references to the items that satisfy the search statement which
are passed to the document manager for retrieval.
Focus : on data structure that support search function

pg. 26 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Stemming : is the transformation often applied to data before placing it in the
searchable data structure Stemming represents concept(word) to a canonical
(authorized; recognized; accepted language ) representation .
Risk with stemming : concept discrimination information may be lost in the process.
Causing decrease in performance.
Stemming Algorithm
 Stemming algorithm is used to improve the efficiency of IRS and improve recall.
 Conflation (the process or result of fusing items into one entity; fusion;
amalgamation) is a term that is used to refer mapping multiple morphological
variants to single representation(stem).
 Stem carries the meaning of the concept associated with the word and the
affixes(ending) introduce subtle(slight) modification of the concept.
 Terms with a common stem will usually have similar meanings,
 Ex : Terms with a common stem will usually have similar meanings, for example:
• CONNECT
• CONNECTED
• CONNECTING
• CONNECTION
• CONNECTIONS
 Frequently, the performance of an IR system will be improved if term groups such
as this are conflated into a single term. This may be done by removal of the
various suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT
 In addition, the suffix stripping process will reduce the total number of terms in
the IR system, and hence reduce the size and complexity of the data in the system,
which is always advantageous.
 Major usage of stemming is to improve recall.
 Important for a system to categories a word prior to making the decision to stem.
 Proper names and acronyms (A word formed from the initial letters of a name say
IARE …) should not have stemming applied.
 Stemming can also cause problems for natural language processing NPL systems
by causing loss of information .
 Stemming algorithms are Porter Stemming, Table Lookup Stemming, Dictionary
Stemming and Successor Stemming.

PORTER STEMMING ALGORITHM


 Based on a set condition of the stem •
 A consonant in a word is a letter other than A, E, I, O or U, some important stem
conditions are
1. The measure m of a stem is a function of sequence of vowels (V) followed by a
sequence of consonant ( C ) .
2. C (VC)mV. m is number VC repeats The case m = 0 covers the null word.
3. * - stem ends with a letter X .

pg. 27 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


4.*v* - stem contains a vowel
5. *d - stem ends in double consonant (e.g. -TT, -SS).
6. *o- stem ends in consonant vowel sequence where the final consonant is not
w,x,y(e.g. -WIL, -HOP).
 Suffix cond.s takes the form current _suffix = = pattern Actions are in the form
old_suffix ->. New_suffix
 Rules are divided into steps to define the order for applying the rule.
Examples of the rules

Table Lookup STEMMING ALGORITHM


 The Table Lookup Stemming Algorithm is a simple and accurate stemming
approach where each word is matched directly with a predefined table or
dictionary containing inflected/derived forms and their corresponding stems.
 Precompiled Table:
A table (lookup dictionary) is prepared containing pairs of: Stem
Word
Inflected/derived forms (e.g., "running", "jumps",
running run
"easily")
Corresponding root words or stems (e.g., "run", "jump", ran run
"easy")
jumps jump
 Lookup Operation:
When a word is encountered in a document or query, the easily easy
algorithm searches the table.
studies study
If the word exists, it is replaced with its stem.
If not, the word may be left unchanged or passed to a
fallback stemmer.

Advantages:
Very accurate for known words
Fast lookup using hash maps or tries
No ambiguity if the table is well-defined

pg. 28 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Disadvantages:
• Requires large storage for comprehensive vocabulary
• Fails on unseen words (not in the table)
• Not dynamic—needs updating as new words evolve
Dictionary Lookup Stemming Algorithm
 The Dictionary Lookup Algorithm is a stemming technique where words are
matched directly against a predefined dictionary of words and their base/root forms
(stems). This is a direct and accurate method for reducing words to their root forms.
 Precompiled Dictionary:
Word Stem
 A large dictionary is created that maps
studies study
inflected/derived words to their root words.
children child
 "running" → "run"
went go
 "bought" → "buy"
playing play
 "happier" → "happy“
happily happy
Word Matching:
 When processing a document or query:
 Each word is looked up in the dictionary.
 If found, the root word is returned.
 If not found, it may be left unchanged or passed to a fallback stemmer.

Successor Variety Stemming Algorithm


 The Successor Variety Algorithm is a statistical, unsupervised stemming method that
detects word boundaries (stems) based on successor variety — i.e., the number of
different characters or letter sequences that follow a certain position in a word
across a corpus.
 Step by step
Input: A large set of words (corpus vocabulary).
For each word, scan from left to right and Track how many different characters
follow each position in the word across the vocabulary.
Successor Variety at a position = number of different letters that follow that position
in different words.
A morpheme boundary is assumed where the successor variety is highest.
play, played, playing, playful, player, plays
playing
0123456

Successor varieties might look like:

pg. 29 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


• After p: 1 (only 'l’) After l: 1 (only 'a’) After a: 1 (only 'y’) After y: 3 (i,.e, s) ← High
variety, likely stem boundary here
• Successor Variety (SV): Looks at the next letter only.
• Successor Variety with Threshold: Only cut at positions with variety > threshold.
• Forward and Backward SV: Some algorithms use both successor and predecessor
varieties. examples of successor variety approach Cutoff Method

Successor variety of words are used to segment a word by applying one of the following
four methods.
1. Cutoff method: a cut of value is selected to define the stem length.
2. Peak and plateau: a segment break is made after a character whose successor variety
exceeds that of the character.
3. Complete word method: break on boundaries of complete words.
4. Entropy method: uses the distribution method of successor variety letters.
INVERTED FILE STRUCTURE
Most common data structure
Inverted file structures are composed of three files
The document file
1. The inversion list (Posting List)
2. Dictionary
3. The inverted file : based on the methodology of storing an inversion of documents.
For each word a list of documents in which the word is found is stored(inversion of
document).

pg. 30 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Each document is given a unique the numerical identifier that is stored in inversion
list . Dictionary is used to located the inversion list for a particular word.
Which is a sorted list( processing tokens) in the system and a pointer to the location of its
inversion list. Dictionary can also store other information used in query optimization such
as length of inversion lists to increase the precision.

 Use zoning to improve


 precision and Restrict entries.
 Inversion list consists of document identifier for each document in which the word is
found.
N-GRAM DATA STRUCTURE
 N-Grams can be viewed as a special technique for conflation (stemming) and as a
unique data structure in information systems.
 N-Grams are a fixed length consecutive series of “n” characters.
 Unlike stemming that generally tries to determine the stem of a word that represents
the semantic meaning of the word, n-grams do not care about semantics.
 The searchable data structure is transformed into overlapping n-grams, which are then
used to create the searchable database.
 Examples of bigrams, trigrams and pentagrams for the word phrase “sea colony.” se
ea co ol lo on ny Bigrams (no interword symbols)
 The symbol # is used to represent the interword symbol which is anyone of a set of
symbols (e.g., blank, period, semicolon, colon, etc.).
 Each of the n-grams created becomes a separate processing tokens and are searchable.
 It is possible that the same n-gram can be created multiple times from a single word.
 Text preprocessing
 Spelling correction

pg. 31 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 Query suggestion
 Language modeling
 Plagiarism detection
 Indexing in search engines
Types of n-Grams
Unigram – 1 item (e.g., one word or one character)
Bigram – 2 consecutive items
Trigram – 3 consecutive items
n-Gram – n consecutive items
Example with Words (Word-Level n-Gram)
Sentence:
Information retrieval is important
• Unigrams:
["Information", "retrieval", "is", "important"]
• Bigrams:
["Information retrieval", "retrieval is", "is important"]
• Trigrams:
["Information retrieval is", "retrieval is important"]
PAT Data structures
1. practical algorithm to retrieve information coded in alphanumeric
2. PAT structure or PAT tree or PAT array: continuous text input, data structures (string
like N- Gram data structure).
3. The input stream is transformed into a searchable data structure consisting of
substrings, all substrings are unique.
4. Each position in a input string is a anchor point for a sub string.
5. In creation of PAT trees each position in the input string is the anchor point for a sub-
string that starts at that point and includes all new text up to the end of the input.
6. Binary tree, most common class for prefix search, But Pat trees are sorted logically
which facilitate range search, and more accurate then inversion file .
7. PAT trees provide alternate structure if supporting strings search.
Example : sistring
Text  Economics for Warsaw is complex.

pg. 32 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


---------------------------------------------------------------------------------
sistring 1 Economics for Warsaw is complex.
sistring 2 conomics for Warsaw is complex.
sistring 5 omics for Warsaw is complex.
sistring 10 for Warsaw is complex.
sistring 20 w is complex.
sistring 30 ?
The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree.
For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher level nodes.
It is possible to place additional constraints on sistrings for the leaf nodes.
If the binary representations of “h” is (100), “o” is (110), “m” is (001) and “e” is (101)
then the word “home” produces the input.
100110001101. ............................... Using the sistrings.
1. The full PAT binary tree is

The value in the intermediate nodes (indicated by rectangles) is the number of bits to skip
until the next bit to compare that causes differences between similar terms.
Signature file Structures
Superimposed coding is a technique used to represent words and documents using
bit patterns (signatures).
Useful for fast searching and retrieval of data in various environments like
databases, distributed systems, and parallel processing.
Each word is converted into a fixed-length bit code called a word signature.
The code is created using a hash function which decides which bits to set as ‘1’ in
the signature.

pg. 33 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


For example, a 16-bit code may have 5 bits set to 1 for each word.
Generate Word Signature:
A hash function sets selected bit positions to '1'.
Example: Computer → 0001011000000110 Science → 1001000011100000
A block is a set of fixed number of words (e.g., 5 words).
All word signatures in the block are combined using OR operation to form the
block signature.
Example Block Signature: 1001011111100110
Searching is done using template matching — the search pattern is matched with bit
positions in block signature.
If too many bits are 1s, the signature becomes dense and inefficient.
To avoid this:
Limit the number of words per block (e.g., 5 words).
Fix code length (e.g., 16 bits).
Limit number of 1s per word (e.g., 5 bits).
• TEXT: Computer Science graduate students study
Block Size: 5 words
Code Length: 16 bits
Each word has 5 bits set to 1
Applications / Advantages
Efficient in searching and storing information.
Suitable for:
Medium-sized databases
Low-term frequency databases
WORM (Write Once Read Many) devices
Parallel processing machines
Distributed computing systems

Hypertext and XML Data Structures


Hypertext refers to non-linear text containing links (hyperlinks) to other documents
or parts of the same document.

pg. 34 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


hypertext is used for browsing and retrieving linked documents.
Enables interactive navigation through related documents.
Anchor text (clickable link text) is often used as a cue for relevance in retrieval.
Search engines analyze link structure (e.g., PageRank algorithm in Google) to
improve relevance.
Used in Web Information Retrieval systems.
XML (eXtensible Markup Language) is used to represent structured and semi structured
data.
XML documents are organized in a tree-like structure, making them suitable for
hierarchical data retrieval.
1. The advent of the Internet and its exponential growth and wide acceptance as a new
global information network has introduced new mechanisms for representing
information.
2. This structure is called hypertext and differs from traditional information storage data
structures in format and use.
3. The hypertext is Hypertext is stored in HTML format and XML.
4. Both of these languages provide detailed descriptions for subsets of text similar to the
zoning.
5. Hypertext allows one item to reference another item via a embedded pointer.
6. HTML defines internal structure for information exchange over WWW on the
internet.

7. XML: defined by DTD, DOM, XSL, etc.

Document and term clustering


Two types of clustering:
1) clustering index terms to create a statistical thesaurus and
2) clustering items to create document clusters.
In the first case clustering is used to increase recall by expanding searches with related
terms. In document clustering the search can retrieve items similar to an item of interest,
even if the query would not have retrieved the item. The clustering process is not precise and
care must be taken on use of clustering techniques to minimize the negative impact misuse
can have.

Applications in IRS

pg. 35 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Hypertext:
 Web search engines.
 Link-based ranking (e.g., PageRank).
 Web crawling and indexing.
XML:
 Digital libraries.
 News/article archives.
 Scientific repositories.
Advantages
Hypertext:
Enhances browsing experience.
Useful in exploratory search.
XML:
Enables precise and structured querying.
Easy to integrate across platforms/systems.
Short Answer Questions
1. Define Data in the context of Information Retrieval.
2. Differentiate between Structured data and Unstructured data with examples.
3. What is Semi-structured data? Give one example.
4. Define Information and give an example.
5. What is the difference between Knowledge and Wisdom?
6. Define Information Retrieval (IR).
7. What is Information Storage and Retrieval (ISR)?
8. List the two basic aspects of an Information Retrieval System (IRS).
9. Mention two operations performed by an IR system.
10. Define Data Retrieval. How is it different from Information Retrieval?
11. State how an IR system acts as a bridge between authors and users.
12. Mention any two features of a Web Search System that distinguish it from a
classic IRS.
13. Define Precision in the context of IR.
14. Define Recall in the context of IR.
15. Write the formula for Precision and give one example.
16. Write the formula for Recall and give one example.
17. What is Indexing in IR systems?
18. Mention two differences between Manual Indexing and Automatic Indexing.
19. Define Stopwords with two examples.

pg. 36 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


20. What is Stemming? Why is it important in IR?
Long Answer Questions

1. Define and differentiate IR systems from other information systems.


2. Explain the importance of domain analysis in IR system design.
3. Describe the evaluation measures used for IR systems.
4. Illustrate with examples how recall and precision are computed.
5. Explain basic data structures used in IR systems.
6. Discuss algorithms for efficient data retrieval.
7. Identify different domains where IR systems can be applied.
8. Analyze the challenges in IR for each domain.
9. Evaluate the performance of an IR system with given metrics.
10. Suggest improvements for better retrieval effectiveness.

pg. 37 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

You might also like