0% found this document useful (0 votes)
15 views47 pages

Ir MT Notes

Information Retrieval (IR) is the process of finding unstructured documents that meet an information need, evolving from professional use to everyday activities through web search engines. IR systems manage large collections of unstructured data, utilizing techniques like indexing and ranking to efficiently retrieve relevant information while facing challenges such as data variety and ambiguity. The retrieval process involves user queries, preprocessing, and feedback loops to enhance accuracy and relevance in results.

Uploaded by

purswanijiten11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views47 pages

Ir MT Notes

Information Retrieval (IR) is the process of finding unstructured documents that meet an information need, evolving from professional use to everyday activities through web search engines. IR systems manage large collections of unstructured data, utilizing techniques like indexing and ranking to efficiently retrieve relevant information while facing challenges such as data variety and ambiguity. The retrieval process involves user queries, preprocessing, and feedback loops to enhance accuracy and relevance in results.

Uploaded by

purswanijiten11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

​ ​ ​ ​ ​ ​ MODULE 1

​ ​ ​ ​ Introduction to Information Retrieval

What is Information Retrieval?



●​ Information Retrieval refers to the process of finding material, usually documents of an
unstructured nature such as text, that satisfies an information need from within large
collections stored on computers.
●​ In a general sense, IR can mean any act of retrieving required information, but in
computer science it specifically relates to searching and managing large document
collections.
●​ It has evolved from being used mainly by professionals to becoming an everyday activity
for millions of users through web search engines and email searches.
●​ IR deals primarily with unstructured or semi-structured data such as text, unlike
structured data stored in relational databases.
●​ The field also supports browsing and filtering document collections, clustering
documents into related groups, and classifying them into predefined categories such as
spam or non-spam.
●​ Information retrieval systems operate at different scales: at the web level with billions of
documents, at the enterprise level for domain-specific collections like patents or research
papers, and at the personal level with search tools integrated into operating systems and
email clients.
●​ The foundation of IR relies on techniques like the term-document matrix for representing
data, the inverted index for efficient searching, and the Boolean retrieval model for
processing logical queries.

Example: Suppose a university student searches the library’s digital archive for “neural
networks for image recognition.” The IR system scans thousands of research papers, uses an
inverted index to quickly locate documents containing the keywords “neural networks” in the
title and “image recognition” in the abstract or body, and returns the most relevant papers.
Need for an Information Retrieval (IR) System

●​ The rapid growth of digital information has made it impossible for humans to manually
locate relevant material within massive collections of documents.
●​ Traditional database systems are limited to structured data and require exact identifiers
(like IDs or codes), which is impractical for unstructured or text-heavy data.
●​ Information Retrieval systems allow users to express their needs in natural language or
keywords and still retrieve relevant results, making information access more flexible.
●​ They help in managing unstructured and semi-structured data such as web pages, emails,
research articles, and multimedia content.
●​ IR systems not only retrieve documents but also support filtering, clustering, and
classification, ensuring users get the most relevant and organized information.
●​ At different scales, IR systems are crucial: at the web level for search engines, at the
enterprise level for corporate databases, and at the personal level for searching files and
emails on individual devices.
●​ Without IR systems, handling the vast and continuously growing amount of digital
information would be inefficient, time-consuming, and often impossible.

Example: A researcher looking for studies on “climate change impact on agriculture” cannot
manually sift through millions of published papers. An IR system like Google Scholar quickly
scans large digital repositories, identifies documents with the required keywords in titles and
abstracts, and ranks them by relevance, saving time and effort.

D/B Information Retrieval system vs Data Retrieval system


Challenges Faced by IR Systems While Handling Unstructured Data

●​ Data Variety and Volume: The massive growth of text, images, audio, and video makes
it difficult to store, process, and retrieve information efficiently.
●​ Ambiguity and Context: Words can have multiple meanings, and user queries may be
vague, making it challenging to determine the intended context.
●​ Lack of Standardization: Unstructured data comes in diverse formats (documents,
emails, web pages, PDFs) with inconsistent structures, which complicates indexing and
retrieval.
●​ NLP Limitations: Current natural language processing techniques struggle with idioms,
sarcasm, domain-specific jargon, and cross-lingual retrieval.
●​ Quality and Noise: Data often contains spelling errors, duplicates, incomplete
information, or irrelevant content that reduces retrieval accuracy.
●​ Semantic Understanding: IR systems find it hard to capture the true meaning behind
user queries and documents, as keyword matching alone may miss relevant results.
●​ Scalability and Performance: Handling billions of documents requires efficient
indexing, storage, and real-time query processing while maintaining speed.
●​ Privacy and Security: Sensitive information in emails, medical records, or enterprise
data must be retrieved without violating user privacy or access restrictions.
●​ Evaluation and Metrics: Defining what counts as a “relevant” document is subjective,
making it difficult to measure and improve retrieval effectiveness.

Example: If a user searches for “Apple benefits”, an IR system must decide whether the query
refers to the fruit or the technology company. Without proper context handling, the system may
return irrelevant results like health articles when the user intended to find stock reports,
highlighting the challenges of ambiguity and semantic understanding in unstructured data.
COMPONENTS OF IR SYSTEM

1. Document Collection

●​ This is the input source for the IR system.


●​ It consists of a large set of documents (web pages, PDFs, articles, etc.).
●​ These documents are stored in a repository and form the knowledge base from which the
IR system retrieves information.

2. Crawling

●​ The crawler (or spider) automatically visits documents/web pages to gather their
contents.
●​ It follows hyperlinks and identifies new or updated pages.
●​ The output of crawling is a set of URLs to be crawled further and the raw data of
documents to be processed.

3. Indexing

●​ The crawled documents are processed to create an index, similar to the index of a book.
●​ It maps terms (keywords) to the documents where they appear.
●​ This step uses techniques like tokenization, stemming, stop-word removal, and term
weighting.
●​ The index allows fast retrieval instead of scanning entire documents every time.

4. Ranking

●​ When a user submits a query, the system matches it against the index.
●​ The retrieved documents are then ranked based on their relevance.
●​ Ranking algorithms (e.g., TF-IDF, BM25, PageRank) score documents by similarity to
the query.
5. Search Result

●​ The ranked list of documents is presented to the user.


●​ This is the direct output of the IR system that the user interacts with.

6. Relevance Feedback

●​ After viewing results, the user may mark some documents as relevant or irrelevant.
●​ The system uses this feedback to improve future retrieval (query expansion, re-ranking).
●​ This creates a feedback loop that helps refine the system’s accuracy over time.

7. Query

●​ The query is the input given by the user in the form of keywords, natural language
questions, or structured queries.
●​ It triggers the entire retrieval process.

User Interaction/Tasks in an Information Retrieval (IR) System

●​ Definition
○​ User interaction in an IR system refers to the way users communicate their
information needs to the system, navigate through results, and refine their search
until the desired information is obtained.
○​ According to Ricardo Baeza-Yates, interaction is not just about issuing queries
but also about iterative exploration, browsing, and feedback between the user and
the retrieval system.
●​ Core User Tasks
○​ Retrieval Task: The user formulates a query (keywords, phrases, or structured
conditions), and the system searches the database to return a ranked list of
potentially relevant documents.
○​ Browsing Task: Instead of issuing precise queries, users explore the information
space by following categories, hyperlinks, or clusters of documents to gradually
reach relevant information.
●​ Interaction Flow (Based on the Diagram)
○​ The user can directly interact with the retrieval module by submitting a query.
○​ The retrieval system communicates with the database to fetch results.
○​ Alternatively, the user can browse the document collection (e.g., by topic,
category, or metadata).
○​ Browsing can feed into retrieval (e.g., selecting a document cluster to refine
queries) and retrieval can feed into browsing (e.g., checking related results).
○​ Both browsing and retrieval tasks loop back, allowing iterative refinement of the
search process until the information need is satisfied.
●​ Importance of Interaction
○​ Users rarely find the desired information on the first attempt. Interaction allows
them to refine queries, filter irrelevant results, and explore related documents.
○​ Effective interaction bridges the gap between the user’s mental model of the
information need and the system’s structured representation of data.

Example: A researcher searching for “deep learning applications in medical imaging” may start
by entering the query (retrieval). From the results, they might browse related categories like
“MRI-based methods” or “cancer detection”. They may then refine the search using terms
found while browsing, creating an interactive cycle between retrieval and browsing until the
exact set of relevant papers is located.
Logical View of a Document in IR

The logical view of a document refers to the transformation of a raw document into a structured
representation suitable for indexing and retrieval. Instead of treating the document as just plain
text, an Information Retrieval (IR) system processes it step by step to extract meaningful
features.

1. Input Document

●​ A document contains both text (words, sentences) and structure (titles, sections, tables,
metadata).
●​ This raw form is not efficient for retrieval because it is too large and unorganized.

2. Structure Recognition

●​ The system identifies logical components such as headings, abstracts, captions, or


bibliographic data.
●​ This helps separate important content (e.g., keywords in the title) from supporting
details (e.g., references).
●​ Output: recognized document structure.

3. Normalization (Accents, Spacing, etc.)

●​ Convert text into a standard, uniform form.


●​ Tasks include lowercasing, removing extra spaces, handling special symbols, and
normalizing accents.
●​ Example: “Café” → “cafe”.

4. Stopword Removal

●​ Commonly used words (e.g., the, and, is, of) are removed since they carry little meaning.
●​ This reduces noise and storage requirements.
●​ Example: “The history of information retrieval” → “history information retrieval.”

5. Noun Group Identification

●​ Multi-word expressions are identified because they represent richer meaning than
individual words.
●​ Example: “information retrieval system” is more meaningful than {information,
retrieval, system} separately.

6. Stemming or Lemmatization

●​ Words are reduced to their root or base form to group variations.


●​ Example: “connected, connecting, connection” → “connect.”
●​ This improves matching between queries and documents.

7. Indexing (Automatic or Manual)

●​ After preprocessing, a set of index terms is generated.


●​ Index terms are the core representation of the document used for retrieval.
●​ Indexing can be automatic (by algorithms) or manual (by experts in libraries or archives).

Information Retrieval in the Library (optional, just give a read)

●​ Libraries were among the first to adopt IR systems for retrieving information, initially
developed in academic institutions and later by commercial vendors.
●​ The first generation of these systems automated earlier catalog processes like card
catalogs and allowed searches mainly by author name and title.
●​ The second generation introduced more advanced search functions, such as searching by
subject headings, keywords, and complex queries, thereby improving precision and
recall.
●​ The third generation, still in use, emphasized user convenience through improved
graphical interfaces, electronic forms, hypertext linking, and open architectures.
●​ Several vendors supported these systems, including Endeavor Information Systems Inc.,
Innovative Interfaces Inc., and EOS International.
●​ Academic and research-focused systems also emerged, like Okapi (City University,
London), MELVYL (University of California), and Cheshire II (UC Berkeley).
●​ For example, in a university library using MELVYL, a student could search not only by
book title but also by keywords like “quantum mechanics” to retrieve multiple related
resources instantly.

The Web and Digital Libraries(optional, just give a read)

●​ Modern Web search engines still rely on indexing techniques similar to those used by
librarians a century ago, but massive advances in computing and the Web have
transformed their scale and accessibility.
●​ Costs of accessing information have dropped significantly, enabling wider audiences to
reach diverse resources.
●​ Advances in digital communication have provided global access, meaning users can
retrieve data from anywhere, even if the source is physically distant.
●​ Publishing freedom allows anyone to post information, making the Web and digital
libraries highly interactive platforms for sharing documents, photos, videos, and ideas.
●​ Users enjoy greater convenience, as information can be accessed anytime and in flexible
ways—for instance, buying a book online at night or participating in forums and
discussions.
●​ Despite these improvements, challenges persist in finding the most relevant information
due to the vast volume of available content. Faster indexing techniques and smaller query
response times are urgently needed.
●​ The quality of retrieval strongly depends on user interaction, so understanding user
behavior is vital in designing new IR strategies.
●​ For example, unlike a library catalog limited to metadata, a Web user searching “climate
change impact on agriculture” can instantly retrieve global research articles, government
reports, and multimedia content.​
.
THE RETRIEVAL PROCESS

Retrieval Process in an IR System

1.​ User Need → User Interface


○​ The process begins when a user has an information need.
○​ The user expresses this need as a query through the user interface (keywords,
natural language, Boolean operators).
2.​ Text Operations
○​ The query text undergoes preprocessing so that it can be matched with documents.
○​ Preprocessing includes:
■​ Tokenization (breaking into words)
■​ Stop-word removal
■​ Stemming/Lemmatization (reducing to root form)
○​ Similarly, the system also preprocesses document texts during indexing to
maintain consistency.
3.​ Query Operations
○​ The system interprets the processed query.
○​ It may apply query expansion (adding synonyms, related terms) or
reformulation (improving the query using past user behavior or relevance
feedback).
○​ The output is a structured query ready for searching.
4.​ Searching Using the Index
○​ The structured query is compared against the Index.
○​ The Index (inverted file) maps terms to the documents where they occur, enabling
fast retrieval.
○​ Instead of scanning the full database, the system only looks at documents listed in
the index for those query terms.
5.​ Indexing and Text Database
○​ Documents in the Text Database are continuously processed by the Indexing
module to update the index.
○​ This ensures that newly added or modified documents are searchable.
○​ Thus, the index serves as the bridge between user queries and documents.
6.​ Retrieving Candidate Documents
○​ After searching the index, the system retrieves a set of candidate documents that
contain (or are related to) the query terms.
○​ These documents are relevant to some extent but not yet ranked.
7.​ Ranking of Documents
○​ The retrieved documents are then ranked according to their degree of relevance.
○​ Ranking models include:
■​ TF-IDF weighting and cosine similarity (vector space model)
■​ BM25 (probabilistic model)
■​ Language models
○​ Documents most relevant to the query are placed at the top.
8.​ Results Displayed to User
○​ The ranked list of documents is shown to the user through the interface.
○​ The user can browse through the results and judge relevance.
9.​ User Feedback Loop
○​ The system may collect feedback from the user
■​ Explicit (user marks documents as relevant/irrelevant).
■​ Implicit (click-through rates, time spent on documents).
○​ This feedback is used to refine the query (e.g., Rocchio algorithm) and improve
future retrievals.
PRECISION AND RECALL

Precision

●​ Precision is about exactness.


●​ It measures how many of the documents retrieved by the system are actually relevant to
the user’s query.
●​ If a system gives you 10 documents and only 6 are useful, then precision is 60%.
●​ High precision means the system is good at filtering out irrelevant documents.

Recall

●​ Recall is about completeness.


●​ It measures how many of the relevant documents in the whole collection were
successfully retrieved.
●​ For example, if there are 50 relevant documents in the database, and the system gives you
20 of them, then recall is 40%.
●​ High recall means the system doesn’t miss many relevant documents.

Key Difference

●​ Precision focuses on quality of results (Are the retrieved results relevant?).


●​ Recall focuses on quantity of coverage (Did the system retrieve most of the relevant
results?).

Trade-off

●​ Often, increasing recall (retrieving more documents) lowers precision, because irrelevant
documents also get included.
●​ Increasing precision (returning fewer but highly relevant documents) may lower recall,
because some relevant documents get left out.​
Numericals
TERM DOCUMENT MATRIX AND INVERTED INDEX

Numericals

1)
2)
3)
DIY:-
Module 2: Modeling in Information Retrieval

1. Introduction to Modeling in Information Retrieval

1.1 What is an Information Retrieval (IR) Model?

●​ An Information Retrieval model provides the mathematical and conceptual framework


to describe:
1.​ How documents are represented (representation of knowledge contained in
documents).
2.​ How queries are represented (user’s information need in a formal way).
3.​ How to measure the similarity/relevance between a query and a document.
●​ Formally, an IR model is defined as a quadruple:

M=[D,Q,F,R(q,d)]

where:

●​ D → The set of documents (logical representations of documents in the collection).


●​ Q → The set of queries (formal representation of information needs).
●​ F → A framework to model documents and queries, i.e., rules or mathematical structure
used.
●​ R(q, d) → A ranking function that assigns a real number indicating the degree of
relevance of a document d to a query q.

Thus, IR models allow us to transform vague user information needs into ranked lists of
documents.

1.2 Need for IR Models

●​ In the real world, information needs are ambiguous and cannot be expressed precisely.
●​ Document collections are large, unstructured, and diverse.
●​ Simply matching keywords is not sufficient (as in early search systems).
●​ Hence, IR models are needed to:
○​ Represent documents and queries in a systematic way.
○​ Allow efficient retrieval and ranking.
○​ Handle uncertainty and vagueness in user queries.

1.3 General Information Retrieval Process

The IR process can be divided into sequential stages:


1.​ Document Collection (Corpus)
○​ A set of documents (could be text files, webpages, research papers, etc.).
○​ Example: Digital library containing 1 million research articles.
2.​ Indexing:
○​ Each document is analyzed to extract index terms (keywords or features that
represent content).
○​ Terms can be selected by removing stop words, stemming, lemmatization, etc.
○​ Example: Document → “Information Retrieval and Search Engines” → Index
terms: {information, retrieval, search, engine}.
3.​ Query Formulation:
○​ The user expresses an information need in natural language.
○​ This query is transformed into a formal representation compatible with the IR
model
○​ Example: Query: “retrieval models” → Index terms: {retrieval, model}.
4.​ Matching:
○​ The system compares the query representation with the indexed documents using
a matching strategy (Boolean operators, cosine similarity, probability estimates,
etc.).
5.​ Ranking:
○​ The IR system assigns a relevance score to each document.
○​ Documents are sorted in descending order of similarity to the query.
○​ Example: If the query is “information retrieval”, document D1 may get a
similarity score of 0.89, D2 = 0.65, D3 = 0.41.
6.​ Retrieval (Output):
○​ The system retrieves the most relevant documents (e.g., top 10 results like in
Google).

1.4 Key Concepts

●​ Index Terms:
○​ Keywords/phrases used to represent the semantic content of documents.
○​ Example: “machine learning algorithms” → index terms = {machine, learning,
algorithm}.
●​ Vocabulary (V):
○​ Set of all distinct index terms in the collection.
○​ Example: If corpus has terms {machine, learning, retrieval, engine}, then
V={machine,learning,retrieval,engine}V = \{machine, learning, retrieval,
engine\}V={machine,learning,retrieval,engine}.
●​ Ranking Function:
○​ A scoring mechanism to evaluate similarity/relevance between query and
documents.
○​ Example: Cosine similarity in Vector Model, Probabilistic relevance in BM25.
●​ Relevance:
○​ A measure of how well a document satisfies a user’s information need.
○​ Can be binary relevance (relevant or not) or graded relevance (highly relevant,
moderately relevant, etc.).

1.6 Challenges in IR Modeling

Information Retrieval models face difficulties due to language complexity, diverse document
collections, and the gap between user intent and system interpretation. The six key challenges
are:

1. Uncertainty of User Needs​


Queries are often short, vague, and incomplete. Example: “retrieval models” could mean
comparison of models, mathematical details, or applications. This ambiguity makes exact
retrieval difficult.

2. Vocabulary Mismatch​
Users and authors may use different terms for the same concept (e.g., “car insurance” vs.
“automobile coverage”). Strict keyword matching misses relevant results. Solutions include
stemming, thesaurus expansion, and word embeddings.

3. Polysemy and Ambiguity​


Words can have multiple meanings (e.g., “bank security” = finance vs. riverbank). Queries like
“python” may refer to a language or animal. Disambiguation requires context and semantic
modeling.

4. Subjectivity of Relevance​
Relevance varies by user and purpose. Example: “history of India” → a student may want
basics, a researcher detailed studies. IR systems struggle to model this subjectivity.

5. Scalability and Efficiency​


Large-scale IR systems (Google, Bing) handle billions of documents and queries. Challenges
include indexing, high-dimensional term-document matrices, and delivering millisecond
responses efficiently.

6. Evaluation of IR Systems​
Assessing effectiveness is difficult since relevance is subjective. Metrics like precision, recall,
F1, and MAP exist, but trade-offs remain (high recall = more irrelevant docs, high precision =
missed results). User satisfaction also depends on freshness, diversity, and readability.
2. Taxonomy of Information Retrieval (IR) Models

2.1 What is a Taxonomy in IR?

●​ A taxonomy is a systematic classification of IR models.


●​ It shows different approaches for how documents and queries are represented, and how
relevance is computed.
●​ The taxonomy helps us understand the scope of IR models and compare their strengths
and weaknesses.

2.2 Two Main Retrieval Scenarios

IR systems can be broadly categorized into two retrieval modes:

1. Ad-hoc Retrieval

●​ Definition: The user submits a query to a static collection of documents, and the system
retrieves and ranks relevant results.
●​ Example: Google search, library catalog search, research database search.
●​ Process:
○​ User formulates a query (e.g., keywords, natural language question).
○​ The system matches this query against pre-indexed documents.
○​ A ranking algorithm (e.g., TF-IDF, BM25, PageRank) is applied to order the
results by relevance.
●​ Characteristics:
○​ Each query is independent; no prior context or feedback is assumed.
○​ The document collection is relatively stable (static).
○​ Works well for one-time information needs.
●​ Real-world analogy: Asking a librarian a new question every time you visit the library.

2. Filtering (or Routing)

●​ Definition: Documents arrive as a stream over time, and the system must filter them
according to a long-term user profile (standing query).
●​ Example:
○​ News recommendation systems (e.g., Google News personalization).
○​ Email spam filtering.
○​ Stock market alerts, research paper alerts.
●​ Process:
○​ A user profile is stored, representing long-term interests (e.g., “AI and deep
learning papers”).
○​ As new documents arrive in real-time, the system checks them against this profile.
○​ Only documents matching the user’s interest are delivered.
●​ Characteristics:
○​ Continuous and long-term process (profile remains active).
○​ Focuses on personalization and relevance over time.
○​ Common in recommender systems and notification services.
●​ Real-world analogy: Subscribing to a magazine or journal that automatically sends you
issues relevant to your interest.​

2.3 Classification of IR Models

Within both retrieval modes, IR models can be grouped into classic models and alternative
models.

A. Classic Models

These are the foundational IR models that shaped the theory of information retrieval. They are
widely studied and form the basis of most search engines:

1.​ Boolean Model


○​ Based on set theory and Boolean logic (AND, OR, NOT).
○​ A document is either retrieved (relevant) or not retrieved.
○​ Example: Query = machine AND learning.
2.​ Vector Space Model (VSM)
○​ Represents documents and queries as vectors in a multi-dimensional space.
○​ Uses term weighting (TF-IDF) and cosine similarity to measure closeness
between query and document vectors.
○​ More flexible than Boolean because it supports partial matching.
3.​ Probabilistic Model
○​ Uses probability theory to estimate the likelihood that a document is relevant to
a query.
○​ Documents are ranked by their estimated probability of relevance.
○​ Provides a more principled mathematical foundation than Boolean or VSM.

These will be explained in detail in the next section.


2.4 Example to Differentiate Ad-hoc vs Filtering

To clearly highlight the difference, let’s take the example of a query on Deep Learning for
NLP:

●​ Ad-hoc Retrieval:
○​ User enters a one-time query: “Best deep learning algorithms for NLP”.
○​ Google searches its entire indexed database and produces a ranked list of web
pages, articles, and tutorials.
○​ The user selects from these results and the process ends there.
●​ Filtering (Routing):
○​ User defines a long-term profile: “Send me updates about new research papers on
deep learning and NLP”.
○​ Every time new papers appear on ArXiv or research databases, the system
automatically evaluates them against this profile.
○​ Only relevant results are continuously delivered to the user (e.g., email alerts, feed
recommendations).

1. Boolean Model

Definition and Basic Idea

●​ The Boolean model is the earliest and simplest model of Information Retrieval. It
represents both documents and queries as sets of index terms and uses the principles of
Boolean algebra to determine whether a document is relevant.
●​ In this model, retrieval is based on exact matching. A document is either considered
relevant (retrieved) if it satisfies the query condition, or non-relevant (not retrieved) if it
does not. There is no concept of degrees of relevance.

Representation of Documents and Queries

●​ A document is represented as a set of terms or keywords that describe its content. Each
term can either be present (1) or absent (0).
●​ Queries are expressed using Boolean operators:
○​ AND ( ∧ ) ensures that a document must contain all the specified terms.
○​ OR ( ∨ ) ensures that a document can contain any one of the specified terms.
○​ NOT ( ¬ ) ensures that a document must not contain the specified term.
●​ For example, the query “information AND retrieval” will only retrieve documents
containing both terms, while “information OR retrieval” retrieves documents containing
either one or both terms.
Disjunctive Normal Form (DNF)

●​ Any Boolean query can be rewritten in Disjunctive Normal Form (DNF), which is a
logical expression formed as an OR (disjunction) of multiple AND (conjunction)
conditions.
●​ This makes queries easier to evaluate because they reduce to checking which documents
satisfy at least one conjunction.
●​ Example: Consider the query q=A∧(B∨¬C)
●​ This can be rewritten as:​
qDNF=(A∧B)∨(A∧¬C)
●​ Each conjunction represents one possible way in which a document can satisfy the query,
and documents are retrieved if they satisfy any of the conjunctions

Example of Boolean Retrieval

Suppose we have the following document collection

●​ D1: “information retrieval and search engines”


●​ D2: “boolean retrieval model and algorithms”
●​ D3: “search engines and algorithms”
●​ D4: “information retrieval and algorithms”

Queries:

1.​ “information AND retrieval” – Retrieves D1 and D4, since both contain the terms
“information” and “retrieval”.
2.​ “boolean OR search” – Retrieves D2 (boolean), D1 (search), and D3 (search).
3.​ “algorithms AND NOT retrieval” – Retrieves D3, since it contains “algorithms” but
does not contain “retrieval”.

This shows that Boolean queries act as strict filters.

Weight Vectors in Boolean Model

●​ Each document and query can be represented as a binary vector of 0s and 1s.
●​ A 1 indicates that the term appears in the document or query, while a 0 indicates that it is
absent.
●​ The similarity function is defined as:​

●​ This binary scoring confirms that retrieval is exact and not based on partial similarity.

Advantages of Boolean Model

●​ The model is simple and easy to implement. It provides a clear logical framework for
document retrieval.
●​ Queries are unambiguous because they are explicitly formulated using logical operators.
●​ It is suitable when the user wants very precise results, for example in legal or patent
retrieval systems where exact matching is important.

Limitations of Boolean Model

●​ Documents are not ranked, meaning all retrieved documents are considered equally
relevant, which does not reflect real-world user needs.
●​ Partial matches are not allowed. If a document misses even one required term, it is
excluded, even if it is otherwise highly relevant.
●​ Users must be skilled at formulating complex Boolean queries, which is often impractical
for ordinary users.
●​ In practice, Boolean queries often return either too few results (famine) or too many
results (feast), making retrieval less effective

2: Vector Space Model

Introduction and Motivation

●​ The Boolean model is limited because it only supports exact matching and provides no
ranking of documents. This makes it unsuitable for large, diverse collections where
queries may be vague or incomplete.
●​ The Vector Space Model (VSM) was developed to overcome these limitations. It allows
partial matching of documents to queries and ranks documents in decreasing order of
similarity.
●​ This ranking capability makes VSM one of the most widely used IR models, forming the
foundation for many modern search systems.
Representation of Documents and Queries

●​ In the vector space model, both documents and queries are represented as vectors in a
t-dimensional space, where t is the number of distinct terms (the vocabulary).
●​ Each term corresponds to one dimension, and the value in that dimension represents the
weight of the term in the document or query.

●​
●​ The weight values are usually derived from term frequency (TF), inverse document
frequency (IDF), or their combination (TF-IDF).

Term Weighting

●​ Not all terms contribute equally to the relevance of a document. Common words like
"the" or "is" are less useful, while rare but meaningful terms provide stronger evidence of
relevance
●​ To capture this, each term is assigned a weight that reflects its importance.
1.​ Term Frequency (TF):
○​ Measures how often a term appears in a document.
○​ Higher frequency → higher importance.
○​ Example: If the term “retrieval” occurs 5 times in a document, TF = 5.
2.​ Inverse Document Frequency (IDF):
○​ Reduces the weight of terms that occur in many documents, since they provide
little discrimination.
○​ Defined as:​

○​ Rare terms (low ni​) → high IDF value.


3.​ TF-IDF Weighting:
○​ Combines TF and IDF to balance importance within a document and across the
collection.
○​ Defined as:​

○​ This ensures that terms with high frequency in a document but low frequency in
the collection get the highest weights.

Similarity Computation

●​ The similarity between a document and a query is computed by comparing their vectors.
●​ The most common measure is cosine similarity, which computes the cosine of the angle
between two vectors:​

●​ This value ranges from 0 to 1:


○​ 1 → document and query are identical in direction (perfectly relevant).
○​ 0 → no similarity.
●​ Example: If a query vector is [0.5, 0.2, 0.3] and a document vector is [0.3, 0.4, 0.2], their
similarity is computed by the dot product divided by vector norms.

Document Length Normalization

●​ Longer documents naturally contain more terms and may unfairly appear more similar to
queries.
●​ To address this, the cosine similarity formula normalizes by the length of vectors.
●​ The length of a document vector is:​

●​ This ensures that the similarity depends on term distribution rather than document size.
Example of Ranking

Suppose we query “machine learning algorithms” and have the following documents:

●​ D1: “Introduction to machine learning algorithms and their applications”


●​ D2: “Deep learning and its relationship with machine learning”
●​ D3: “Comparing various machine learning techniques for data analysis”

Using TF-IDF and cosine similarity, the model will assign a similarity score to each document.
The documents will then be ranked in decreasing order of score, showing the most relevant first.

Advantages

●​ Allows partial matching, which retrieves documents even if they do not contain all query
terms.
●​ Produces ranked results, making it easier for users to access the most relevant documents
first.
●​ Incorporates statistical term weighting (TF-IDF), which significantly improves retrieval
effectiveness.
●​ Document length normalization prevents bias toward longer documents.

Limitations

●​ Assumes independence of terms, which is not always true (terms like “computer” and
“network” often appear together).
●​ Computationally expensive for very large document collections because vectors can be
high-dimensional.
●​ TF-IDF weighting does not account for semantic meaning or synonymy; two different
terms with similar meaning are treated separately.
Practice Problems on Vector Space Model

Problem 1:​
Given the following term-document matrix with TF-IDF weights:

Term Document 1 Document 2 Document 3

X 0.3 0.4 0.2

Y 0.6 0.1 0.8

Z 0.4 0.5 0.7

Query vector:
Problem 2:​
Query: “machine learning algorithms”​
Documents:

●​ D1: “Introduction to machine learning algorithms and their applications”


●​ D2: “Deep learning and its relationship with machine learning”
●​ D3: “Comparing various machine learning techniques for data analysis”

Task: Represent terms with TF-IDF, compute cosine similarity, and rank the documents.

Solution (outline):

●​ Step 1: Preprocess (tokenize, remove stopwords, keep main terms: machine, learning,
algorithms, applications, deep, techniques, analysis).
●​ Step 2: Build term-document matrix with frequencies.
●​ Step 3: Compute IDF for each term across the 3 documents.
●​ Step 4: Calculate TF-IDF weights for each document vector and query vector.
●​ Step 5: Use cosine similarity to find sim(D1, Q), sim(D2, Q), sim(D3, Q).
●​ Step 6: Rank based on similarity values.

Solution:
Problem 3:​
Query: “gold silver truck”​
Documents:

●​ D1: “Shipment of gold damaged in a fire”


●​ D2: “Delivery of silver arrived in a silver truck”
●​ D3: “Shipment of gold arrived in a truck”

Solution (outline):

●​ Extract terms: gold, silver, truck.


●​ Represent documents with term frequencies.
●​ Apply TF-IDF weighting (gold appears in D1, D3; silver in D2; truck in D2, D3).
●​ Compute similarity of query vector with each document vector.
●​ Likely ranking: D2 (highest, because contains both silver and truck), then D3 (gold +
truck), then D1 (only gold).

Solution
3: Probabilistic Model

Introduction

The probabilistic model of Information Retrieval is based on the idea that relevance is uncertain
and can be expressed in terms of probability. Unlike the Boolean model, which retrieves
documents using exact matching, or the Vector Space Model, which uses geometric similarity,
the probabilistic model estimates how likely a document is to be relevant for a given query.

The central question is:

●​ Given a query and a document, what is the probability that this document is relevant?

Documents are then ordered (ranked) according to these probabilities, so that the most relevant
documents appear first. This approach is based on the Probability Ranking Principle (PRP),
which states that retrieval effectiveness is maximized when documents are ranked in decreasing
order of probability of relevance.

Prior and Posterior Probability

In probability theory:

●​ Prior probability is the probability of an event before considering any new evidence. In
IR, this could mean the probability that a document is relevant without looking at its
terms.
●​ Posterior probability is the probability of an event after considering evidence. In IR,
this is the probability that a document is relevant given that it contains certain query
terms.

This distinction is important because IR systems start with a prior belief about relevance and
then update that belief as new evidence (terms in the document) is observed.

Bayes’ Theorem in IR

The updating of beliefs is done using Bayes’ theorem, which provides a formal rule for moving
from prior to posterior probability.
This formula means: the probability that a document is relevant after observing evidence
depends on how likely that evidence is to appear in relevant documents versus non-relevant ones.

Odds and Log Odds


The log odds form is useful because it converts multiplication into addition, and makes
interpretation simpler. Each new piece of evidence contributes additively to the score of a
document.
Advantages of Probabilistic Model

1.​ Theoretical Foundation: It is based on probability theory and Bayes’ rule, giving a
strong mathematical justification for ranking documents.
2.​ Dynamic Updating: Prior probabilities can be updated to posterior probabilities as new
evidence (terms) is observed, which makes the model adaptive.
3.​ Interpretability: The use of prior, posterior, and odds provides an intuitive explanation
of how relevance changes with evidence.
4.​ Discriminative Evidence Handling: If a term is much more likely in relevant documents
than in non-relevant ones, the odds ratio strongly favors documents containing that term.

Disadvantages of Probabilistic Model

1.​ Independence Assumption: It assumes terms occur independently, which is not true in
natural language (e.g., “machine” and “learning” are correlated).
2.​ Binary Term Representation: In its basic form, it only considers whether a term is
present or absent, ignoring term frequency (a term appearing 10 times is treated the same
as once).
3.​ Initial Estimates Needed: Prior probabilities and conditional probabilities
P(E∣R),P(E∣¬R) must be estimated, which is difficult without relevance judgments.
4.​ Simplistic in Syllabus Form: When taught only as Bayes’ rule and odds, it explains the
principle but is not practical enough for large-scale retrieval systems without further
extensions.
PROBLEMS:
2)
Conclusion:

Even if a person tests positive, the probability that they actually have the disease is only about
8.75%. This low probability is due to the fact that the disease is rare, and the test has a significant
false positive rate.

ODDS AND RANKING KA NUMERICAL MEKO HI NHI SMJHA SO PLEASE


KHUDSE KAR LE
Module 4: Indexing and Scoring in Information Systems

Section 4.1: Introduction

Role of Indexing in IR

●​ Information retrieval systems deal with large document collections. A naïve approach
would be to scan every document sequentially to answer a query. This is called
sequential searching, and it is highly inefficient when dealing with millions of
documents.
●​ To overcome this, indexing is used. Indexing structures organize the document collection
in a way that allows fast and efficient retrieval.
●​ The index acts like the index of a book: instead of scanning every page, we can directly
jump to the relevant locations for a keyword.

Role of Scoring in IR

●​ Simply identifying which documents contain a query term is not enough; we also need to
know which documents are more relevant than others.
●​ Scoring refers to the assignment of a numerical score to each document with respect to a
query.
●​ This score reflects the degree of relevance, and documents are returned in ranked order,
with the most relevant at the top.

Combined Role

●​ Indexing ensures fast access to candidate documents.


●​ Scoring ensures that the retrieved documents are ranked according to relevance.
●​ Together, indexing and scoring form the core of efficient IR systems (including web
search engines like Google).

Section 4.2: Inverted Files


Definition

●​ The inverted file index is the most widely used indexing structure in information
retrieval.
●​ It is called “inverted” because it maps terms to documents, instead of documents to
terms.
●​ The index consists of two main components:
1.​ Dictionary (or Vocabulary): A list of all distinct terms in the collection.
2.​ Postings Lists: For each term, a list of documents in which the term appears,
often along with additional information (e.g., frequency, positions).

Construction Steps

1.​ Document preprocessing: Tokenization, stopword removal, and


stemming/lemmatization are applied.
2.​ Term extraction: Distinct terms are identified for each document
3.​ Posting creation: Each term is associated with the document IDs in which it occurs.
4.​ Index storage: Dictionary and postings are stored in a compact format to allow fast
access.

Example

Suppose the collection is:

●​ D1: “data retrieval systems


●​ D2: “database systems and retrieval”

Dictionary: {data, database, retrieval, systems}

Inverted index:

●​ data → {D1}
●​ database → {D2}
●​ retrieval → {D1, D2}
●​ systems → {D1, D2}

This allows immediate identification of documents containing the term “retrieval”, without
scanning all documents.

Advantages of Inverted Files

●​ Extremely efficient for text retrieval, as it avoids full scans.


●​ Can be extended to include additional information such as term frequency, term positions,
or even zone information (e.g., title vs body).
●​ Forms the backbone of almost all practical IR systems today.

Limitations

●​ Requires preprocessing and storage overhead.


●​ Updates (adding or deleting documents) can be expensive, as the index must be adjusted.
Section 4.3: Boolean Queries and Sequential Searching
Boolean Queries in IR

●​ A Boolean query is a formal way of representing an information need using the operators
AND, OR, and NOT from Boolean logic. Each operator specifies how terms should be
combined to determine which documents are retrieved.
●​ Documents are represented by postings lists in the inverted index. A postings list is the
set of document IDs in which a term appears. By applying set operations on these
postings lists, Boolean queries can be efficiently answered.
●​ For example, “information AND retrieval” means that only those documents that contain
both the term “information” and the term “retrieval” should be returned. Similarly,
“algorithms OR systems” retrieves documents that contain at least one of the two terms,
while “retrieval AND NOT systems” retrieves documents that contain “retrieval” but do
not contain “systems.”

Processing Boolean Queries with Inverted Index

●​ To process a Boolean query, the retrieval system uses the inverted index to obtain
postings lists of the relevant terms.
●​ AND corresponds to the intersection of postings lists, because a document must be
present in both lists.
●​ OR corresponds to the union of postings lists, since documents may appear in either list.
●​ NOT corresponds to set difference, as documents in one postings list are excluded if they
appear in the postings list of the negated term.

Example​
Suppose the collection has three documents:

●​ D1: “data retrieval systems”


●​ D2: “database systems and retrieval”
●​ D3: “retrieval techniques in data mining”

Inverted index:

●​ data → {D1, D3}


●​ database → {D2}
●​ retrieval → {D1, D2, D3}
●​ systems → {D1, D2}

Queries

1.​ “data AND retrieval” → {D1, D3} ∩ {D1, D2, D3} = {D1, D3}
○​ Documents D1 and D3 contain both “data” and “retrieval.”
2.​ “retrieval OR systems” → {D1, D2, D3} ∪ {D1, D2} = {D1, D2, D3}
○​ All three documents contain at least one of the two terms.
3.​ “retrieval AND NOT systems” → {D1, D2, D3} – {D1, D2} = {D3}
○​ Only D3 contains “retrieval” but not “systems.”

This demonstrates how Boolean operators are executed as set operations on postings lists

Sequential Searching

●​ Sequential searching is the simplest way to answer a query: the system scans every
document in the collection, one by one, to check if it satisfies the query.
●​ For example, to answer the query “data AND retrieval,” the system would open each
document, check if both words appear, and only then include it in the results.
●​ While this works for small document collections, it becomes impractical for large-scale
systems like web search engines. With millions of documents, sequential scanning would
be far too slow.

Indexed Search vs Sequential Search

●​ In sequential search, the time required is proportional to the size of the collection,
because every document must be checked. Thus, the complexity is O(N),where N is the
number of documents.
●​ In indexed search, only the postings lists of the query terms are accessed. This means the
system works with a much smaller subset of the collection, often only a few hundred or
thousand entries, even if the collection contains millions of documents.
●​ This difference in efficiency is why inverted indices are the backbone of modern search
engines: they allow queries to be answered in milliseconds, even for massive collections.

Strengths of Boolean Queries

●​ Boolean queries provide clear and precise control. A user can specify exactly what
terms must or must not appear in the results.
●​ They are suitable in domains where strict precision is required, such as legal document
retrieval, patent search, or medical research databases.
●​ They are straightforward conceptually, since Boolean operators (AND, OR, NOT) are
widely understood.
Weaknesses of Boolean Queries

●​ The Boolean model does not provide ranking. All retrieved documents are considered
equally relevant, which is not realistic in most situations.
●​ It does not allow partial matching. If a document contains many of the query terms but
misses one, it is excluded, even if it is otherwise useful.
●​ Boolean queries can be difficult to construct for average users. Users may not know
how to best combine AND, OR, and NOT to capture their information need.
●​ In practice, Boolean retrieval often produces either too many documents (feast) or too
few documents (famine).

You might also like