DEPARTMENT OF COMPUTER ENGINEERING
Class/Sem/Year: B.E./ VII/2025-26(ODD)
Div: (A/B) Subject: Information Retrieval
Assignment No. 1 Solution
1) Define information retrieval and list down classification of
information? .(CO 1)
Information retrieval (IR) is the process of obtaining relevant information from a large collection of
data, typically stored in various forms such as text documents, multimedia files, or databases. The
goal of information retrieval is to efficiently retrieve documents or resources that are relevant to a
user's information needs.
Classification of Information Retrieval Systems:
1. Based on Data Type:
. Text Retrieval Systems: Primarily focused on retrieving textual documents.
. Multimedia Retrieval Systems: Designed to retrieve multimedia content such as images, audio,
video, etc.
. Database Retrieval Systems: Aimed at retrieving structured data stored in databases.
[Link] on Retrieval Model:
. Boolean Retrieval Systems: Retrieve documents based on Boolean logic (AND, OR, NOT) matching
user queries with document terms.
. Vector Space Model (VSM): Represents documents and queries as vectors in a multidimensional
space and retrieves documents based on similarity.
. Probabilistic Retrieval Systems: Employ probabilistic models to estimate the likelihood of relevance
of documents to a query.
. Latent Semantic Indexing (LSI): Utilizes singular value decomposition to identify latent semantic
relationships between terms and documents.
. Neural Network-based Retrieval Systems: Utilize neural network architectures for information
retrieval tasks.
[Link] on Access Method:
. Online Retrieval Systems: Users interact with the retrieval system over a network, accessing data
stored remotely.
. Offline Retrieval Systems: Users interact with the retrieval system locally, accessing data stored on
their own systems.
[Link] on Domain:
. Web Search Engines: Retrieve information from the World Wide Web.
. Enterprise Search Systems: Designed for searching within an organization's internal documents and
resources.
. Domain-Specific Retrieval Systems: Tailored for specific domains such as medical, legal, scientific,
etc.
[Link] on User Interaction:
. Batch Retrieval Systems: Process queries in batches without user interaction.
. Interactive Retrieval Systems: Allow users to refine queries based on intermediate results and
feedback.
These classifications provide a framework for understanding the various types of information retrieval
systems and their characteristics. Each type has its own strengths and weaknesses, making them
suitable for different contexts and requirements.
2) Explain the process of structured text retrieval model.(CO 3)
The Structured Text Retrieval Model is a method used for retrieving information from structured text
documents, where the documents contain well-defined fields or attributes. This model is commonly
used in database retrieval systems or in scenarios where the text data is organized in a structured
format such as XML, JSON, CSV, etc. The process of structured text retrieval involves several steps:
[Link] Preprocessing:
Structured text documents are preprocessed to extract relevant fields or attributes.
This may involve parsing the documents to identify and extract fields, removing noise or irrelevant
information, and performing any necessary normalization.
[Link]:
Once the documents are preprocessed, an index is created to facilitate efficient retrieval.
Each field or attribute in the documents is indexed separately to enable precise retrieval based on
specific criteria.
Indexing involves creating data structures (such as inverted indexes) that map terms to the documents
in which they occur and the positions within those documents.
[Link] Processing:
When a user submits a query, the query is processed to identify the relevant terms and fields.
The query may contain keywords or specified criteria for matching particular fields or attributes.
Query processing involves identifying the relevant terms and translating the query into a form suitable
for retrieval.
[Link] and Ranking:
The structured text retrieval model matches the query against the indexed documents based on the
specified criteria.
Depending on the retrieval model used (e.g., Boolean, Vector Space, etc.), documents are scored or
ranked based on their relevance to the query.
Matching involves comparing the terms in the query with the terms indexed in the documents' fields
and attributes.
[Link] Presentation:
Once the matching and ranking process is complete, the retrieved documents are presented to the user.
Results may be presented in a ranked list, with the most relevant documents appearing at the top.
The presentation may also include highlighting the matched terms or displaying relevant metadata
associated with the documents.
[Link] Interaction:
In interactive retrieval systems, users may have the option to refine their queries based on the initial
results.
Users can iteratively adjust their queries or specify additional criteria to further narrow down the
search results.
Overall, the structured text retrieval model enables efficient retrieval of information from structured
text documents by organizing and indexing the data in a way that supports precise querying and
matching based on specific fields or attributes.
3) Explain the taxonomy of information retrieval model?(CO 1,2)
The taxonomy of information retrieval (IR) models categorizes various approaches used to retrieve
relevant information from a collection of documents. These models can be classified based on
different factors such as the representation of documents and queries, the ranking mechanism, and the
underlying mathematical frameworks. Here's a breakdown of the taxonomy:
[Link]-Based Taxonomy:
Binary Models: Documents and queries are represented as binary vectors indicating the presence or
absence of terms.
Boolean Models: Queries consist of Boolean expressions (AND, OR, NOT) of terms, matching
documents are retrieved based on exact term matches.
Term Frequency Models: Documents and queries are represented as vectors indicating the frequency
of each term occurrence.
Vector Space Models: Documents and queries are represented as vectors in a multidimensional
space, where each dimension corresponds to a term, and similarity is measured using metrics like
cosine similarity.
[Link] Taxonomy:
Probabilistic Models: Based on the probability of relevance of documents given a query. Examples
include the Binary Independence Model (BIM) and the Okapi BM25 model.
Language Models: Treats both documents and queries as collections of words and models their
generation process using statistical language models.
[Link]-Based Taxonomy:
Classical IR Models: Include Boolean, Vector Space, and Probabilistic models which form the
foundation of traditional IR.
Latent Semantic Models: Capture the latent relationships between terms and documents, such as
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Machine Learning Models: Utilize machine learning algorithms to learn the relevance of documents
to queries. Examples include Support Vector Machines (SVM), Neural Networks, and Random
Forests.
[Link]-Based Taxonomy:
Ad Hoc Retrieval: Focuses on retrieving relevant documents for a single query without considering
the user's context or history.
Relevance Feedback: Incorporates user feedback to refine the retrieval process iteratively.
Filtering: Involves automatically selecting and delivering relevant documents to users based on
predefined criteria or user profiles.
Cross-Language Retrieval: Addresses the retrieval of documents in languages different from the
query language.
[Link] Taxonomy:
Combination Models: Combine different IR models or techniques to improve retrieval effectiveness.
For example, combining a Boolean model with a vector space model.
Fusion Models: Integrate retrieval results from multiple sources or models to generate a unified
ranking of documents.
These taxonomies provide a structured way to understand the diverse range of information retrieval
models and their characteristics. Researchers and practitioners often choose models based on the
specific requirements of their applications and the nature of the data they are dealing with.
4) Explain the role of pattern matching in information retrieval. (CO 3)
Pattern matching plays a crucial role in information retrieval by enabling systems to identify relevant
documents or data based on patterns present in queries and documents. Here's a breakdown of its role:
[Link] Processing:
.When a user submits a query to an information retrieval system, the system needs to match the query
terms with the terms present in the documents' content.
.Pattern matching techniques are used to identify exact or approximate matches between the query
terms and the terms within the documents.
[Link] Matching:
.Information retrieval systems often use pattern matching to match individual terms or phrases in the
query with terms present in the documents.
.This matching process involves considering various factors such as case sensitivity, stemming
(matching different forms of the same word), and synonyms.
[Link] Matching:
.Pattern matching plays a crucial role in information retrieval by enabling systems to identify relevant
documents or data based on patterns present in queries and documents. Here's a breakdown of its role:
[Link] and Regular Expressions:
.Pattern matching allows for the use of wildcards and regular expressions in queries, enabling users to
specify more complex search patterns.
.Wildcards such as '*' or '?' can represent variable characters, while regular expressions provide a
powerful way to define flexible search patterns.
[Link] Matching:
.In some cases, exact term matching may not be sufficient, especially when dealing with noisy or
misspelled text.
.Pattern matching techniques such as edit distance algorithms or phonetic matching are used to find
approximate matches for query terms, allowing retrieval of relevant documents even if the exact terms
do not match.
[Link] Matching:
.Pattern matching is also used to match query terms with metadata associated with documents, such as
titles, authors, dates, etc.
.This allows for precise retrieval based on specific metadata criteria specified in the query.
[Link] Search:
.In systems employing faceted search, pattern matching is used to match query terms with predefined
facets or categories associated with the documents.
.This enables users to narrow down search results based on specific facets or attributes.
Overall, pattern matching is essential in information retrieval as it enables systems to efficiently
match queries with relevant documents based on various patterns present in both the query and
document content. It forms the foundation for accurate and effective retrieval of information from
large collections of data.
5)State difference between data retrieval and information retrieval (CO 1)
5) Compare Boolean Model and Vector Model. (CO 2)
Feature Boolean Model Vector Model
Representation Documents and queries represented as Documents and queries
sets of keywords (terms). represented as vectors in a
multi-dimensional space (terms
as dimensions).
Matching Exact matching based on Boolean Partial matching based on
Criterion logic (AND, OR, NOT). Documents similarity measures (e.g., cosine
either fully match or don't match the similarity). Documents are
query. ranked by relevance.
Output Binary: document is either relevant or Ranked list of documents
not relevant. according to degree of
relevance.
Query Queries are rigid; complex Boolean Queries are flexible; can handle
Flexibility expressions needed to refine. vague or approximate queries.
Handling Term Does not consider term frequency or Considers term frequency and
Frequency importance. inverse document frequency
(TF-IDF), reflecting term
importance.
Complexity Simple and easy to implement. More computationally intensive
due to vector calculations.
Use Case Good for precise searches where exact Better for ranked retrieval and
matches are needed. handling large document
collections with noisy data.
Example Query: (apple AND banana) NOT Query vector compared with
orange — documents must contain document vectors, ranking
apple and banana but not orange. documents by cosine similarity
score.
7) Which are different classic IR models? Briefly explain the probabilistic IR model
(CO 2)
In information retrieval (IR), classic models are foundational approaches used to index,
retrieve, and rank documents based on user queries. Here are some of the key classic IR
models:
1. Boolean Model: This model retrieves documents based on boolean logic (AND, OR,
NOT). Queries are expressed as boolean expressions, and documents are retrieved if
they meet the exact boolean criteria.
2. Vector Space Model: Documents and queries are represented as vectors in a multi-
dimensional space. The relevance of documents is determined by calculating the
similarity between the query vector and document vectors, often using measures like
cosine similarity.
3. Probabilistic Model: This model estimates the probability that a document is relevant
to a given query. It incorporates statistical methods to rank documents based on their
likelihood of relevance.
4. Latent Semantic Analysis (LSA): This model improves upon the vector space model
by reducing the dimensionality of term-document matrices. It captures semantic
relationships between terms and documents by identifying patterns in the co-
occurrence of terms.
5. Language Models: This approach models the probability distribution of words in a
document given a query. Documents are ranked based on the likelihood that a given
query would be generated by the language model of the document.
Probabilistic IR Model
The probabilistic information retrieval model is grounded in probability theory and aims to
rank documents based on their likelihood of being relevant to a user's query. Here's a brief
overview:
1. Core Idea: The probabilistic model estimates the probability that a document is
relevant to a given query. It relies on the assumption that relevance can be represented
probabilistically rather than deterministically.
2. Relevance Probability: For each document, the model calculates the probability that
the document is relevant to the query. This probability is denoted as P(R∣Q,D)P(R | Q,
D)P(R∣Q,D), where RRR indicates relevance, QQQ is the query, and DDD is the
document.
3. Ranking: Documents are ranked based on their estimated relevance probabilities. The
higher the probability that a document is relevant, the higher it is ranked in the search
results.
4. BM25: One of the most well-known probabilistic models is BM25 (Best Match 25),
which is an extension of the probabilistic model. It uses term frequency and document
length normalization to improve ranking. BM25 incorporates parameters like term
frequency and document length to provide a more refined relevance score.
5. Advantages: Probabilistic models can handle uncertainty in relevance and often
provide better ranking results compared to models that rely solely on exact keyword
matches. They can incorporate various factors and use statistical methods to estimate
relevance more effectively.
In summary, the probabilistic IR model provides a sophisticated approach to ranking
documents by estimating their relevance based on probability, offering a more nuanced and
flexible alternative to simpler models like the Boolean approach.
8) Compare flat browsing and hypertext browsing models.(CO 2)
Flat Browsing Model:
1. Definition:
- In the flat browsing model, information is organized in a hierarchical or linear structure, where
content is accessed sequentially or through predefined categories.
- Users navigate through the content by following predefined paths or categories without the ability
to directly link between unrelated pieces of information.
2. Characteristics:
- Content is typically organized in a tree-like structure, with parent-child relationships between
categories or pages.
- Navigation is sequential or hierarchical, following a predefined order or taxonomy.
- Users have limited flexibility in exploring unrelated topics or content outside the predefined
structure.
3. Examples:
- Traditional websites with menu navigation structures.
- File systems on computers, organized in directories and subdirectories.
4. Advantages:
- Provides a structured and organized way to navigate content, suitable for scenarios where content
relationships are well-defined.
- Offers predictability and familiarity to users accustomed to hierarchical navigation paradigms.
5. Disadvantages:
- Limited flexibility for exploring content outside predefined paths.
- May not be suitable for complex or interconnected information environments.
Hypertext Browsing Model:
1. Definition:
- In the hypertext browsing model, information is interconnected through hyperlinks, allowing users
to navigate non-linearly between related pieces of content.
- Users can follow links to explore different topics, access related information, and traverse
interconnected content freely.
2. Characteristics:
- Content is interconnected through hyperlinks, allowing for non-linear navigation.
- Users can follow links to navigate between related pages or documents, regardless of their
hierarchical relationships.
- Navigation is fluid and non-linear, enabling users to explore content in a more exploratory manner.
3. Examples:
- Web browsing using hyperlinks on websites.
- Wikis, where users can navigate between interconnected articles.
- E-books or digital documents with embedded hyperlinks.
4. Advantages:
- Offers flexibility for exploring interconnected content and discovering related information.
- Supports non-linear navigation, allowing users to follow their interests and explore content freely.
- Facilitates serendipitous discovery of new information.
5. Disadvantages:
- Can lead to disorientation or loss of context if links are not well-organized or if there are too many
options available.
- Requires effective link management and organization to ensure usability and coherence.
In summary, the flat browsing model provides a structured and hierarchical way to navigate content,
while the hypertext browsing model offers more flexibility and exploration opportunities through
interconnected links. The choice between these models depends on the nature of the content, user
preferences, and the desired user experience.
Feature Flat Browsing Hypertext Browsing
Structure Linear or hierarchical list of items or Non-linear network of
pages. interconnected pages (nodes) via
hyperlinks.
Navigation Sequential navigation, moving through Non-sequential navigation, jumping
Style items one after another. between related pages through links.
User Control Limited; users browse in a fixed order or High; users can choose different
within a simple hierarchy. paths based on interests and link
choices.
Information Access to information is more rigid and Access is flexible and exploratory,
Access predictable. allowing associative navigation.
Example Table of contents, file directory World Wide Web, Wikipedia
structures. articles with internal links.
Complexity Simpler to implement and use. More complex due to linking and
potential for many navigation paths.
Context Limited; user often loses context if High; links provide contextual
Awareness jumping between unrelated items. relationships between pages.
User May feel restrictive or linear. More dynamic and engaging due to
Experience freedom of choice and exploration.
9) Explain keyword-based queries? (CO 3)
Keyword-based queries are a common way to search for information using specific words or
phrases that represent the information you're seeking. Here's a breakdown of how they work:
1. Keywords: These are the main terms or phrases you enter into a search engine or
database to describe what you're looking for. For instance, if you're searching for
recipes for chocolate cake, your keywords might be "chocolate cake recipes."
2. Search Process:
o Input: You type your keywords into a search engine or a database query field.
o Matching: The search engine or database uses algorithms to match your
keywords with relevant content. This often involves scanning through
documents, webpages, or entries to find where those keywords appear.
o Ranking: The results are then ranked based on relevance, which can be
influenced by various factors like keyword frequency, the quality of the
content, and the site’s authority.
3. Relevance: The effectiveness of keyword-based queries depends on the precision of
the keywords you use. For example, using "chocolate cake recipe" might yield more
relevant results than just "cake," because it's more specific.
4. Search Operators: Advanced users often employ search operators (like quotation
marks for exact phrases, or minus signs to exclude terms) to refine their queries and
improve the accuracy of search results.
In summary, keyword-based queries are a straightforward way to find information by using
specific words that describe what you're looking for. The more precise your keywords, the
more relevant your search results are likely to be.
10) What are different types of queries in Information retrieval?(CO 3)
1. Keyword Queries :
Simplest and most common queries.
The user enters just keyword combinations to retrieve documents.
These keywords are connected by logical AND operator.
All retrieval models provide support for keyword queries.
2. Boolean Queries :
Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean
operators in combination of keyword formulations.
No ranking is involved because a document either satisfies such a
query or does not satisfy it.
A document is retrieved for boolean query if it is logically true as exact
match in document.
3. Phrase Queries :
When documents are represented using an inverted keyword index for
searching, the relative order of items in document is lost.
To perform exact phase retrieval, these phases are encoded in
inverted index or implemented differently.
This query consists of a sequence of words that make up a phase.
It is generally enclosed within double quotes.
4. Proximity Queries :
Proximity refers ti search that accounts for how close within a record
multiple items should be to each other.
Most commonly used proximity search option is a phase search that
requires terms to be in exact order.
Other proximity operators can specify how close terms should be to
each other. Some will specify the order of search terms.
Search engines use various operators names such as NEAR, ADJ
(adjacent), or AFTER.
However, providing support for complex proximity operators becomes
expensive as it requires time-consuming pre-processing of documents
and so it is suitable for smaller document collections rather than for
web.
5. Wildcard Queries :
It supports regular expressions and pattern matching-based searching
in text.
Retrieval models do not directly support for this query type.
In IR systems, certain kinds of wildcard search support may be
implemented. Example: usually words ending with trailing characters.
6. Natural Language Queries :
There are only a few natural language search engines that aim to
understand the structure and meaning of queries written in natural
language text, generally as question or narrative.
The system tries to formulate answers for these queries from retrieved
results.
Semantic models can provide support for this query type.