DEPARTMENT OF COMPUTER ENGINEERING
Class/Sem/Year: B.E./ VII/2023-24(ODD)
Div: (A/B) Subject: Information Retrieval
Assignment No. 2 Solution
S.No Questions
1 Explain Pattern Matching? What are the Different Types of Patterns Used for Pattern
Matching?
Pattern Matching
Pattern Matching is a process of checking and finding a specific sequence of characters,
symbols, or structures within a larger body of data (text, string, or database).
It is widely used in Information Retrieval, Text Processing, Compilers, and Data
Mining.
The main goal of pattern matching is to locate desired information efficiently from a
large collection by comparing the given input (pattern) with stored data.
Types of Patterns Used in Pattern Matching
There are several kinds of patterns, depending on the matching approach:
1. Exact Match Patterns
● The pattern must match the text exactly without any deviation.
● Example: Searching for the word "apple" in a document – only exact "apple" matches
are retrieved.
● Used in: dictionary lookups, database queries.
2. Substring Patterns
● The pattern can occur as a part (substring) within a larger string.
● Example: Searching "cat" will match "concatenate", "educate", etc.
● Useful in: keyword search engines, spell checking.
3. Prefix and Suffix Patterns
● Prefix Pattern: Pattern appears at the start of the string.
Example: "pre" matches "prefix", "preorder".
● Suffix Pattern: Pattern appears at the end of the string.
Example: "ing" matches "playing", "running".
● Used in: autocomplete systems, stemming in IR.
4. Regular Expression (Regex) Patterns
● Patterns defined using special symbols and rules.
● Can match complex text forms like email addresses, phone numbers, or specific
sequences.
● Example: Regex [0-9]+ matches any sequence of digits.
● Used in: text editors, search engines, programming.
5. Wildcard Patterns
● Patterns with placeholders (wildcards) that can match any character(s).
o ? → matches a single character
o * → matches zero or more characters
● Example: ca* matches "cat", "car", "capital".
● Used in: file searching (*.txt), command line tools.
6. Approximate (Fuzzy) Patterns
● Allow partial matching when exact match is not possible.
● Useful when spelling errors, typos, or variations occur.
● Example: "colour" matches "color".
● Used in: spell checkers, DNA sequence analysis, OCR systems.
7. Structural or Syntactic Patterns
● Patterns based on hierarchical structures such as trees or graphs.
● Example: Matching an XML/JSON structure, compiler syntax checking.
● Used in: programming languages, structured data retrieval.
2 List and explain the types of Structured Queries?
Structured Queries
Structured queries are formal queries written in a structured query language (like SQL) or
other IR languages that follow a defined syntax.
They specify exact conditions for retrieving information from databases or information
retrieval systems.
Types of Structured Queries
1. Boolean Queries
● Based on Boolean logic (AND, OR, NOT).
● The query retrieves documents that satisfy exact logical conditions.
● Example:
o Climate AND Change → retrieves documents containing both terms.
o AI OR Machine Learning → retrieves documents containing either.
o Pollution NOT Noise → excludes documents with “Noise.”
● Application: Legal databases, digital libraries, simple IR systems.
2. Vector Space Queries
● Represent documents and queries as vectors in a multidimensional space.
● Relevance is based on cosine similarity between query vector and document vector.
● Example: Searching "machine learning algorithms" will rank documents by closeness
(similarity score).
● Application: Modern search engines, ranking in IR.
3. Probabilistic Queries
● Retrieve documents based on probability of relevance to the user’s query.
● Uses statistical models to estimate if a document is likely relevant.
● Example: "Cancer Treatment" → system calculates probability (like 0.8 relevant, 0.2 not
relevant).
● Application: Medical information retrieval, recommendation systems.
4. Faceted Queries
● Query refinement using facets (categories or attributes).
● Example: Searching for "Laptops" → refine by brand = Dell, price < ₹50,000, RAM ≥
8GB.
● Application: E-commerce websites, digital libraries.
5. Proximity Queries
● Retrieve documents where terms occur near each other (within a certain distance).
● Example: "Artificial w/3 Intelligence" → finds documents where “Artificial” occurs
within 3 words of “Intelligence.”
● Application: Legal documents, news retrieval.
6. Structured Field Queries
● Queries that target specific fields in structured databases.
● Example in SQL:
o SELECT Name FROM Students WHERE Marks > 80;
● Example in IR (library search):
o title: "Renewable Energy" AND author: "Patel"
● Application: Relational databases, library catalogues, metadata search.
3 Explain User Relevance Feedback in Detail.
User Relevance Feedback (RF)
User Relevance Feedback is a technique used in Information Retrieval (IR) where the system
learns from the user’s judgment about which retrieved documents are relevant or not.
● The idea is that initial search results may not always satisfy the user’s intent, so
feedback is taken from the user to improve future results.
● It is an iterative process: the user provides feedback, and the system refines the query or
ranking to deliver better results.
Process of Relevance Feedback
1. Initial Query Submission
o The user enters a query (e.g., "renewable energy sources").
2. Retrieval of Initial Results
o The system retrieves documents based on the query (using Boolean, Vector Space,
etc.).
3. User Feedback Collection
o The user marks some results as Relevant or Not Relevant.
4. Query Refinement
o The system analyzes the feedback and adjusts the query (by adding, removing,
or re-weighting terms).
5. Improved Retrieval
o A new set of results is retrieved, more closely aligned with the user’s actual
information need.
Techniques of Relevance Feedback
1. Explicit Feedback
● The user manually marks documents as relevant/not relevant.
● Example: Clicking thumbs-up/down on search results.
● Advantage: Accurate feedback.
● Disadvantage: Requires user effort.
2. Implicit Feedback
● The system observes user behavior to infer relevance.
● Examples: Clicks, dwell time (time spent on page), scrolling, download.
● Advantage: No extra effort from user.
● Disadvantage: May be noisy (not always correct).
3. Pseudo Relevance Feedback (Blind Feedback)
● The system assumes the top-ranked results are relevant and uses them to expand/refine
the query.
● Example: Google expanding search results using synonyms.
● Advantage: Fully automatic, no user effort.
● Disadvantage: Risk of introducing errors if top results are irrelevant.
Advantages of Relevance Feedback
● Improves retrieval effectiveness (precision & recall).
● Helps in query reformulation when the user query is vague.
● Learns user preferences over time.
Limitations of Relevance Feedback
● Users may be unwilling to give feedback (extra effort).
● Implicit signals may be inaccurate.
● Initial results must contain some relevant documents; otherwise, refinement fails.
4 Write a Short note on Query Protocols.
Query Protocols
A Query Protocol refers to the set of rules and procedures that define how a user’s query is
communicated to an information retrieval (IR) system or database, and how the results are
returned. It standardizes the interaction between the user interface and the search
engine/database.
Functions of Query Protocols
1. Define Query Syntax – Specify how queries should be written (keywords, operators,
fields).
2. Control Communication – Ensure correct transfer of query from client to server.
3. Interpretation of Results – Specify how search results are structured and delivered back.
4. Interoperability – Enable different systems (databases, digital libraries, web services) to
work together.
Types of Query Protocols
1. Boolean Query Protocol
o Uses logical operators AND, OR, NOT.
o Example: climate AND change.
2. SQL (Structured Query Language)
o Standard database query protocol for structured data.
o Example: SELECT * FROM Students WHERE Marks > 80;.
3. Z39.50 Protocol
o Standard client–server protocol for searching remote library databases.
o Widely used in digital libraries and bibliographic databases.
4. XQuery / XPath
o Used for querying XML data in structured or semi-structured formats.
o Example: /bookstore/book[price>500].
5. Web Query Protocols
o HTTP/HTTPS used in web search engines.
o User query is submitted through a search form, sent as a request, and results are
returned as web pages.
Advantages
● Standardizes query communication.
● Improves compatibility between heterogeneous systems.
● Enables efficient retrieval across databases, libraries, and web systems.
5 What do you mean by Inverted Files? Explain in detail.
Inverted Files
An Inverted File (or Inverted Index) is a data structure used in Information Retrieval (IR)
systems to efficiently store and search for terms within a large collection of documents.
● It is called “inverted” because instead of mapping documents → words, it maps words
(terms) → documents where they appear.
● This allows for fast full-text search, which is the basis of search engines (like Google,
Bing) and digital libraries.
Structure of an Inverted File
An inverted file typically consists of two main components:
1. Vocabulary (Dictionary):
o A list of all unique terms/keywords appearing in the document collection.
o Each term points to its posting list.
2. Posting List (Inverted List):
o For each term, it stores the list of documents in which the term occurs, along
with additional information like:
▪ Document ID
▪ Term Frequency (TF): number of times the term appears in the document
▪ Position Information: exact word position(s) in the document (useful for
phrase/proximity queries)
Example
Consider 3 documents:
● D1: "Information retrieval is powerful"
● D2: "Information systems are useful"
● D3: "Retrieval of data is important"
Inverted File:
Posting List (Doc IDs &
Term
Frequency)
Informati
{D1:1, D2:1}
on
Retrieval {D1:1, D3:1}
Powerful {D1:1}
Systems {D2:1}
Useful {D2:1}
Data {D3:1}
Important {D3:1}
👉 This structure makes it fast to retrieve documents containing “retrieval” or “information.”
Types of Inverted Files
1. Word-Level Inverted File
o Stores only the list of documents in which each word appears.
o Example: “retrieval” → {D1, D3}.
2. Word-POSition Inverted File
o Stores the position of each word within documents.
o Useful for phrase queries like "information retrieval".
3. Full Inverted File
o Stores complete occurrence information (frequency + position).
o Used in advanced IR systems and search engines.
Advantages of Inverted Files
● Fast search and retrieval (especially for large text collections).
● Supports Boolean, phrase, and proximity queries.
● Efficient for full-text search engines.
Disadvantages of Inverted Files
● Requires large storage space (posting lists can be huge).
● Needs updating when documents are added or removed.
● Construction can be time-consuming for very large datasets.
6 Discuss the various Structures used in Inverted Files.
Structures Used in Inverted Files
An Inverted File (Index) can be implemented using different data structures depending on
the storage, efficiency, and query requirements. These structures organize the dictionary
(vocabulary) and posting lists for fast retrieval.
1. Sequential (Linear) Structure
● Both dictionary and posting lists are stored as sequential files (arrays or linked lists).
● Dictionary: Sorted list of terms.
● Posting List: Sequentially stores document IDs for each term.
● Advantages: Simple to implement, easy to maintain.
● Disadvantages: Searching requires sequential scanning unless dictionary is sorted
(binary search possible).
● Use: Small IR systems.
2. Hashing Structure
● The dictionary terms are stored in a hash table.
● Each term is hashed to a bucket, which contains a pointer to its posting list.
● Advantages: Very fast search (O(1) average time).
● Disadvantages: Collisions may occur, dictionary not stored in sorted order (hard for
prefix queries).
● Use: Real-time search systems.
3. Tree Structure
● The dictionary is stored in a tree-based structure like B-Tree, B+ Tree, or Binary
Search Tree (BST).
● Posting lists are stored as linked lists or arrays, pointed to by the tree nodes.
● Advantages: Efficient for range queries, supports ordered traversal.
● Disadvantages: Slightly more complex than hashing.
● Use: Databases, large-scale IR systems.
4. Bitmaps
● Instead of storing a posting list, a bit vector (bitmap) is used.
● For each term, the bitmap indicates the presence/absence of the term in each document (1
= present, 0 = absent).
● Example (for 5 documents):
o Term “retrieval” → 1 0 1 0 0
● Advantages: Very fast Boolean operations (AND, OR, NOT).
● Disadvantages: High space consumption for large document collections.
● Use: When the number of documents is fixed and not too large.
5. Signature Files
● Documents are represented using bit-string signatures generated by hashing terms.
● Query terms are hashed and matched with document signatures.
● Advantages: Saves space compared to full posting lists.
● Disadvantages: May produce false matches (requires verification).
● Use: Alternative to inverted indexes in some database systems.
6. Clustered or Blocked Inverted Files
● Documents are grouped into blocks or clusters.
● Posting list points to blocks instead of individual documents.
● Advantages: Reduces index size.
● Disadvantages: Less precise than standard posting lists.
● Use: Large-scale IR where memory is limited.
7 What are Boolean Queries? Explain in Brief
Boolean Queries
A Boolean Query is a type of structured query in Information Retrieval (IR) that uses Boolean
logic operators (AND, OR, NOT) to combine keywords for retrieving documents.
● Based on George Boole’s Boolean algebra.
● The search system matches documents exactly according to the logical conditions.
● Commonly used in databases, library systems, and early IR models.
Operators in Boolean Queries
1. AND → Retrieves documents containing all terms.
o Example: Climate AND Change → documents must have both words.
2. OR → Retrieves documents containing at least one term.
o Example: AI OR Machine Learning → documents with either or both terms.
3. NOT → Excludes documents containing a term.
o Example: Pollution NOT Noise → retrieves docs with “Pollution” but without
“Noise.”
Example
● Query: (Renewable AND Energy) OR (Solar AND Power)
● Meaning: Documents must contain either both "Renewable Energy" OR both "Solar
Power.
Advantages
● Simple and easy to use.
● Provides precise control over search results.
Limitations
● Does not rank results (all retrieved documents are equal).
● May retrieve too few or too many results if query is not well-formed.
● Requires users to know exact keywords.
8 Explain TF-IDF Weighting.
TF–IDF Weighting
TF–IDF (Term Frequency – Inverse Document Frequency) is a popular weighting scheme
used in Information Retrieval (IR) and text mining to evaluate how important a word (term)
is to a document in a collection (corpus).
It combines two measures:
1. TF (Term Frequency) – Importance of a term within a document.
2. IDF (Inverse Document Frequency) – Importance of a term across the whole collection
(rarity).
1. Term Frequency (TF)
2.
Advantages of TF–IDF
● Simple and effective.
● Balances local importance (TF) and global rarity (IDF).
● Forms the basis of modern ranking models.
Limitations
● Does not consider semantic meaning (synonyms, context).
● Assumes terms are independent.
● Cannot handle polysemy (same word, multiple meanings).
9 Write a Short note on :
1) Single-word Queries
● A single-word query consists of only one term or keyword.
● The system retrieves all documents that contain this specific word.
● Example: Query = "Blockchain" → retrieves all documents containing the word
"Blockchain".
● Advantages:
o Simple and quick to execute.
o Useful for broad searches.
● Limitations:
o May return too many irrelevant results due to ambiguity.
o Cannot capture the context or precise meaning of the word.
2) Multi-Word Queries
● A query containing two or more terms.
● The system retrieves documents containing all or some of these terms, depending on the
query model (Boolean, Vector Space, etc.).
● Example: "Renewable Energy Sources" → retrieves documents with one or more of the
terms.
● Advantage: More specific than single-word queries.
● Limitation: May still retrieve irrelevant results if words are scattered far apart.
3) Phrase and Proximity Queries
Phrase Queries: Retrieve documents where terms occur next to each other in
the exact order.
● Example: "Information Retrieval" → retrieves documents containing the exact
phrase.
Proximity Queries: Retrieve documents where terms occur within a certain
distance (window) of each other.
● Example: "Artificial w/3 Intelligence" → retrieves documents where “Artificial”
occurs within 3 words of “Intelligence.”
Advantage: Increases precision, useful for names, titles, and specific concepts.
Limitation: More computationally expensive than simple queries.
4) More Complex Queries
● These queries go beyond simple single-word, multi-word, or phrase queries,
and combine multiple conditions, operators, or fields to refine the search.
● They are typically used in advanced IR systems, databases, and web search
engines.
Examples of More Complex Queries
1. Boolean + Phrase Combination
o Example: (“Renewable Energy” AND “Solar Power”) OR “Wind
Turbines”
o Mixes Boolean logic with phrase queries for precision.
2. Fielded / Structured Queries
o Example: title:"Machine Learning" AND author:"Andrew Ng"
o Searches specific metadata fields (title, author, year).
3. Proximity + Boolean Queries
o Example: (Artificial w/5 Intelligence) AND (Neural w/3 Network)
o Combines proximity with Boolean operators.
4. Weighted Queries (Vector Space Model)
o Terms are assigned weights based on importance.
o Example: 0.7 * "Climate" + 0.3 * "Policy"
5. Nested / Hierarchical Queries
o Example: (AI AND (Robotics OR Automation)) AND NOT "Science
Fiction"
o Uses parentheses to control order of evaluation.
Advantages
● Provides highly accurate and customized retrieval.
● Allows complex information needs to be expressed.
● Useful in research databases, legal IR, and scientific search engines.
Limitations
● Requires more effort and knowledge from the user.
● Complex queries may be time-consuming for the system to process.
10 Explain Various Reference Collection/Standard Test Collection in brief.
Reference Collection / Standard Test Collection
A Reference Collection (or Standard Test Collection) is a predefined set of documents,
queries, and relevance judgments used to evaluate and benchmark information retrieval
systems.
It provides a controlled environment to test retrieval effectiveness and compare different
systems or algorithms.
Components of a Test Collection
1. Document Set – A collection of text documents or records.
2. Query Set (Topics) – A set of sample queries or information needs.
3. Relevance Judgments (Ground Truth) – Predefined labels indicating which documents
are relevant for each query.
Popular Standard Test Collections
1. Cranfield Collection
o One of the earliest IR test collections.
o Includes documents, queries, and relevance judgments.
o Used to study and evaluate retrieval performance.
2. TREC (Text REtrieval Conference) Collections
o Large-scale collections developed for benchmarking IR systems.
o Includes diverse corpora: news, web, medical, legal, etc.
o Provides queries and human-judged relevance labels.
3. Reuters-21578 Collection
o Collection of news articles from Reuters.
o Used in text categorization and retrieval experiments.
4. MEDLINE / OHSUMED Collection
o Medical literature documents.
o Used for evaluating medical IR systems.
5. Other Domain-Specific Collections
o Legal, patent, scientific, or social media datasets.
Advantages
● Provides standardized benchmarks for system evaluation.
● Enables comparison between different IR models.
● Helps in research and development of new retrieval algorithms.
Limitations
● May not reflect real-world queries or dynamic data.
● Relevance judgments are subjective, depending on human assessors.
● Updating collections is costly.
11 Explain Retrieval Evaluation Measures.
Retrieval Evaluation Measures
Retrieval Evaluation Measures are metrics used to assess the effectiveness of an
Information Retrieval (IR) system. They measure how well the system retrieves relevant
documents in response to a query.
Evaluation focuses on precision, recall, and other derived metrics to quantify system
performance.
4. Mean Average Precision (MAP)
● Measures the average precision over multiple queries.
● Considers the order of retrieved documents (ranking matters).
● Useful for: Ranking-based IR systems like search engines.
5. Precision-Recall Curve
● A graphical representation of precision vs recall at different thresholds.
● Interpretation:
o Ideal systems maintain high precision at high recall.
o Shows trade-off between precision and recall.
6. Other Measures
1. R-Precision – Precision at R, where R = total number of relevant documents for a query.
2. Discounted Cumulative Gain (DCG / nDCG) – Measures ranked relevance (higher
weight to relevant documents appearing earlier).
3. Average Rank / Mean Reciprocal Rank (MRR) – Focuses on first relevant document
retrieved.
Advantages of Evaluation Measures
● Helps compare different IR systems or algorithms.
● Provides quantitative feedback for system improvement.
● Useful in research, benchmarks, and industry.
Limitations
● Requires a reference collection with relevance judgments.
● Relevance is often subjective, may vary between users.
● Trade-off between precision and recall makes optimization challenging.
12 Explain in brief Distributed Information
Retrieval (DIR)Architectures. (Peer-to-Peer, Broker-Based,Crawling,Metadata Harvesting,
Hybrid)
Distributed Information Retrieval (DIR) Architectures
Distributed Information Retrieval (DIR) refers to IR systems where information is spread
across multiple, heterogeneous sources, and retrieval is performed without centralizing all
data.
There are several architectures used in DIR:
1. Peer-to-Peer (P2P) Architecture
● Description: Every node acts as both client and server, sharing data directly with peers.
● Characteristics: Fully decentralized, no central authority.
● Example: Early file-sharing networks like Gnutella.
● Advantages: Scalable, robust to single-point failures.
● Disadvantages: Query routing and indexing are challenging.
2. Broker-Based (Centralized Coordinator) Architecture
● Description: A central broker receives user queries and forwards them to distributed
databases or servers.
● Characteristics: Broker maintains metadata about available sources.
● Example: Federated search systems in digital libraries.
● Advantages: Efficient query routing, easier to manage.
● Disadvantages: Broker is a potential single point of failure.
3. Crawling-Based Architecture
● Description: Distributed sources are periodically crawled, and documents are indexed
centrally.
● Example: Search engines like Google, Bing.
● Advantages: Fast retrieval from centralized index, simple querying.
● Disadvantages: Requires storage and frequent crawling to stay up-to-date.
4. Metadata Harvesting Architecture
● Description: Only metadata (summaries, abstracts) is collected from distributed sources,
not full documents.
● Example: Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
● Advantages: Lightweight, faster than full crawling.
● Disadvantages: Limited search to metadata fields; full documents are not always
searchable.
5. Hybrid Architecture
● Description: Combines two or more DIR approaches (e.g., crawling + broker-based).
● Characteristics: Tries to balance efficiency, scalability, and accuracy.
● Example: Modern enterprise search systems combining metadata and content indexing.
● Advantages: Flexible, scalable, and can optimize retrieval performance.
● Disadvantages: More complex to implement and maintain.
13 Write a Short note on :
A) Multimedia Information Retrieval
Multimedia Information Retrieval (MIR)
Multimedia Information Retrieval refers to the process of searching and retrieving
multimedia data—such as images, audio, video, graphics, and animations—from large
collections based on user queries.
Key Points:
1. Content-Based Retrieval: Uses features of the multimedia content like:
o Images: color, texture, shape, patterns
o Audio: pitch, rhythm, timbre
o Video: motion, frames, key scenes
2. Text-Based Retrieval: Uses metadata or annotations associated with multimedia files,
like captions, tags, or descriptions.
3. Applications:
o Image search engines (e.g., Google Images)
o Video retrieval (e.g., YouTube, surveillance video search)
o Audio/music retrieval (e.g., Shazam)
o Digital libraries, medical imaging, social media search
4. Challenges:
o Large data size and complexity
o Feature extraction and representation
o Semantic gap between user query and multimedia content
o Efficient indexing and retrieval
B) Distributed Information Retrieval
Distributed Information Retrieval (DIR)
Distributed Information Retrieval refers to searching and retrieving information from
multiple, geographically or logically distributed sources rather than a single centralized
database.
Key Points:
1. Architecture: DIR systems can be peer-to-peer, broker-based, crawling-based,
metadata harvesting, or hybrid, depending on how queries and data are managed.
2. Process:
o A user query is sent to multiple sources.
o Each source processes the query locally.
o Results are collected, ranked, and presented to the user.
3. Advantages:
o Access to diverse, heterogeneous data sources.
o Scalable and can handle large distributed datasets.
4. Challenges:
o Query routing and result merging.
o Maintaining consistency and relevance across distributed sources.
o Handling network delays and heterogeneous formats.
Applications:
● Federated search in digital libraries
● Web search across multiple servers
● Enterprise information systems