0% found this document useful (0 votes)

31 views18 pages

IR Assignment 2 Solution

The document provides an overview of various concepts in Information Retrieval, including Pattern Matching, Structured Queries, User Relevance Feedback, Query Protocols, Inverted Files, and Boolean Queries. It explains different types of patterns used in pattern matching, structured queries, and the relevance feedback process, along with their advantages and limitations. Additionally, it discusses the structures used in inverted files and the TF-IDF weighting scheme.

Uploaded by

Vikas Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views18 pages

IR Assignment 2 Solution

Uploaded by

Vikas Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

DEPARTMENT OF COMPUTER ENGINEERING

Class/Sem/Year: B.E./ VII/2023-24(ODD)

Div: (A/B) Subject: Information Retrieval
Assignment No. 2 Solution
S.No Questions

1 Explain Pattern Matching? What are the Different Types of Patterns Used for Pattern
Matching?

Pattern Matching

Pattern Matching is a process of checking and finding a specific sequence of characters,

symbols, or structures within a larger body of data (text, string, or database).
It is widely used in Information Retrieval, Text Processing, Compilers, and Data
Mining.

The main goal of pattern matching is to locate desired information efficiently from a
large collection by comparing the given input (pattern) with stored data.

Types of Patterns Used in Pattern Matching

There are several kinds of patterns, depending on the matching approach:

1. Exact Match Patterns

● The pattern must match the text exactly without any deviation.

● Example: Searching for the word "apple" in a document – only exact "apple" matches
are retrieved.

● Used in: dictionary lookups, database queries.

2. Substring Patterns

● The pattern can occur as a part (substring) within a larger string.

● Example: Searching "cat" will match "concatenate", "educate", etc.

● Useful in: keyword search engines, spell checking.

3. Prefix and Suffix Patterns

● Prefix Pattern: Pattern appears at the start of the string.

Example: "pre" matches "prefix", "preorder".
● Suffix Pattern: Pattern appears at the end of the string.
Example: "ing" matches "playing", "running".

● Used in: autocomplete systems, stemming in IR.

4. Regular Expression (Regex) Patterns

● Patterns defined using special symbols and rules.

● Can match complex text forms like email addresses, phone numbers, or specific
sequences.

● Example: Regex [0-9]+ matches any sequence of digits.

● Used in: text editors, search engines, programming.

5. Wildcard Patterns

● Patterns with placeholders (wildcards) that can match any character(s).

o ? → matches a single character

o * → matches zero or more characters

● Example: ca* matches "cat", "car", "capital".

● Used in: file searching (*.txt), command line tools.

6. Approximate (Fuzzy) Patterns

● Allow partial matching when exact match is not possible.

● Useful when spelling errors, typos, or variations occur.

● Example: "colour" matches "color".

● Used in: spell checkers, DNA sequence analysis, OCR systems.

7. Structural or Syntactic Patterns

● Patterns based on hierarchical structures such as trees or graphs.

● Example: Matching an XML/JSON structure, compiler syntax checking.

● Used in: programming languages, structured data retrieval.

2 List and explain the types of Structured Queries?

Structured Queries
Structured queries are formal queries written in a structured query language (like SQL) or
other IR languages that follow a defined syntax.
They specify exact conditions for retrieving information from databases or information
retrieval systems.
Types of Structured Queries
1. Boolean Queries
● Based on Boolean logic (AND, OR, NOT).
● The query retrieves documents that satisfy exact logical conditions.
● Example:
o Climate AND Change → retrieves documents containing both terms.
o AI OR Machine Learning → retrieves documents containing either.
o Pollution NOT Noise → excludes documents with “Noise.”
● Application: Legal databases, digital libraries, simple IR systems.
2. Vector Space Queries
● Represent documents and queries as vectors in a multidimensional space.
● Relevance is based on cosine similarity between query vector and document vector.
● Example: Searching "machine learning algorithms" will rank documents by closeness
(similarity score).
● Application: Modern search engines, ranking in IR.
3. Probabilistic Queries
● Retrieve documents based on probability of relevance to the user’s query.
● Uses statistical models to estimate if a document is likely relevant.
● Example: "Cancer Treatment" → system calculates probability (like 0.8 relevant, 0.2 not
relevant).
● Application: Medical information retrieval, recommendation systems.
4. Faceted Queries
● Query refinement using facets (categories or attributes).
● Example: Searching for "Laptops" → refine by brand = Dell, price < ₹50,000, RAM ≥
8GB.
● Application: E-commerce websites, digital libraries.
5. Proximity Queries
● Retrieve documents where terms occur near each other (within a certain distance).
● Example: "Artificial w/3 Intelligence" → finds documents where “Artificial” occurs
within 3 words of “Intelligence.”
● Application: Legal documents, news retrieval.
6. Structured Field Queries
● Queries that target specific fields in structured databases.
● Example in SQL:
o SELECT Name FROM Students WHERE Marks > 80;
● Example in IR (library search):
o title: "Renewable Energy" AND author: "Patel"
● Application: Relational databases, library catalogues, metadata search.
3 Explain User Relevance Feedback in Detail.
User Relevance Feedback (RF)
User Relevance Feedback is a technique used in Information Retrieval (IR) where the system
learns from the user’s judgment about which retrieved documents are relevant or not.
● The idea is that initial search results may not always satisfy the user’s intent, so
feedback is taken from the user to improve future results.
● It is an iterative process: the user provides feedback, and the system refines the query or
ranking to deliver better results.
Process of Relevance Feedback
1. Initial Query Submission
o The user enters a query (e.g., "renewable energy sources").
2. Retrieval of Initial Results
o The system retrieves documents based on the query (using Boolean, Vector Space,
etc.).
3. User Feedback Collection
o The user marks some results as Relevant or Not Relevant.
4. Query Refinement
o The system analyzes the feedback and adjusts the query (by adding, removing,
or re-weighting terms).
5. Improved Retrieval
o A new set of results is retrieved, more closely aligned with the user’s actual
information need.
Techniques of Relevance Feedback
1. Explicit Feedback
● The user manually marks documents as relevant/not relevant.
● Example: Clicking thumbs-up/down on search results.
● Advantage: Accurate feedback.
● Disadvantage: Requires user effort.
2. Implicit Feedback
● The system observes user behavior to infer relevance.
● Examples: Clicks, dwell time (time spent on page), scrolling, download.
● Advantage: No extra effort from user.
● Disadvantage: May be noisy (not always correct).
3. Pseudo Relevance Feedback (Blind Feedback)
● The system assumes the top-ranked results are relevant and uses them to expand/refine
the query.
● Example: Google expanding search results using synonyms.
● Advantage: Fully automatic, no user effort.
● Disadvantage: Risk of introducing errors if top results are irrelevant.
Advantages of Relevance Feedback
● Improves retrieval effectiveness (precision & recall).
● Helps in query reformulation when the user query is vague.
● Learns user preferences over time.
Limitations of Relevance Feedback
● Users may be unwilling to give feedback (extra effort).
● Implicit signals may be inaccurate.
● Initial results must contain some relevant documents; otherwise, refinement fails.
4 Write a Short note on Query Protocols.
Query Protocols
A Query Protocol refers to the set of rules and procedures that define how a user’s query is
communicated to an information retrieval (IR) system or database, and how the results are
returned. It standardizes the interaction between the user interface and the search
engine/database.
Functions of Query Protocols
1. Define Query Syntax – Specify how queries should be written (keywords, operators,
fields).
2. Control Communication – Ensure correct transfer of query from client to server.
3. Interpretation of Results – Specify how search results are structured and delivered back.
4. Interoperability – Enable different systems (databases, digital libraries, web services) to
work together.
Types of Query Protocols
1. Boolean Query Protocol
o Uses logical operators AND, OR, NOT.
o Example: climate AND change.
2. SQL (Structured Query Language)
o Standard database query protocol for structured data.
o Example: SELECT * FROM Students WHERE Marks > 80;.
3. Z39.50 Protocol
o Standard client–server protocol for searching remote library databases.
o Widely used in digital libraries and bibliographic databases.
4. XQuery / XPath
o Used for querying XML data in structured or semi-structured formats.
o Example: /bookstore/book[price>500].
5. Web Query Protocols
o HTTP/HTTPS used in web search engines.
o User query is submitted through a search form, sent as a request, and results are
returned as web pages.
Advantages
● Standardizes query communication.
● Improves compatibility between heterogeneous systems.
● Enables efficient retrieval across databases, libraries, and web systems.

5 What do you mean by Inverted Files? Explain in detail.

Inverted Files
An Inverted File (or Inverted Index) is a data structure used in Information Retrieval (IR)
systems to efficiently store and search for terms within a large collection of documents.
● It is called “inverted” because instead of mapping documents → words, it maps words
(terms) → documents where they appear.
● This allows for fast full-text search, which is the basis of search engines (like Google,
Bing) and digital libraries.
Structure of an Inverted File
An inverted file typically consists of two main components:
1. Vocabulary (Dictionary):
o A list of all unique terms/keywords appearing in the document collection.
o Each term points to its posting list.
2. Posting List (Inverted List):
o For each term, it stores the list of documents in which the term occurs, along
with additional information like:
▪ Document ID
▪ Term Frequency (TF): number of times the term appears in the document
▪ Position Information: exact word position(s) in the document (useful for
phrase/proximity queries)
Example
Consider 3 documents:
● D1: "Information retrieval is powerful"
● D2: "Information systems are useful"
● D3: "Retrieval of data is important"
Inverted File:
Posting List (Doc IDs &
Term
Frequency)
Informati
{D1:1, D2:1}
on
Retrieval {D1:1, D3:1}
Powerful {D1:1}
Systems {D2:1}
Useful {D2:1}
Data {D3:1}
Important {D3:1}
👉 This structure makes it fast to retrieve documents containing “retrieval” or “information.”
Types of Inverted Files
1. Word-Level Inverted File
o Stores only the list of documents in which each word appears.
o Example: “retrieval” → {D1, D3}.
2. Word-POSition Inverted File
o Stores the position of each word within documents.
o Useful for phrase queries like "information retrieval".
3. Full Inverted File
o Stores complete occurrence information (frequency + position).
o Used in advanced IR systems and search engines.
Advantages of Inverted Files
● Fast search and retrieval (especially for large text collections).
● Supports Boolean, phrase, and proximity queries.
● Efficient for full-text search engines.
Disadvantages of Inverted Files
● Requires large storage space (posting lists can be huge).
● Needs updating when documents are added or removed.
● Construction can be time-consuming for very large datasets.
6 Discuss the various Structures used in Inverted Files.
Structures Used in Inverted Files
An Inverted File (Index) can be implemented using different data structures depending on
the storage, efficiency, and query requirements. These structures organize the dictionary
(vocabulary) and posting lists for fast retrieval.
1. Sequential (Linear) Structure
● Both dictionary and posting lists are stored as sequential files (arrays or linked lists).
● Dictionary: Sorted list of terms.
● Posting List: Sequentially stores document IDs for each term.
● Advantages: Simple to implement, easy to maintain.
● Disadvantages: Searching requires sequential scanning unless dictionary is sorted
(binary search possible).
● Use: Small IR systems.
2. Hashing Structure
● The dictionary terms are stored in a hash table.
● Each term is hashed to a bucket, which contains a pointer to its posting list.
● Advantages: Very fast search (O(1) average time).
● Disadvantages: Collisions may occur, dictionary not stored in sorted order (hard for
prefix queries).
● Use: Real-time search systems.
3. Tree Structure
● The dictionary is stored in a tree-based structure like B-Tree, B+ Tree, or Binary
Search Tree (BST).
● Posting lists are stored as linked lists or arrays, pointed to by the tree nodes.
● Advantages: Efficient for range queries, supports ordered traversal.
● Disadvantages: Slightly more complex than hashing.
● Use: Databases, large-scale IR systems.
4. Bitmaps
● Instead of storing a posting list, a bit vector (bitmap) is used.
● For each term, the bitmap indicates the presence/absence of the term in each document (1
= present, 0 = absent).
● Example (for 5 documents):
o Term “retrieval” → 1 0 1 0 0
● Advantages: Very fast Boolean operations (AND, OR, NOT).
● Disadvantages: High space consumption for large document collections.
● Use: When the number of documents is fixed and not too large.
5. Signature Files
● Documents are represented using bit-string signatures generated by hashing terms.
● Query terms are hashed and matched with document signatures.
● Advantages: Saves space compared to full posting lists.
● Disadvantages: May produce false matches (requires verification).
● Use: Alternative to inverted indexes in some database systems.
6. Clustered or Blocked Inverted Files
● Documents are grouped into blocks or clusters.
● Posting list points to blocks instead of individual documents.
● Advantages: Reduces index size.
● Disadvantages: Less precise than standard posting lists.
● Use: Large-scale IR where memory is limited.
7 What are Boolean Queries? Explain in Brief
Boolean Queries
A Boolean Query is a type of structured query in Information Retrieval (IR) that uses Boolean
logic operators (AND, OR, NOT) to combine keywords for retrieving documents.
● Based on George Boole’s Boolean algebra.
● The search system matches documents exactly according to the logical conditions.
● Commonly used in databases, library systems, and early IR models.
Operators in Boolean Queries
1. AND → Retrieves documents containing all terms.
o Example: Climate AND Change → documents must have both words.
2. OR → Retrieves documents containing at least one term.
o Example: AI OR Machine Learning → documents with either or both terms.
3. NOT → Excludes documents containing a term.
o Example: Pollution NOT Noise → retrieves docs with “Pollution” but without
“Noise.”
Example
● Query: (Renewable AND Energy) OR (Solar AND Power)
● Meaning: Documents must contain either both "Renewable Energy" OR both "Solar
Power.
Advantages
● Simple and easy to use.
● Provides precise control over search results.
Limitations
● Does not rank results (all retrieved documents are equal).
● May retrieve too few or too many results if query is not well-formed.
● Requires users to know exact keywords.

8 Explain TF-IDF Weighting.

TF–IDF Weighting
TF–IDF (Term Frequency – Inverse Document Frequency) is a popular weighting scheme
used in Information Retrieval (IR) and text mining to evaluate how important a word (term)
is to a document in a collection (corpus).
It combines two measures:
1. TF (Term Frequency) – Importance of a term within a document.
2. IDF (Inverse Document Frequency) – Importance of a term across the whole collection
(rarity).
1. Term Frequency (TF)

2.
Advantages of TF–IDF
● Simple and effective.
● Balances local importance (TF) and global rarity (IDF).
● Forms the basis of modern ranking models.
Limitations
● Does not consider semantic meaning (synonyms, context).
● Assumes terms are independent.
● Cannot handle polysemy (same word, multiple meanings).
9 Write a Short note on :
1) Single-word Queries
● A single-word query consists of only one term or keyword.
● The system retrieves all documents that contain this specific word.
● Example: Query = "Blockchain" → retrieves all documents containing the word
"Blockchain".
● Advantages:
o Simple and quick to execute.
o Useful for broad searches.
● Limitations:
o May return too many irrelevant results due to ambiguity.
o Cannot capture the context or precise meaning of the word.
2) Multi-Word Queries
● A query containing two or more terms.
● The system retrieves documents containing all or some of these terms, depending on the
query model (Boolean, Vector Space, etc.).
● Example: "Renewable Energy Sources" → retrieves documents with one or more of the
terms.
● Advantage: More specific than single-word queries.
● Limitation: May still retrieve irrelevant results if words are scattered far apart.
3) Phrase and Proximity Queries
 Phrase Queries: Retrieve documents where terms occur next to each other in
the exact order.
● Example: "Information Retrieval" → retrieves documents containing the exact
phrase.
 Proximity Queries: Retrieve documents where terms occur within a certain
distance (window) of each other.
● Example: "Artificial w/3 Intelligence" → retrieves documents where “Artificial”
occurs within 3 words of “Intelligence.”
 Advantage: Increases precision, useful for names, titles, and specific concepts.
 Limitation: More computationally expensive than simple queries.
4) More Complex Queries
● These queries go beyond simple single-word, multi-word, or phrase queries,
and combine multiple conditions, operators, or fields to refine the search.
● They are typically used in advanced IR systems, databases, and web search
engines.
Examples of More Complex Queries
1. Boolean + Phrase Combination
o Example: (“Renewable Energy” AND “Solar Power”) OR “Wind
Turbines”
o Mixes Boolean logic with phrase queries for precision.
2. Fielded / Structured Queries
o Example: title:"Machine Learning" AND author:"Andrew Ng"
o Searches specific metadata fields (title, author, year).
3. Proximity + Boolean Queries
o Example: (Artificial w/5 Intelligence) AND (Neural w/3 Network)
o Combines proximity with Boolean operators.
4. Weighted Queries (Vector Space Model)
o Terms are assigned weights based on importance.
o Example: 0.7 * "Climate" + 0.3 * "Policy"
5. Nested / Hierarchical Queries
o Example: (AI AND (Robotics OR Automation)) AND NOT "Science
Fiction"
o Uses parentheses to control order of evaluation.
Advantages
● Provides highly accurate and customized retrieval.
● Allows complex information needs to be expressed.
● Useful in research databases, legal IR, and scientific search engines.
Limitations
● Requires more effort and knowledge from the user.
● Complex queries may be time-consuming for the system to process.
10 Explain Various Reference Collection/Standard Test Collection in brief.
Reference Collection / Standard Test Collection
A Reference Collection (or Standard Test Collection) is a predefined set of documents,
queries, and relevance judgments used to evaluate and benchmark information retrieval
systems.
It provides a controlled environment to test retrieval effectiveness and compare different
systems or algorithms.
Components of a Test Collection
1. Document Set – A collection of text documents or records.
2. Query Set (Topics) – A set of sample queries or information needs.
3. Relevance Judgments (Ground Truth) – Predefined labels indicating which documents
are relevant for each query.
Popular Standard Test Collections
1. Cranfield Collection
o One of the earliest IR test collections.
o Includes documents, queries, and relevance judgments.
o Used to study and evaluate retrieval performance.
2. TREC (Text REtrieval Conference) Collections
o Large-scale collections developed for benchmarking IR systems.
o Includes diverse corpora: news, web, medical, legal, etc.
o Provides queries and human-judged relevance labels.
3. Reuters-21578 Collection
o Collection of news articles from Reuters.
o Used in text categorization and retrieval experiments.
4. MEDLINE / OHSUMED Collection
o Medical literature documents.
o Used for evaluating medical IR systems.
5. Other Domain-Specific Collections
o Legal, patent, scientific, or social media datasets.
Advantages
● Provides standardized benchmarks for system evaluation.
● Enables comparison between different IR models.
● Helps in research and development of new retrieval algorithms.
Limitations
● May not reflect real-world queries or dynamic data.
● Relevance judgments are subjective, depending on human assessors.
● Updating collections is costly.
11 Explain Retrieval Evaluation Measures.
Retrieval Evaluation Measures
Retrieval Evaluation Measures are metrics used to assess the effectiveness of an
Information Retrieval (IR) system. They measure how well the system retrieves relevant
documents in response to a query.
Evaluation focuses on precision, recall, and other derived metrics to quantify system
performance.
4. Mean Average Precision (MAP)
● Measures the average precision over multiple queries.
● Considers the order of retrieved documents (ranking matters).
● Useful for: Ranking-based IR systems like search engines.
5. Precision-Recall Curve
● A graphical representation of precision vs recall at different thresholds.
● Interpretation:
o Ideal systems maintain high precision at high recall.
o Shows trade-off between precision and recall.
6. Other Measures
1. R-Precision – Precision at R, where R = total number of relevant documents for a query.
2. Discounted Cumulative Gain (DCG / nDCG) – Measures ranked relevance (higher
weight to relevant documents appearing earlier).
3. Average Rank / Mean Reciprocal Rank (MRR) – Focuses on first relevant document
retrieved.
Advantages of Evaluation Measures
● Helps compare different IR systems or algorithms.
● Provides quantitative feedback for system improvement.
● Useful in research, benchmarks, and industry.
Limitations
● Requires a reference collection with relevance judgments.
● Relevance is often subjective, may vary between users.
● Trade-off between precision and recall makes optimization challenging.
12 Explain in brief Distributed Information
Retrieval (DIR)Architectures. (Peer-to-Peer, Broker-Based,Crawling,Metadata Harvesting,
Hybrid)
Distributed Information Retrieval (DIR) Architectures
Distributed Information Retrieval (DIR) refers to IR systems where information is spread
across multiple, heterogeneous sources, and retrieval is performed without centralizing all
data.
There are several architectures used in DIR:
1. Peer-to-Peer (P2P) Architecture
● Description: Every node acts as both client and server, sharing data directly with peers.
● Characteristics: Fully decentralized, no central authority.
● Example: Early file-sharing networks like Gnutella.
● Advantages: Scalable, robust to single-point failures.
● Disadvantages: Query routing and indexing are challenging.
2. Broker-Based (Centralized Coordinator) Architecture
● Description: A central broker receives user queries and forwards them to distributed
databases or servers.
● Characteristics: Broker maintains metadata about available sources.
● Example: Federated search systems in digital libraries.
● Advantages: Efficient query routing, easier to manage.
● Disadvantages: Broker is a potential single point of failure.
3. Crawling-Based Architecture
● Description: Distributed sources are periodically crawled, and documents are indexed
centrally.
● Example: Search engines like Google, Bing.
● Advantages: Fast retrieval from centralized index, simple querying.
● Disadvantages: Requires storage and frequent crawling to stay up-to-date.
4. Metadata Harvesting Architecture
● Description: Only metadata (summaries, abstracts) is collected from distributed sources,
not full documents.
● Example: Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
● Advantages: Lightweight, faster than full crawling.
● Disadvantages: Limited search to metadata fields; full documents are not always
searchable.
5. Hybrid Architecture
● Description: Combines two or more DIR approaches (e.g., crawling + broker-based).
● Characteristics: Tries to balance efficiency, scalability, and accuracy.
● Example: Modern enterprise search systems combining metadata and content indexing.
● Advantages: Flexible, scalable, and can optimize retrieval performance.
● Disadvantages: More complex to implement and maintain.
13 Write a Short note on :
A) Multimedia Information Retrieval
Multimedia Information Retrieval (MIR)
Multimedia Information Retrieval refers to the process of searching and retrieving
multimedia data—such as images, audio, video, graphics, and animations—from large
collections based on user queries.
Key Points:
1. Content-Based Retrieval: Uses features of the multimedia content like:
o Images: color, texture, shape, patterns
o Audio: pitch, rhythm, timbre
o Video: motion, frames, key scenes
2. Text-Based Retrieval: Uses metadata or annotations associated with multimedia files,
like captions, tags, or descriptions.
3. Applications:
o Image search engines (e.g., Google Images)
o Video retrieval (e.g., YouTube, surveillance video search)
o Audio/music retrieval (e.g., Shazam)
o Digital libraries, medical imaging, social media search
4. Challenges:
o Large data size and complexity
o Feature extraction and representation
o Semantic gap between user query and multimedia content
o Efficient indexing and retrieval
B) Distributed Information Retrieval
Distributed Information Retrieval (DIR)
Distributed Information Retrieval refers to searching and retrieving information from
multiple, geographically or logically distributed sources rather than a single centralized
database.
Key Points:
1. Architecture: DIR systems can be peer-to-peer, broker-based, crawling-based,
metadata harvesting, or hybrid, depending on how queries and data are managed.
2. Process:
o A user query is sent to multiple sources.
o Each source processes the query locally.
o Results are collected, ranked, and presented to the user.
3. Advantages:
o Access to diverse, heterogeneous data sources.
o Scalable and can handle large distributed datasets.
4. Challenges:
o Query routing and result merging.
o Maintaining consistency and relevance across distributed sources.
o Handling network delays and heterogeneous formats.
Applications:
● Federated search in digital libraries
● Web search across multiple servers
● Enterprise information systems

Ir Individual Kalkidan Tilahun
No ratings yet
Ir Individual Kalkidan Tilahun
16 pages
Irs CH 3
No ratings yet
Irs CH 3
28 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Query Languages
No ratings yet
Query Languages
54 pages
Text Databases and Information Retrieval
No ratings yet
Text Databases and Information Retrieval
23 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
Cs8080 - Information Retrieval Techniques: Sequential Inverted
No ratings yet
Cs8080 - Information Retrieval Techniques: Sequential Inverted
12 pages
Unit II
No ratings yet
Unit II
73 pages
Query Languages and Search Techniques
No ratings yet
Query Languages and Search Techniques
36 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Information Retrieval System MODULE 3 Mumbai University
No ratings yet
Information Retrieval System MODULE 3 Mumbai University
27 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Ir Ass1
No ratings yet
Ir Ass1
12 pages
Unit 5 - Query Operations and Languages
No ratings yet
Unit 5 - Query Operations and Languages
11 pages
Information Retrivals Ans
No ratings yet
Information Retrivals Ans
78 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Department of Information Technology Sub: Information Storage and Retrieval Topic: Basics of IR Mr. Rajpure A.S
No ratings yet
Department of Information Technology Sub: Information Storage and Retrieval Topic: Basics of IR Mr. Rajpure A.S
35 pages
IR Ans
No ratings yet
IR Ans
13 pages
Bulu
No ratings yet
Bulu
47 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
1) Explain User Interaction With IR With The Help of A Diagram
No ratings yet
1) Explain User Interaction With IR With The Help of A Diagram
12 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
NLP See
No ratings yet
NLP See
9 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
1.explain User Search Techniques
No ratings yet
1.explain User Search Techniques
8 pages
CS8080 Irt
No ratings yet
CS8080 Irt
30 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Understanding Query Languages in Search
No ratings yet
Understanding Query Languages in Search
19 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
IR Unit 1
No ratings yet
IR Unit 1
19 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit 3
No ratings yet
Unit 3
27 pages
IR
No ratings yet
IR
5 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Module 1print
No ratings yet
Module 1print
5 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
IR Ut-1
No ratings yet
IR Ut-1
16 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
IR (1-7) - Heet
No ratings yet
IR (1-7) - Heet
19 pages
4 TH Unit
No ratings yet
4 TH Unit
13 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
59 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Irs 1
No ratings yet
Irs 1
4 pages
CSE Information Retrieval Guide
100% (1)
CSE Information Retrieval Guide
33 pages
Information Search and Retrieval
No ratings yet
Information Search and Retrieval
23 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
IRS Extended
No ratings yet
IRS Extended
15 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
88 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Assessment Guide - DBA
No ratings yet
Assessment Guide - DBA
14 pages
Business Applications: Tanmay Roy
No ratings yet
Business Applications: Tanmay Roy
32 pages
Cloud Computing: IaaS Overview and Benefits
No ratings yet
Cloud Computing: IaaS Overview and Benefits
77 pages
NSA Guide Keeping Home Network Secure
No ratings yet
NSA Guide Keeping Home Network Secure
4 pages
Comparison of Objective Image Quality Metrics To Expert Radiologists Scoring of Diagnostic Quality of MR Images
No ratings yet
Comparison of Objective Image Quality Metrics To Expert Radiologists Scoring of Diagnostic Quality of MR Images
9 pages
UltraBot Autonomous Mobile Robot For Indoor UV-C Disinfection
No ratings yet
UltraBot Autonomous Mobile Robot For Indoor UV-C Disinfection
6 pages
FDHGAF
No ratings yet
FDHGAF
27 pages
Mastering Data Visualization Techniques (Part 1)
No ratings yet
Mastering Data Visualization Techniques (Part 1)
20 pages
Python Data Science Course Overview
No ratings yet
Python Data Science Course Overview
61 pages
Digital - Questions - With - Answers - PDF Version 1
No ratings yet
Digital - Questions - With - Answers - PDF Version 1
69 pages
Microservices for Developers
No ratings yet
Microservices for Developers
23 pages
User's Manual: Ns King 1000-2000-3000VA
No ratings yet
User's Manual: Ns King 1000-2000-3000VA
26 pages
Mi Smart Band 6
No ratings yet
Mi Smart Band 6
1 page
Cloudera Hadoop Solutions Overview
No ratings yet
Cloudera Hadoop Solutions Overview
20 pages
K-8000C Using Manual
No ratings yet
K-8000C Using Manual
9 pages
Allen Bradley PLC Programming Guide
100% (5)
Allen Bradley PLC Programming Guide
50 pages
Logic and Theory of Computing Course Outline
No ratings yet
Logic and Theory of Computing Course Outline
73 pages
Firewall App Blocker v1.7
No ratings yet
Firewall App Blocker v1.7
6 pages
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
No ratings yet
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
7 pages
Essential Resume Writing Tips for MBA Graduates
No ratings yet
Essential Resume Writing Tips for MBA Graduates
13 pages
Worksheet Chapter 2 Polynomials
No ratings yet
Worksheet Chapter 2 Polynomials
4 pages
A3 DTG T-Shirt Printer Specifications
No ratings yet
A3 DTG T-Shirt Printer Specifications
2 pages
Data Structures Algorithms
No ratings yet
Data Structures Algorithms
27 pages
LCD Television Service Manual: Chassis MT9602
100% (3)
LCD Television Service Manual: Chassis MT9602
64 pages
TIMELINE: Cybercrime Prevention Act of 2012: A Law 11 Years in The Making
No ratings yet
TIMELINE: Cybercrime Prevention Act of 2012: A Law 11 Years in The Making
4 pages
Azure Devops Pipelines Yaml Schema Azure Pipelines
No ratings yet
Azure Devops Pipelines Yaml Schema Azure Pipelines
193 pages
GEN SM-T230 Galaxy Tab 4 KK English User Manual NC4 F3
No ratings yet
GEN SM-T230 Galaxy Tab 4 KK English User Manual NC4 F3
121 pages
Fractal 50 MCQ Quiz
No ratings yet
Fractal 50 MCQ Quiz
3 pages
A Survey For The Application of Blockchain Technology in The Media
No ratings yet
A Survey For The Application of Blockchain Technology in The Media
23 pages
DPlus Hosts
No ratings yet
DPlus Hosts
11 pages

IR Assignment 2 Solution

Uploaded by

IR Assignment 2 Solution

Uploaded by

DEPARTMENT OF COMPUTER ENGINEERING

Class/Sem/Year: B.E./ VII/2023-24(ODD)

Pattern Matching is a process of checking and finding a specific sequence of characters,

Types of Patterns Used in Pattern Matching

There are several kinds of patterns, depending on the matching approach:

1. Exact Match Patterns

●​ Used in: dictionary lookups, database queries.

●​ The pattern can occur as a part (substring) within a larger string.

●​ Example: Searching "cat" will match "concatenate", "educate", etc.

●​ Useful in: keyword search engines, spell checking.

3. Prefix and Suffix Patterns

●​ Prefix Pattern: Pattern appears at the start of the string.​

●​ Used in: autocomplete systems, stemming in IR.

4. Regular Expression (Regex) Patterns

●​ Patterns defined using special symbols and rules.

●​ Example: Regex [0-9]+ matches any sequence of digits.

●​ Used in: text editors, search engines, programming.

●​ Patterns with placeholders (wildcards) that can match any character(s).

o​ ? → matches a single character

o​ * → matches zero or more characters

●​ Example: ca* matches "cat", "car", "capital".

●​ Used in: file searching (*.txt), command line tools.

6. Approximate (Fuzzy) Patterns

●​ Allow partial matching when exact match is not possible.

●​ Useful when spelling errors, typos, or variations occur.

●​ Example: "colour" matches "color".

●​ Used in: spell checkers, DNA sequence analysis, OCR systems.

7. Structural or Syntactic Patterns

●​ Patterns based on hierarchical structures such as trees or graphs.

●​ Example: Matching an XML/JSON structure, compiler syntax checking.

●​ Used in: programming languages, structured data retrieval.

2 List and explain the types of Structured Queries?

5 What do you mean by Inverted Files? Explain in detail.

8 Explain TF-IDF Weighting.

You might also like

● Used in: dictionary lookups, database queries.

● The pattern can occur as a part (substring) within a larger string.

● Example: Searching "cat" will match "concatenate", "educate", etc.

● Useful in: keyword search engines, spell checking.

● Prefix Pattern: Pattern appears at the start of the string.

● Used in: autocomplete systems, stemming in IR.

● Patterns defined using special symbols and rules.

● Example: Regex [0-9]+ matches any sequence of digits.

● Used in: text editors, search engines, programming.

● Patterns with placeholders (wildcards) that can match any character(s).

o ? → matches a single character

o * → matches zero or more characters

● Example: ca* matches "cat", "car", "capital".

● Used in: file searching (*.txt), command line tools.

● Allow partial matching when exact match is not possible.

● Useful when spelling errors, typos, or variations occur.

● Example: "colour" matches "color".

● Used in: spell checkers, DNA sequence analysis, OCR systems.

● Patterns based on hierarchical structures such as trees or graphs.

● Example: Matching an XML/JSON structure, compiler syntax checking.

● Used in: programming languages, structured data retrieval.