0% found this document useful (0 votes)
13 views13 pages

Module 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Module 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

What is the Fuzzy Set Model?

When we search for something (like on Google), we use keywords. But these keywords don’t
always perfectly describe what we want. Similarly, documents may not have exact keywords even
if they are related.
So, the match is not exact — it’s vague or approximate.
The Fuzzy Set Model is used to handle this vagueness.
• Every query term (like a keyword) is treated as a fuzzy set.
• Each document has a degree of membership in that fuzzy set.
• Instead of saying a document is relevant or not (0 or 1), we say it's partially relevant, like
0.6 or 0.9.
Fuzzy Set Theory
• In normal logic (Boolean), something is either in a set or not (0 or 1).
• In fuzzy logic, it can be partially in. For example:
o Membership = 1 → fully in the set (very relevant)
o Membership = 0 → not in the set at all (not relevant)
o Membership = 0.5 → somewhat in the set (partially relevant)

Why Use Fuzzy Sets in IR?


Because relevance is not always black or white. Documents might be:
• Strongly relevant
• Partially relevant
• Slightly related
Fuzzy sets let us measure this degree and rank results more accurately.
Fuzzy Operations
These help combine or modify sets:

Operation Meaning

Complement Opposite of the set (e.g., not relevant documents)

Union Combines multiple sets (e.g., documents relevant to term A or B)

Intersection Common elements (e.g., documents relevant to A and B)

Prof. Harsha Zope


Example
Let’s say you search for:
Query : "healthy snacks"
And we have documents:

Document Relevance (membership)

D1 0.9 (highly relevant)

D2 0.5 (partially relevant)

D3 0.2 (barely relevant)

The fuzzy model uses these partial scores to sort documents better than just saying "yes" or "no".

• It uses degrees of relevance (not just 0 or 1).


• Helps in giving more realistic and flexible search results.
Let me know if you want a diagram or code example for it!

What is Fuzzy Information Retrieval?


Fuzzy Information Retrieval (FIR) is used when exact keyword matching isn't enough. It helps
the system find relevant documents even when:
• A word is misspelled
• A word is close in meaning
• A word has similar spelling

1. Query Expansion:
o The system takes your search query and adds similar or related terms using a
thesaurus or dictionary.
o This helps find more documents that may be useful (even if they don’t exactly
match your keywords).
Example:
You search for "computer", and the system also looks for:
o "compute"
o "compiter"

Prof. Harsha Zope


o "computter"
...even if those were typos or spelling variations.

2. Handling Spelling Mistakes:


o Fuzzy search helps when you type a word wrong.
o It matches words that are similar in spelling and character positions.
Example:
Search for: comptuer
It finds: computer, commuter, compter, etc.

3. Term-Term Correlation Matrix:


o The system uses a special matrix to see how closely related different words are.
o It calculates a score (like a degree of relevance) for each document.

4. Algebraic Operations (instead of Boolean):


o Instead of simple AND/OR logic (Boolean), fuzzy retrieval uses math operations
like sum and product to calculate how relevant a document is.
o This gives a more gradual score — not just "yes" or "no".

5. Trade-off:
o Increases recall → finds more relevant documents.
o Decreases precision → may include some less relevant ones too.

Simple Example:
You search: “computer”
The system will also include documents with:
• "computter" (typo)
• "compute" (related)
• "compiter" (misspelling)
But maybe not include "commuter" (different meaning) unless specified.

Prof. Harsha Zope


Concept Simple Meaning

Fuzzy Search Finds similar words, not just exact ones

Query Expansion Adds related terms to your query

Spelling Match Helps find results even with typos

Term Correlation Measures how words relate

Algebraic Matching Uses math, not just yes/no logic

Recall ↑, Precision ↓ Finds more, but may be less accurate

What is the Extended Boolean Model?


The Extended Boolean Model is an improved version of the traditional Boolean model. It keeps
the AND/OR/NOT logic from Boolean searches, but also adds ranking and term weights, just like
in the Vector Space Model.

Problem with Traditional Boolean Model:


• Only gives yes/no answers: either a document matches or it doesn’t.
• Doesn’t rank results: all matching documents are treated equally.
• Can give too many or too few results.
• No concept of importance of a word.

What Does the Extended Model Add?


1. Partial Matching:
o Even if a document doesn't match all terms, it can still be somewhat relevant and
appear in results.
2. Term Weighting:
o Words are given weights to show how important they are.
o The weight is usually a number between 0 and 1.
o Higher weight = more important in the document or query.
3. Ranking Results:

Prof. Harsha Zope


o The system calculates a score for each document.
o Documents are shown in order from most to least relevant.

Simple Example
Let’s say you search for:
Query: apple AND juice
Boolean Model:
• Only shows documents that have both words.
• No ranking.
Extended Boolean Model:
• Gives higher score to documents that:
o Have both words many times.
o Have "apple" or "juice" as important keywords.
• Returns documents ranked by relevance.

Term Weights in Documents


• A document that mentions “apple” 10 times and “juice” 1 time:
o Weight of “apple” = 0.9
o Weight of “juice” = 0.2
• These weig

• hts tell the system how strongly each term is related to the document.

Summary Table:

Feature Boolean Model Extended Boolean Model

Logic AND/OR/NOT AND/OR/NOT + partial match

Ranking No Yes

Term Importance All equal Weighted

Result Type Exact match only Ranked & flexible

Prof. Harsha Zope


Feature Boolean Model Extended Boolean Model

Matching All or nothing Allows "close enough"

• The Extended Boolean Model is a mix of Boolean and Vector Space Models.
• It helps improve flexibility, relevance, and user experience.

Prof. Harsha Zope


What is Structured Text Retrieval?
Structured Text Retrieval is a model that allows searching based on both content and structure
of the document.
Instead of just searching for words, you can also search based on:
• Where the word appears (e.g., title, heading, caption)
• How it is formatted (e.g., italic, bold)
• The section or page it appears in (e.g., inside a figure or a table)

Why Do We Need It?


Sometimes, people remember more than just the keywords.
They remember how or where those words appeared in the document.

Example:
Let’s say a user remembers:
“I saw the words atomic holocaust in italics next to a figure with the word earth in its label.”

Normal search:
Using a basic search like:
plaintext
CopyEdit
"atomic holocaust" AND "earth"
This finds any documents that have both words — but too many results.

Structured Text Retrieval:


You can search with structure, like:
plaintext

Prof. Harsha Zope


CopyEdit
same-page( near("atomic holocaust", Figure(label("earth")) ) )
This means:
“Find a page where atomic holocaust is near a figure whose label has the word earth.”

Now the result is much more accurate!

What Does It Support?

Feature Meaning

Text search Find words or phrases

Structure Specify where it appears (title, figure, table, etc.)

Formatting Italics, bold, headings

Proximity Words near each other or on the same page

Summary

Concept Easy Meaning

Structured Retrieval Search using both words and layout/structure

Classic Search Only matches words, no structure

Better Precision Helps narrow down search to exact format/location

Real-life Use Helpful for users who remember layout, not exact text

2.6.2.1 Model Based on Non-Overlapping Lists


What it means:
• The document is split into parts that do not overlap, like:
o Chapter list
o Section list
o Subsection list
(Each is stored as a separate list)

Prof. Harsha Zope


Example:
• A book has:
o Chapter 1, Chapter 2, ...
o Section 1.1, Section 1.2, ...
o Subsection 1.1.1, 1.1.2, ...
Each list (chapters, sections, etc.) is stored separately, and within each list, the parts don’t
overlap.

How it works:
• An index is built so that you can search for:
o Which chapter contains the word “virus”
o Which section does not contain another subsection
o Which paragraph stands alone (not inside a section)
Key Points:

Feature Explanation

Non-overlapping Text regions in the same list don't overlap

Separate lists Chapters, sections, etc. are stored independently

Simple queries Search within or outside certain parts

2.6.2.2 Model Based on Proximal Nodes


What it means:
• The document is structured into hierarchies (like trees):
o Chapter → Section → Paragraph → Line
• These are called nodes.
• Each node points to a part of the text.
Example:
You can define two different hierarchies:
• One based on chapters/sections
• Another based on pages/paragraphs
How queries work:

Prof. Harsha Zope


• If a user query refers to different hierarchies, the answer is taken only from one
hierarchy.
• For example:

o You can search “paragraphs inside sections”

o But not “lines from both pages and sections”


This rule helps make the search faster (but with less flexibility).

Feature Explanation

Hierarchical Chapters → Sections → Paragraphs, etc.

Nodes Each part is a node that covers some text

One hierarchy per query Results come from a single structure for speed

Final Summary Table:

Model Simple Meaning Key Feature

Non-Overlapping
Document is split into flat, separate parts No overlapping in the same list
Lists

Document is structured like a tree (chapter Queries return results from one
Proximal Nodes
→ section → paragraph) hierarchy only

Prof. Harsha Zope


Top Part: Hierarchical Structure
• The document is broken into 4 levels:
1. Chapter
2. Sections
3. Subsections
4. Subsubsections
These are connected like a tree — each chapter has sections, each section has subsections, and so
on.

Bottom Part: Inverted List for the Word ‘holocaust’


• The word ‘holocaust’ is stored in an inverted list.
• It points to all the places (positions) in the document where the word appears.

Prof. Harsha Zope


Example:
nginx
CopyEdit
holocaust → 10 → 256 → ... → 48,324
This means:
• The word 'holocaust' appears at position 10, then at 256, and so on — all through the
document.

How Searching Works (Query Example):

Suppose we ask:
Find all sections, subsections, or subsubsections that contain the word 'holocaust'.

The system does this in 2 steps:


1. Find the word in the inverted list (like position 10, 256, etc.).
2. Check the hierarchy to see which section, subsection, or subsubsection that position
belongs to.

Query Language Features:


You can:
• Search for words using regular expressions
• Search by structure (e.g., “section” or “subsection”)
• Combine both (e.g., section that contains the word "holocaust")

Summary:

Concept Easy Meaning

Hierarchy Document is split like a tree: chapter → section → subsection...

Inverted List Fast lookup: shows where a word appears

Query You can search by word and structure

Efficient Fast because it first finds the word, then checks where it is in the structure

Prof. Harsha Zope


Prof. Harsha Zope

You might also like