Spelling Correction in IR Systems
Spelling Correction in IR Systems
BSc CS SEM VI
Mumbai University
Information Retrieval
Information Retrieval 1
T. Y. BSc CS SEM VI
M.S. College of Science, Arts, Commerce, BSc (IT), BSc (CS), B.com, BMS
(Devghar)
Miss…………………………………………………………………………………………
…………………………………………………………………………………..
Information Retrieval 2
T. Y. BSc CS SEM VI
INDEX
SR TOPIC Page No. Date Sign
N
O.
1. Document Indexing and Retrieval
• Implement an inverted index construction algorithm.
• Build a simple document retrieval system using the
constructed index.
2. Retrieval Models
• Implement the Boolean retrieval model and process
queries.
• Implement the vector space model with TF-IDF weighting
and cosine similarity.
3. Spelling Correction in IR Systems
Information Retrieval 3
T. Y. BSc CS SEM VI
Practical No: 1
Aim: Document Indexing and Retrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the constructed
index.
Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing a
specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and each
term points to a list of documents or web pages that contain that term.
● Inverted indexes are widely used in search engines, database systems, and
other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from content,
such as words or numbers, to its locations in a document or a set of
documents.
Practical:
Input:
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
Information Retrieval 4
T. Y. BSc CS SEM VI
Information Retrieval 5
T. Y. BSc CS SEM VI
inverted_index[term] = documents
Information Retrieval 6
T. Y. BSc CS SEM VI
Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and cosine
similarity.
Theory :
A) Boolean Retrieval Model -
● A Boolean model is a fundamental concept in Information Retrieval
(IR) that is used to represent and retrieve documents or
information based on Boolean logic.
3. Boolean Operators:
● AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
● OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT dogs"
"cats" but not "dogs."
Information Retrieval 7
T. Y. BSc CS SEM VI
B) TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is a
widely used statistical method in information retrieval.
● It measures how important a term is within a document
relative to a collection of documents.
C) Cosine Similarity -
Information Retrieval 8
T. Y. BSc CS SEM VI
Practical:
documents = {
1: "apple banana orange",
2: "apple banana",
3: "banana orange",
4: "apple"
}
Information Retrieval 9
T. Y. BSc CS SEM VI
result = index.get(operands[0], set()) # Get the set of document IDs for the
first operand
for term in operands[1:]: # Iterate through the rest of the operands
result = result.intersection(index.get(term, set())) # Compute intersection
with sets of document IDs
return list(result) # Return the resulting list of document IDs
Information Retrieval 10
T. Y. BSc CS SEM VI
# Example queries
query1 = ["apple", "banana"] # Query for documents containing both "apple"
and "banana"
query2 = ["apple", "orange"] # Query for documents containing "apple" or
"orange"
# Printing results
print("Documents containing 'apple' and 'banana':", result1)
print("Documents containing 'apple' or 'orange':", result2)
print("Documents not containing 'orange':", result3)
print("Performed by 740_Pallavi & 743_Deepak")
Output:
Information Retrieval 11
T. Y. BSc CS SEM VI
B) Implement the vector space model with TF-IDF weighting and cosine
similarity:
Input:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Import necessary libraries
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
import numpy as np # Import NumPy library
from numpy.linalg import norm # Import norm function from NumPy's linear
algebra module
Information Retrieval 12
T. Y. BSc CS SEM VI
# Fit the transformer to the test set and transform it to TF-IDF representation
transformer.fit(testVectorizerArray)
print()
tfidf = transformer.transform(testVectorizerArray)
print(tfidf.todense())
Information Retrieval 13
T. Y. BSc CS SEM VI
Output:
Information Retrieval 14
T. Y. BSc CS SEM VI
Practical No: 3
Theory:
Edit Distance :
Practical:
Input:
# A Naive recursive python program to find minimum number
# operations to convert str1 to str2
def editDistance(str1, str2, m, n):
# If first string is empty, the only option is to insert all characters of second
string into first
if m == 0:
return n
# If second string is empty, the only option is to remove all characters of first
string
if n == 0:
Information Retrieval 15
T. Y. BSc CS SEM VI
return m
# If last characters of two strings are same, nothing much to do. Ignore last
characters and get count for remaining strings.
if str1[m-1] == str2[n-1]:
return editDistance(str1, str2, m-1, n-1)
# If last characters are not same, consider all three operations on last
character of first string, recursively compute minimum cost for all three
operations and take minimum of three values.
return 1 + min(editDistance(str1, str2, m, n-1), # Insert
editDistance(str1, str2, m-1, n), # Remove
editDistance(str1, str2, m-1, n-1) # Replace
)
# Driver code
str1 = "sunday"
str2 = "saturday"
print('Edit Distance is: ', editDistance(str1, str2, len(str1), len(str2)))
Output:
Information Retrieval 16
T. Y. BSc CS SEM VI
Practical No: 4
Theory:
1. Precision:
● Precision is the ratio of correctly predicted positive observations to
the total predicted positives.
● It is also called Positive Predictive Value (PPV).
● Precision is calculated using the following formula:
Precision = TP / TP+FP
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FP (False Positives) is the number of
instances incorrectly predicted as positive.
High precision indicates that the model has a low rate of false positives. In
other words, when the model predicts a positive result, it is likely to be
correct.
2. Recall:
• Recall is the ratio of correctly predicted positive observations to all
observations in actual class.
• Recall is calculated using the following formula:
Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FN (False Negatives) is the number of
instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In
Information Retrieval 17
T. Y. BSc CS SEM VI
other words, the model is effective at capturing all the positive instances.
3. F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing a
balanced measure of a model's performance.
4. Average Precision:
Average Precision is used to find the Average of the model precision based
on relevancy of result. • Algorithm:
Information Retrieval 18
T. Y. BSc CS SEM VI
Input:
'''
(Optional)
PPT values:
true_positive = 20
false_positive = 10
false_negative = 30
'''
Information Retrieval 19
T. Y. BSc CS SEM VI
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-measure: {f_measure}")
Output:
Input:
Information Retrieval 20
T. Y. BSc CS SEM VI
Practical No: 5
Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• Naive Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Some popular examples of Naive Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Bayes' Theorem:
Information Retrieval 21
T. Y. BSc CS SEM VI
Practical:
Input:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
Information Retrieval 22
T. Y. BSc CS SEM VI
Information Retrieval 23
T. Y. BSc CS SEM VI
Output:
Information Retrieval 24
T. Y. BSc CS SEM VI
Practical No: 6
Theory:
K-Means Clustering:
● K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
● Here K defines the number of predefined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
● It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
● The main aim of this algorithm is to minimise the sum of distances
between the data point and their corresponding clusters.
● The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.
Information Retrieval 25
T. Y. BSc CS SEM VI
Step-5: Repeat the third steps, which means assign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Practical
Input:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["Cats are known for their agility and grace", #cat doc1
"Dogs are often called ‘man’s best friend’.", #dog doc1
"Some dogs are trained to assist people with disabilities.", #dog doc2
"The sun rises in the east and sets in the west.", #sun doc1
"Many cats enjoy climbing trees and chasing toys.", #cat doc2
]
Output:
Information Retrieval 26
T. Y. BSc CS SEM VI
Practical No: 7
Theory
Crawling: Google downloads text, images, and videos from pages it found on
the internet with automated programs called crawlers.
Indexing: Google analyses the text, images, and video files on the page, and
stores the information in the Google index, which is a large database.
Crawling Process -
1) Starting Point: The crawling process usually begins with a set of seed
URLs, which can be provided manually or generated through algorithms.
These URLs serve as the starting points for the web crawlers.
Information Retrieval 27
T. Y. BSc CS SEM VI
Indexing Process -
1) HTML Parsing: The content retrieved by the crawler is parsed to extract
relevant information. This involves analysing the HTML structure to
identify text, metadata, and other elements on the page.
2) Isolating Textual Content: From the parsed content, the crawler isolates
the textual information, such as the body of the page, headings, and other
relevant textual data.
Information Retrieval 28
T. Y. BSc CS SEM VI
Practical
Input:
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110
Safari/537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.RequestException as err:
print(f"Request Error: {err}")
return None
Information Retrieval 29
T. Y. BSc CS SEM VI
def save_robots_txt(url):
try:
robots_url = urljoin(url, '/robots.txt')
robots_content = get_html(robots_url)
if robots_content:
with open('robots.txt', 'wb') as file:
file.write(robots_content.encode('utf-8-sig'))
except Exception as e:
print(f"Error saving robots.txt: {e}")
def load_robots_txt():
try:
with open('robots.txt', 'rb') as file:
return file.read().decode('utf-8-sig')
except FileNotFoundError:
return None
Information Retrieval 30
T. Y. BSc CS SEM VI
visited_urls = set()
html = get_html(url)
if html:
print(f"Crawling {url}")
links = extract_links(html, url)
save_robots_txt(start_url)
robots_content = load_robots_txt()
if not robots_content:
print("Unable to retrieve robots.txt. Crawling without restrictions.")
recursive_crawl(start_url, 1, robots_content)
# Example usage:
print("Performed by 740_Pallavi & 743_Deepak")
crawl('https://wikipedia.com', max_depth=2, delay=2)
Information Retrieval 31
T. Y. BSc CS SEM VI
Output:
robot.txt file:
Information Retrieval 32
T. Y. BSc CS SEM VI
Practical No: 8
Theory
Link Analysis:
PageRank Algorithm -
● The PageRank algorithm is an algorithm used by the Google search
engine to rank web pages in its search results.
● It was developed by Larry Page and Sergey Brin, the co-founders of
Google, and is named after Larry Page.
● PageRank is based on the idea that the importance of a webpage is
determined by the number and quality of other pages that link to it.
Working -
Information Retrieval 33
T. Y. BSc CS SEM VI
4) Convergence Check:
After each iteration, check for convergence. If the difference between the new
and previous PageRank values falls below a certain threshold, the algorithm has
converged, and you can stop iterating.
5) Repeat Iterations:
Continue iterating until the maximum number of iterations is reached or until
convergence is achieved.
Practical
Input:
import numpy as np
Information Retrieval 34
T. Y. BSc CS SEM VI
return page_ranks
# Example usage
if name == " main ":
# Define a simple directed graph as an adjacency list
# Each index represents a node, and the list at that index contains nodes to
which it has outgoing links
web_graph = [
[1, 2], # Node 0 has links to Node 1 and Node 2
[0, 2], # Node 1 has links to Node 0 and Node 2
[0, 1] , # Node 2 has links to Node 0 and Node 1
[1,2], # Node 3 has links to Node 1 and Node 2
]
Information Retrieval 35
T. Y. BSc CS SEM VI
# Calculate PageRank
result = page_rank(web_graph)
Output:
Information Retrieval 36