0% found this document useful (0 votes)

162 views36 pages

Spelling Correction in IR Systems

This document outlines the practical components of an Information Retrieval course for BSc CS students at M.S. College of Science, Arts, Commerce. It includes various topics such as document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, and link analysis, along with practical implementations for each topic. The document serves as a guide for students to complete their practical assignments in the academic year 2024-2025.

Uploaded by

surajdhapshi07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views36 pages

Spelling Correction in IR Systems

Uploaded by

surajdhapshi07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

T. Y.

BSc CS SEM VI

Habib Education and Welfare Society’s

M.S. College of Science, Arts, Commerce, BSc (IT),

BSc (CS), B.com, Bms.
(Devghar)

Mumbai University

Information Retrieval

(A.Y. 2024 – 2025)

Information Retrieval 1
T. Y. BSc CS SEM VI

M.S. College of Science, Arts, Commerce, BSc (IT), BSc (CS), B.com, BMS
(Devghar)

This is to certify that Mr. /

Miss…………………………………………………………………………………………

Class: F.Y. BSc C.S Exam Seat No. ________________

…………………………………………………………………………………..

Satisfactorily completed the practical

in……………………………………………………………………………. As laid down in

the regulation of the University of Mumbai for……………………….

… ..................................................... Examination 2024 – 25.

In-Charge Professor External

Signature
Examiner Signature

Principal Signature H.O. D Signature

Information Retrieval 2
T. Y. BSc CS SEM VI

INDEX
SR TOPIC Page No. Date Sign
N
O.
1. Document Indexing and Retrieval
• Implement an inverted index construction algorithm.
• Build a simple document retrieval system using the
constructed index.
2. Retrieval Models
• Implement the Boolean retrieval model and process
queries.
• Implement the vector space model with TF-IDF weighting
and cosine similarity.
3. Spelling Correction in IR Systems

• Develop a spelling correction module using edit distance

algorithms.
• Integrate the spelling correction module into an
information retrieval system.
4. Evaluation Metrics for IR Systems
• Calculate precision, recall, and F-measure for a given set
of retrieval results.
• Use an evaluation toolkit to measure average precision
and other evaluation metrics.
5. Text Categorization

• Implement a text classification algorithm (e.g., Naive

Bayes or Support Vector Machines).
• Train the classifier on a labelled dataset and evaluate its
performance.
6. Clustering for Information Retrieval

• Implement a clustering algorithm (e.g., K-means or

hierarchical clustering).
• Apply the clustering algorithm to a set of documents
and evaluate the clustering results.
7. Web Crawling and Indexing

• Develop a web crawler to fetch and index web pages.

• Handle challenges such as robots.txt, dynamic content,
and crawling delays.
8. Link Analysis and PageRank

• Implement the PageRank algorithm to rank web pages

based on link analysis.
• Apply the PageRank algorithm to a small web graph and
analyze the results.

Information Retrieval 3
T. Y. BSc CS SEM VI

Practical No: 1
Aim: Document Indexing and Retrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the constructed
index.

Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing a
specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and each
term points to a list of documents or web pages that contain that term.
● Inverted indexes are widely used in search engines, database systems, and
other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from content,
such as words or numbers, to its locations in a document or a set of
documents.

Rules to create an inverted index -

1) The text of each document is first preprocessed by removing stop words :
Stop words are the most occurring and useless words in documents like
“I”, “the”, “we”, “is”, and “an”.
2) The text is tokenized, meaning that it is split into individual terms.
3) The terms are then added to the index, with each term pointing to the
documents in which it appears.

Practical:
Input:
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK

# Define the documents

document1 = "The quick brown fox jumped over the lazy dog"
document2 = "The lazy dog slept in the sun"

Information Retrieval 4
T. Y. BSc CS SEM VI

# Get the stopwords for English language from NLTK

nltk.download('stopwords')
stopWords = stopwords.words('english')

# Step 1: Tokenize the documents

# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()

# Combine the tokens into a list of unique terms

terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index

# Create an empty dictionary to store the inverted index as well as a dictionary
to store number of occurrences
inverted_index = {}
occ_num_doc1 = {}
occ_num_doc2 = {}

# For each term, find the documents that contain it

for term in terms:
if term in stopWords:
continue
documents = []
if term in tokens1:
documents.append("Document 1")
occ_num_doc1[term] = tokens1.count(term)
if term in tokens2:
documents.append("Document 2")
occ_num_doc2[term] = tokens2.count(term)

Information Retrieval 5
T. Y. BSc CS SEM VI

inverted_index[term] = documents

# Step 3: Print the inverted index

for term, documents in inverted_index.items():
print(term, "->", end=" ")
for doc in documents:
if doc == "Document 1":
print(f"{doc} ({occ_num_doc1.get(term, 0)}),", end=" ")
else:
print(f"{doc} ({occ_num_doc2.get(term, 0)}),", end=" ")
print()
print("Performed by 740_Pallavi & 743_Deepak")
Output:

Information Retrieval 6
T. Y. BSc CS SEM VI

Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and cosine
similarity.

Theory :
A) Boolean Retrieval Model -
● A Boolean model is a fundamental concept in Information Retrieval
(IR) that is used to represent and retrieve documents or
information based on Boolean logic.

● In this model, a document is typically represented as a set of terms

(words or phrases), and queries are also represented using Boolean
operators (AND, OR, NOT) to specify the desired information.

Here's how the Boolean model works in IR:

1. Document Representation: Each document in the collection is

represented as a set of terms. These terms can be extracted
from the document's content and can be single words, phrases,
or other units of information.

2. Query Representation: Queries are also represented as sets of terms,

and Boolean operators (AND, OR, NOT) are used to combine these
terms to express the user's information needs. For example, a query
might be "cats AND dogs," meaning the user wants documents that
contain both "cats" and "dogs."

3. Boolean Operators:
● AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
● OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT dogs"
"cats" but not "dogs."

Information Retrieval 7
T. Y. BSc CS SEM VI

B) TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is a
widely used statistical method in information retrieval.
● It measures how important a term is within a document
relative to a collection of documents.

Term Frequency (TF): TF of a term or word is the number of times the

term appears in a document compared to the total number of words in
the document.

Inverse Document Frequency(IDF):

● IDF of a term reflects the proportion of documents in the corpus
that contain the term.
IDF = log( N / df ) where,
N= total no. of documents
df = no. of documents containing a term

● The TF-IDF of a term is calculated by multiplying TF and IDF

scores. TF-IDF = TF*IDE

C) Cosine Similarity -

● Cosine similarity is a measure of similarity between two non-

zero vectors defined in an inner product space.
● Cosine similarity is the cosine of the angle between the
vectors. ● The cosine similarity always belongs to the
interval [−1,1].
● In cosine similarity, data objects in a dataset are treated as a
vector. ● The formula to find the cosine similarity between two
vectors is -

Information Retrieval 8
T. Y. BSc CS SEM VI

Here A . B is the product of the vector.

Practical:

A) Implement the Boolean retrieval model and process queries:

Input:

documents = {
1: "apple banana orange",
2: "apple banana",
3: "banana orange",
4: "apple"
}

# Function to build an inverted index using dictionaries

def build_index(docs):
index = {} # Initialize an empty dictionary to store the inverted index
for doc_id, text in docs.items(): # Iterate through each document and its text
terms = set(text.split()) # Split the text into individual terms
for term in terms: # Iterate through each term in the document
if term not in index:
index[term] = {doc_id} # If the term is not in the index, create a new
set with document ID
else:
index[term].add(doc_id) # If the term exists, add the document ID to
its set
return index # Return the built inverted index

Information Retrieval 9
T. Y. BSc CS SEM VI

# Building the inverted index

inverted_index = build_index(documents)

# Function for Boolean AND operation using inverted index

def boolean_and(operands, index):
if not operands: # If there are no operands, return all document IDs
return list(range(1, len(documents) + 1))

result = index.get(operands[0], set()) # Get the set of document IDs for the
first operand
for term in operands[1:]: # Iterate through the rest of the operands
result = result.intersection(index.get(term, set())) # Compute intersection
with sets of document IDs
return list(result) # Return the resulting list of document IDs

# Function for Boolean OR operation using inverted index

def boolean_or(operands, index):
result = set() # Initialize an empty set to store the resulting document IDs
for term in operands: # Iterate through each term in the query
result = result.union(index.get(term, set())) # Union of sets of document
IDs for each term
return list(result) # Return the resulting list of document IDs

# Function for Boolean NOT operation using inverted index

def boolean_not(operand, index, total_docs):
operand_set = set(index.get(operand, set())) # Get the set of document IDs
for the operand
all_docs_set = set(range(1, total_docs + 1)) # Create a set of all document
IDs

Information Retrieval 10
T. Y. BSc CS SEM VI

return list(all_docs_set.difference(operand_set)) # Return documents not in

the operand set

# Example queries
query1 = ["apple", "banana"] # Query for documents containing both "apple"
and "banana"
query2 = ["apple", "orange"] # Query for documents containing "apple" or
"orange"

# Performing Boolean Model queries using inverted index

result1 = boolean_and(query1, inverted_index) # Get documents containing
both terms
result2 = boolean_or(query2, inverted_index) # Get documents containing
either of the terms
result3 = boolean_not("orange", inverted_index, len(documents)) # Get
documents not containing "orange"

# Printing results
print("Documents containing 'apple' and 'banana':", result1)
print("Documents containing 'apple' or 'orange':", result2)
print("Documents not containing 'orange':", result3)
print("Performed by 740_Pallavi & 743_Deepak")

Output:

Information Retrieval 11
T. Y. BSc CS SEM VI

B) Implement the vector space model with TF-IDF weighting and cosine
similarity:
Input:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Import necessary libraries
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
import numpy as np # Import NumPy library
from numpy.linalg import norm # Import norm function from NumPy's linear
algebra module

# Define the training and test sets of text documents

train_set = ["The sky is blue.", "The sun is bright."] # Documents
test_set = ["The sun in the sky is bright."] # Query

# Get the stopwords for English language from NLTK

nltk.download('stopwords')
stopWords = stopwords.words('english')

# Initialize CountVectorizer and TfidfTransformer objects

vectorizer = CountVectorizer(stop_words=stopWords) # CountVectorizer to
convert text to matrix of token counts
transformer = TfidfTransformer() # TfidfTransformer to convert matrix of
token counts to TF-IDF representation

# Convert the training and test sets to arrays of TF-IDF features

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() # Fit-
transform training set
testVectorizerArray = vectorizer.transform(test_set).toarray() # Transform test
set

# Display the TF-IDF arrays for training and test sets

Information Retrieval 12
T. Y. BSc CS SEM VI

print('Fit Vectorizer to train set', trainVectorizerArray)

print('Transform Vectorizer to test set', testVectorizerArray)

# Define a lambda function to calculate cosine similarity between vectors

cx = lambda a, b: round(np.inner(a, b) / (norm(a) * norm(b)), 3)

# Iterate through each vector in the training set

for vector in trainVectorizerArray:
print(vector) # Display each vector in the training set
# Iterate through each vector in the test set
for testV in testVectorizerArray:
print(testV) # Display each vector in the test set
cosine = cx(vector, testV) # Calculate cosine similarity between vectors
print(cosine) # Display the cosine similarity

# Fit the transformer to the training set and transform it to TF-IDF

representation
transformer.fit(trainVectorizerArray)
print()
print(transformer.transform(trainVectorizerArray).toarray())

# Fit the transformer to the test set and transform it to TF-IDF representation
transformer.fit(testVectorizerArray)
print()
tfidf = transformer.transform(testVectorizerArray)
print(tfidf.todense())

Information Retrieval 13
T. Y. BSc CS SEM VI

Output:

Information Retrieval 14
T. Y. BSc CS SEM VI

Practical No: 3

Aim: Spelling Correction in IR Systems

● Develop a spelling correction module using edit distance algorithms.
● Integrate the spelling correction module into an information retrieval
system.

Theory:

Edit Distance :

● Edit distance is a measure of the similarity between two strings by

calculating the minimum number of single-character edits (insertions,
deletions, or substitutions) required to change one string into the other.
● The smaller the edit distance, the more similar the strings are.

Consider two strings str1 and str2 of length M and N respectively.

For finding edit distance there are performed below operations -

1. Operation 1 (INSERT): Insert any character before or after any
index value
2. Operation 2 (REMOVE): Remove a character
3. Operation 3 (Replace): Replace a character at any index value with
some other character

Practical:
Input:
# A Naive recursive python program to find minimum number
# operations to convert str1 to str2
def editDistance(str1, str2, m, n):
# If first string is empty, the only option is to insert all characters of second
string into first
if m == 0:
return n
# If second string is empty, the only option is to remove all characters of first
string
if n == 0:

Information Retrieval 15
T. Y. BSc CS SEM VI

return m
# If last characters of two strings are same, nothing much to do. Ignore last
characters and get count for remaining strings.
if str1[m-1] == str2[n-1]:
return editDistance(str1, str2, m-1, n-1)
# If last characters are not same, consider all three operations on last
character of first string, recursively compute minimum cost for all three
operations and take minimum of three values.
return 1 + min(editDistance(str1, str2, m, n-1), # Insert
editDistance(str1, str2, m-1, n), # Remove
editDistance(str1, str2, m-1, n-1) # Replace
)
# Driver code
str1 = "sunday"
str2 = "saturday"
print('Edit Distance is: ', editDistance(str1, str2, len(str1), len(str2)))

Output:

Information Retrieval 16
T. Y. BSc CS SEM VI

Practical No: 4

Aim: Evaluation Metrics for IR Systems

A) Calculate precision, recall, and F-measure for a given set of retrieval
results.
B) Use an evaluation toolkit to measure average precision and other
evaluation metrics.

Theory:
1. Precision:
● Precision is the ratio of correctly predicted positive observations to
the total predicted positives.
● It is also called Positive Predictive Value (PPV).
● Precision is calculated using the following formula:

Precision = TP / TP+FP
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FP (False Positives) is the number of
instances incorrectly predicted as positive.
High precision indicates that the model has a low rate of false positives. In
other words, when the model predicts a positive result, it is likely to be
correct.

2. Recall:
• Recall is the ratio of correctly predicted positive observations to all
observations in actual class.
• Recall is calculated using the following formula:
Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FN (False Negatives) is the number of
instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In

Information Retrieval 17
T. Y. BSc CS SEM VI

other words, the model is effective at capturing all the positive instances.

3. F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing a
balanced measure of a model's performance.

• The formula for F-measure is:

• The F-measure ranges from 0 to 1, where 1 indicates perfect precision and

recall.

4. Average Precision:
Average Precision is used to find the Average of the model precision based
on relevancy of result. • Algorithm:

In order to find Average Precision:

1) Take 2 variables X and Y as 0
2) We will then go through the prediction from left to right:
3) In case the prediction is 0, we will only increment Y by 1 and not
find prediction score
4) In case the prediction is 1, we will increment both X and Y by 1
5) After incrementing, we use the formula X/Y to get the current position
prediction score.
6) Lastly we will find summation of all prediction scores and divide them
by total number of positive predictions.

Information Retrieval 18
T. Y. BSc CS SEM VI

A) Calculate precision, recall, and F-measure for a given set of retrieval

results.

Input:

def calculate_metrics(retrieved_set, relevant_set):

true_positive = len(retrieved_set.intersection(relevant_set))
false_positive = len(retrieved_set.difference(relevant_set))
false_negative = len(relevant_set.difference(retrieved_set))

'''
(Optional)
PPT values:
true_positive = 20
false_positive = 10
false_negative = 30
'''

print("True Positive: ", true_positive

,"\nFalse Positive: ", false_positive
,"\nFalse Negative: ", false_negative ,"\n")

precision = true_positive / (true_positive + false_positive)

recall = true_positive / (true_positive + false_negative)
f_measure = 2 * precision * recall / (precision + recall)

return precision, recall, f_measure

retrieved_set = set(["doc1", "doc2", "doc3"]) #Predicted set

relevant_set = set(["doc1", "doc4"]) #Actually Needed set (Relevant)

precision, recall, f_measure = calculate_metrics(retrieved_set, relevant_set)

Information Retrieval 19
T. Y. BSc CS SEM VI

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-measure: {f_measure}")

Output:

B) Use an evaluation toolkit to measure average precision and other

evaluation metrics.

Input:

from sklearn.metrics import average_precision_score

y_true = [0, 1, 1, 0, 1, 1] #Binary Prediction

y_scores = [0.1, 0.4, 0.35, 0.8, 0.65, 0.9] #Model's estimation score

average_precision = average_precision_score(y_true, y_scores)

print(f'Average precision-recall score: {average_precision}'

Output:

Information Retrieval 20
T. Y. BSc CS SEM VI

Practical No: 5

Aim: Text Categorization

A) Implement a text classification algorithm (e.g., Naive Bayes or
Support Vector Machines).
B) Train the classifier on a labelled dataset and evaluate its
performance.

Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• Naive Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Some popular examples of Naive Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Bayes' Theorem:

• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is

used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.

• The formula for Bayes' theorem is given as:

P(B|A) * P(A)
P(A|B) =
P(B)

Create two CSV file:

Dataset.csv file:

Information Retrieval 21
T. Y. BSc CS SEM VI

Practical:

Input:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the CSV file

df = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Dataset.csv")
data = df["covid"] + "" + df["fever"]
X = data.astype(str) # Test data
y = df['flu'] # Labels

# Splitting the data into training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)

# Converting data into bag-of-data format to train the model

vectorizer = CountVectorizer()

Information Retrieval 22
T. Y. BSc CS SEM VI

# initializing the converter

X_train_counts = vectorizer.fit_transform(X_train)
# converting the training data
X_test_counts = vectorizer.transform(X_test)
# converting the test data
# using and training the multinomial model of naive bayes algorithm
classifier = MultinomialNB() # initializing the classifier
classifier.fit(X_train_counts, y_train) # training the classifier

# loading another dataset to test if the model is working properly

data1 = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Test.csv")
new_data = data1["covid"] + "" + data1["fever"]
new_data_counts = vectorizer.transform(new_data.astype(str)) # converting
the new data

# making the model to predict the results for new dataset

predictions = classifier.predict(new_data_counts)

# Output the results

new_data = predictions
print(new_data)

# retrieving the accuracy and classification report

accuracy = accuracy_score(y_test, classifier.predict(X_test_counts))
print(f"\nAccuracy: {accuracy:.2f}")

print("Classification Report: ")

print(classification_report(y_test, classifier.predict(X_test_counts)))

# Convert the predictions to a DataFrame

predictions_df = pd.DataFrame(predictions, columns = ['flu_prediction'])

Information Retrieval 23
T. Y. BSc CS SEM VI

# concatenate the original DataFrame with the predictions DataFrame

data1 = pd.concat([data1, predictions_df], axis = 1)

# write the DataFrame back to CSV

data1.to_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Test1.csv", index
= False)

Output:

Information Retrieval 24
T. Y. BSc CS SEM VI

Practical No: 6

Aim: Clustering for Information Retrieval

• Implement a clustering algorithm (e.g., K-means or
hierarchical clustering).
• Apply the clustering algorithm to a set of documents and evaluate
the clustering results.

Theory:
K-Means Clustering:
● K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
● Here K defines the number of predefined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
● It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
● The main aim of this algorithm is to minimise the sum of distances
between the data point and their corresponding clusters.
● The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

● The k-means clustering algorithm mainly performs two tasks:

1. Determines the best value for K centre points or centroids by an
iterative process.
2. Assigns each data point to its closest k-centre. Those data points
which are near to the particular k-centre, create a cluster.

Working of K-means Algorithm -

Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be different from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.

Information Retrieval 25
T. Y. BSc CS SEM VI

Step-5: Repeat the third steps, which means assign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Practical

Input:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = ["Cats are known for their agility and grace", #cat doc1
"Dogs are often called ‘man’s best friend’.", #dog doc1
"Some dogs are trained to assist people with disabilities.", #dog doc2
"The sun rises in the east and sets in the west.", #sun doc1
"Many cats enjoy climbing trees and chasing toys.", #cat doc2
]

# Create a TfidfVectorizer object

vectorizer = TfidfVectorizer(stop_words='english')

# Learn vocabulary and idf from training set.

X = vectorizer.fit_transform(documents)

# Perform k-means clustering

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Print cluster labels for each document

print(kmeans.labels_)

Output:

Information Retrieval 26
T. Y. BSc CS SEM VI

Practical No: 7

Aim: Web Crawling and Indexing

A) Develop a web crawler to fetch and index web pages.
B) Handle challenges such as robots.txt, dynamic content, and crawling
delays.

Theory

Crawling: Google downloads text, images, and videos from pages it found on
the internet with automated programs called crawlers.

Indexing: Google analyses the text, images, and video files on the page, and
stores the information in the Google index, which is a large database.

Crawling Process -
1) Starting Point: The crawling process usually begins with a set of seed
URLs, which can be provided manually or generated through algorithms.
These URLs serve as the starting points for the web crawlers.

2) Queue of URLs: Web crawlers maintain a queue of URLs, often referred

to as the URL frontier. This queue represents the set of URLs that the
crawler is yet to visit. New URLs are continuously added to this queue
during the crawling process.

3) Parsing Content: When a web crawler visits a webpage, it parses the

HTML content to extract links (URLs) embedded within the page. This
process involves examining HTML tags and, in some cases, executing
JavaScript to discover additional links.

4) Respecting Directives: Crawlers adhere to rules specified in the

robots.txt file, which indicates areas of a website that should not be
crawled. Additionally, crawlers may implement policies to filter out
certain types of content or URLs.

Information Retrieval 27
T. Y. BSc CS SEM VI

5) Avoiding Redundancy: To prevent redundancy and ensure efficient

crawling, duplicate URLs are often identified and removed from the
crawling queue.

6) Traversal Strategy: Crawlers can follow various traversal strategies,

such as depth-first or breadth-first, as they explore the web. The chosen
strategy determines the order in which pages are crawled.

7) Politeness: To avoid overloading web servers with too many requests,

crawlers may introduce a crawl delay between successive requests.

8) Retrieving Web Pages: Crawlers download the content of web pages,

including HTML, text, images, and other resources. The downloaded
content is then processed for indexing.

Indexing Process -
1) HTML Parsing: The content retrieved by the crawler is parsed to extract
relevant information. This involves analysing the HTML structure to
identify text, metadata, and other elements on the page.

2) Isolating Textual Content: From the parsed content, the crawler isolates
the textual information, such as the body of the page, headings, and other
relevant textual data.

3) Breaking into Tokens: The textual content is tokenized, breaking it

down into smaller units, typically words or phrases. Tokenization
facilitates efficient indexing and retrieval based on keywords.

4) Data Organization: The extracted information is organized into an

index, which is a structured database allowing for fast and efficient
retrieval. The index includes details about keywords, their locations, and
other relevant metadata.

5) Optimising for Retrieval: The index is often organised as an inverted

index, mapping each term to the documents or web pages where it

Information Retrieval 28
T. Y. BSc CS SEM VI

appears. This structure enables quick retrieval of documents containing

specific terms.

6) Additional Information: In addition to textual content, metadata such as

page title, URL, and other relevant details may be included in the index to
enhance the search experience.
7) Real-time Changes: Search engines continuously update their indexes to
reflect changes on the web. This ensures that the search results remain
current and accurate.

Practical

Input:
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser

def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110
Safari/537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.RequestException as err:
print(f"Request Error: {err}")
return None

Information Retrieval 29
T. Y. BSc CS SEM VI

def save_robots_txt(url):
try:
robots_url = urljoin(url, '/robots.txt')
robots_content = get_html(robots_url)
if robots_content:
with open('robots.txt', 'wb') as file:
file.write(robots_content.encode('utf-8-sig'))
except Exception as e:
print(f"Error saving robots.txt: {e}")

def load_robots_txt():
try:
with open('robots.txt', 'rb') as file:
return file.read().decode('utf-8-sig')
except FileNotFoundError:
return None

def extract_links(html, base_url):

soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
absolute_url = urljoin(base_url, link['href'])
links.append(absolute_url)
return links

def is_allowed_by_robots(url, robots_content):

parser = RobotFileParser()
parser.parse(robots_content.split('\n'))
return parser.can_fetch('*', url)

def crawl(start_url, max_depth=3, delay=1):

Information Retrieval 30
T. Y. BSc CS SEM VI

visited_urls = set()

def recursive_crawl(url, depth, robots_content):

if depth > max_depth or url in visited_urls or not
is_allowed_by_robots(url, robots_content):
return
visited_urls.add(url)
time.sleep(delay)

html = get_html(url)
if html:
print(f"Crawling {url}")
links = extract_links(html, url)

for link in links:

recursive_crawl(link, depth + 1, robots_content)

save_robots_txt(start_url)
robots_content = load_robots_txt()
if not robots_content:
print("Unable to retrieve robots.txt. Crawling without restrictions.")

recursive_crawl(start_url, 1, robots_content)

# Example usage:
print("Performed by 740_Pallavi & 743_Deepak")
crawl('https://wikipedia.com', max_depth=2, delay=2)

Information Retrieval 31
T. Y. BSc CS SEM VI

Output:

robot.txt file:

Information Retrieval 32
T. Y. BSc CS SEM VI

Practical No: 8

Aim: Link Analysis and PageRank

A) Implement the PageRank algorithm to rank web pages based on link
analysis.
B) Apply the PageRank algorithm to a small web graph and analyse the
results.

Theory

Link Analysis:

● Link analysis is a method used to examine relationships and connections

between entities in a network.
● It involves studying the links or connections between different elements
to uncover patterns, structures, and insights.
● Link analysis is commonly applied in various fields, including
information retrieval, social network analysis, fraud detection, and
recommendation systems.

PageRank Algorithm -
● The PageRank algorithm is an algorithm used by the Google search
engine to rank web pages in its search results.
● It was developed by Larry Page and Sergey Brin, the co-founders of
Google, and is named after Larry Page.
● PageRank is based on the idea that the importance of a webpage is
determined by the number and quality of other pages that link to it.

Working -

1) Initialize PageRank Values:

Set an initial PageRank value for each node. Commonly, this is initialised
to 1 divided by the total number of nodes, making the sum of all
PageRank values equal to 1.

2) Define Damping Factor and Iterations:

Choose a damping factor (typically 0.85) to model the probability that a

Information Retrieval 33
T. Y. BSc CS SEM VI

user will continue navigating through the web by following links.

Decide on the maximum number of iterations for the algorithm.

4) Convergence Check:
After each iteration, check for convergence. If the difference between the new
and previous PageRank values falls below a certain threshold, the algorithm has
converged, and you can stop iterating.

5) Repeat Iterations:
Continue iterating until the maximum number of iterations is reached or until
convergence is achieved.

6) Final PageRank Values:

The final PageRank values represent the importance of each node in the graph
based on the link structure.

Practical

Input:
import numpy as np

def page_rank(graph, damping_factor=0.85, max_iterations=100, tolerance=1e-

6):
# Get the number of nodes
num_nodes = len(graph)

# Initialize PageRank values

page_ranks = np.ones(num_nodes) / num_nodes

Information Retrieval 34
T. Y. BSc CS SEM VI

# Iterative PageRank calculation

for _ in range(max_iterations):
prev_page_ranks = np.copy(page_ranks)

for node in range(num_nodes):

# Calculate the contribution from incoming links
incoming_links = [i for i, v in enumerate(graph) if node in v]
if not incoming_links:
continue

page_ranks[node] = (1 - damping_factor) / num_nodes + \

damping_factor * sum(prev_page_ranks[link] /
len(graph[link]) for link in incoming_links)

# Check for convergence

if np.linalg.norm(page_ranks - prev_page_ranks, 2) < tolerance:
break

return page_ranks

# Example usage
if name == " main ":
# Define a simple directed graph as an adjacency list
# Each index represents a node, and the list at that index contains nodes to
which it has outgoing links
web_graph = [
[1, 2], # Node 0 has links to Node 1 and Node 2
[0, 2], # Node 1 has links to Node 0 and Node 2
[0, 1] , # Node 2 has links to Node 0 and Node 1
[1,2], # Node 3 has links to Node 1 and Node 2
]

Information Retrieval 35
T. Y. BSc CS SEM VI

# Calculate PageRank
result = page_rank(web_graph)

# Display PageRank values

for i, pr in enumerate(result):
print(f"Page {i}: {pr}")

Output:

Information Retrieval 36

Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
IR Journal
No ratings yet
IR Journal
20 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IR Journal 21054
No ratings yet
IR Journal 21054
30 pages
Document Indexing & Retrieval Guide
No ratings yet
Document Indexing & Retrieval Guide
20 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
Inverted Index Search Engine Guide
No ratings yet
Inverted Index Search Engine Guide
11 pages
1 Overview
No ratings yet
1 Overview
44 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Practical 1
No ratings yet
IR Practical 1
5 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
Inverted Index Construction Explained
No ratings yet
Inverted Index Construction Explained
10 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
21 pages
Vector-Space Retrieval Implementation Guide
No ratings yet
Vector-Space Retrieval Implementation Guide
16 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Aspect Information Retrieval (IR) Web Search
No ratings yet
Aspect Information Retrieval (IR) Web Search
19 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Text Databases and Information Retrieval
No ratings yet
Text Databases and Information Retrieval
23 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Overview of Information Retrieval in CS583
No ratings yet
Overview of Information Retrieval in CS583
33 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Bulu
No ratings yet
Bulu
47 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Unit2 ISR
No ratings yet
Unit2 ISR
12 pages
Query Languages
No ratings yet
Query Languages
54 pages
Comprehensive Guide to Information Retrieval
No ratings yet
Comprehensive Guide to Information Retrieval
74 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
23 pages
Information Retrieval & MapReduce
No ratings yet
Information Retrieval & MapReduce
72 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Thesis
No ratings yet
Thesis
49 pages
Irt Unit5
No ratings yet
Irt Unit5
111 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Ir 1
No ratings yet
Ir 1
14 pages
Tamirat IRS
No ratings yet
Tamirat IRS
7 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
All Unit 2 Mark
No ratings yet
All Unit 2 Mark
15 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
1) Explain User Interaction With IR With The Help of A Diagram
No ratings yet
1) Explain User Interaction With IR With The Help of A Diagram
12 pages
Ir QB
No ratings yet
Ir QB
8 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Decision Tree Lecture 1
No ratings yet
Decision Tree Lecture 1
7 pages
Advanced Economic Theory Lecture
No ratings yet
Advanced Economic Theory Lecture
36 pages
Algebra 1 Ch. 5 Practice Test
100% (1)
Algebra 1 Ch. 5 Practice Test
7 pages
Help Tops 1130
No ratings yet
Help Tops 1130
312 pages
Farm Management Handout STS Plant and Horticulture
100% (1)
Farm Management Handout STS Plant and Horticulture
104 pages
Moma Internship
No ratings yet
Moma Internship
2 pages
Advanced Free DPP - 01-2
No ratings yet
Advanced Free DPP - 01-2
6 pages
Forum 15fforum Halliwell Complex
No ratings yet
Forum 15fforum Halliwell Complex
66 pages
Sven O Krumke Integer Programming Polyhedra and Algorithms Lecture Notes
No ratings yet
Sven O Krumke Integer Programming Polyhedra and Algorithms Lecture Notes
188 pages
Artificial Intelligence Programming Language
No ratings yet
Artificial Intelligence Programming Language
2 pages
DSD ch-9 Sequential Circuits
No ratings yet
DSD ch-9 Sequential Circuits
76 pages
Oxford EAP Advanced C1 Teacher S Book DVD and Audio CD Pack Edward de Chazal Download Instantly
100% (1)
Oxford EAP Advanced C1 Teacher S Book DVD and Audio CD Pack Edward de Chazal Download Instantly
72 pages
Annual15790 Exam-Syllabus-Class 7
No ratings yet
Annual15790 Exam-Syllabus-Class 7
4 pages
IP Board Ut
No ratings yet
IP Board Ut
5 pages
Task Management Functions Overview
No ratings yet
Task Management Functions Overview
4 pages
IB 04 Straight Lines (11 16)
100% (1)
IB 04 Straight Lines (11 16)
3 pages
CH 5 More About Block Ciphers-3
No ratings yet
CH 5 More About Block Ciphers-3
68 pages
Engineering Mathematics 2A (Scee08009) : Refresher Sheet
No ratings yet
Engineering Mathematics 2A (Scee08009) : Refresher Sheet
2 pages
Final Report - CAB 420
No ratings yet
Final Report - CAB 420
13 pages
Numerical Error Analysis Guide
100% (1)
Numerical Error Analysis Guide
8 pages
DM R-18 Important Questions
No ratings yet
DM R-18 Important Questions
8 pages
Max78000 Article Series Part 1
No ratings yet
Max78000 Article Series Part 1
4 pages
Biostatistics in Neonatal Nursing
No ratings yet
Biostatistics in Neonatal Nursing
275 pages
Road and Off-Road Vehicle Dynamics Handbook
No ratings yet
Road and Off-Road Vehicle Dynamics Handbook
5 pages
Masmoudi Et Al 2019
No ratings yet
Masmoudi Et Al 2019
11 pages
C Notes
100% (1)
C Notes
96 pages
Linear Algebra: Basis and Spanning Sets
No ratings yet
Linear Algebra: Basis and Spanning Sets
19 pages
Kriging 1 - 3
No ratings yet
Kriging 1 - 3
23 pages
Year 10 Baseline Test Maths Foundation Calculator - Mark Scheme
No ratings yet
Year 10 Baseline Test Maths Foundation Calculator - Mark Scheme
4 pages
Ors 2002 B
No ratings yet
Ors 2002 B
28 pages