0% found this document useful (0 votes)
11 views23 pages

DSA Module 5 Notes

The document covers Natural Language Processing (NLP) and its applications, including chatbots, text analysis, and recommendation systems. It discusses techniques such as word clouds, n-gram models, and grammars, providing examples and Python code for implementation. Additionally, it introduces Gibbs sampling as a method for generating samples from complex distributions, with practical examples and code snippets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

DSA Module 5 Notes

The document covers Natural Language Processing (NLP) and its applications, including chatbots, text analysis, and recommendation systems. It discusses techniques such as word clouds, n-gram models, and grammars, providing examples and Python code for implementation. Additionally, it introduces Gibbs sampling as a method for generating samples from complex distributions, with practical examples and code snippets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Science and It’s Applications 21AD62

SUBJECT: Data Science and Its Applications (21AD62)

MODULE-5 Natural Language Processing


Syllabus: Word Clouds, n-Gram Language Models, Grammars, An Aside: Gibbs Sampling,
Topic Modeling, Word Vectors, Recurrent Neural Networks, Example: Using a Character-Level
RNN, Network Analysis, Betweenness Centrality, Eigenvector Centrality, Directed Graphs and
PageRank, Recommender Systems, Manual Cu ration, Recommending What's Popular, User-
Based Collaborative Filtering, Item-Based Collaborative Filtering, Matrix Factorization.

Introduction
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and humans through natural language. The goal of NLP is to enable computers to
understand, interpret, and generate human language in a way that is both meaningful and useful.

Applications of NLP
1. Chabot‟s and Virtual Assistants.
2. Text Analysis and Sentiment Analysis
3. Machine Translation
4. Speech Recognition and Voice Assistants
5. Information Retrieval and Search Engines
6. Language Modeling and Understanding
7. Recommendation Systems.
8. Healthcare and Medical Applications

Word cloud
A word cloud, also known as a tag cloud, is a visual representation of text data. It highlights the most
frequently occurring words within a body of text by making them larger and bolder compared to less
frequent words. Word clouds are often used to quickly identify prominent terms in large text datasets
and provide an at a glance summary of the content.

Example: The word cloud is created from a list of data science-related buzzwords, where each word is
associated with two numbers: how frequently it appears in job postings and how frequently it appears
on resumes

Creating a Word Cloud


To create a word cloud, the typically use a tool or library (such as word cloud in Python) to generate
the visualization. The size of each word in the cloud would be proportional to its frequency.
Steps:
1. Text Input:
o A body of text is provided as input, which can be anything from a single document to a
collection of articles, tweets, or reviews.
2. Text Processing:
o The text is processed to remove common stop words (e.g., "and", "the", "is") that do
not contribute to the main content. Words are then counted to determine their
frequency.
1

3. Visualization:
Page

o Words are arranged in a layout, with their size proportional to their frequency. The
arrangement often ensures that words fit together compactly, creating a dense and
readable cloud.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Example Data

Consider the following data for creating a word cloud:

data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50), ("R", 50, 40), ("machine
learning", 80, 20), ("statistics", 20, 60), ("data science", 60, 70), ("analytics", 90, 3), ("team player", 85,
85), ("dynamic", 2, 90), ("synergies", 70, 0), ("actionable insights", 40, 30), ("think out of the box",
45, 10), ("self-starter", 30, 50), ("customer focus", 65, 15), ("thought leadership", 35, 35)]
Where,

 X-Axis: Represents the frequency of the buzzword in job postings.


 Y-Axis: Represents the frequency of the buzzword in resumes.

By plotting the words on a scatter plot with these axes, adds more value and clarity compared to a
traditional word cloud

Limitations

1. Lack of Precision: Word clouds don't provide exact counts or clear comparisons between
words.
2. Random Placement: The position of words in a word cloud is arbitrary and doesn't convey any
additional information.
3. Space Usage: Words are placed wherever there is space, which can make the visualization
cluttered and less, readable.

Python Code

import matplotlib.pyplot as plt


from word cloud import wordcloud

data = [
("big data", 100, 15),
("Hadoop", 95, 25),
("Python", 75, 50),
("R", 50, 40),
("machine learning", 80, 20),
("statistics", 20, 60),
("data science", 60, 70),
("analytics", 90, 3),
("team player", 85, 85),
("dynamic", 2, 90),
("synergies", 70, 0),
("actionable insights", 40, 30),
("think out of the box", 45, 10),
("self-starter", 30, 50),
("customer focus", 65, 15),
("thought leadership", 35, 35)
2

]
Page

#Prepare text for word cloud


word_freq = {word: job_popularity + resume_popularity for word, job_popularity, resume_popularity in data}

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Generate word cloud


wordcloud = WordCloud(width=800, height=400,
background_color='white').generate_from_frequencies(word_freq)

# Display the generated image


plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

N- Gram Model

Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition, Spam
filtering, etc.

N-gram modeling is a foundational technique in natural language processing (NLP) that


leverages sequences of words to model and generate human-like text. It balances simplicity and
efficiency.

N-gram

N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-grams
typically are collected from a text or speech corpus.
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence. The goal is to
estimate the probability of a word by decomposing a sentence probability into a product of conditional
probabilities using chain rule as follows:
P(S) =P (W1, W2, W3, W4, …………. Wn )
=P(W1) P(W2/ W1,) P(W3 / W1 W2 ) P (W4 / W1 W2 W3)………………P( Wn / W1, W2……….. Wn-1 )

=∑

Where hi is history of word Wi defined as W1, W2………… Wi-1.


To calculate sentence probability, need to calculate the probability of the word, given the sentence of
words preceding it. An n-gram model simplifies the task by approximating the probability of a word
given all previous words by the conditional probability given previous n-1 words only.

Thus an n- gram model calculate P(Wi / hi) by modeling languages as marker model of order n-1.
Steps in N-gram Modeling
1. Collect and pre-process data: Gather a corpus of text, Clean and tokenize the text.
3

2. Generate n-grams:
Page

Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram („This
article‟, „article is‟, „is on‟, „on NLP‟).

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

3. Build transitions dictionary:


o Store frequencies of each n-gram in a dictionary.
4. Estimate probabilities:
o Calculate the probability of each word given its preceding words using formulas

5. Generate text: Use the model to predict and generate sequences of text

Consider an example

"Corpus"

The girl bought a chocolate


The boy ate the chocolate
The girl bought a toy
The girl played with the toy

"Vocabulary"

{the, girl, bought, a,


chocolate, boy, ate, toy,
played, with}

count ( The , girl , bought )


P(bought|The , girl)= count ( The , girl)
P(bought|The , girl) = 2/3 =0.67
count (The , girl, played )
P(played|The , girl) = count ( The , girl)
P(played|The , girl) = 1/3 = 0.33

Limitations

 Data sparsity: Large number of possible n-grams, many of which may not appear in the training data.
 Context limitation: Limited to fixed 'n' words of context, leading to potential incoherence in longer
texts.
4

 Scalability: Higher-order n-grams require significantly more data to produce accurate models.
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

GRAMMER
A grammar is a set of rules for generating sentences. Each rule consists of a non-terminal ex- _S, _NP,
_VP and its possible expansions. Non-terminals are symbols that can be expanded further whereas
Terminals are symbols that cannot be expanded further and appear in the final sentence ex- "Python",
"trains". Grammars can be recursive, allowing rules to refer to them. This enables the generation of
complex and varied sentences.This process involves defining rules for different parts of speech and
then expanding those rules until only terminal symbols remain.

Using grammars to model language allows for the generation of diverse and potentially infinite
sentences by defining a set of rules and recursively expanding them. This approach is useful in natural
language processing, computational linguistics and various applications where syntactically correct
sentences need to be generated.

Grammar Example
grammar = {
"_S”: ["_NP _VP"],
"_NP”: ["_N", "_A _NP _P _A _N"],
"_VP”: ["_V", "_V _NP"],
"_N”: ["data science", "Python", "regression"],
"_A”: ["big", "linear", "logistic"],
"_P”: ["about", "near"],
"_V”: ["learns", "trains", "tests", "is"]
}

In above example _S represents a sentence. _NP represents a noun phrase&


_VP represents a verb phrase. _N, _A, _P, and _V represent nouns,
adjectives, prepositions, and verbs respectively.
Start with the Sentence Rule: Begin with the non-terminal _S.
Expand Non-terminal _S: Replace _S with its rule, _NP _VP.
Expand _NP: Choose one of the expansions for _NP, say _N.
Replace _N: Choose a terminal for _N, say "Python".
Expand _VP: Choose one of the expansions for _VP, say _V _NP.
Replace _V: Choose a terminal for _V, say "trains".
Expand _NP: Choose one of the expansions for _NP, say _A _NP _P _A _N.
Replace Each Non-terminal: Continue replacing until all are terminals.

To implement the described approach to generating sentences using grammars, the steps are as
follows:

1. Define the grammar: Create a dictionary where each key is a non-terminal and each value is a
list of possible expansions.
2. Helper function to identify terminals: A terminal is a token that doesn't start with an
underscore.
3. Function to expand tokens: Recursively replace non-terminal tokens with one of their possible
expansions until all tokens are terminals.
4. Function to generate sentences: Start with the sentence rule and expand it using the defined
functions.
5
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

CODE:

import random
# Define the grammar
grammar = {
"_S" : ["_NP _VP"],
"_NP" : ["_N", "_A _NP _P _A _N"],
"_VP" : ["_V", "_V _NP"],
"_N" : ["data science", "Python", "regression"],
"_A" : ["big", "linear", "logistic"],
"_P" : ["about", "near"],
"_V" : ["learns", "trains", "tests", "is"]
}
# Helper function to identify terminals
def is_terminal(token):
return token[0] != "_"
# Function to expand tokens
def expand(grammar, tokens):
for i, token in enumerate(tokens):
# Skip over terminals
if is_terminal(token):
continue
# Found a non-terminal token, choose a replacement at random
replacement = random.choice(grammar[token])
if is_terminal(replacement):
tokens[i] = replacement
else:
tokens = tokens[:i] + replacement.split() + tokens[(i+1):]
# Recursively expand the new list of tokens
return expand(grammar, tokens)
# If we get here, we had all terminals and are done
return tokens

# Function to generate sentences


def generate_sentence(grammar):
return ' '.join(expand(grammar, ["_S"]))

# Generate and print a sentence


print(generate_sentence(grammar))

GIBB’S SAMPLING
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) technique used to generate samples from a
joint distribution when the conditional distributions are known. It is a technique for obtaining a
sequence of observations which are approximated from a specified multivariate probability
distribution; It‟s particularly useful for high-dimensional problems where direct sampling is difficult.

In many practical problems, the joint distribution of multiple variables is complex, and direct sampling
is infeasible. Gibbs Sampling allows us to sample from such distributions by breaking the problem
6

into simpler conditional distributions.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Example: Rolling Two Dice


Consider a simple example to generate samples for two dice. Let x be the value of the first die, and
y be the sum of the two dice. Given the conditional distributions:

 P(y / x) is straightforward: if you know x, y can be x+1, x+2,…, x+6.


 P(x / y) is a bit more complicated but can be derived from the rules of dice.

The Gibbs sampling process involves alternating between these conditional distributions:
1. Initialize (x, y): Start with any valid pair (x, y). For instance, (1, 2).or Assume initial values

2. Update x given y: Sample x from P(x /y).


3. Update y given x: Sample y from P(y/x).

I,e Sample P(X/ ) and


Sample P Y/ )

 Repeat: Repeat steps 2 and 3 for a number of iterations.

After sufficient iterations, the values of x and y will approximate samples from the joint distribution
P(x,y). This is due to the Markov property of Gibbs sampling, where each step only depends on the
current state and not on the history of states.

To verify the correctness of Gibbs sampling, compare the distribution of samples obtained
from Gibbs sampling with those from direct sampling. Both methods should yield similar distributions
after a large number of samples, validating that Gibbs sampling is correctly approximating the joint
distribution.

By comparing the counts of outcomes from Gibbs sampling and direct sampling, it is observed that
Gibbs sampler converges to the same distribution as the direct sampler.

CODE:

import random
from collections import defaultdict

def roll_a_die():
return random.choice([1, 2, 3, 4, 5, 6])

def direct_sample():
d1 = roll_a_die()
d2 = roll_a_die()
return d1, d1 + d2

def random_y_given_x(x):
"""Equally likely to be x + 1, x + 2, ..., x + 6"""
return x + roll_a_die()

def random_x_given_y(y):
if y <= 7:
7

# if the total is 7 or less, the first die is equally likely to be


Page

# 1, 2, ..., (total - 1)
return random.randrange(1, y)

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

else:
# if the total is 7 or more, the first die is equally likely to be
# (total - 6), (total - 5), ..., 6
return random.randrange(y - 6, 7)

def gibbs_sample(num_iters=100):
x, y = 1, 2 # initial values, doesn't really matter
for _ in range(num_iters):
x = random_x_given_y(y)
y = random_y_given_x(x)
return x, y

def compare_distributions(num_samples=1000):
counts = defaultdict(lambda: [0, 0])
for _ in range(num_samples):
counts[gibbs_sample()][0] += 1
counts[direct_sample()][1] += 1
return counts

# Example usage
num_samples = 1000
distribution_counts = compare_distributions(num_samples)

for outcome, (gibbs_count, direct_count) in distribution_counts.items():


print(f"Outcome {outcome}: Gibbs = {gibbs_count}, Direct = {direct_count}")

Topic Modeling
Latent Dirichlet Allocation (LDA) is a powerful technique used for topic modeling, which helps us
identify the underlying topics in a collection of documents. The idea is to discover the hidden
particular structure in a corpus of text data. Let us consider an example based on users' stated interests.

Probabilistic Model

 Documents and Topics: For each document, LDA assumes that it is a mixture of topics. For
each topic, it is a mixture of words.
 Random Variables: LDA involves two sets of random variables:
o θd: Distribution of topics for document d.
o ϕk: Distribution of words for topic k.

LDA Process

1. Initialization: Assign each word in each document a topic randomly.


2. Gibbs Sampling: Iterate over each word in each document to update its topic assignment
based on conditional probabilities.

This involves two steps:


o Calculate the weight for each topic based on current topic distributions in the document
and word distributions in the topic.
8

o Sample a new topic for the word based on these weights.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Consider the following documents (users' interests):

documents = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]

Aim to find K=4 topics.

Document-Topic Counts - Track how many times each topic is assigned to each document.
Topic-Word Counts - Track how many times each word is assigned to each topic.
Topic Counts - Track the total number of words assigned to each topic.
Document Lengths - Track the total number of words in each document.
Conditional probabilities are calculated

Probability of Topic Given Document:

P(topic/document) ≈

Probability of Word Given Topic:

P(word/topic) ≈

Here, α and β are smoothing parameters to ensure non-zero probabilities.

CODE:

import random
from collections import Counter
import numpy as np

def sample_from(weights):
total = sum(weights)
9

rnd = total * random.random()


Page

for i, w in enumerate(weights):
rnd -= w
if rnd <= 0:
return i

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

def p_topic_given_document(topic, d, alpha=0.1):


return (document_topic_counts[d][topic] + alpha) / (document_lengths[d] + K * alpha)

def p_word_given_topic(word, topic, beta=0.1):


return (topic_word_counts[topic][word] + beta) / (topic_counts[topic] + W * beta)

def topic_weight(d, word, k):


return p_word_given_topic(word, k) * p_topic_given_document(k, d)

def choose_new_topic(d, word):


return sample_from([topic_weight(d, word, k) for k in range(K)])

# Initialize variables
random.seed(0)
K=4
document_topic_counts = [Counter() for _ in documents]
topic_word_counts = [Counter() for _ in range(K)]
topic_counts = [0 for _ in range(K)]
document_lengths = list(map(len, documents))
distinct_words = set(word for document in documents for word in document)
W = len(distinct_words)
D = len(documents)

# Randomly assign topics to words


document_topics = [[random.randrange(K) for word in document] for document in documents]
for d in range(D):
for word, topic in zip(documents[d], document_topics[d]):
document_topic_counts[d][topic] += 1
topic_word_counts[topic][word] += 1
topic_counts[topic] += 1

# Perform Gibbs sampling


for iter in range(1000):
for d in range(D):
for i, (word, topic) in enumerate(zip(documents[d], document_topics[d])):
document_topic_counts[d][topic] -= 1
topic_word_counts[topic][word] -= 1
topic_counts[topic] -= 1
document_lengths[d] -= 1

new_topic = choose_new_topic(d, word)


document_topics[d][i] = new_topic

document_topic_counts[d][new_topic] += 1
topic_word_counts[new_topic][word] += 1
topic_counts[new_topic] += 1
document_lengths[d] += 1

# Output the topics and their most common words


10

for k, word_counts in enumerate(topic_word_counts):


Page

print(f"Topic {k}:")
for word, count in word_counts.most_common():
if count > 0:
print(f"{word}: {count}")

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

print()

# Assign topic names based on the most common words


topic_names = ["Big Data and programming languages", "Python and statistics", "databases", "machine
learning"]

# Print the topic distribution for each document


for document, topic_counts in zip(documents, document_topic_counts):
print(f"Document: {document}")
for topic, count in topic_counts.most_common():
if count > 0:
print(f"{topic_names[topic]}: {count}")
print()

Network Analysis

Network analysis in data science involves examining the structure and dynamics of networks.
Networks, also known as graphs, consist of nodes (or vertices) and edges (or links) that represent
relationships or interactions between entities. Network analysis provides insights into the patterns,
connections, and behaviors within the network, which can be applied to various fields such as social
sciences, biology, computer science, and more. Network analysis provides a powerful framework for
understanding and leveraging the relationships and interactions within complex data. It enables data
scientists to gain deeper insights, make informed decisions, and solve problems that involve
interconnected entities. Network analysis in data science is crucial because it allows us to understand
and model the relationships and interactions between different entities.

Networks naturally represent complex relationships between entities, such as:

 Social Networks: Relationships between individuals on social media.


 Biological Networks: Interactions between proteins, genes, or species.
 Communication Networks: Connections between devices in a telecommunication network.

Centrality is a key concept in network theory and graph analysis, which focuses on identifying the
most important or influential nodes within a network. Centrality can be measured in several ways,
each capturing different aspects of a node's importance:
1. Degree Centrality: Measures the number of direct connections a node has. Nodes with higher
degrees are considered more central.
2. Betweenness Centrality: Quantifies the number of times a node acts as a bridge along the
shortest path between two other nodes. Nodes with high betweenness centrality play a critical
role in the flow of information within the network.
3. Eigenvector Centrality: Measures a node's influence based on the number and quality of its
connections. A node is considered central if it is connected to other nodes that are themselves
central. It is often used in identifying influential nodes in social networks.
4. PageRank: A variant of eigenvector centrality, originally used by Google to rank web pages in
search results. It measures the probability that a random walker will land on a particular node,
considering both the quantity and quality of incoming links.

Each centrality measure provides different insights into the network structure and the roles of
11

individual nodes. The choice of centrality measure depends on the specific context and the aspect of
Page

importance being investigated.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Betweenness Centrality

Betweenness centrality is a measure of centrality in a graph based on the shortest paths. It quantifies
the number of times a node acts as a bridge along the shortest path between two other nodes. Nodes
with high betweenness centrality are crucial for the flow of information through the network.

Definition

For a given graph G= (V, E), where V is the set of vertices and E is the set of edges:

 σst is the total number of shortest paths from node s to node t.


 σst(v) is the number of those shortest paths that pass through v.

Steps to calculate betweenness centrality for the DataSciencester network:

1. Setup the Users and Friendships:

 Define the users and their friendships.


 Add lists to each user to track their friends.

Code:
from collections import deque

# Define users
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]

# Define friendships
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

# Add friends lists to each user


for user in users:
user["friends"] = []
12

for i, j in friendships:
Page

users[i]["friends"].append(users[j]) # add j as a friend of i


users[j]["friends"].append(users[i]) # add i as a friend of j

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

2. Compute the Shortest Paths:


 Use Breadth-First Search (BFS) to find the shortest paths from a given user to every other user.
 Store all the shortest paths in a dictionary.

Code:
def shortest_paths_from(from_user):
shortest_paths_to = { from_user["id"]: [[]] }
frontier = deque((from_user, friend) for friend in from_user["friends"])

while frontier:
prev_user, user = frontier.popleft()
user_id = user["id"]

paths_to_prev_user = shortest_paths_to[prev_user["id"]]
new_paths_to_user = [path + [user_id] for path in paths_to_prev_user]

old_paths_to_user = shortest_paths_to.get(user_id, [])

if old_paths_to_user:
min_path_length = len(old_paths_to_user[0])
else:
min_path_length = float('inf')

new_paths_to_user = [path for path in new_paths_to_user if len(path) <= min_path_length and path not
in old_paths_to_user]

shortest_paths_to[user_id] = old_paths_to_user + new_paths_to_user

frontier.extend((user, friend) for friend in user["friends"] if friend["id"] not in shortest_paths_to)

return shortest_paths_to

for user in users:


user["shortest_paths"] = shortest_paths_from(user)

3. Compute Betweenness Centrality:

 For each pair of nodes, find all shortest paths between them.
 For each node on these paths (excluding the source and target), update its betweenness
centrality score.

Code:
# Initialize betweenness centrality for each user
for user in users:
user["betweenness_centrality"] = 0.0
13

# Compute betweenness centrality


for source in users:
Page

source_id = source["id"]
for target_id, paths in source["shortest_paths"].items():
if source_id < target_id:

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

num_paths = len(paths)
contrib = 1 / num_paths
for path in paths:
for id in path:
if id not in [source_id, target_id]:
users[id]["betweenness_centrality"] += contrib

# Print the results


for user in users:
print(f"User {user['name']} has betweenness centrality: {user['betweenness_centrality']:.4f}")

Eigenvector centrality
Eigenvector centrality is a measure of the influence of a node in a network. Unlike other centrality
measures, it assigns relative scores to all nodes in the network based on the concept that connections to
high-scoring nodes contribute more to the score of the node in question. It Measures a node's influence
based on the number and quality of its connections. It is often used in identifying influential nodes in
social networks

Example:

Consider a simple network with four nodes A, B, C, and D, with the following adjacency matrix A

This adjacency matrix represents the following graph:


 A is connected to B and C.
 B is connected to A, C, and D.
 C is connected to A, B, and D.
 D is connected to B and C
For a given adjacency matrix the Eigenvalues and Eigenvectors are

For matrix A, solve the characteristic equation det(A−λI)=0

Eigenvalues: λ1=2, λ2=−1, λ3=−1, λ4=0.

Eigenvector for λ1=2


14
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Normalize the Principal Eigenvector:

The eigenvector centrality scores for nodes A, B, C, and D are approximately 0.2989, 0.4226, 0.2989,
and 0.2989, respectively. Node B has the highest centrality score, indicating it is the most influential
node in this network.

Code:

import numpy as np

def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm

def eigenvector_centrality(A, tolerance=1e-6, max_iter=1000):


n = len(A)
v = np.random.rand(n)
v = normalize(v)

for _ in range(max_iter):
v_new = np.dot(A, v)
v_new = normalize(v_new)

if np.linalg.norm(v - v_new) < tolerance:


return v_new

v = v_new

raise Exception("Eigenvector centrality did not converge")

# Example usage:
A = np.array([
[0, 1, 1, 0],
[1, 0, 1, 1],
[1, 1, 0, 1],
[0, 1, 1, 0]
])

centrality = eigenvector_centrality(A)
print("Eigenvector Centrality:", centrality)
15
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

PageRank
The PageRank algorithm, introduced by Google, evaluates the importance of nodes in a directed
graph based on their connections. This is especially useful in social networks, where endorsements
can signify the respect or influence of a person within a community.
The PageRank algorithm is a method for ranking nodes in a directed graph, such as webpages in a
network of hyperlinks. The algorithm assigns a numerical weight to each node, representing the
likelihood that a person randomly clicking on links will arrive at that particular node.
Algorithm
• Each node starts with an equal PageRank value.
• The total PageRank across all nodes is normalized to 1.
• PageRank is distributed from each node to its outbound links.
• A damping factor is used to simulate the probability that a person will continue clicking on
links (typically set to 0.85).
• The algorithm iterates until the PageRank values stabilize (i.e., change very little between
iterations).

The main idea behind PageRank is that a page is considered important if many other important pages
link to it. The algorithm works as follows:

1. Every page is assigned an initial rank. If there are N pages, each page is typically assigned a
rank of 1/N.
2. The rank of a page is determined by the ranks of the pages linking to it. If a page P has
incoming links from pages A1,A2,...,Am with ranks. R(A1),R(A2),...,R(Am), and if each page Ai
has L(Ai) outgoing links, then the rank of page P is calculated as:

Here, d is a damping factor, usually set to 0.85, which accounts for the probability that a user
continues clicking on links rather than starting a new search.

3. The rank values are updated iteratively until they converge to a stable set of values.

PageRank revolutionized web search by providing a way to rank pages based on their relevance and
importance rather than just keyword matching.

Recommender Systems
A recommendation system is a type of information filtering system that predicts and suggests items or
content to users based on various criteria. These systems are widely used in various domains such as e-
commerce, streaming services, social media, and content websites to enhance user experience by
providing personalized recommendations.
Recommending what's popular is a straightforward approach in recommendation systems. This
method involves identifying the most popular items based on user interactions and then recommending
these items to users.
16
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Steps:
• Collect Data: Gather data on user interactions with items. For example, this could be a list of
items each user is interested in.
• Calculate Popularity: Count how often each item appears in the list of user interests.
• Sort by Popularity: Sort the items by their popularity
• Recommend Popular Items: Recommend the top N popular items that a user has not already
interacted with.

Code:
from collections import Counter
from typing import List, Tuple

users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
]

popular_interests_sorted = popular_interests.most_common()
print(popular_interests_sorted)

def most_popular_new_interests(user_interests: List[str], max_results: int = 5) -> List[Tuple[str, int]]:


# Filter out interests the user already has
suggestions = [(interest, frequency) for interest, frequency in popular_interests_sorted if interest not in
user_interests]
return suggestions[:max_results]

# Example usage:
user_id = 1 # User with interests ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"]
recommendations = most_popular_new_interests(users_interests[user_id], 5)
print(f"Recommendations for user {user_id}: {recommendations}")
17
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Types of recommendations

The system uses two approaches– content-based filtering and collaborative filtering- to make
recommendations.

Content-Based Filtering

This approach recommends items based on user preferences. It matches the requirement,considering
the past actions of the user, patterns detected, or any explicit feedback provided by the user, and
accordingly, makes a recommendation.

Example: If a person prefer the chocolate flavor and purchase a chocolate ice cream, the next time he
raise a query, the system shall scan for options related to chocolate, and then, recommend you to try a
chocolate cake.

Collaborative Filtering

This approach uses similarities between users and items simultaneously, to provide recommendations.
It is the idea of recommending an item or making a prediction, depending on other like-minded
individuals. It could comprise a set of users, items, opinions about an item, ratings, reviews, or
purchases.

Example: Suppose Persons A and B both like the chocolate flavor and have them have tried the ice-
cream and cake, then if Person A buys chocolate biscuits, the system will recommend chocolate
biscuits to Person B.

User-Based Collaborative Filtering

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on
the basis of ratings given to that item by other users who have similar taste with that of the target
18

user. Many websites use collaborative filtering for building their recommendation system.
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Steps for User-Based Collaborative Filtering:

1. Collect Unique Interests:

unique_interests = sorted(list({ interest for user_interests in users_interests for interest in


user_interests }))

2. Create User Interest Vectors:

def make_user_interest_vector(user_interests):
return [1 if interest in user_interests else 0 for interest in unique_interests]
user_interest_matrix = map(make_user_interest_vector, users_interests)

3. Compute Pairwise User Similarities:

user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j)


for interest_vector_j in user_interest_matrix]
for interest_vector_i in user_interest_matrix]
4. Find Most Similar Users:

def most_similar_users_to(user_id):
pairs = [(other_user_id, similarity)
for other_user_id, similarity in enumerate(user_similarities[user_id])
if user_id != other_user_id and similarity > 0]
return sorted(pairs, key=lambda (_, similarity): similarity, reverse=True)

5. Suggest New Interests:

def user_based_suggestions(user_id, include_current_interests=False):


suggestions = defaultdict(float)
for other_user_id, similarity in most_similar_users_to(user_id):
for interest in users_interests[other_user_id]:
suggestions[interest] += similarity
suggestions = sorted(suggestions.items(), key=lambda (_, weight): weight, reverse=True)
if include_current_interests:
return suggestions
else:
return [(suggestion, weight) for suggestion, weight in suggestions if suggestion not in
users_interests[user_id]]

6. Users and Interests

users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
19

["Python", "R", "Java", "C++", "Haskell", "programming languages"],


["statistics", "probability", "mathematics", "theory"],
Page

["machine learning", "scikit-learn", "Mahout", "neural networks"],


["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

["statistics", "R", "statsmodels"],


["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
]

Item -Based Collaborative Filtering

Item-Based Collaborative Filtering is a recommendation system technique that focuses on the


similarity between items to make recommendations to users. Instead of finding similar users, it finds
items similar to the ones a user has shown interest in.

Steps in Item-Based Collaborative filtering


1. Create the User-Item Matrix: Collect data on user interactions with items.

user_interest_matrix = [

[1, 0, 1], # User 0


[0, 1, 0], # User 1
[1, 1, 0] # User 2
]
unique_interests = ["Interest A", "Interest B", "Interest C"]

2. Compute Item Similarities: Calculate the similarity score for each pair of items.

import numpy as np
def cosine_similarity(v1, v2):
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)

3. Compute Item Similarities: Store these scores in an item-item similarity matrix

def calculate_similarities(matrix):
num_items = len(matrix[0])
similarities = np.zeros((num_items, num_items))

for i in range(num_items):
for j in range(num_items):
if i != j:
similarities[i][j] = cosine_similarity(matrix[:, i], matrix[:, j])

return similarities
20

interest_user_matrix = np.array(user_interest_matrix).T
interest_similarities = calculate_similarities(interest_user_matrix).
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

4. Generate Recommendations:

For a given user, identify items similar to the items they have interacted with.Aggregate these similar
items to form a ranked list of recommendations.

def item_based_suggestions(user_id, user_interest_matrix, interest_similarities, unique_interests):

user_interests = user_interest_matrix[user_id]
scores = np.zeros(len(user_interests))

for i, interested in enumerate(user_interests):

if interested:
scores += interest_similarities[i]

recommendations = [(unique_interests[i], score) for i, score in enumerate(scores) if user_interests[i] == 0]

recommendations.sort(key=lambda x: x[1], reverse=True)

return recommendations

recommendations = item_based_suggestions(0, user_interest_matrix, interest_similarities,


unique_interests)
print(recommendations)

Matrix Factorization
 Matrix factorization is one of the most sought-after machine learning recommendation
models.
 It acts as a catalyst, enabling the system to gauge the customer‟s exact purpose of the
purchase, scan numerous pages, shortlist, and rank the right product or service, and
recommend multiple options available.
 Once the output matches the requirement, the lead translates into a transaction and the deal
clicks.
 Once an individual raises a query on a search engine, the machine deploys uses matrix
factorization to generate an output in the form of recommendations.
 Matrix factorization is an extensively used technique in collaborative filtering
recommendation systems.
 Its objective is to factorize a user-item matrix into two low-ranked matrices, the user-factor
matrix and the item-factor matrix that can predict new items that users might be interested in.
21
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Which is just the dot product of the ith row of A with the jth column of B :

def matrix_product_entry(A, B, i, j):


return dot(get_row(A, i), get_column(B, j))

def matrix_multiply(A, B):


n1, k1 = shape(A)
n2, k2 = shape(B)
if k1 != n2:
raise ArithmeticError(&quot;incompatible shapes!&quot;)
return make_matrix(n1, k2, partial(matrix_product_entry, A, B))

Notice that if A is a n X K matrix and B is a k X 1 matrix, then AB is a n X 1 matrix.


If the vector consider as a one-column matrix and A as a function that maps k dimensional
vectors to n-dimensional vectors, where the function is just matrix multiplication.
I,e

v = [1, 2, 3]
v_as_matrix = [[1],
[2],
[3]]

Therefore
22
Page

def vector_as_matrix(v):
“””returns the vector v (represented as a list) as a n x 1 matrix”””
return [[v_i] for v_i in v]

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

def vector_from_matrix(v_as_matrix):
“””returns the n x 1 matrix as a list of values”””
return [row[0] for row in v_as_matrix]

matrix operation using matrix_multiply:

def matrix_operate(A, v):


v_as_matrix = vector_as_matrix(v)
product = matrix_multiply(A, v_as_matrix)
return vector_from_matrix(product)

When A is a square matrix, this operation maps n-dimensional vectors to other n dimensional vectors.

23
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore

You might also like