DSA Module 5 Notes
DSA Module 5 Notes
Introduction
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and humans through natural language. The goal of NLP is to enable computers to
understand, interpret, and generate human language in a way that is both meaningful and useful.
Applications of NLP
1. Chabot‟s and Virtual Assistants.
2. Text Analysis and Sentiment Analysis
3. Machine Translation
4. Speech Recognition and Voice Assistants
5. Information Retrieval and Search Engines
6. Language Modeling and Understanding
7. Recommendation Systems.
8. Healthcare and Medical Applications
Word cloud
A word cloud, also known as a tag cloud, is a visual representation of text data. It highlights the most
frequently occurring words within a body of text by making them larger and bolder compared to less
frequent words. Word clouds are often used to quickly identify prominent terms in large text datasets
and provide an at a glance summary of the content.
Example: The word cloud is created from a list of data science-related buzzwords, where each word is
associated with two numbers: how frequently it appears in job postings and how frequently it appears
on resumes
3. Visualization:
Page
o Words are arranged in a layout, with their size proportional to their frequency. The
arrangement often ensures that words fit together compactly, creating a dense and
readable cloud.
Example Data
data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50), ("R", 50, 40), ("machine
learning", 80, 20), ("statistics", 20, 60), ("data science", 60, 70), ("analytics", 90, 3), ("team player", 85,
85), ("dynamic", 2, 90), ("synergies", 70, 0), ("actionable insights", 40, 30), ("think out of the box",
45, 10), ("self-starter", 30, 50), ("customer focus", 65, 15), ("thought leadership", 35, 35)]
Where,
By plotting the words on a scatter plot with these axes, adds more value and clarity compared to a
traditional word cloud
Limitations
1. Lack of Precision: Word clouds don't provide exact counts or clear comparisons between
words.
2. Random Placement: The position of words in a word cloud is arbitrary and doesn't convey any
additional information.
3. Space Usage: Words are placed wherever there is space, which can make the visualization
cluttered and less, readable.
Python Code
data = [
("big data", 100, 15),
("Hadoop", 95, 25),
("Python", 75, 50),
("R", 50, 40),
("machine learning", 80, 20),
("statistics", 20, 60),
("data science", 60, 70),
("analytics", 90, 3),
("team player", 85, 85),
("dynamic", 2, 90),
("synergies", 70, 0),
("actionable insights", 40, 30),
("think out of the box", 45, 10),
("self-starter", 30, 50),
("customer focus", 65, 15),
("thought leadership", 35, 35)
2
]
Page
N- Gram Model
Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition, Spam
filtering, etc.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-grams
typically are collected from a text or speech corpus.
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence. The goal is to
estimate the probability of a word by decomposing a sentence probability into a product of conditional
probabilities using chain rule as follows:
P(S) =P (W1, W2, W3, W4, …………. Wn )
=P(W1) P(W2/ W1,) P(W3 / W1 W2 ) P (W4 / W1 W2 W3)………………P( Wn / W1, W2……….. Wn-1 )
=∑
Thus an n- gram model calculate P(Wi / hi) by modeling languages as marker model of order n-1.
Steps in N-gram Modeling
1. Collect and pre-process data: Gather a corpus of text, Clean and tokenize the text.
3
2. Generate n-grams:
Page
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram („This
article‟, „article is‟, „is on‟, „on NLP‟).
5. Generate text: Use the model to predict and generate sequences of text
Consider an example
"Corpus"
"Vocabulary"
Limitations
Data sparsity: Large number of possible n-grams, many of which may not appear in the training data.
Context limitation: Limited to fixed 'n' words of context, leading to potential incoherence in longer
texts.
4
Scalability: Higher-order n-grams require significantly more data to produce accurate models.
Page
GRAMMER
A grammar is a set of rules for generating sentences. Each rule consists of a non-terminal ex- _S, _NP,
_VP and its possible expansions. Non-terminals are symbols that can be expanded further whereas
Terminals are symbols that cannot be expanded further and appear in the final sentence ex- "Python",
"trains". Grammars can be recursive, allowing rules to refer to them. This enables the generation of
complex and varied sentences.This process involves defining rules for different parts of speech and
then expanding those rules until only terminal symbols remain.
Using grammars to model language allows for the generation of diverse and potentially infinite
sentences by defining a set of rules and recursively expanding them. This approach is useful in natural
language processing, computational linguistics and various applications where syntactically correct
sentences need to be generated.
Grammar Example
grammar = {
"_S”: ["_NP _VP"],
"_NP”: ["_N", "_A _NP _P _A _N"],
"_VP”: ["_V", "_V _NP"],
"_N”: ["data science", "Python", "regression"],
"_A”: ["big", "linear", "logistic"],
"_P”: ["about", "near"],
"_V”: ["learns", "trains", "tests", "is"]
}
To implement the described approach to generating sentences using grammars, the steps are as
follows:
1. Define the grammar: Create a dictionary where each key is a non-terminal and each value is a
list of possible expansions.
2. Helper function to identify terminals: A terminal is a token that doesn't start with an
underscore.
3. Function to expand tokens: Recursively replace non-terminal tokens with one of their possible
expansions until all tokens are terminals.
4. Function to generate sentences: Start with the sentence rule and expand it using the defined
functions.
5
Page
CODE:
import random
# Define the grammar
grammar = {
"_S" : ["_NP _VP"],
"_NP" : ["_N", "_A _NP _P _A _N"],
"_VP" : ["_V", "_V _NP"],
"_N" : ["data science", "Python", "regression"],
"_A" : ["big", "linear", "logistic"],
"_P" : ["about", "near"],
"_V" : ["learns", "trains", "tests", "is"]
}
# Helper function to identify terminals
def is_terminal(token):
return token[0] != "_"
# Function to expand tokens
def expand(grammar, tokens):
for i, token in enumerate(tokens):
# Skip over terminals
if is_terminal(token):
continue
# Found a non-terminal token, choose a replacement at random
replacement = random.choice(grammar[token])
if is_terminal(replacement):
tokens[i] = replacement
else:
tokens = tokens[:i] + replacement.split() + tokens[(i+1):]
# Recursively expand the new list of tokens
return expand(grammar, tokens)
# If we get here, we had all terminals and are done
return tokens
GIBB’S SAMPLING
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) technique used to generate samples from a
joint distribution when the conditional distributions are known. It is a technique for obtaining a
sequence of observations which are approximated from a specified multivariate probability
distribution; It‟s particularly useful for high-dimensional problems where direct sampling is difficult.
In many practical problems, the joint distribution of multiple variables is complex, and direct sampling
is infeasible. Gibbs Sampling allows us to sample from such distributions by breaking the problem
6
The Gibbs sampling process involves alternating between these conditional distributions:
1. Initialize (x, y): Start with any valid pair (x, y). For instance, (1, 2).or Assume initial values
After sufficient iterations, the values of x and y will approximate samples from the joint distribution
P(x,y). This is due to the Markov property of Gibbs sampling, where each step only depends on the
current state and not on the history of states.
To verify the correctness of Gibbs sampling, compare the distribution of samples obtained
from Gibbs sampling with those from direct sampling. Both methods should yield similar distributions
after a large number of samples, validating that Gibbs sampling is correctly approximating the joint
distribution.
By comparing the counts of outcomes from Gibbs sampling and direct sampling, it is observed that
Gibbs sampler converges to the same distribution as the direct sampler.
CODE:
import random
from collections import defaultdict
def roll_a_die():
return random.choice([1, 2, 3, 4, 5, 6])
def direct_sample():
d1 = roll_a_die()
d2 = roll_a_die()
return d1, d1 + d2
def random_y_given_x(x):
"""Equally likely to be x + 1, x + 2, ..., x + 6"""
return x + roll_a_die()
def random_x_given_y(y):
if y <= 7:
7
# 1, 2, ..., (total - 1)
return random.randrange(1, y)
else:
# if the total is 7 or more, the first die is equally likely to be
# (total - 6), (total - 5), ..., 6
return random.randrange(y - 6, 7)
def gibbs_sample(num_iters=100):
x, y = 1, 2 # initial values, doesn't really matter
for _ in range(num_iters):
x = random_x_given_y(y)
y = random_y_given_x(x)
return x, y
def compare_distributions(num_samples=1000):
counts = defaultdict(lambda: [0, 0])
for _ in range(num_samples):
counts[gibbs_sample()][0] += 1
counts[direct_sample()][1] += 1
return counts
# Example usage
num_samples = 1000
distribution_counts = compare_distributions(num_samples)
Topic Modeling
Latent Dirichlet Allocation (LDA) is a powerful technique used for topic modeling, which helps us
identify the underlying topics in a collection of documents. The idea is to discover the hidden
particular structure in a corpus of text data. Let us consider an example based on users' stated interests.
Probabilistic Model
Documents and Topics: For each document, LDA assumes that it is a mixture of topics. For
each topic, it is a mixture of words.
Random Variables: LDA involves two sets of random variables:
o θd: Distribution of topics for document d.
o ϕk: Distribution of words for topic k.
LDA Process
documents = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
Document-Topic Counts - Track how many times each topic is assigned to each document.
Topic-Word Counts - Track how many times each word is assigned to each topic.
Topic Counts - Track the total number of words assigned to each topic.
Document Lengths - Track the total number of words in each document.
Conditional probabilities are calculated
P(topic/document) ≈
P(word/topic) ≈
CODE:
import random
from collections import Counter
import numpy as np
def sample_from(weights):
total = sum(weights)
9
for i, w in enumerate(weights):
rnd -= w
if rnd <= 0:
return i
# Initialize variables
random.seed(0)
K=4
document_topic_counts = [Counter() for _ in documents]
topic_word_counts = [Counter() for _ in range(K)]
topic_counts = [0 for _ in range(K)]
document_lengths = list(map(len, documents))
distinct_words = set(word for document in documents for word in document)
W = len(distinct_words)
D = len(documents)
document_topic_counts[d][new_topic] += 1
topic_word_counts[new_topic][word] += 1
topic_counts[new_topic] += 1
document_lengths[d] += 1
print(f"Topic {k}:")
for word, count in word_counts.most_common():
if count > 0:
print(f"{word}: {count}")
print()
Network Analysis
Network analysis in data science involves examining the structure and dynamics of networks.
Networks, also known as graphs, consist of nodes (or vertices) and edges (or links) that represent
relationships or interactions between entities. Network analysis provides insights into the patterns,
connections, and behaviors within the network, which can be applied to various fields such as social
sciences, biology, computer science, and more. Network analysis provides a powerful framework for
understanding and leveraging the relationships and interactions within complex data. It enables data
scientists to gain deeper insights, make informed decisions, and solve problems that involve
interconnected entities. Network analysis in data science is crucial because it allows us to understand
and model the relationships and interactions between different entities.
Centrality is a key concept in network theory and graph analysis, which focuses on identifying the
most important or influential nodes within a network. Centrality can be measured in several ways,
each capturing different aspects of a node's importance:
1. Degree Centrality: Measures the number of direct connections a node has. Nodes with higher
degrees are considered more central.
2. Betweenness Centrality: Quantifies the number of times a node acts as a bridge along the
shortest path between two other nodes. Nodes with high betweenness centrality play a critical
role in the flow of information within the network.
3. Eigenvector Centrality: Measures a node's influence based on the number and quality of its
connections. A node is considered central if it is connected to other nodes that are themselves
central. It is often used in identifying influential nodes in social networks.
4. PageRank: A variant of eigenvector centrality, originally used by Google to rank web pages in
search results. It measures the probability that a random walker will land on a particular node,
considering both the quantity and quality of incoming links.
Each centrality measure provides different insights into the network structure and the roles of
11
individual nodes. The choice of centrality measure depends on the specific context and the aspect of
Page
Betweenness Centrality
Betweenness centrality is a measure of centrality in a graph based on the shortest paths. It quantifies
the number of times a node acts as a bridge along the shortest path between two other nodes. Nodes
with high betweenness centrality are crucial for the flow of information through the network.
Definition
For a given graph G= (V, E), where V is the set of vertices and E is the set of edges:
Code:
from collections import deque
# Define users
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]
# Define friendships
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
for i, j in friendships:
Page
Code:
def shortest_paths_from(from_user):
shortest_paths_to = { from_user["id"]: [[]] }
frontier = deque((from_user, friend) for friend in from_user["friends"])
while frontier:
prev_user, user = frontier.popleft()
user_id = user["id"]
paths_to_prev_user = shortest_paths_to[prev_user["id"]]
new_paths_to_user = [path + [user_id] for path in paths_to_prev_user]
if old_paths_to_user:
min_path_length = len(old_paths_to_user[0])
else:
min_path_length = float('inf')
new_paths_to_user = [path for path in new_paths_to_user if len(path) <= min_path_length and path not
in old_paths_to_user]
return shortest_paths_to
For each pair of nodes, find all shortest paths between them.
For each node on these paths (excluding the source and target), update its betweenness
centrality score.
Code:
# Initialize betweenness centrality for each user
for user in users:
user["betweenness_centrality"] = 0.0
13
source_id = source["id"]
for target_id, paths in source["shortest_paths"].items():
if source_id < target_id:
num_paths = len(paths)
contrib = 1 / num_paths
for path in paths:
for id in path:
if id not in [source_id, target_id]:
users[id]["betweenness_centrality"] += contrib
Eigenvector centrality
Eigenvector centrality is a measure of the influence of a node in a network. Unlike other centrality
measures, it assigns relative scores to all nodes in the network based on the concept that connections to
high-scoring nodes contribute more to the score of the node in question. It Measures a node's influence
based on the number and quality of its connections. It is often used in identifying influential nodes in
social networks
Example:
Consider a simple network with four nodes A, B, C, and D, with the following adjacency matrix A
The eigenvector centrality scores for nodes A, B, C, and D are approximately 0.2989, 0.4226, 0.2989,
and 0.2989, respectively. Node B has the highest centrality score, indicating it is the most influential
node in this network.
Code:
import numpy as np
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
for _ in range(max_iter):
v_new = np.dot(A, v)
v_new = normalize(v_new)
v = v_new
# Example usage:
A = np.array([
[0, 1, 1, 0],
[1, 0, 1, 1],
[1, 1, 0, 1],
[0, 1, 1, 0]
])
centrality = eigenvector_centrality(A)
print("Eigenvector Centrality:", centrality)
15
Page
PageRank
The PageRank algorithm, introduced by Google, evaluates the importance of nodes in a directed
graph based on their connections. This is especially useful in social networks, where endorsements
can signify the respect or influence of a person within a community.
The PageRank algorithm is a method for ranking nodes in a directed graph, such as webpages in a
network of hyperlinks. The algorithm assigns a numerical weight to each node, representing the
likelihood that a person randomly clicking on links will arrive at that particular node.
Algorithm
• Each node starts with an equal PageRank value.
• The total PageRank across all nodes is normalized to 1.
• PageRank is distributed from each node to its outbound links.
• A damping factor is used to simulate the probability that a person will continue clicking on
links (typically set to 0.85).
• The algorithm iterates until the PageRank values stabilize (i.e., change very little between
iterations).
The main idea behind PageRank is that a page is considered important if many other important pages
link to it. The algorithm works as follows:
1. Every page is assigned an initial rank. If there are N pages, each page is typically assigned a
rank of 1/N.
2. The rank of a page is determined by the ranks of the pages linking to it. If a page P has
incoming links from pages A1,A2,...,Am with ranks. R(A1),R(A2),...,R(Am), and if each page Ai
has L(Ai) outgoing links, then the rank of page P is calculated as:
Here, d is a damping factor, usually set to 0.85, which accounts for the probability that a user
continues clicking on links rather than starting a new search.
3. The rank values are updated iteratively until they converge to a stable set of values.
PageRank revolutionized web search by providing a way to rank pages based on their relevance and
importance rather than just keyword matching.
Recommender Systems
A recommendation system is a type of information filtering system that predicts and suggests items or
content to users based on various criteria. These systems are widely used in various domains such as e-
commerce, streaming services, social media, and content websites to enhance user experience by
providing personalized recommendations.
Recommending what's popular is a straightforward approach in recommendation systems. This
method involves identifying the most popular items based on user interactions and then recommending
these items to users.
16
Page
Steps:
• Collect Data: Gather data on user interactions with items. For example, this could be a list of
items each user is interested in.
• Calculate Popularity: Count how often each item appears in the list of user interests.
• Sort by Popularity: Sort the items by their popularity
• Recommend Popular Items: Recommend the top N popular items that a user has not already
interacted with.
Code:
from collections import Counter
from typing import List, Tuple
users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
]
popular_interests_sorted = popular_interests.most_common()
print(popular_interests_sorted)
# Example usage:
user_id = 1 # User with interests ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"]
recommendations = most_popular_new_interests(users_interests[user_id], 5)
print(f"Recommendations for user {user_id}: {recommendations}")
17
Page
Types of recommendations
The system uses two approaches– content-based filtering and collaborative filtering- to make
recommendations.
Content-Based Filtering
This approach recommends items based on user preferences. It matches the requirement,considering
the past actions of the user, patterns detected, or any explicit feedback provided by the user, and
accordingly, makes a recommendation.
Example: If a person prefer the chocolate flavor and purchase a chocolate ice cream, the next time he
raise a query, the system shall scan for options related to chocolate, and then, recommend you to try a
chocolate cake.
Collaborative Filtering
This approach uses similarities between users and items simultaneously, to provide recommendations.
It is the idea of recommending an item or making a prediction, depending on other like-minded
individuals. It could comprise a set of users, items, opinions about an item, ratings, reviews, or
purchases.
Example: Suppose Persons A and B both like the chocolate flavor and have them have tried the ice-
cream and cake, then if Person A buys chocolate biscuits, the system will recommend chocolate
biscuits to Person B.
User-Based Collaborative Filtering is a technique used to predict the items that a user might like on
the basis of ratings given to that item by other users who have similar taste with that of the target
18
user. Many websites use collaborative filtering for building their recommendation system.
Page
def make_user_interest_vector(user_interests):
return [1 if interest in user_interests else 0 for interest in unique_interests]
user_interest_matrix = map(make_user_interest_vector, users_interests)
def most_similar_users_to(user_id):
pairs = [(other_user_id, similarity)
for other_user_id, similarity in enumerate(user_similarities[user_id])
if user_id != other_user_id and similarity > 0]
return sorted(pairs, key=lambda (_, similarity): similarity, reverse=True)
users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
19
user_interest_matrix = [
2. Compute Item Similarities: Calculate the similarity score for each pair of items.
import numpy as np
def cosine_similarity(v1, v2):
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)
def calculate_similarities(matrix):
num_items = len(matrix[0])
similarities = np.zeros((num_items, num_items))
for i in range(num_items):
for j in range(num_items):
if i != j:
similarities[i][j] = cosine_similarity(matrix[:, i], matrix[:, j])
return similarities
20
interest_user_matrix = np.array(user_interest_matrix).T
interest_similarities = calculate_similarities(interest_user_matrix).
Page
4. Generate Recommendations:
For a given user, identify items similar to the items they have interacted with.Aggregate these similar
items to form a ranked list of recommendations.
user_interests = user_interest_matrix[user_id]
scores = np.zeros(len(user_interests))
if interested:
scores += interest_similarities[i]
return recommendations
Matrix Factorization
Matrix factorization is one of the most sought-after machine learning recommendation
models.
It acts as a catalyst, enabling the system to gauge the customer‟s exact purpose of the
purchase, scan numerous pages, shortlist, and rank the right product or service, and
recommend multiple options available.
Once the output matches the requirement, the lead translates into a transaction and the deal
clicks.
Once an individual raises a query on a search engine, the machine deploys uses matrix
factorization to generate an output in the form of recommendations.
Matrix factorization is an extensively used technique in collaborative filtering
recommendation systems.
Its objective is to factorize a user-item matrix into two low-ranked matrices, the user-factor
matrix and the item-factor matrix that can predict new items that users might be interested in.
21
Page
Which is just the dot product of the ith row of A with the jth column of B :
v = [1, 2, 3]
v_as_matrix = [[1],
[2],
[3]]
Therefore
22
Page
def vector_as_matrix(v):
“””returns the vector v (represented as a list) as a n x 1 matrix”””
return [[v_i] for v_i in v]
def vector_from_matrix(v_as_matrix):
“””returns the n x 1 matrix as a list of values”””
return [row[0] for row in v_as_matrix]
When A is a square matrix, this operation maps n-dimensional vectors to other n dimensional vectors.
23
Page