Unit 5
Classification: Performance measures, Logistic Regression implementation in R, K-Nearest
Neighbors(KNN), K-Nearest Neighbors implementation in R, Clustering: K-Means Algorithm, K-Means
implementation in R.
Case studies of Data Science Applications: Weather Forecasting, Stock Market Prediction, Objection
Recognition, Real Time Sentiment Analysis.
_____________________________________________________________________________________
Classification
Classification is a type of supervised learning which predicts discrete value of target attribute for given
input. The examples for classification algorithms includes K-Nearest Neighbor Algorithm, Logistic
regression, Support Vector Machine Algorithm, Decision Tree Algorithm, etc.
Performance measures
Classification models predict categorical outcomes for the given test set records. To evaluate the
performance of the classification models, the various performance measures or metrics are used.
They are accuracy, precision, sensitivity or recall, F-score, specificity and confusion matrix.
The following terms are used duringth measure the performance of the model.
True Positive (TP) – The classification model predicts the output as positive and actual output is
also positive.
True Negative (TN) – The model predicts the output as negative and actual output is also negative.
False Positive (FP) – The model predicts the output as positive but the actual output is negative.
False Negative (FN) – The model predicts the output as negative but the actual output is positive.
True Negative (TN) – The classification
1. Accuracy
Accuracy is the ratio of correct predictions to the total predictions.
𝑇𝑃 + 𝑇𝑁
Accuracy =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Example: If a model correctly classifies 90 out of 100 instances, the accuracy is 90%.
2. Precision
Precision measures how many of the predicted positive instances are actually positive.
It is the ratio of correct positive predictions to the total positive predictions.
𝑇𝑃
Precision =
𝑇𝑃 + 𝐹𝑃
Example: In a spam detection system, if 80 emails are predicted as spam and 70 are actual spam,
70
precision is 80.
3. Recall (Sensitivity)
Recall measures the proportion of actual positive cases that are correctly identified.
It is the ratio of correct positive predictions to the actual positives.
𝑇𝑃
Recall =
𝑇𝑃 + 𝐹𝑁
70
Example: If there are 100 spam emails, and the model correctly identifies 70, recall is .
100
4. F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics.
Precision×Recall
F1 Score = 2 × Precision+Recall
0.875×0.7
Example: If precision is 0.875 and recall is 0.7, the F1 Score is 2 × 0.875+0.7 ≈ 0.778.
5. Specificity
Specificity measures the proportion of actual negative cases that are correctly identified.
It is the ratio of correct negative predictions the actual negatives.
𝑇𝑁
Specificity = 𝑇𝑁+𝐹𝑃
900
Example: If 900 non-spam emails are correctly identified out of 950, specificity is .
950
6. Confusion Matrix
Definition: A confusion matrix summarizes the model's performance by showing the number of
true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Example:
Predicted
Positive Negative
Actual Positive TP=70 FN=30
Negative FP=10 TN=9
K-Nearest Neighbor(KNN) Algorithm
KNN Algorithm is supervised learning used for classification and regression. It stores the available training
set data and classify the new input or new data point based on similarity.
KNN algorithm is also called lazy learning algorithm because it does not learn from the training set
immediately, instead it stores the dataset in memory and at the time of classification, it performs the
action on the dataset.
Working of KNN Algorithm or steps in KNN Algorithm
1. Choose the number of neighbors K (usually an odd number to avoid ties).
2. Calculate the distance between the new data point (or new input) and all points (all records )in
the training set using Euclidean distance.
3. Identify the K nearest neighbors (smallest distances) based on the calculated distances.
4. Among these K Neighbors, find the number of data points (or records) in each category.
5. Assign new data point to that for which number of neighbors is maximum.
Advantages
Simple and easy to implement.
No training phase, useful for small datasets.
Works well for classification.
Disadvantages
Computationally expensive for large datasets.
Requires careful selection of K.
Each record in training set is plotted as a point in 2D space. For a given new record, place it 2 Dimensional
space and find K Nearest Neighbors to it. Find the category of each neighbor. Assigns the category to the
new data point, where number of data points to that category is maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
Assume that K=5. i.e. we have to find 5 nearest neighbors. Find the category of each neighbor.
Among 5 neighbors, 3 neighbors belongs to category A and neighbors belongs to category.
So new input belongs to category A.
Erampa
s2
y+2s
Clustering
The process of forming clusters or groups from dataset is known as clustering.
The data objects within a cluster are similar and data objects in different clusters are dissimilar. Each
cluster is said to have cluster centroid or center. The cluster centroid is obtained by calculating the mean
or average of all data objects within that cluster.
The Euclidian distance will be calculated from data object to every cluster centroid. The data object will
be placed in a cluster, where distance from data object to that cluster is minimum.
K-Means Clustering
K-Means is an unsupervised machine learning algorithm used for clustering. It partitions a dataset into
K clusters, where each data point belongs to the cluster with the nearest mean.
K-Means Algorithm
Input
K: The number of clusters to be formed
D : Dataset that contains n number of data objects or records
Output
K Clusters from data set
Method:
1. Randomly assign K data objects as cluster centroids or cluster centers for K clusters.
2. Repeat
3. Assign each data object to a cluster where distance from that data object to that cluster
centroid is minimum.
4. Update the cluster centroids. i.e. recalculate the mean of each cluster with updated
cluster members.
5. Until there is no change in cluster member or centroids.
Some of the applications of K-Means Clustering includes: Customer segmentation in marketing. Anomaly
detection in cybersecurity, etc.
Exampl tr k-Means dusteiy
Al= (2,10) w As (7, 5)
A1
An =I, 2)
Ay (5,8)
C4,10)
ALE e)
|A5I,s) dCA7,
As y =Va
A7tI
ASE C2
As ce /9+ib=/as
V1tl6)
d CAb,c»)>
cltai:f
o clren ccwtiko
dmeah data Polat
Fhd ta dytan
duntenl
Case Studies of Data Science Applications
1. Weather Forecasting
Weather forecasting is the application of science and technology to predict atmospheric conditions for a
given location and time. With advancements in data science and machine learning, forecasting accuracy
has significantly improved.
1. Importance of Weather Forecasting
Agriculture: Helps farmers plan irrigation and crop protection.
Disaster Management: Predicts extreme weather events like hurricanes, floods, and droughts.
Aviation: Ensures safe flight operations.
Energy Sector: Optimizes power generation and distribution.
2. Data Sources for Weather Forecasting
Meteorological Stations: Collect temperature, humidity, wind speed, and pressure data.
Satellites: Provide cloud cover, precipitation, and storm movement.
IoT Sensors: Gather real-time weather data from different locations.
3. Data Science and Machine Learning Approaches
Regression Models: Predict temperature and precipitation.
Time Series Analysis: Uses past weather data to make future predictions.
Deep Learning Models: Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
networks analyze time-dependent weather data.
5. Work-flow of Weather Forecasting Using Machine Learning
1 Data Collection and Preprocessing
Obtain data from sources like Kaggle, or OpenWeatherMap.
Handle missing values and perform feature scaling.
Convert categorical variables (e.g., sunny, rainy) into numerical format.
2. Model Building
Split the collected dataset into training set and test set
Build the model by supplying the training set to Machine learning or deep learning algorithm
3. Model Evaluation
The performance evaluation of model can be done using the test set data and performance
metrics.
2. Stock Market Prediction
Stock market is a public market where public can buy and sell shares of listed companies. Stock market
prediction involves using historical data, statistical methods, and machine learning algorithms to
forecast stock prices and trends.
Data Sources for Stock Market Prediction
Historical Stock Prices: Open, close, high, and low prices of certain company shares in the past.
Fundamental Data: Company earnings, revenue, and financial profits.
Machine Learning-Based Approaches
Regression Models: Linear Regression for stock price prediction.
Classification Models: Predict stock movement (increase/decrease) using Support Vector
Machines (SVM) and Random Forest.
Deep Learning Models:
Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM): uses
historical prices of stock prices to predict its future share price.
3. Object Recogniton
Object recognition is the technique of identifying the object present in images and videos. It is one of the
most important applications of machine learning and deep learning. The goal of this field is to teach
machines to understand (recognize) the content of an image just like humans do.
It takes image as input and outputs the classification label of that image. Object detection algorithms also
locates the presence of an object in the image and represents it with a bounding box or location in form
of position, height and width.
Applications of Object Recognition
Medical Imaging: Identifying tumors in X-rays or MRIs.
Security and Surveillance: Recognizing faces and detecting suspicious activities.
Retail and Inventory Management: Automated checkout and product detection.
Agriculture: Monitoring plant health and detecting pests.
The deep learning technique, particularly Convolutional Neural Networks (CNNs) is used for object
detection.
Steps in Object Recognition Process
1 Data Collection and Preprocessing
Collect images from sources like Open Images, or custom datasets.
Label objects using tools like LabelImg.
Normalize and resize images for uniform processing.
2 Model Training
Split data into training, validation, and test sets.
Build a model with training set images.
3 Model Evaluation
Evaluates model performance using test set images by using accuracy metrics.
4. Real Time Sentiment Analysis
Real Time Sentiment Analysis technique will analyze and classify sentiments (positive, negative,
neutral) in text data from social media, reviews, and customer feedback.
Real-time Sentiment Analysis is a machine learning (ML) technique that automatically recognizes
and extracts the sentiment in a text whenever it occurs. It is most commonly used to analyze brand
and product mentions in live social comments and posts. So real-time sentiment analysis can be
done only from social media platforms that share live feeds like Twitter does.
Working of Real Time Sentiment Analysis
1. Data Collection
Collect sentiment data about a product or event from social media such as Instagram,
twitter, facebook, etc.
2. Data Processing
All text comments (sentiments) are cleaned up and processed for next stage. All non-text
data from live video or audio feeds are added as text comments.
3. Data Analysis
All text comments are analyzed using Natural Language Processing, Clustering. It will give
overall percentage of opinions or feedback given by people about a product or event.
4. Data visualization
The results of opinions or sentiments are shown in form of charts or graphs.
The following techniques can be used to build the model for Real Time Sentiment Analysis
NLP Techniques: Bag of Words
Machine Learning: Logistic Regression, Naïve Bayes, Random Forest.
Deep Learning: Recurrent Neural Networks (RNN)
Logistic Regression implementation in R
Steps in implementation
Step1. Importing the dataset
Step 2: Splitting the dataset into training set and test set.
Step 3: Build the logistic regression model by supplying the training set to logistic
regression algorithm.
Step 4: Evaluate the model performance with test set data and confusion matrix.
R Code for logistic regression
# importing the dataset from Social_Network_Ads.Csv file
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Purchased ~ .,
family = binomial,
data = training_set)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred > 0.5)
print(cm)
output
Confusion Matrix
0 1
0 57 7
1 10 26
Accuracy = 83
K-Nearest Neighbors (K-NN) implementation in R
# Importing the dataset
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting K-NN to the Training set and Predicting the Test set results
library(class)
y_pred = knn(train = training_set[, -3],
test = test_set[, -3],
cl = training_set[, 3],
k = 5,
prob = TRUE)
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
print(cm)
output
Confusion Matrix
0 1
0 59 5
1 6 30
Accuracy = 89%
K-Means Clustering implementation in R
# K-Means Clustering
# Importing the dataset
dataset = read.csv('mall.csv')
X = dataset[4:5]
# Using the elbow method to find the optimal number of clusters
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(X, i)$withinss)
plot(x = 1:10, y = wcss, type = 'b', main = paste('The Elbow Method'),
xlab = 'Number of clusters', ylab = 'WCSS')
# Fitting K-Means to the dataset
kmeans = kmeans(x = X, centers = 5, iter.max = 300, nstart = 10)
# Visualising the clusters
library(cluster)
clusplot(x = X, clus = kmeans$cluster, lines = 0, shade = TRUE, color = TRUE, labels = 2,
plotchar = FALSE, span = TRUE, main = paste('Clusters of customers'),
xlab = 'Annual Income', ylab = 'Spending Score')