Experiment 5
(a) Execute the Logistic Regression with the help of properly identified data set. Analyse
the result and identify how well the model performed on test set. Brief the steps that
you have followed for analyse the data set.
(b) Implement Logistic Regression using python.
What is Logistic Regression?
Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for
an input is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0.
It’s referred to as regression because it is the extension of linear regression but is mainly used
for classification problems.
Key Points:
● Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
● It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
● In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
Linear Regression Equation:
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
Apply Sigmoid function on linear regression:
Logistic Function – Sigmoid Function
● The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
● It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
● The S-form curve is called the Sigmoid function or the logistic function.
● In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Program: -
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
df= pd.read_csv("iris.csv") #importing dataset and making dataframe
df.head() #showing top 5 data entry
df.describe() #describes are data
df.info() #gives information about the columns
df.shape #tells us about no. of rows and column [rows , columns]
(150, 5)
print(df["variety"].value_counts())
sns.countplot(df["variety"])
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),annot=True,fmt=".0%") #draws heatmap with input as the correlation
matrix calculted by(df.corr())
plt.show()
# We'll use seaborn's FacetGrid to color the scatterplot by species
sns.FacetGrid(df, hue="variety", height=5).map(plt.scatter, "sepal.length",
"sepal.width").add_legend()
from sklearn.linear_model import LogisticRegression # for Logistic Regression algorithm
from sklearn.model_selection import train_test_split #to split the dataset for training and
testing
from sklearn import metrics #for checking the model accuracy
X=df.iloc[:,0:4]
Y=df["variety"]
X.head()
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)# in this
our main data is split into train and test
# the attribute test_size=0.3 splits the data into 70% and 30% ratio. train=70% and test=30%
print("Train Shape",X_train.shape)
print("Test Shape",X_test.shape)
log = LogisticRegression()
log.fit(X_train,Y_train)
prediction=log.predict(X_test)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,Y_test))
Experiment 6
Execute the Naïve Bayes algorithm with suitable data set and do proper
analysis on the result. Also implement Naïve Bayes algorithm using
python.
Naïve Bayes Classifier Algorithm
● Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
● It is mainly used in text classification that includes a high-dimensional training
dataset.
● Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
● It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
● Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B)
is Marginal Probability: Probability of Evidence.
Program: -
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import
seaborn as sns
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB from
sklearn import metrics
from sklearn.model_selection import train_test_split #to split the dataset for training and
testing
df= pd.read_csv("iris.csv") #importing dataset and making dataframe
df.head() #showing top 5 data entry
df.describe() #describes are data
df.info() #gives information about the columns
df.shape #tells us about no. of rows and column [rows , columns] (150,
5)
print(df["variety"].value_counts())
sns.countplot(df["variety"])
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),annot=True,fmt=".0%") #draws heatmap with input as the correlation
matrix calculted by(df.corr())
plt.show()
# We'll use seaborn's FacetGrid to color the scatterplot by species
sns.FacetGrid(df, hue="variety", height=5).map(plt.scatter, "sepal.length",
"sepal.width").add_legend()
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.4,random_state=0)# in this
our main data is split into train and test
# the attribute test_size=0.4 splits the data into 60% and 40% ratio. train=60% and test=40%
print("Train Shape",X_train.shape)
print("Test Shape",X_test.shape)
Train Shape (90, 4)
Test Shape (60, 4)
#Creating Naive Bayes classifier model gnb
= GaussianNB()
gnb.fit(X_train, Y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(Y_test,
y_pred)*100)
Gaussian Naive Bayes model accuracy(in %): 93.33333333333333
Experiment 7
Identify a data set for executing the Decision Tree algorithm to implement
using python and analyse the same with cross validation and percentage
split.
Decision Tree Algorithm
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Program: -
# Load libraries
import pandas as pd
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# load dataset
pima = pd.read_csv("diabetes.csv")
pima.head()
pima.describe()
pima.info()
print(pima["Outcome"].value_counts())
sns.countplot(pima["Outcome"]
#split dataset in features and target variable
feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = pima[feature_cols] # Features
y = pima.Outcome # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70%
training and 30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.70995670995671
Experiment 8
Identify / prepare a data set for executing K-Means algorithm. Implement K-
Means algorithm using python. Do the proper analysis of the result with
visualizing the clusters and by changing the K.
What is K-Means Algorithm?
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
• It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters.
• The value of k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has data points with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
•
How to choose the value of "K number of clusters" in K-means Clustering?
• The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• There are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster.
• The formula to calculate the value of WCSS (for 3 clusters) is given below:
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
Program: -
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp import
pandas as pd
from sklearn.cluster import KMeans
# Importing the dataset
dataset = pd.read_csv('Mall_Customers.csv')
print(dataset.head())
x = dataset.iloc[:, [3, 4]].values
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10. for
i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
y_predict=kmeans.fit_predict(x)
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
#for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
#for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
#for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
#for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow',
label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend(loc='lower center')
mtp.show()