Datascience Notes Unit-4
Datascience Notes Unit-4
Supervised machine learning is a fundamental approach for machine learning and artificial
intelligence. It involves training a model using labeled data, where each input comes with a
corresponding correct output. The process is like a teacher guiding a student—hence the term
"supervised" learning.
Where supervised learning algorithm consists of input features and corresponding output
labels. The process works through:
Training Data: The model is provided with a training dataset that includes input data
(features) and corresponding output data (labels or target variables).
Learning Process: The algorithm processes the training data, learning the relationships
between the input features and the output labels. This is achieved by adjusting the
model's parameters to minimize the difference between its predictions and the actual
labels.
Training phase involves feeding the algorithm labeled data, where each data point is
paired with its correct output. The algorithm learns to identify patterns and
relationships between the input and output data.
Testing phase involves feeding the algorithm new, unseen data and evaluating its ability
to predict the correct output based on the learned patterns.
Classification: Where the output is a categorical variable (e.g., spam vs. non-spam
emails, yes vs. no).
Regression: Where the output is a continuous variable (e.g., predicting house prices,
stock prices).
Regression:
The regression procedure aims to identify the mapping function that will allow us to translate
the continuous output variable "y" into the input variable "x."
Classification:
On the other hand, classification is an algorithm that identifies functions that support
categorizing the dataset based on different factors. Computer software learns from the training
dataset when employing a classification algorithm, then divides the data into several groups
based on what it has discovered.
The mapping function that converts the discrete "y" output from the "x" input is found by
classification algorithms. Based on a certain set of independent variables, the algorithms
estimate discrete values (sometimes known as binary values such as 0 and 1, yes and no, true,
or false). To put it another simpler way, classification algorithms determine the likelihood that
an event will occur by fitting data to a logic function.
For example, it can determine whether an email is spam or not, classify images as "cat" or
"dog," or predict weather conditions like "sunny," "rainy," or "cloudy." with decision boundary
and regression models are used to predict house prices based on features like size and location,
or forecast stock prices over time with straight fit line.
Supervised Machine Learning Algorithms
Supervised learning can be further divided into several different types, each with its own
unique characteristics and applications. Here are some of the most common types of
supervised learning algorithms:
Decision Trees : Decision tree is a tree-like structure that is used to model decisions and
their possible consequences. Each internal node in the tree represents a decision, while
each leaf node represents a possible outcome.
Random Forests : Random forests again are made up of multiple decision trees that
work together to make predictions. Each tree in the forest is trained on a different
subset of the input features and data. The final prediction is made by aggregating the
predictions of all the trees in the forest.
K-Neighbors (KNN) : KNN works by finding k training examples closest to a given input
and then predicts the class or value based on the majority class or average value of
these neighbors. The performance of KNN can be influenced by the choice of k and the
distance metric used to measure proximity.
Gradient Boosting : Gradient Boosting combines weak learners, like decision trees, to
create a strong model. It iteratively builds new models that correct errors made by
previous ones.
Naive Bayes Algorithm: The Naive Bayes algorithm is a supervised machine learning
algorithm based on applying Bayes' Theorem with the “naive” assumption that features
are independent of each other given the class label.
Support Vector Machine (SVM) is a supervised learning algorithm used for classification and
regression. It finds the optimal boundary to separate classes, ensuring maximum margin. This
article explores SVM’s working, mathematical foundation, types, real-world applications, and
implementation with examples.
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It is widely applied in fields like image recognition, text
classification, and bioinformatics due to its efficiency in handling high-dimensional data.
In this article, we will start from the basics of SVM in machine learning, gradually diving into its
working principles, different types, mathematical formulation, real-world applications, and
implementation.
SVM is a classification algorithm that finds the best boundary (hyperplane) to separate different
classes in a dataset. It works by identifying key data points, called support vectors, that
influence the position of this boundary, ensuring maximum separation between categories.
For example, if we have a dataset of emails labeled as spam and not spam, SVM will create a
decision boundary that best separates these two groups based on features like word frequency
and message length.
What SVM Does:
Support Vector Machines (SVM) find the optimal hyperplane that maximizes the margin
between two classes. The margin is the distance between the decision boundary and the
closest data points from each class, which are called support vectors.
In this visualization:
SVM aims to find the hyperplane that has the maximum margin, as this typically leads to better
generalization on unseen data.
2. Hyperplane Selection: SVM finds the optimal hyperplane that separates the data into
distinct classes.
3. Maximizing the Margin: It ensures that the separation margin between classes is as
wide as possible to improve generalization.
4. Using Support Vectors: The closest points to the hyperplane (support vectors) influence
its placement.
5. Kernel Trick (for Non-Linearity): If the data is not linearly separable, SVM transforms it
into a higher dimension using kernel functions.
For two-class classification, given a dataset (xi, yi), where xi are feature vectors and yi are class
labels (either +1 or -1), SVM aims to minimize the function:
Subject to:
yi (w · xi + b) ≥ 1 for all i
where:
In cases where data points are not entirely separable, slack variables ξ are introduced,
modifying the function as follows:
Here, C controls the trade-off between maximizing the margin and minimizing classification
errors.
Types of SVM
SVM can be classified into different types based on the nature of the dataset:
1. Linear SVM: Used when data can be separated using a straight hyperplane. It is suitable
for datasets where classes are linearly separable. The decision boundary is a straight line
(in 2D) or a plane (in higher dimensions).
2. Non-Linear SVM: When data is not linearly separable, kernel functions are used to map
data into a higher-dimensional space where separation is possible. Kernel functions like
RBF and polynomials are commonly used in non-linear SVM.
3. Support Vector Regression (SVR): A variation of SVM used for regression problems
instead of classification. It works similarly to classification SVM but tries to fit a function
within a margin of tolerance rather than finding a strict boundary between categories.
4. Hard Margin SVM: Assumes that the data is perfectly separable and aims to find a
hyperplane that classifies all data points correctly with no tolerance for misclassification.
This works well when there is a clear distinction between classes.
5. Soft Margin SVM: Introduces slack variables to handle cases where classes may overlap
slightly. It allows some misclassifications to improve generalization and prevent
overfitting.
Real-world data is often non-linearly separable. SVM uses kernel functions to transform data
into a higher-dimensional space where it becomes separable. Some popular kernels include:
Discover the various Machine Learning Models and their role in predictions, classifications, and
data-driven decision-making across industries.
Implementation:
For implementing SVM in Python we will start with the standard libraries import as follows −
import numpy as np
Next, we are creating a sample dataset, having linearly separable data, from
sklearn.dataset.sample_generator for classification using SVM −
from sklearn.datasets import make_blobs
The following would be the output after generating sample dataset having 100 samples and 2
clusters −
We know that SVM supports discriminative classification. it divides the classes from each
other by simply finding a line in case of two dimensions or manifold in case of multiple
dimensions. It is implemented on the above dataset as follows −
plt.xlim(-1, 3.5);
As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can
draw around each line a margin of some width up to the nearest point. It can be done as
follows −
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
Random Forest Algorithm:
Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make
predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the
algorithm is to create a large number of decision trees, each of which is trained on a different
subset of the data. The predictions of these individual trees are then combined to produce a
final prediction.
We can understand the working of Random Forest algorithm with the help of following steps −
Step 1 − First, start with the selection of random samples from a given dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will
get the prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final prediction result.
The following diagram illustrates how the Random Forest Algorithm works −
Random Forest is a flexible algorithm that can be used for both classification and regression
tasks. In classification tasks, the algorithm uses the mode of the predictions of the individual
trees to make the final prediction. In regression tasks, the algorithm uses the mean of the
predictions of the individual trees.
Advantages of Random Forest Algorithm
Random Forest algorithm has several advantages over other machine learning algorithms. Some
of the key advantages are −
High Accuracy − Random Forest algorithm is known for its high accuracy. This is because
the algorithm combines the predictions of multiple decision trees, which helps to
reduce the impact of individual decision trees that may be biased or inaccurate.
Handles Missing Data − Random Forest algorithm can handle missing data without the
need for imputation. This is because the algorithm only considers the features that are
available for each data point and does not require all features to be present for all data
points.
Feature Importance − Random Forest algorithm can provide information about the
importance of each feature in the model. This information can be used to identify the
most important features in the data and can be used for feature selection and feature
engineering.
Let's take a look at the implementation of Random Forest Algorithm in Python. We will be using
the scikit-learn library to implement the algorithm. The scikit-learn library is a popular machine
learning library that provides a wide range of algorithms and tools for machine learning.
We will begin by importing the necessary libraries. We will be using the pandas library for data
manipulation, and the scikit-learn library for implementing the Random Forest algorithm.
import pandas as pd
Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the
famous Iris dataset, which is a classic dataset for classification tasks.
Before we can use the data to train our model, we need to preprocess it. This involves
separating the features and the target variable and splitting the data into training and testing
sets.
X = iris.iloc[:, :-1]
y = iris.iloc[:, -1]
Next, we will train our Random Forest classifier on the training data.
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Once we have trained our model, we can use it to make predictions on the test data.
y_pred = rfc.predict(X_test)
recall_score, f1_score
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
Below is the complete implementation example of Random Forest Algorithm in python using
the iris dataset −
import pandas as pd
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data',
header=None)
y = iris.iloc[:, -1]
test_size=0.35, random_state=42)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
recall_score, f1_score
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
Output
This will give us the performance metrics of our Random Forest classifier as follows −
Accuracy: 0.9811320754716981
Precision: 0.9821802935010483
Recall: 0.9811320754716981
F1-score: 0.9811157396063056
Pros
Random forests work well for a large range of data items than a single decision tree
does.
Random forests are very flexible and possess very high accuracy.
Scaling of data does not require in random forest algorithm. It maintains good accuracy
even after providing data without scaling.
Scaling of data does not require in random forest algorithm. It maintains good accuracy
even after providing data without scaling.
Cons
Construction of Random forests are much harder and time-consuming than decision
trees.
The prediction process using random forests is very time-consuming in comparison with
other algorithms.
Regression Analysis:
Regression models use the input data features (independent variables) and
their corresponding continuous numeric output values (dependent or
outcome variables) to learn specific associations between inputs and
corresponding outputs.
Regression:
During the training phase, the regression algorithm learns the relation
between independent variables (predictors) and dependent variables
(target).
The regression models predict new values based on the learned relation
between predictors and targets during the training.
Regression in machine learning refers to a supervised learning technique where the goal is to
predict a continuous numerical value based on one or more independent features. It finds
relationships between variables so that predictions can be made. we have two types of
variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.
Independent Variables (Features): The input variables that influence the prediction e.g locality,
number of rooms.
Regression analysis problem works with if output variable is a real or continuous value such as
“salary” or “weight”. Many different regression models can be used but the simplest model in
them is linear regression.
Types of Regression
Regression can be classified into different types based on the number of predictor variables and
the nature of the relationship between variables:
Linear regression is one of the simplest and most widely used statistical models. This assumes
that there is a linear relationship between the independent and dependent variables. This
means that the change in the dependent variable is proportional to the change in the
independent variables. For example predicting the price of a house based on its size.
Multiple linear regression extends simple linear regression by using multiple independent
variables to predict target variable. For example predicting the price of a house based on
multiple features such as size, location, number of rooms, etc.
3. Polynomial Regression
Polynomial regression is used to model with non-linear relationships between the dependent
variable and the independent variables. It adds polynomial terms to the linear regression model
to capture more complex relationships. For example when we want to predict a non-linear
trend like population growth over time we use polynomial regression.
Ridge & lasso regression are regularized versions of linear regression that help avoid overfitting
by penalizing large coefficients. When there’s a risk of overfitting due to too many features we
use these type of regression algorithms.
5. Support Vector Regression (SVR)
SVR is a type of regression algorithm that is based on the Support Vector Machine (SVM)
algorithm. SVM is a type of algorithm that is used for classification tasks but it can also be used
for regression tasks. SVR works by finding a hyperplane that minimizes the sum of the squared
residuals between the predicted and actual values.
Decision tree Uses a tree-like structure to make decisions where each branch of tree represents
a decision and leaves represent outcomes. For example predicting customer behavior based on
features like age, income, etc there we use decison tree regression.
Random Forest is a ensemble method that builds multiple decision trees and each tree is
trained on a different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees. For example customer churn or sales data using this.
Evaluation in machine learning measures the performance of a model. Here are some popular
evaluation metrics for regression:
Mean Absolute Error (MAE): The average absolute difference between the predicted and actual
values of the target variable.
Mean Squared Error (MSE): The average squared difference between the predicted and actual
values of the target variable.
Root Mean Squared Error (RMSE): Square root of the mean squared error.
Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors, providing
balance between robustness and MSE’s sensitivity to outliers.
Let's take an example of linear regression. We have a Housing data set and we want to predict
the price of the house. Following is the python code for it.
import matplotlib
import numpy as np
import pandas as pd
# Load dataset
df = pd.read_csv("Housing.csv")
Y = df['price']
X = df['lotsize']
X = X.to_numpy().reshape(len(X), 1)
Y = Y.to_numpy().reshape(len(Y), 1)
X_train = X[:-250]
X_test = X[-250:]
Y_train = Y[:-250]
Y_test = Y[-250:]
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
# Plot predictions
plt.show()
Output:
Here in this graph we plot the test data. The red line indicates the best fit line for predicting the
price.
Applications of Regression:
Predicting prices: Used to predict the price of a house based on its size, location and other
features.
Forecasting trends: Model to forecast the sales of a product based on historical sales data.
Identifying risk factors: Used to identify risk factors for heart patient based on patient medical
data.
Making decisions: It could be used to recommend which stock to buy based on market data.
Advantages of Regression:
Robust to outliers.
Disadvantages of Regression:
Assumes linearity.
Sensitive to situation where two or more independent variables are highly correlated with each
other i.e multicollinearity.
Linear regression is a statistical technique that estimates the linear relationship between a
dependent and one or more independent variables. In machine learning, linear regression is
implemented as a supervised learning approach. In machine learning, labeled datasets contain
input data (features) and output labels (target values). For linear regression in machine
learning, we represent features as independent variables and target values as the dependent
variable.
For the simplicity, take the following data (Single feature and single target)
1300 240
1500 320
1700 330
1830 295
1550 256
2350 409
1450 319
In the above data, the target House Price is the dependent variable represented by X, and the
feature, Square Feet, is the independent variable represented by Y. The input features (X) are
used to predict the target label (Y). So, the independent variables are also known as predictor
variables, and the dependent variable is known as the response variable.
In machine learning, linear regression uses a linear equation to model the relationship between
a dependent variable (Y) and one or more independent variables (Y).
The main goal of the linear regression model is to find the best-fitting straight line (often called
a regression line) through a set of data points.
Line of Regression
A straight line that shows a relation between the dependent variable and independent variables
is known as the line of regression or regression line.
ML Regression Line
Furthermore, the linear relationship can be positive or negative in nature as explained below −
1. Positive Linear Relationship
A linear relationship will be called positive if both independent and dependent variable
increases. It can be understood with the help of the following graph –
A linear relationship will be called positive if the independent increases and the dependent
variable decreases. It can be understood with the help of the following graph –
Simple linear regression is a type of regression analysis in which a single independent variable
(also known as a predictor variable) is used to predict the dependent variable. In other words, it
models the linear relationship between the dependent variable and a single independent
variable.
In the above image, the straight line represents the simple linear regression line where Ŷ
is the predicted value, and X is the input value.
Y=w0+w1X+ϵ
Where
Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.
When dealing with more than one independent variable, we extend simple linear regression to
multiple linear regression. The model is expressed as:
Multiple linear regression extends the concept of simple linear regression to multiple
independent variables. The model is expressed as:
Y=w0+w1X1+w2X2+⋯+wpXp+ϵ
Where
Polynomial Regression:
Polynomial Linear Regression is a type of regression analysis in which the relationship between
the independent variable and the dependent variable is modeled as an n-th degree polynomial
function. Polynomial regression allows for a more complex relationship between the variables
to be captured beyond the linear relationship in simple linear regression and multiple linear
regression.
In machine learning (ML) and data science, choosing between a linear regression or polynomial
regression depends upon the characteristics of the dataset. A non-linear dataset can't be fitted
with a linear regression. If we apply linear regression to a nonlinear dataset, it will not be able
to capture the non-linear patterns in the data.
Equation of Polynomial Regression Model
In machine learning, the general formula for polynomial regression of degree n is as follows −
y=w0+w1x+w2x2+w3x3+…+wnxn+ϵ
Where
ϵ is the error term or residual, representing the difference between the observed value
and the model's prediction.
y=w0+w1x+w2x2+ϵ
In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).
Generally, logistic regression means binary logistic regression having binary target variables, but
there can be two more categories of target variables that can be predicted by it. Based on those
number of categories, Logistic regression can be divided into following types −
Binary or Binomial:
In such a kind of classification, a dependent variable will have only two possible types either 1
and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.
Multinomial:
In such a kind of classification, dependent variable can have 3 or more possible unordered types
or the types having no quantitative significance. For example, these variables may represent
"Type A" or "Type B" or "Type C".
Ordinal:
In such a kind of classification, dependent variable can have 3 or more possible ordered types
or the types having a quantitative significance. For example, these variables may represent
"poor" or "good", "very good", "Excellent" and each category can have the scores like 0,1,2,3.
The simplest form of logistic regression is binary or binomial logistic regression in which the
target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model
a relationship between multiple predictor variables and a binary/binomial target variable. In
case of logistic regression, the linear function is basically used as an input to another function
such as in the following relation −
hθ(x)=g(θTx)0hθ1
g(z)=11+e−z=θT
To sigmoid curve can be represented with the help of following graph. We can see the values of
y-axis lie between 0 and 1 and crosses the axis at 0.5.
The classes can be divided into positive or negative. The output comes under the probability of
positive class if it lies between 0 and 1. For our implementation, we are interpreting the output
of hypothesis function as positive if it is 0.5, otherwise negative.
We also need to define a loss function to measure how well the algorithm performs using the
weights on functions, represented by theta as follows −
=()
J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))
Now, after defining the loss function our prime goal is to minimize the loss function. It can be
done with the help of fitting the weights which means by increasing or decreasing the weights.
With the help of derivatives of the loss function w.r.t each weight, we would be able to know
what parameters should have high weight and what should have smaller weight.
The following gradient descent equation tells us how loss would change if we modified the
parameters −
()θj=1mXT(())
Now we will implement the above concept of binomial logistic regression in Python. For this
purpose, we are using a multivariate flower dataset named iris which have 3 classes of 50
instances each, but we will be using the first two feature columns. Every class represents a type
of iris flower.
import numpy as np
X = iris.data[:, :2]
y = (iris.target != 0) * 1
plt.figure(figsize=(6, 6))
plt.legend();
Next, we will define sigmoid function, loss function and gradient descend as follows −
class LogisticRegression:
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
return 1 / (1 + np.exp(-z))
if self.fit_intercept:
X = self.__add_intercept(X)
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
loss = self.__loss(h, y)
With the help of the following script, we can predict the output probabilities −
if self.fit_intercept:
X = self.__add_intercept(X)
return self.predict_prob(X).round()
Next, we can evaluate the model and plot it as follows −
preds = model.predict(X)
(preds == y).mean()
plt.figure(figsize=(10, 6))
plt.legend()
probs = model.predict_prob(grid).reshape(xx1.shape)
Another useful form of logistic regression is multinomial logistic regression in which the target
or dependent variable can have 3 or more possible unordered types i.e. the types having no
quantitative significance.
Now we will implement the above concept of multinomial logistic regression in Python. For this
purpose, we are using a dataset from sklearn named digit.
Import sklearn
digits = datasets.load_digits()
X = digits.data
y = digits.target
With the help of next line of code, we can split X and y into training and testing sets −
digreg = linear_model.LogisticRegression()
Now, we need to train the model by using the training sets as follows −
digreg.fit(X_train, y_train)
y_pred = digreg.predict(X_test)
Output: