0% found this document useful (0 votes)
41 views35 pages

Datascience Notes Unit-4

Uploaded by

22981a0523
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views35 pages

Datascience Notes Unit-4

Uploaded by

22981a0523
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT-4

Supervised machine learning:

Supervised machine learning is a fundamental approach for machine learning and artificial
intelligence. It involves training a model using labeled data, where each input comes with a
corresponding correct output. The process is like a teacher guiding a student—hence the term
"supervised" learning.

 supervised learning is a type of machine learning where a model is trained on labeled


data—meaning each input is paired with the correct output.
 the model learns by comparing its predictions with the actual answers provided in the
training data. Over time, it adjusts itself to minimize errors and improve accuracy.
 The goal of supervised learning is to make accurate predictions when given new, unseen
data. For example, if a model is trained to recognize handwritten digits, it will use what
it learned to correctly identify new numbers it hasn't seen before.
 Supervised learning can be applied in various forms, including supervised learning
classification and supervised learning regression, making it a crucial technique in the
field of artificial intelligence and supervised data mining.
 A fundamental concept in supervised machine learning is learning a class from
examples. This involves providing the model with examples where the correct label is
known, such as learning to classify images of cats and dogs by being shown labeled
examples of both. The model then learns the distinguishing features of each class and
applies this knowledge to classify new images.
How Supervised Machine Learning Works?

Where supervised learning algorithm consists of input features and corresponding output
labels. The process works through:

 Training Data: The model is provided with a training dataset that includes input data
(features) and corresponding output data (labels or target variables).

 Learning Process: The algorithm processes the training data, learning the relationships
between the input features and the output labels. This is achieved by adjusting the
model's parameters to minimize the difference between its predictions and the actual
labels.

 After training, the model is evaluated using a test dataset to measure


its accuracy and performance. Then the model's performance is
optimized by adjusting parameters and using techniques like cross-
validation to balance bias and variance. This ensures the model
generalizes well to new, unseen data.
 Let's learn how a supervised machine learning model is trained on a
dataset to learn a mapping function between input and output, and
then with learned function is used to make predictions on new data:

In the image above,

 Training phase involves feeding the algorithm labeled data, where each data point is
paired with its correct output. The algorithm learns to identify patterns and
relationships between the input and output data.

 Testing phase involves feeding the algorithm new, unseen data and evaluating its ability
to predict the correct output based on the learned patterns.

Types of Supervised Learning in Machine Learning


Now, Supervised learning can be applied to two main types of problems:

 Classification: Where the output is a categorical variable (e.g., spam vs. non-spam
emails, yes vs. no).

 Regression: Where the output is a continuous variable (e.g., predicting house prices,
stock prices).

Regression:

Regression determines whether dependent and independent variables are correlated.


Regression algorithms, therefore, aid in predicting continuous variables such as real estate
values, economic trends, climatic patterns, oil and gas prices (a crucial task in today's world! ),
etc.

The regression procedure aims to identify the mapping function that will allow us to translate
the continuous output variable "y" into the input variable "x."

Classification:

On the other hand, classification is an algorithm that identifies functions that support
categorizing the dataset based on different factors. Computer software learns from the training
dataset when employing a classification algorithm, then divides the data into several groups
based on what it has discovered.

The mapping function that converts the discrete "y" output from the "x" input is found by
classification algorithms. Based on a certain set of independent variables, the algorithms
estimate discrete values (sometimes known as binary values such as 0 and 1, yes and no, true,
or false). To put it another simpler way, classification algorithms determine the likelihood that
an event will occur by fitting data to a logic function.

For example, it can determine whether an email is spam or not, classify images as "cat" or
"dog," or predict weather conditions like "sunny," "rainy," or "cloudy." with decision boundary
and regression models are used to predict house prices based on features like size and location,
or forecast stock prices over time with straight fit line.
Supervised Machine Learning Algorithms

Supervised learning can be further divided into several different types, each with its own
unique characteristics and applications. Here are some of the most common types of
supervised learning algorithms:

 Linear Regression: Linear regression is a type of supervised learning regression


algorithm that is used to predict a continuous output value. It is one of the simplest and
most widely used algorithms in supervised learning.

 Logistic Regression : Logistic regression is a type of supervised learning classification


algorithm that is used to predict a binary output variable.

 Decision Trees : Decision tree is a tree-like structure that is used to model decisions and
their possible consequences. Each internal node in the tree represents a decision, while
each leaf node represents a possible outcome.

 Random Forests : Random forests again are made up of multiple decision trees that
work together to make predictions. Each tree in the forest is trained on a different
subset of the input features and data. The final prediction is made by aggregating the
predictions of all the trees in the forest.

 Support Vector Machine(SVM) : The SVM algorithm creates a hyperplane to segregate


n-dimensional space into classes and identify the correct category of new data points.
The extreme cases that help create the hyperplane are called support vectors, hence the
name Support Vector Machine.

 K-Neighbors (KNN) : KNN works by finding k training examples closest to a given input
and then predicts the class or value based on the majority class or average value of
these neighbors. The performance of KNN can be influenced by the choice of k and the
distance metric used to measure proximity.
 Gradient Boosting : Gradient Boosting combines weak learners, like decision trees, to
create a strong model. It iteratively builds new models that correct errors made by
previous ones.

 Naive Bayes Algorithm: The Naive Bayes algorithm is a supervised machine learning
algorithm based on applying Bayes' Theorem with the “naive” assumption that features
are independent of each other given the class label.

Support Vector Machine (SVM) Algorithm

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and
regression. It finds the optimal boundary to separate classes, ensuring maximum margin. This
article explores SVM’s working, mathematical foundation, types, real-world applications, and
implementation with examples.

By Great Learning Editorial Team Updated on Mar 18, 2025

Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It is widely applied in fields like image recognition, text
classification, and bioinformatics due to its efficiency in handling high-dimensional data.

In this article, we will start from the basics of SVM in machine learning, gradually diving into its
working principles, different types, mathematical formulation, real-world applications, and
implementation.

What is Support Vector Machine (SVM)?

SVM is a classification algorithm that finds the best boundary (hyperplane) to separate different
classes in a dataset. It works by identifying key data points, called support vectors, that
influence the position of this boundary, ensuring maximum separation between categories.

For example, if we have a dataset of emails labeled as spam and not spam, SVM will create a
decision boundary that best separates these two groups based on features like word frequency
and message length.
What SVM Does:

Support Vector Machines (SVM) find the optimal hyperplane that maximizes the margin
between two classes. The margin is the distance between the decision boundary and the
closest data points from each class, which are called support vectors.

In this visualization:

 The black line represents the decision boundary

 The dashed lines represent the margins

 Highlighted points are support vectors

 Red and teal dots represent different classes

SVM aims to find the hyperplane that has the maximum margin, as this typically leads to better
generalization on unseen data.

How Does SVM Classify Data?

SVM follows these steps to classify data:

1. Data Representation: Each data point is represented as a vector in an n-dimensional


space.

2. Hyperplane Selection: SVM finds the optimal hyperplane that separates the data into
distinct classes.

3. Maximizing the Margin: It ensures that the separation margin between classes is as
wide as possible to improve generalization.

4. Using Support Vectors: The closest points to the hyperplane (support vectors) influence
its placement.
5. Kernel Trick (for Non-Linearity): If the data is not linearly separable, SVM transforms it
into a higher dimension using kernel functions.

Mathematical Foundation of SVM Algorithm

For two-class classification, given a dataset (xi, yi), where xi are feature vectors and yi are class
labels (either +1 or -1), SVM aims to minimize the function:

minw,b (1/2) ||w||2

Subject to:

yi (w · xi + b) ≥ 1 for all i

where:

 w represents the weight vector,

 b is the bias term,

 ||w||2 ensures the maximization of the margin.

In cases where data points are not entirely separable, slack variables ξ are introduced,
modifying the function as follows:

minw,b,ξ (1/2) ||w||2 + C Σ ξi

Here, C controls the trade-off between maximizing the margin and minimizing classification
errors.

Types of SVM
SVM can be classified into different types based on the nature of the dataset:

1. Linear SVM: Used when data can be separated using a straight hyperplane. It is suitable
for datasets where classes are linearly separable. The decision boundary is a straight line
(in 2D) or a plane (in higher dimensions).

2. Non-Linear SVM: When data is not linearly separable, kernel functions are used to map
data into a higher-dimensional space where separation is possible. Kernel functions like
RBF and polynomials are commonly used in non-linear SVM.

3. Support Vector Regression (SVR): A variation of SVM used for regression problems
instead of classification. It works similarly to classification SVM but tries to fit a function
within a margin of tolerance rather than finding a strict boundary between categories.

4. Hard Margin SVM: Assumes that the data is perfectly separable and aims to find a
hyperplane that classifies all data points correctly with no tolerance for misclassification.
This works well when there is a clear distinction between classes.

5. Soft Margin SVM: Introduces slack variables to handle cases where classes may overlap
slightly. It allows some misclassifications to improve generalization and prevent
overfitting.

The Kernel Trick: Handling Non-Linearity

Real-world data is often non-linearly separable. SVM uses kernel functions to transform data
into a higher-dimensional space where it becomes separable. Some popular kernels include:

 Linear Kernel: K(x, y) = x · y (used in linear support vector machine)

 Polynomial Kernel: K(x, y) = (x · y + c)d

 Radial Basis Function (RBF) Kernel: K(x, y) = e−γ||x − y||2

Real-World Applications of SVM


SVM has practical use cases across different domains:

 Spam Detection: In this, we classify emails as spam or not spam.

 Image Classification: Recognizing objects, faces, or handwritten digits.

 Sentiment Analysis: Determining if a review is positive or negative.

 Bioinformatics: Identifying diseases based on genetic data.

 Financial Fraud Detection: Identifying unusual transaction patterns.

Discover the various Machine Learning Models and their role in predictions, classifications, and
data-driven decision-making across industries.

Support Vector Machine Example: Solving a Classification Problem

Implementation:

For implementing SVM in Python we will start with the standard libraries import as follows −

import numpy as np

import matplotlib.pyplot as plt

from scipy import stats

import seaborn as sns; sns.set()

 Next, we are creating a sample dataset, having linearly separable data, from
sklearn.dataset.sample_generator for classification using SVM −
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

The following would be the output after generating sample dataset having 100 samples and 2
clusters −

We know that SVM supports discriminative classification. it divides the classes from each
other by simply finding a line in case of two dimensions or manifold in case of multiple
dimensions. It is implemented on the above dataset as follows −

xfit = np.linspace(-1, 3.5)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')

plt.plot([0.6], [2.1], 'x', color='black', markeredgewidth=4, markersize=12)

for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:

plt.plot(xfit, m * xfit + b, '-k')

plt.xlim(-1, 3.5);

The output is as follows −


We can see from the above output that there are three different separators that perfectly
discriminate the above samples.

As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can
draw around each line a margin of some width up to the nearest point. It can be done as
follows −

xfit = np.linspace(-1, 3.5)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')

for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:

yfit = m * xfit + b

plt.plot(xfit, yfit, '-k')

plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',

color='#AAAAAA', alpha=0.4)

plt.xlim(-1, 3.5);
Random Forest Algorithm:

Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make
predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the
algorithm is to create a large number of decision trees, each of which is trained on a different
subset of the data. The predictions of these individual trees are then combined to produce a
final prediction.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps −

 Step 1 − First, start with the selection of random samples from a given dataset.

 Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will
get the prediction result from every decision tree.

 Step 3 − In this step, voting will be performed for every predicted result.

 Step 4 − At last, select the most voted prediction result as the final prediction result.

The following diagram illustrates how the Random Forest Algorithm works −

Random Forest is a flexible algorithm that can be used for both classification and regression
tasks. In classification tasks, the algorithm uses the mode of the predictions of the individual
trees to make the final prediction. In regression tasks, the algorithm uses the mean of the
predictions of the individual trees.
Advantages of Random Forest Algorithm

Random Forest algorithm has several advantages over other machine learning algorithms. Some
of the key advantages are −

 Robustness to Overfitting − Random Forest algorithm is known for its robustness to


overfitting. This is because the algorithm uses an ensemble of decision trees, which
helps to reduce the impact of outliers and noise in the data.

 High Accuracy − Random Forest algorithm is known for its high accuracy. This is because
the algorithm combines the predictions of multiple decision trees, which helps to
reduce the impact of individual decision trees that may be biased or inaccurate.

 Handles Missing Data − Random Forest algorithm can handle missing data without the
need for imputation. This is because the algorithm only considers the features that are
available for each data point and does not require all features to be present for all data
points.

 Non-Linear Relationships − Random Forest algorithm can handle non-linear relationships


between the features and the target variable. This is because the algorithm uses
decision trees, which can model non-linear relationships.

 Feature Importance − Random Forest algorithm can provide information about the
importance of each feature in the model. This information can be used to identify the
most important features in the data and can be used for feature selection and feature
engineering.

Implementation of Random Forest Algorithm in Python

Let's take a look at the implementation of Random Forest Algorithm in Python. We will be using
the scikit-learn library to implement the algorithm. The scikit-learn library is a popular machine
learning library that provides a wide range of algorithms and tools for machine learning.

Step 1 − Importing the Libraries

We will begin by importing the necessary libraries. We will be using the pandas library for data
manipulation, and the scikit-learn library for implementing the Random Forest algorithm.

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

Step 2 − Loading the Data

Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the
famous Iris dataset, which is a classic dataset for classification tasks.

# Loading the iris dataset


iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data',
header=None)

iris.columns = ['sepal_length', 'sepal_width', 'petal_length','petal_width', 'species']

Step 3 − Data Preprocessing

Before we can use the data to train our model, we need to preprocess it. This involves
separating the features and the target variable and splitting the data into training and testing
sets.

# Separating the features and target variable

X = iris.iloc[:, :-1]

y = iris.iloc[:, -1]

# Splitting the data into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

Step 4 − Training the Model

Next, we will train our Random Forest classifier on the training data.

# Creating the Random Forest classifier object

rfc = RandomForestClassifier(n_estimators=100)

# Training the model on the training data

rfc.fit(X_train, y_train)

Step 5 − Making Predictions

Once we have trained our model, we can use it to make predictions on the test data.

# Making predictions on the test data

y_pred = rfc.predict(X_test)

Step 6 − Evaluating the Model


Finally, we will evaluate the performance of our model using various metrics such as accuracy,
precision, recall, and F1-score.

# Importing the metrics library

from sklearn.metrics import accuracy_score, precision_score,

recall_score, f1_score

# Calculating the accuracy, precision, recall, and F1-score

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)

Complete Implementation Example

Below is the complete implementation example of Random Forest Algorithm in python using
the iris dataset −

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

# Loading the iris dataset

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data',
header=None)

iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# Separating the features and target variable


X = iris.iloc[:, :-1]

y = iris.iloc[:, -1]

# Splitting the data into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.35, random_state=42)

# Creating the Random Forest classifier object

rfc = RandomForestClassifier(n_estimators=100)

# Training the model on the training data

rfc.fit(X_train, y_train)

# Making predictions on the test data

y_pred = rfc.predict(X_test)

# Importing the metrics library

from sklearn.metrics import accuracy_score, precision_score,

recall_score, f1_score

# Calculating the accuracy, precision, recall, and F1-score

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)

print("Precision:", precision)
print("Recall:", recall)

print("F1-score:", f1)

Output

This will give us the performance metrics of our Random Forest classifier as follows −

Accuracy: 0.9811320754716981

Precision: 0.9821802935010483

Recall: 0.9811320754716981

F1-score: 0.9811157396063056

Pros and Cons of Random Forest

Pros

The following are the advantages of Random Forest algorithm −

 It overcomes the problem of overfitting by averaging or combining the results of


different decision trees.

 Random forests work well for a large range of data items than a single decision tree
does.

 Random forest has less variance then single decision tree.

 Random forests are very flexible and possess very high accuracy.

 Scaling of data does not require in random forest algorithm. It maintains good accuracy
even after providing data without scaling.

 Scaling of data does not require in random forest algorithm. It maintains good accuracy
even after providing data without scaling.

Cons

The following are the disadvantages of Random Forest algorithm −

 Complexity is the main disadvantage of Random forest algorithms.

 Construction of Random forests are much harder and time-consuming than decision
trees.

 More computational resources are required to implement Random Forest algorithm.

 It is less intuitive in case when we have a large collection of decision trees .

 The prediction process using random forests is very time-consuming in comparison with
other algorithms.
Regression Analysis:

Regression analysis is a fundamental concept in machine learning and it is


used in many applications such as forecasting, predictive analytics, etc.

In machine learning, regression is a type of supervised learning. The key


objective of regression-based tasks is to predict output labels or responses,
which are continuous numeric values, for the given input data. The output
will be based on what the model has learned in the training phase.

Regression models use the input data features (independent variables) and
their corresponding continuous numeric output values (dependent or
outcome variables) to learn specific associations between inputs and
corresponding outputs.

Regression:

Regression in machine learning is a supervised learning. Basically,


regression is a statistical technique that finds a relationship between
dependent and independent variables. To implement regression in machine
learning, a regression algorithm is trained with a labeled dataset. The
dataset contains features (independent variables) and target values
(dependent variable).

During the training phase, the regression algorithm learns the relation
between independent variables (predictors) and dependent variables
(target).

The regression models predict new values based on the learned relation
between predictors and targets during the training.

Regression in machine learning refers to a supervised learning technique where the goal is to
predict a continuous numerical value based on one or more independent features. It finds
relationships between variables so that predictions can be made. we have two types of
variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.

Independent Variables (Features): The input variables that influence the prediction e.g locality,
number of rooms.

Regression analysis problem works with if output variable is a real or continuous value such as
“salary” or “weight”. Many different regression models can be used but the simplest model in
them is linear regression.

Types of Regression

Regression can be classified into different types based on the number of predictor variables and
the nature of the relationship between variables:

1. Simple Linear Regression

Linear regression is one of the simplest and most widely used statistical models. This assumes
that there is a linear relationship between the independent and dependent variables. This
means that the change in the dependent variable is proportional to the change in the
independent variables. For example predicting the price of a house based on its size.

2. Multiple Linear Regression

Multiple linear regression extends simple linear regression by using multiple independent
variables to predict target variable. For example predicting the price of a house based on
multiple features such as size, location, number of rooms, etc.

3. Polynomial Regression

Polynomial regression is used to model with non-linear relationships between the dependent
variable and the independent variables. It adds polynomial terms to the linear regression model
to capture more complex relationships. For example when we want to predict a non-linear
trend like population growth over time we use polynomial regression.

4. Ridge & Lasso Regression

Ridge & lasso regression are regularized versions of linear regression that help avoid overfitting
by penalizing large coefficients. When there’s a risk of overfitting due to too many features we
use these type of regression algorithms.
5. Support Vector Regression (SVR)

SVR is a type of regression algorithm that is based on the Support Vector Machine (SVM)
algorithm. SVM is a type of algorithm that is used for classification tasks but it can also be used
for regression tasks. SVR works by finding a hyperplane that minimizes the sum of the squared
residuals between the predicted and actual values.

6. Decision Tree Regression

Decision tree Uses a tree-like structure to make decisions where each branch of tree represents
a decision and leaves represent outcomes. For example predicting customer behavior based on
features like age, income, etc there we use decison tree regression.

7. Random Forest Regression

Random Forest is a ensemble method that builds multiple decision trees and each tree is
trained on a different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees. For example customer churn or sales data using this.

Regression Evaluation Metrics

Evaluation in machine learning measures the performance of a model. Here are some popular
evaluation metrics for regression:

Mean Absolute Error (MAE): The average absolute difference between the predicted and actual
values of the target variable.

Mean Squared Error (MSE): The average squared difference between the predicted and actual
values of the target variable.

Root Mean Squared Error (RMSE): Square root of the mean squared error.

Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors, providing
balance between robustness and MSE’s sensitivity to outliers.

R2 – Score: Higher values indicate better fit ranging from 0 to 1.

Regression Model Machine Learning

Let's take an example of linear regression. We have a Housing data set and we want to predict
the price of the house. Following is the python code for it.
import matplotlib

matplotlib.use('TkAgg') # General backend for plots

import matplotlib.pyplot as plt

import numpy as np

from sklearn import datasets, linear_model

import pandas as pd

# Load dataset

df = pd.read_csv("Housing.csv")

# Extract features and target variable

Y = df['price']

X = df['lotsize']

# Reshape for compatibility with scikit-learn

X = X.to_numpy().reshape(len(X), 1)

Y = Y.to_numpy().reshape(len(Y), 1)

# Split data into training and testing sets

X_train = X[:-250]

X_test = X[-250:]

Y_train = Y[:-250]

Y_test = Y[-250:]

# Plot the test data

plt.scatter(X_test, Y_test, color='black')

plt.title('Test Data')

plt.xlabel('Size')

plt.ylabel('Price')
plt.xticks(())

plt.yticks(())

# Train linear regression model

regr = linear_model.LinearRegression()

regr.fit(X_train, Y_train)

# Plot predictions

plt.plot(X_test, regr.predict(X_test), color='red', linewidth=3)

plt.show()

Output:

Here in this graph we plot the test data. The red line indicates the best fit line for predicting the
price.

To make an individual prediction using the linear regression model:

print("Predicted price for a lot size of 5000: " + str(round(regr.predict([[5000]])[0][0])))

Applications of Regression:

Predicting prices: Used to predict the price of a house based on its size, location and other
features.

Forecasting trends: Model to forecast the sales of a product based on historical sales data.
Identifying risk factors: Used to identify risk factors for heart patient based on patient medical
data.

Making decisions: It could be used to recommend which stock to buy based on market data.

Advantages of Regression:

Easy to understand and interpret.

Robust to outliers.

Can handle both linear relationships easily.

Disadvantages of Regression:

Assumes linearity.

Sensitive to situation where two or more independent variables are highly correlated with each
other i.e multicollinearity.

May not be suitable for highly complex relationships.

Linear regression is a statistical technique that estimates the linear relationship between a
dependent and one or more independent variables. In machine learning, linear regression is
implemented as a supervised learning approach. In machine learning, labeled datasets contain
input data (features) and output labels (target values). For linear regression in machine
learning, we represent features as independent variables and target values as the dependent
variable.

For the simplicity, take the following data (Single feature and single target)

Square Feet (X) House Price (Y)

1300 240

1500 320

1700 330

1830 295

1550 256

2350 409

1450 319

In the above data, the target House Price is the dependent variable represented by X, and the
feature, Square Feet, is the independent variable represented by Y. The input features (X) are
used to predict the target label (Y). So, the independent variables are also known as predictor
variables, and the dependent variable is known as the response variable.

So lets define linear regression in machine learning as follows:

In machine learning, linear regression uses a linear equation to model the relationship between
a dependent variable (Y) and one or more independent variables (Y).

The main goal of the linear regression model is to find the best-fitting straight line (often called
a regression line) through a set of data points.

Line of Regression

A straight line that shows a relation between the dependent variable and independent variables
is known as the line of regression or regression line.

ML Regression Line

Furthermore, the linear relationship can be positive or negative in nature as explained below −
1. Positive Linear Relationship

A linear relationship will be called positive if both independent and dependent variable
increases. It can be understood with the help of the following graph –

Positive Linear Relationship

2. Negative Linear Relationship

A linear relationship will be called positive if the independent increases and the dependent
variable decreases. It can be understood with the help of the following graph –

Negative Linear Relationship

Types of Linear Regression:


Linear regression is of the following two types −

Simple Linear Regression

Multiple Linear Regression

1. Simple Linear Regression

Simple linear regression is a type of regression analysis in which a single independent variable
(also known as a predictor variable) is used to predict the dependent variable. In other words, it
models the linear relationship between the dependent variable and a single independent
variable.

Simple Linear Regression

In the above image, the straight line represents the simple linear regression line where Ŷ
is the predicted value, and X is the input value.

Mathematically, the relationship can be modeled as a linear equation −

Y=w0+w1X+ϵ

Where

Y is the dependent variable (target).

X is the independent variable (feature).


w0 is the y-intercept of the line.

w1 is the slope of the line, representing the effect of X on Y.

ε is the error term, capturing the variability in Y not explained by X.

2. Multiple Linear Regression

Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.

When dealing with more than one independent variable, we extend simple linear regression to
multiple linear regression. The model is expressed as:

Multiple linear regression extends the concept of simple linear regression to multiple
independent variables. The model is expressed as:

Y=w0+w1X1+w2X2+⋯+wpXp+ϵ

Where

X1, X2, ..., Xp are the independent variables (features).

w0, w1, ..., wp are the coefficients for these variables.

ε is the error term.

Polynomial Regression:

Polynomial Linear Regression is a type of regression analysis in which the relationship between
the independent variable and the dependent variable is modeled as an n-th degree polynomial
function. Polynomial regression allows for a more complex relationship between the variables
to be captured beyond the linear relationship in simple linear regression and multiple linear
regression.

In machine learning (ML) and data science, choosing between a linear regression or polynomial
regression depends upon the characteristics of the dataset. A non-linear dataset can't be fitted
with a linear regression. If we apply linear regression to a nonlinear dataset, it will not be able
to capture the non-linear patterns in the data.
Equation of Polynomial Regression Model

In machine learning, the general formula for polynomial regression of degree n is as follows −

y=w0+w1x+w2x2+w3x3+…+wnxn+ϵ

Where

 y is the dependent variable (output).

 x is the independent variable (input).

 w0,w1,w2,…,wn are the coefficients (parameters) of the model.

 n is the degree of the polynomial (the highest power of x).

 ϵ is the error term or residual, representing the difference between the observed value
and the model's prediction.

For a quadratic (second-degree) polynomial regression, the formula would be:

y=w0+w1x+w2x2+ϵ

This would fit a parabolic curve to the data points.

Introduction to Logistic Regression:

Logistic regression is a supervised learning classification algorithm used to predict the


probability of a target variable. The nature of target or dependent variable is dichotomous,
which means there would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the


simplest ML algorithms that can be used for various classification problems such as spam
detection, Diabetes prediction, cancer detection etc.
Types of Logistic Regression

Generally, logistic regression means binary logistic regression having binary target variables, but
there can be two more categories of target variables that can be predicted by it. Based on those
number of categories, Logistic regression can be divided into following types −

Binary or Binomial:

In such a kind of classification, a dependent variable will have only two possible types either 1
and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

Multinomial:

In such a kind of classification, dependent variable can have 3 or more possible unordered types
or the types having no quantitative significance. For example, these variables may represent
"Type A" or "Type B" or "Type C".

Ordinal:

In such a kind of classification, dependent variable can have 3 or more possible ordered types
or the types having a quantitative significance. For example, these variables may represent
"poor" or "good", "very good", "Excellent" and each category can have the scores like 0,1,2,3.

Binary Logistic Regression Model

The simplest form of logistic regression is binary or binomial logistic regression in which the
target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model
a relationship between multiple predictor variables and a binary/binomial target variable. In
case of logistic regression, the linear function is basically used as an input to another function
such as in the following relation −

hθ(x)=g(θTx)0hθ1

Here, is the logistic or sigmoid function which can be given as follows −

g(z)=11+e−z=θT

To sigmoid curve can be represented with the help of following graph. We can see the values of
y-axis lie between 0 and 1 and crosses the axis at 0.5.
The classes can be divided into positive or negative. The output comes under the probability of
positive class if it lies between 0 and 1. For our implementation, we are interpreting the output
of hypothesis function as positive if it is 0.5, otherwise negative.

We also need to define a loss function to measure how well the algorithm performs using the
weights on functions, represented by theta as follows −

=()

J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))

Now, after defining the loss function our prime goal is to minimize the loss function. It can be
done with the help of fitting the weights which means by increasing or decreasing the weights.
With the help of derivatives of the loss function w.r.t each weight, we would be able to know
what parameters should have high weight and what should have smaller weight.

The following gradient descent equation tells us how loss would change if we modified the
parameters −

()θj=1mXT(())

Implementation of Binary Logistic Regression Model in Python:

Now we will implement the above concept of binomial logistic regression in Python. For this
purpose, we are using a multivariate flower dataset named iris which have 3 classes of 50
instances each, but we will be using the first two feature columns. Every class represents a type
of iris flower.

First, we need to import the necessary libraries as follows −

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import datasets

Next, load the iris dataset as follows −


iris = datasets.load_iris()

X = iris.data[:, :2]

y = (iris.target != 0) * 1

We can plot our training data s follows −

plt.figure(figsize=(6, 6))

plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')

plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')

plt.legend();

Next, we will define sigmoid function, loss function and gradient descend as follows −

class LogisticRegression:

def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):

self.lr = lr

self.num_iter = num_iter

self.fit_intercept = fit_intercept

self.verbose = verbose

def __add_intercept(self, X):


intercept = np.ones((X.shape[0], 1))

return np.concatenate((intercept, X), axis=1)

def __sigmoid(self, z):

return 1 / (1 + np.exp(-z))

def __loss(self, h, y):

return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

def fit(self, X, y):

if self.fit_intercept:

X = self.__add_intercept(X)

Now, initialize the weights as follows −

self.theta = np.zeros(X.shape[1])

for i in range(self.num_iter):

z = np.dot(X, self.theta)

h = self.__sigmoid(z)

gradient = np.dot(X.T, (h - y)) / y.size

self.theta -= self.lr * gradient

z = np.dot(X, self.theta)

h = self.__sigmoid(z)

loss = self.__loss(h, y)

if(self.verbose ==True and i % 10000 == 0):

print(f'loss: {loss} \t')

With the help of the following script, we can predict the output probabilities −

def predict_prob(self, X):

if self.fit_intercept:

X = self.__add_intercept(X)

return self.__sigmoid(np.dot(X, self.theta))

def predict(self, X):

return self.predict_prob(X).round()
Next, we can evaluate the model and plot it as follows −

model = LogisticRegression(lr=0.1, num_iter=300000)

preds = model.predict(X)

(preds == y).mean()

plt.figure(figsize=(10, 6))

plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')

plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')

plt.legend()

x1_min, x1_max = X[:,0].min(), X[:,0].max(),

x2_min, x2_max = X[:,1].min(), X[:,1].max(),

xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))

grid = np.c_[xx1.ravel(), xx2.ravel()]

probs = model.predict_prob(grid).reshape(xx1.shape)

plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');


Multinomial Logistic Regression Model

Another useful form of logistic regression is multinomial logistic regression in which the target
or dependent variable can have 3 or more possible unordered types i.e. the types having no
quantitative significance.

Implementation of Multinomial Logistic Regression Model in Python

Now we will implement the above concept of multinomial logistic regression in Python. For this
purpose, we are using a dataset from sklearn named digit.

First, we need to import the necessary libraries as follows −

Import sklearn

from sklearn import datasets

from sklearn import linear_model

from sklearn import metrics

from sklearn.model_selection import train_test_split

Next, we need to load digit dataset −

digits = datasets.load_digits()

Now, define the feature matrix(X) and response vector(y)as follows −

X = digits.data

y = digits.target

With the help of next line of code, we can split X and y into training and testing sets −

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

Now create an object of logistic regression as follows −

digreg = linear_model.LogisticRegression()

Now, we need to train the model by using the training sets as follows −

digreg.fit(X_train, y_train)

Next, make the predictions on testing set as follows −

y_pred = digreg.predict(X_test)

Next print the accuracy of the model as follows −

print("Accuracy of Logistic Regression model is:",


metrics.accuracy_score(y_test, y_pred)*100)

Output:

Accuracy of Logistic Regression model is: 95.6884561891516

You might also like