Supervised Machine Learning Algorithms
UNIT-III
Types of Machine Learning (ML)
Machine Learning Algorithms helps computer system learn without being
explicitly programmed.
These algorithms are categorized into supervised or unsupervised.
Supervised machine learning algorithms
This is the most commonly used machine learning algorithm.
It is called supervised because the process of algorithm learning from the
training dataset can be thought of as a teacher supervising the learning
process.
It can be understood as follows −
Suppose we have input variables x and an output variable y and we applied
an algorithm to learn the mapping function from the input to output such as −
Y = f(x)
Now, the main goal is to approximate the mapping function so well that when we have
new input data (x), we can predict the output variable (Y) for that data.
Mainly supervised leaning problems can be divided into the following two kinds of
problems −
Classification − A problem is called classification problem when we have the
categorized output such as “black”, “teaching”, “non-teaching”, etc.
Regression − A problem is called regression problem when we have the real value
output such as “distance”, “kilogram”, etc.
Decision tree, random forest, knn, logistic regression are the examples of supervised
machine learning algorithms.
1. Decision Tree
Decision Trees are a class of very powerful Machine Learning model cable of
achieving high accuracy in many tasks while being highly interpretable.
What makes decision trees special in the realm of ML models is really their clarity
of information representation.
The “knowledge” learned by a decision tree through training is directly formulated
into a hierarchical structure.
This structure holds and displays the knowledge in such a way that it can easily be
understood, even by non-experts.
Decision tree algorithm falls under the category of supervised learning.
They can be used to solve both regression and classification problems.
Decision tree uses the tree representation to solve the problem in which each leaf
node corresponds to a class label and attributes are represented on the internal node of
the tree.
We can represent any boolean function on discrete attributes using the decision tree.
The most notable types of decision tree algorithms are:-
What is a Decision Tree?
A Decision Tree is a supervised learning algorithm used for both classification and
regression tasks.
It mimics human decision-making by breaking down a dataset into smaller subsets
based on feature values, forming a tree-like structure.
Structure of a Decision Tree-
A Decision Tree consists of:
• Root Node: The starting point that represents the entire dataset.
• Branches: Connections between nodes that show decision paths.
• Internal Nodes: Points where decisions are made based on feature values.
• Leaf Nodes: The final output or classification.
How Does a Decision Tree Work?
1. Splitting: The dataset is split based on feature values to
create pure subsets.
2. Attribute Selection: The best feature for splitting is chosen
using metrics like:
1. Information Gain (based on Entropy)
2. Gini Index (used in CART algorithm)
3. Recursive Partitioning: The process continues until a
stopping criterion is met.
4. Pruning: To prevent overfitting, unnecessary branches are
removed.
Example of a Decision Tree
Imagine predicting whether a customer will buy a product based on income,
age, and previous purchases:
1. Root Node (Income): "Is the person’s income greater than $50,000?"
1. If Yes, proceed to the next question.
2. If No, predict "No Purchase" (leaf node).
2. Internal Node (Age): "Is the person’s age above 30?"
1. If Yes, proceed to the next question.
2. If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases): "Has the person made previous
purchases?"
1. If Yes, predict "Purchase" (leaf node).
2. If No, predict "No Purchase" (leaf node).
Advantages of Decision Trees
• Easy to interpret and visualize
Handles both numerical and categorical data
Requires minimal data preprocessing
Works well with missing values
Disadvantages of Decision Trees
• Prone to overfitting if not pruned properly
Can be biased toward dominant features
Sensitive to noisy data
Example: Predicting Whether a Customer Will Buy a Product
Imagine we want to predict whether a customer will buy a product based on age,
income, and previous purchases.
Step 1: Dataset
Previous
Age Income Buy Product?
Purchases
25 Low No No
45 High Yes Yes
35 Medium No No
50 High Yes Yes
30 Low Yes No
Step 2: Building the Decision Tree
1. Root Node (Income): "Is the person’s income High?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
2. Internal Node (Age): "Is the person’s age above 40?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases): "Has the person made
previous purchases?"
• If Yes, predict "Purchase" (leaf node).
• If No, predict "No Purchase" (leaf node).
Step 3: Making Predictions
• If a new customer is 45 years old, has High income, and has made previous
purchases, the model predicts "Purchase".
• If a customer is 30 years old, has Low income, and has not made previous
purchases, the model predicts "No Purchase".
Naive Bayes is a classification algorithm based on Bayes' theorem,
which calculates the probability of a sample belonging to a particular class based
on the probabilities of its features.
It is called "naive" because it assumes that all features are independent of each
other, which is often not the case in real-world scenarios.
Naive Bayes is a probabilistic classification algorithm based on Bayes'
theorem, which calculates the probability of a sample belonging to a particular
class given the probabilities of its features.
It is widely used in machine learning due to its simplicity and efficiency.
Understanding Bayes' Theorem
Bayes' theorem states:
[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} ]
Where:
• ( P(AB) ) is the posterior probability (probability of class ( A ) given feature ( B )).
• ( P(BA) ) is the likelihood (probability of feature ( B ) given class ( A )).
• ( P(A) ) is the prior probability (initial probability of class ( A )).
• ( P(B) ) is the **probability of feature ( B )).
Why is it "Naive"?
• The algorithm assumes that all features are independent, meaning the presence
of one feature does not affect another.
• While this assumption is often unrealistic, Naive Bayes still performs well in
many applications.
Types of Naive Bayes Classifiers
1. Gaussian Naive Bayes: Used when features follow a normal distribution.
2. Multinomial Naive Bayes: Commonly used in text classification (e.g., spam
filtering).
3. Bernoulli Naive Bayes: Suitable for binary feature data (e.g., sentiment
analysis).
Applications
• Spam Filtering: Classifies emails as spam or not.
• Sentiment Analysis: Determines whether a review is positive or negative.
• Medical Diagnosis: Predicts diseases based on symptoms.
• Text Classification: Categorizes documents into predefined classes.
Advantages
Fast & Efficient: Works well with large datasets.
Requires Minimal Training Data: Performs well even with small datasets.
Handles High-Dimensional Data: Useful for text classification.
Limitations
• Feature Independence Assumption: Not always realistic.
Zero Probability Problem: If a feature never appears in training data, it gets
assigned zero probability.
Sensitive to Data Quality: Requires good feature selection for optimal
performance.
Example: Spam Detection
We'll classify messages as spam or not spam using Naive Bayes.
Step 1: Install Required Libraries
pip install scikit-learn pandas
Step 2: Import Libraries
import pandas as pdfrom sklearn.feature_extraction.text import
CountVectorizerfrom sklearn.naive_bayes import
MultinomialNBfrom sklearn.model_selection import
train_test_split
Step 3: Prepare Data
Step 4: Convert Text to Numerical Data
Step 5: Split Data into Training & Testing Sets
Step 6: Train Naive Bayes Model
Step 7: Make Predictions
This will classify the sample messages as spam or not spam based
on the trained model.
SVM (Support Vector Machine) is a supervised algorithm, effective for
both regression and classification, though it excels in classification
tasks.
Popular since the 1990s, it performs well on smaller or complex
datasets with minimal tuning.
What is a Support Vector Machine(SVM)?
A Support Vector Machine (SVM) is a machine learning
algorithm used for classification and regression.
This finds the best line (or hyperplane) to separate data into groups,
maximizing the distance between the closest points (support vectors)
of each group.
Types of Support Vector Machine (SVM) Algorithms
• Linear SVM: When the data is perfectly linearly separable only then we
can use Linear SVM.
• Perfectly linearly separable means that the data points can be classified
into 2 classes by using a single straight line(if 2D).
• Non-Linear SVM: When the data is not linearly separable, we can use
Non-Linear SVM.
• This happens when the data points cannot be separated into two classes
using a straight line (if 2D).
• In such cases, we use advanced techniques like kernel tricks to classify
them.
How Does Support Vector Machine Algorithm Work?
SVM is defined such that it is defined in terms of the support
vectors only, we don’t have to worry about other
observations since the margin is made using the points
which are closest to the hyperplane (support vectors),
whereas in logistic regression the classifier is defined over all
the points.
Let’s understand the working of SVM using an example.
To classify these points, we can have many decision
boundaries, but the question is which is the best and
how do we find it?
NOTE: Since we are plotting the data points in a 2-dimensional graph
we call this decision boundary a straight line but if we have more
dimensions, we call this decision boundary a “hyperplane”
This diagram illustrates:
• Hyperplane: The decision boundary that separates different classes.
• Support Vectors: The closest data points to the hyperplane, which
influence its position.
• Margin: The distance between the hyperplane and the nearest
support vectors.
• A larger margin improves classification accuracy.
Advantages of Support Vector Machine
1.Works well with complex data: SVM is great for datasets
where the separation between categories is not clear. It can
handle both linear and non-linear data effectively.
2.Effective in high-dimensional spaces: SVM performs
well even when there are more features (dimensions) than
samples, making it useful for tasks like text classification or
image recognition.
1.Avoids overfitting: SVM focuses on finding the best decision
boundary (margin) between classes, which helps in reducing the risk
of overfitting, especially in high-dimensional data.
2.Versatile with kernels: By using different kernel functions (like
linear, polynomial, or radial basis function), SVM can adapt to various
types of data and solve complex problems.
3.Robust to outliers: SVM is less affected by outliers because it
focuses on the support vectors (data points closest to the margin),
which helps in creating a more generalized model.
Disadvantages of Support Vector Machine
1.Slow with large datasets: SVM can be computationally
expensive and slow to train, especially when the dataset is very
large.
2.Difficult to tune: Choosing the right kernel and parameters
(like C and gamma) can be tricky and often requires a lot of trial
and error.
3.Not suitable for noisy data: If the dataset has too many
overlapping classes or noise, SVM may struggle to perform well
because it tries to find a perfect separation.
4.Hard to interpret: Unlike some other algorithms, SVM models
are not easy to interpret or explain, especially when using non-
linear kernels.
How SVMs Solve Classification Problems
1. Finding the Best Hyperplane: SVM identifies a hyperplane that maximizes
the margin between different classes.
The larger the margin, the better the generalization.
2. Support Vectors: These are the data points closest to the hyperplane and
play a crucial role in defining its position.
3. Kernel Trick: If data is not linearly separable, SVM uses kernel functions to
map it into a higher-dimensional space where separation becomes possible.
Common kernels include:
• Linear Kernel (for simple separable data)
• Polynomial Kernel (for complex boundaries)
• Radial Basis Function (RBF) Kernel (for intricate patterns)
Real-World Applications
• Spam Detection: Classifying emails as spam or not spam.
• Handwritten Digit Recognition: Identifying numbers from
images.
• Medical Diagnosis: Predicting whether a tumor is benign or
malignant.
• Stock Market Prediction: Classifying stocks based on trends.
Support Vector Machines-
Hyperplane
A hyperplane is a decision boundary which separates between given set of data points
having different class labels.
The SVM classifier separates data points using a hyperplane with the maximum amount
of margin.
This hyperplane is known as the maximum margin hyperplane and the linear classifier it
defines is known as the maximum margin classifier.
Support Vectors
Support vectors are the sample data points, which are closest to the hyperplane. These
data points will define the separating line or hyperplane better by calculating margins
Margin
A margin is a separation gap between the two lines on the closest data points.
It is calculated as the perpendicular distance from the line to support vectors or
closest data points.
In SVMs, we try to maximize this separation gap so that we get maximum margin.
The following diagram illustrates these concepts visually.
Margin in SVM
Linear Support Vector Machine (SVM) is a type of SVM used when data is linearly
separable, meaning it can be divided using a straight line (or hyperplane in higher
dimensions).
How Linear SVM Works
1. Finds the Best Hyperplane: It identifies the optimal boundary that separates
different classes while maximizing the margin between them.
2. Uses Support Vectors: The closest data points to the hyperplane influence its
position and help define the separation.
3. Maximizes Margin: A larger margin improves classification accuracy and
generalization.
Example of Linear SVM
Imagine classifying emails as spam or not spam based on word frequency.
If spam emails contain words like "free," "win," or "offer" more frequently, Linear
SVM can draw a straight boundary separating spam from non-spam emails.
A Non-Linear Support Vector Machine (SVM) is used when data cannot be separated
by a straight line.
Instead of finding a simple boundary, it uses a technique called the kernel trick to
transform data into a higher-dimensional space where separation becomes
possible.
How Non-Linear SVM Works
1. Data Transformation: If the data is scattered in a way that a straight line cannot
separate it, SVM applies a mathematical function (kernel) to map it into a higher
dimension.
2. Kernel Trick: Instead of manually transforming the data, SVM uses kernel functions
to make separation easier. Common kernels include:
o Radial Basis Function (RBF): Helps separate circular or complex patterns.
o Polynomial Kernel: Useful for curved boundaries.
o Sigmoid Kernel: Mimics neural networks for specific cases.
3. Finding the Best Boundary: Once transformed, SVM finds the optimal hyperplane
that separates different classes.
Example
Imagine classifying red and blue dots arranged in a circular pattern.
A straight line cannot separate them, but using an RBF kernel, SVM
transforms the data into a higher dimension where a clear boundary can
be drawn.
The Random Forest algorithm is a powerful machine learning technique that
combines multiple decision trees to improve accuracy and reduce overfitting.
It works by creating many decision trees, each trained on a random subset of
the data, and then aggregating their predictions through majority voting (for
classification) or averaging (for regression)
Key Features:
• Handles Missing Data: Works even if some data is missing.
• Feature Importance: Identifies the most useful features for predictions.
• Versatile: Used for both classification (predicting categories) and regression
(predicting numerical values).
• Robust to Overfitting: Since it averages multiple trees, it avoids overfitting
better than individual decision trees.
How Random Forest Works-
1. Bootstrapping (Random Sampling): The algorithm selects random
subsets of the training data to build multiple decision trees.
2. Feature Selection: Each tree is trained on a random subset of features,
ensuring diversity among trees.
3. Decision Trees Construction: Each tree learns patterns from its
subset of data and makes predictions.
4. Aggregation of Predictions:
• Classification: The final prediction is determined by majority voting
among all trees.
• Regression: The final prediction is the average of all tree predictions.
Advantages of Random Forest
• Handles Missing Data: Works well even if some data is missing.
• Reduces Overfitting: Since multiple trees are used, it avoids overfitting better
than a single decision tree.
• Feature Importance: Identifies the most influential features in the dataset.
• Works Well with Large Datasets: Can handle high-dimensional data efficiently.
Applications of Random Forest
• Medical Diagnosis: Used to predict diseases based on patient data.
• Fraud Detection: Helps banks and financial institutions detect fraudulent
transactions.
• Stock Market Prediction: Used to analyze trends and predict stock prices.
• Customer Churn Prediction: Businesses use it to identify customers likely to
A typical Random Forest diagram shows:
Multiple decision trees trained on different parts of the dataset.
Each tree making its own prediction.
The final result being determined by majority voting (for
classification) or averaging (for regression).
Random Forest is a machine learning algorithm that builds multiple
decision trees and combines their predictions to improve accuracy. 1.
1. Data Splitting: The algorithm randomly selects different parts of the
dataset to create multiple decision trees.
2. Tree Building: Each tree learns patterns from its subset of data and
makes predictions.
3. Voting/Averaging:
o For classification, the final result is based on majority voting
(the most common prediction among trees).
o For regression, the final result is the average of all tree
predictions.
4. Final Prediction: The combined result from all trees gives a more
accurate and stable prediction.
Step-by-Step Implementation
1. Import Libraries
2. Load Dataset
3. Split Data into Training and Testing Sets
4. Train the Random Forest Model
5. Make Predictions
6. Evaluate Model Performance
Linear Regression:
o Linear regression is a statistical regression method which is used for
predictive analysis.
o It is one of the very simple and easy algorithms which works on
regression and shows the relationship between the continuous
variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis),
hence called linear regression.
Types of Linear Regression
Linear regression is of the following two types −
• Simple linear regression − A linear regression algorithm is called
simple linear regression if it is having only one independent
variable.
• Multiple linear regression − A linear regression algorithm is
called multiple linear regression if it is having more than one
independent variable.
Linear regression is mainly used to estimate the real values based on
continuous variable(s). For example, the total sale of a shop in a day, based on
real values, can be estimated by linear regression.
Advantages of Linear Regression
Interpretable & Simple: Easy to understand and explain
Efficient: Works well for datasets with linear relationships
Fast Training: Computationally inexpensive
Python Implementation Example
Logistic Regression
• It is a classification algorithm and also known as logit regression.
• Mainly logistic regression is a classification algorithm that is used to estimate the discrete
values like 0 or 1, true or false, yes or no based on a given set of independent variable.
Basically, it predicts the probability hence its output lies in between 0 and 1.
• It helps predict whether something belongs to one group or another, like spam vs. not
spam, sick vs. healthy, or pass vs. fail.
• Instead of predicting a number, it calculates the probability that something belongs to a
certain category.
• It uses a special mathematical function called the sigmoid function to keep values
between 0 and 1, representing probability.
Example: Predicting if a Student Passes an Exam
Imagine you are a teacher who wants to predict whether a student will pass or
fail based on their study hours.
•Independent Variable (X): Hours spent studying
•Dependent Variable (Y): Pass (1) or Fail (0)
How Logistic Regression Works Here
•If a student studies for many hours, their probability of passing is high.
•If a student studies little, their probability of passing is low.
•The logistic regression model calculates this probability and assigns a
result:
•If probability > 0.5, the student is predicted to pass (1)
•If probability < 0.5, the student is predicted to fail (0)
Example Prediction
Let’s assume the model gives the following results:
So, a student studying for 8 hours has a 90% chance of passing
according to the logistic regression model.
This is a simple example of how logistic regression helps make binary
classifications based on probabilities.
Probability of Predicted
Study Hours
Passing Outcome
2 hours 0.30 Fail (0)
5 hours 0.70 Pass (1)
8 hours 0.90 Pass (1)