0% found this document useful (0 votes)
5 views38 pages

Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

What is machine learning?

 Machine learning is field of study which give computer ability to learn from
past experience and past data without being explicitly programmed
o Arthure Samuel in 1959
 Or in simple words Machine learning is science or art of programming
computer to learn from the data
Supervised Learning
Definition
 Supervised Learning is a type of Machine Learning where the model is trained
using labeled data (data with both inputs and correct outputs).the computer
uses this information to learn the relationship between inputs and outputs.
It’s called “supervised” because it’s like a teacher guiding the computer.
Simple Example
 Imagine you want to teach a computer to recognize whether an email is spam
or not:
 You show the computer lots of emails.
 Each email is labeled as “spam” or “not spam.”
 The computer learns what spam emails look like based on these labeled
examples
 Later, you give it a new email, and it predicts if it’s spam or no
Supervised learning solve 2 problems
Regression and Classification
 Regression
o It is related to numeric values
o E.g.
 Population growth prediction
 Expecting life expectancy
 Market forecasting/prediction
 Advertising Popularity prediction
 Stock prediction
o Algorithms
 Linear Regression (single feature)
 Multiple Linear Regression (many features)
 Ridge Regression (regularized linear regression)
 Lasso Regression (another regularized version)
 Support Vector Regression (SVR)
 Decision Tree Regression / Random Forest Regression
 Classification
o Related to classify the records based on classes
o E.g.
 Find whether an email received is a spam or ham
 Identify customer segments
 Find if a bank loan is granted
 Identify if a kid will pass or fail in an examination
o Algorithms
 Logistic Regression
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 K-Nearest Neighbors (KNN)
 Naïve Bayes
 AdaBoost
 Gradient Boosting / XGBoost
Unsupervised Learning
Definition
Unsupervised Learning is when the model is trained on unlabeled data (only inputs,
no correct outputs).
The goal is to find patterns, groups, or structures hidden in the data.

Examples

 Customer segmentation (grouping customers based on buying habits)


 Market basket analysis (finding which products are bought together)
 Anomaly detection (fraud detection, unusual activity in sensors)
 Document/topic clustering (news grouped by topics automatically)
 Image compression (reducing size without labels)

It solve the problem like clustering, Association Rules and Dimensionality Reduction

1. Clustering – discover the inherent groupings in the data, such as grouping customers by
purchasing behavior
o e.g., K-Means, Hierarchical Clustering, DBSCAN
2. Association Rules – finding relationships between items

An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as people that buy X also tend to buy Y

E.g.

▪ Market basket analysis

Algorithms

▪ Apriori

▪ Eclat

▪ FP-Growth (used in market basket analysis)

3. Dimensionality Reduction

 The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction
 It is a way of converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information.
 A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated  resource more test Time more
 Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required
to use
o Features Selection
 Filter
 Wrapper
 Embedded
o Features Extraction
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)

Reinforcement Learning
 A type of machine learning where an agent learns by interacting with an
environment, taking actions, and receiving rewards or penalties.
Goal: Learn the best strategy (policy) to maximize long-term rewards.
 It is employed by various software and machines to find the best possible
behavior or path it should take in a specific situation
 Reinforcement learning differs from the supervised learning in a way that in
supervised learning the training data has the answer key with it so the model
is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform
the given task
 In the absence of training dataset, it is bound to learn from its experience
 Like humans learn in the real world – through trial and error. The agent
(software/machine) learns by experience:
o Action → Environment → Reward/Penalty → Learn → Improve
 Examples
o Resources management in computer clusters
o Traffic Light Control
o Robotics
o Web system configuration
o Chemistry
 Algorithms
o Q-Learning
o Deep Q-Learning & offline learning
Batch Learning / Offline Learning

 Production- is a server on which code is going to run


 Conventional way to training a model use whole data to train model
 No Incremental learning
 Generally use own system
 Training data -> train model on system using data -> test data -> deploy model
 If the dataset is larger or big data then it is costly and time taking
 Problem with batch learning

Model is static

\/

Once learn no change in knowledgs

\/

After 1 year

\/

No change in Model

\/

It give the same recommendation that based on previous 1 year dataset


Or

We need to continuously updating of model every month or deploy

If

Data is large or big data

There is problem -> Hardware issues and availability

Train the model - > put on server -> again pull the ML model -> again train with
previous and updated data -> again deploy the model -> repeat the process again
and again
The above process is vary time consuming
In simple terms:
Batch learning is machine learning approach where
Model is train on the entire dataset all at once, instad of updating
continuously as new data arrive[avoid incremental learning] after training the model
is used for making predction ad it only changes if it is retain new data from scratch

Batch Learning advantage


If you want a batch learning system to know about new data, you need to train a
new version of the system from scratch on the full dataset, then stop the old system
and replace it with the new one
▪ The whole process of training, evaluating, and launching a Machine
Learning system can be automated easily

↳ MLops pipeline
▪ Training using the full set of data can take many hours
▪ Typically train a new system only every 24 hours or even just weekly
▪ Training on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O)

Online Learning/ Incremental Learning


Online learning is a method where model learns continuously in sequence, updating
itself with each new data point or small groups of data point
Online learning is like learning day by day from what you see, rather than studying
everything all at once it helps computer keep improving step by step information
arrives.
Fast
Less costly
Less time consuming
Can train on server
Online learning do incremental learning using mini batches of data where train
model sequentially since the batches are small model can train on server

Each learning step is fast and cheap, so the system can learn about new data on the
server, as it arrives
Online learning is great for systems that receive data as a continuous flow (e.g.,
stock prices) and need to adapt to change rapidly or autonomously
It is also a good option if you have limited computing resources once an online
learning system has learned about new data instances, it does not need them
anymore, so you can discard them
This can save a huge amount of space
out-of-core learning: - train systems on huge datasets that cannot fit in one
machine’s main memory

When to use:
When there is concept drift
 If the software have volatile Nature
 The Frequency of data changes is high
Cost Effective
Faster Solution

Learning Rate:
Decide how much frequent you are going to train model on data
One important parameter of online learning systems is how fast they should
adapt to changing data
If you set a high learning rate, then your system will rapidly adapt to new
data, but it will also tend to quickly forget the old data
if you set a low learning rate, the system will have more inertia-
that is, it will learn more slowly, but it will also be less sensitive to
noise in the new data or to sequences of nonrepresentative data points
(outliers
We need to find correct learning rate if not the model can cause
 Learn new pattern and forgot old patterns
 Or Might be learn patterns slowly
Disadvantages:
 Tricky to use
 Risky
 Online learning is less stable and give less accuracy than batch learning
o Like server hack -> hacker give spam email
 Can handle using monitoring system like anomaly detection or
can me rollback
Instance Based
The system learns the examples by heart-- stance
Then generalizes to new cases by using a similarity measure to compare them to
the learned examples (or a subset of them)
It is called instance-based because it builds the hypotheses from the training
instances
It is also known as memory-based learning or lazy-learning
Ex. K Nearest Neighbor (KNN)

Model Based
 Model-> formula
 Train model from training data to estimate model parameters i.e. discover
patterns
 Store the built model in suitable format
 Generalize the rules of model pickle training set
 Predict the unseen instance (data) using the model
 It requires a known model form
 It takes less memory compared to the instance based learning
 E.g. ▪ Linear Regression
End to End for Model Deployment
Model Evaluation
Mean Absolute Error (MAE):
measures the average magnitude of the errors in a set of forecasts, without
considering their direction
measures accuracy for continuous variables

The MAE is a linear score which means that all the individual differences are
weighted equally in the average

Mean Squared Error (MSE)/ mean squared deviation (MSD):


measures the average of the squares of the error
i.e the average square between actual value and estimated value

Classification
 Prediction of class/label values
 Classification is supervised machine learning method where model try to predict correct
label using give input
 In classification model is fully train on training data then evaluated on test data before
model is going to predict on unseen data

Learners

 There are two types of Learners


o Eagar learner
 Logistic Regression
 SVM
 Naïve Bayes
 Decision Tree
 Artificial Neural Network

o Lazy Learners or Instance Base Learner [KNN or case base reasoning]


 Eager Learner

Model based [Model = Formula]

These are the Machine learning algorithm which first build on training dataset then
before making any future prediction

They spend more time on during the process because the eagerness to have more
generalization and required less time to prediction

Training Time  High


Prediction Time  Low

 Lazy Learner or instance based learner


No model is created
They memorize the training data and then during prediction the find the nearest
neighbor
From training dataset, which make prediction slow

Training Time -> Low


Prediction Time -> High

Types of classification

 There are two types of classification


o Binary Classification
o Multi-class Classification
o Multilabel Classification

Binary Classification

Label = 2
Spam

Email No Spam

Output = 1

The goal is to classify the input into two mutually exclusive category

Th training data in such situation is labeled as True False, 0 1 , Spam not Spam

Multiclass Classification
Label > 2
Upper class

Person Middel
class

Lower class
Output = 1

• Each instance belongs to exactly one class out of multiple possible classes.
• Classes are mutually exclusive (choosing one means excluding the others).

• Example:
- Predicting the type of fruit (Apple, Banana, Mango, Orange) → one fruit at a time.
- Handwritten digit recognition (0–9) → only one digit per image.

Multilabel Classification
• Each instance can belong to multiple classes simultaneously.
• Classes are not mutually exclusive (an instance may have zero, one, or several
labels).
• Example:
- Predicting movie genres → a movie can be Action + Comedy + Drama.
- Detecting objects in an image → one picture can have Dog + Car + Tree.
We can site:
 Multi-label Decision Tree
 Multi-label Gradient Boosting
 Multi-label Random Forest
Logistic Regression
Designed for Binary Classification

1) Definition

 Logistic Regression is a supervised machine learning algorithm used for classification


problems (mostly binary).
 It predicts the probability of an instance belonging to a class.
 Decision is made by applying a threshold (commonly 0.5).

2) How it works

 It takes input features and combines them linearly with weights.


 The result is passed through a sigmoid function, which compresses the value into a range
between 0 and 1.
 The output can be interpreted as a probability.

3) Training / Learning

 Parameters (weights) are learned using Maximum Likelihood Estimation (MLE).


 The objective is to minimize the log loss (cross-entropy loss).
 Optimization is usually done with Gradient Descent or advanced solvers (like Newton’s
method, LBFGS).
 Regularization (L1 or L2) is often applied to avoid overfitting and handle
multicollinearity.

4) Decision Making

 If predicted probability ≥ threshold → Class 1.


 If predicted probability < threshold → Class 0.
 Threshold can be adjusted depending on business needs (e.g., medical tests may need
higher recall).

Example: Cat vs Dog


 A Logistic Regression model is trained with images of cats (label = 0) and dogs (label =
1).
 For a new test image, the model predicts a probability:
 threshold (commonly 0.5).

👉 Suppose the model outputs: P(Dog) = 0.82

 Since probability ≥ 0.5 threshold, the model decides → Dog.

👉 Another image gives: P(Dog) = 0.23

 Since probability < 0.5 threshold, the model decides → Cat.

Decision Rule (in words)

 If probability ≥ 0.5 → predict Dog


 If probability < 0.5 → predict Cat

5) Evaluation Metrics

 Confusion Matrix for understanding TP, TN, FP, FN.


 Accuracy (useful only if classes are balanced).
 Precision, Recall, F1-score (better for imbalanced datasets).
 ROC-AUC / PR-AUC for probability ranking and imbalanced problems.
 Calibration curves when probability outputs are directly used in decision-making.

6) Assumptions

 requires the dependent variable to be binary [Two categories/Classes]


 Only the meaningful variables should b included

 The independent variable should be independent of each other. i.e the model should have
little or no multi-collinearity
 Required quite large sample sizes
7) Extensions

 Multinomial Logistic Regression → for multiclass problems (more than 2 classes).


 One-vs-Rest approach → trains multiple binary logistic models for each class.
 Multilabel classification → sigmoid applied independently to each label.

8) Advantages

 Simple and easy to implement.


 Allow easy regularization of outputs to prevent overfitting yielding probabilities as
prediction result
 Allows easy model updating using stochastic gradient descent
 Does not get effected to predict output probabilities on removal of variables uncorrelated
to output or multi-collinear variables

9) Disadvantages

 fails to solve non-linear problems


 underperforms when there are multiple or non-linear decision boundaries.
 It fails to capture more complex relationships.
 Without proper identification of independent variables Logistic Regression fails to
perform correctly.
 Logistic Regression can only predict a categorical outcome with discrete probability
outcome

10) Key Interview Highlights

 Logistic Regression is for classification, not regression.


 Uses a sigmoid function to output probabilities.
 Parameters are estimated by maximum likelihood.
 Evaluated using precision, recall, F1, ROC-AUC (not just accuracy).
 Assumes linear relationship in log-odds, not raw features.
 Interpretation is often in terms of odds ratios.
Sigmoid Function

 Formula:

 Range:
o Output is always between 0 and 1.
o This makes it suitable for probability representation.
o

 In ML usage:
o In binary classification, sigmoid maps raw model output (logit) into a probability
of belonging to the positive class.
o Example: if sigmoid(z) = 0.82 → interpreted as 82% probability of being class
1.
o In multilabel classification, sigmoid is applied to each label independently (since
each can have its own probability).
o In multiclass classification, we usually use softmax instead of sigmoid, because
probabilities must sum to 1 across classes.

🔑 Key Point:

 Sigmoid ≈ probability for one independent outcome.


 Softmax ≈ probability distribution for mutually exclusive outcomes.
Naïve Bayes

Use in Text Classification


Naïve  assumes that all features are conditionally independent of each other,
given the class label.

Bayes’ Theorem  finds the probability of an event occurring given the probability
of another event that has already occurred

A and B are events

P(A)  Prior Probability

 is happening or event A before seeing any evidence

P(B)  Evidence

 already happen or total probability of observing B under all possible conditions.

P(B|A)  Likelihood

 probability of observing event B given that A is true.

P(A|B)  Posterior Probability

 A happening given that event B has already happened


 Posterior = (Prior × Likelihood) ÷ Evidence

Naïve Bayes is Studies extensively since 1960s


Types of Naïve Bayes:

 Gaussian Naïve Bayes classifier [Binary Classification]


o When Features are numbes [marks, height, Weight]
 Multinomial Naive Bayes [ Multiclass Classification]
o Data has count of things [ Spam Detection, Sentiment Analysis ]
o How many times a word appears
 Bernoulli Naive Bayes [ Binary Classification]
o Only Care whether something is there or not [Yes or No]
o Categorical Naïve bayes
o All Features are binary
o Ex. Email spam check
 Is the word win in email ?? Yes / No
KNN [K-Nearest Neighbors]

 It is supervised machine learning algorithm which is used for both problems classification
as well as regression
 Can use in use for both Regression as well as Classification Problem Statement
 K  No of nearest neighbors
 Instance based algorithm or Lazy learner
 No model is created memorize the training data and find the nearest neighbors of the
input and predict the output
 Required more time for prediction
 How ever it is widely use for classification in industry
 It assume that similar data point are in close proximity

Working:

 Classify by the vote of its neighbors , with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function

 Choose the optimal value for K by looking the dataset [ usually for most data set it is
between 3 to 10]
 Cross-validation is another way to retrospectively determine a good K value by using an
independent dataset to validate the K value
 Disadvantage:
o Computationally expensive – it store all the training data in memory
o High Memory requirement
 Applications of KNN
o Recommender system
o Relevant document classification
SVM [Support Vector Machine]

 It is Supervised Machine Learning Algorithm which is use for both


classification as well as regression
 It is mostly used in classification problem
 It separate the classes based on hyperplane
 In two dimensional plane the hyperplane is line dividing a two parts where
each class lay in either side
 In hyperplane there could be multiple way to chosen a plane but svm goal is
to choose the maximum margin plain.
Picture this: you’re on a quest to find the perfect algorithm that can effortlessly
distinguish between apples and oranges, even when they’re mixed together in a
basket. Enter Support Vector Machines, or SVM

Class label are denoted as

-1  for -ve class

+1  for +ve class

The main task of the classification problem is to find the best separating
hyperplane/ Decision boundary.

We can have n-1 hyperplane which can be either linear or nonlinear. Such data
points are called Support vectors.
What exactly are Margins?

 further the data points are from the margins, the more correctly they are
classified.
 Margins represent the width of the corridor that the SVM algorithm aims to
maximize when finding the optimal hyperplane to separate different classes
of data. The larger the margin, the greater the confidence in the
classification made by the SVM model.
Linear and Non-Linear SVM

Linear SVM:

 In linear SVM, it separates data by a straight line or hyperplane in the


input space, rendering it suitable for linearly separable data.
 The key advantage of linear SVM lies in its simplicity and efficiency.

Non-Linear SVM

 Non-linear SVM is employed when the relationship between features and


classes is not linear and cannot be separated by a straight line or
hyperplane in the input space.
 It addresses this by mapping the input data into a higher-dimensional
feature space where it becomes linearly separable.
Optimization Technique used in SVM

 This optimization problem aims to minimize the classification error while maximizing
the margin, which is the distance between the decision boundary and the closest data
points from each class.

 Hard constrain that Support Vector Machine follows:- each data point
must lie on the correct side of the margin and there should be no
misclassification.

Hard and Soft SVM

Hard SVM

 Algorithm aims to find the hyperplane that separates the classes with the
maximum margin while strictly enforcing that all data points are correctly
classified.
 Assuming that the data is linearly separable, it implies the existence of at
least one hyperplane that can perfectly separate the classes without any
misclassifications.
 However, Hard SVM does not tolerate any misclassification errors and
demands the data to be perfectly separable, which can be overly restrictive
and might lead to poor performance on noisy or overlapping datasets.
Soft SVM

 Also Known as C-SVM (C - regularization parameter)


 relaxes the strict requirement of Hard SVM by allowing some
misclassification errors.
 It introduces a regularization parameter (C) that controls the trade-off
between maximizing the margin and minimizing the classification error.
 A smaller value of C allows for a wider margin and more
misclassifications, while a larger value of C penalizes misclassifications
more heavily, leading to a narrower margin.
 Soft SVM is suitable for cases where the data may not be perfectly
separable or contains noise or outliers.
 It provides a more robust and flexible approach to classification, often
yielding better performance in practical scenarios.

Relation between Regularization parameter (C) and SVM


 Low C → Wide margin, allows more misclassifications, better generalization.
 High C → Narrow margin, fewer misclassifications, risk of overfitting

Kernel trick in SVM

Kernel Trick in SVM

RBF SVM
Polynomial Linear SVM
SVM

Library:
#from sklearn.svm import SVC
Decision Tree

Decision Tree
Classifier

ID3 (Iterative Dichotomiser 3): CART(Classification and Regression Tree):

A decision tree algorithm that uses information A decision tree algorithm that uses Gini Index
gain based on entropy to select the best (for classification) or variance reduction (for
attribute for splitting. regression) to split data.
CART always creates a binary tree

a. Entropy and Gini Index  Purity split


b. Information Gain  Feature decision tree split
Example:-
Let go with practical example:
Problem statement: Can Suraj go to play tennis

Test condition:
 Root Node
 Parent Node
 Child Node
 Leaf Node

Hyper Parameter
 Max_depth = #level / Height
Entropy:-
 Entropy H(S) = -p log2 p - q log2q
 Range between 0 to 1
 If entropy is 1  Not pure split
 If entropy is 0  Pure entropy

Gini Index / Gini Impurity:


 Range between 0 to 1
 If Gini impurity is 0 pure split
 If Gini impurity is 0.5 not pure split

H(s)
1- Entropy

Gini
0.5 -
Impurity

P+ ,P-
0 0.5 1

About Decision tree:


 It is a supervised machine learning algorithm that predict the output by splitting the data
into branches based decision rule
 It contains only conditional statement in each node except leaf node
 Work for both classification as well as regression problem
 Splitting is done using information gain and gini index or entropy [depends on dataset
size]
 Unlike linear model it build the relation with non-linear quite well
Application:
 Most popular algorithm use in Data Mining
Assumptions:
 At the beginning, the whole data set is consider as a root node
 Featured value is preferred to be categorical
 If the value are continuous then they are discretized prior to building the model.
 Records are distributed recursively on the basis of values
Advantages
 Easy to interpret and understand, even for non-technical users.
 No statistical knowledge required to interpret results.
 Graphical representation is intuitive and hypothesis-friendly.
 Helps in quick data exploration and variable importance detection.
 Robust to missing values and outliers to some extent.

Disadvantages

 Prone to overfitting if not pruned or constrained.


 Not suitable for continuous variables (information loss on binning).
 Can be unstable (small data changes → very different tree).
 Biased results if data is imbalanced.

Pre-pruning (Early Stopping)

 Stops the tree from growing once it reaches a certain condition.


 Criteria: maximum depth, minimum samples per split, minimum information gain, etc.
 Prevents overfitting during training itself.
 Faster, less complex, but may cause underfitting if stopped too early.

Post-pruning (Pruning after Full Growth)

 Allows the tree to grow fully, then removes (cuts back) the branches that add little
predictive power.
 Based on validation error, cost complexity pruning, or statistical tests.
 Reduces overfitting while keeping useful splits.
 More accurate but computationally expensive

 If dataset is big then Pre-pruning is use if not then Post pruning is use
 Most efficient in both of them is Pre-Pruning because it prune while creating the tree
 Pre-pruning = Stop early to avoid overfitting.
 Post-pruning = Grow fully, then trim to simplify.
Ensemble Learning
Ensemble is art of combining the diverse set of learners together [Individual Model] to improve
the stability and predictive power of model.

Primarily use improve the model


 Classification
 Prediction
 Function approximation
 Performance
Or
 Reduce the unfortunate selection of poor one
Goals:
 Assigning Confidence  Bagging
 Improving Accuracy  Boosting
 Give one model O/P to another as I/P  Stacking

Bagging
 Stands for bootstrap aggregation

You might also like