0% found this document useful (0 votes)
23 views63 pages

Unit 4 Classification and Regression

The document covers key concepts in machine learning, focusing on supervised and unsupervised learning, including various algorithms for classification and regression. It outlines the machine learning pipeline, steps for model training, and specific regression techniques like Linear, Polynomial, Ridge, and Lasso Regression. Additionally, it discusses Support Vector Regression and Decision Tree Regression, highlighting their advantages, disadvantages, and practical applications.

Uploaded by

dp302817
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views63 pages

Unit 4 Classification and Regression

The document covers key concepts in machine learning, focusing on supervised and unsupervised learning, including various algorithms for classification and regression. It outlines the machine learning pipeline, steps for model training, and specific regression techniques like Linear, Polynomial, Ridge, and Lasso Regression. Additionally, it discusses Support Vector Regression and Decision Tree Regression, highlighting their advantages, disadvantages, and practical applications.

Uploaded by

dp302817
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit 4: Classification and Regression

Topic 1: Supervised Learning vs Unsupervised


Learning
Feature Supervised Learning Unsupervised Learning
Learning with labeled data Learning with unlabeled data
Definition (input + correct output (only input data, no output
provided). labels).
Learn a mapping function Discover hidden patterns,
Goal
from input → output. structure, or grouping in data.
Data Requires a large labeled
Works with unlabeled data.
Requirement dataset.
Predictions (classification or Clusters, associations,
Output
regression). dimensionality reduction.
- Linear Regression
- K-Means Clustering
- Logistic Regression
- Hierarchical Clustering
- Decision Trees
Examples of - DBSCAN
- Random Forest
Algorithms - Principal Component
- Support Vector Machines
Analysis (PCA)
(SVM)
- Autoencoders
- Neural Networks
Easy to evaluate (compare Harder to evaluate (no ground
Evaluation
predictions with true labels). truth).
- Spam email detection
- Customer segmentation
(Spam / Not Spam)
- Market basket analysis
Applications - Disease diagnosis (Diabetic
- Document/topic clustering
/ Not)
- Image compression
- Credit score prediction
Topic 2: classification of supervised and unsupervised
algo in ml
(A)Supervised Learning Algorithms

(Supervised = Data has labels → we predict output)

(1)Regression Algorithms (Predict Continuous Values)

 Linear Regression
 Polynomial Regression
 Ridge Regression (L2 Regularization)
 Lasso Regression (L1 Regularization)
 Elastic Net Regression
 Support Vector Regression (SVR)
 Decision Tree Regression
 Random Forest Regression
 Gradient Boosting Regression (XGBoost, LightGBM, CatBoost)
 k-Nearest Neighbors (k-NN) Regression
 Neural Networks (Deep Learning for regression)

(2) Classification Algorithms (Predict Categories / Classes)

 Logistic Regression
 k-Nearest Neighbors (k-NN)
 Support Vector Machine (SVM)
 Decision Tree Classifier
 Random Forest Classifier
 Naïve Bayes
 Gradient Boosting Classifiers (XGBoost, LightGBM, CatBoost)
 Neural Networks (Deep Learning for classification)

(B) Unsupervised Learning Algorithms


(Unsupervised = Data has no labels → we find patterns, structure, groups)
(1) Clustering

 K-Means Clustering
 Hierarchical Clustering (Agglomerative, Divisive)
 DBSCAN (Density-Based Spatial Clustering)
 Mean-Shift Clustering
 Gaussian Mixture Models (GMM, Expectation-Maximization)
 Self-Organizing Maps (SOM)

(2) Association Rule Learning

 Apriori Algorithm
 FP-Growth Algorithm
 Eclat Algorithm

(3) Dimensionality Reduction

 Principal Component Analysis (PCA)


 Singular Value Decomposition (SVD)
 Independent Component Analysis (ICA)
 t-SNE (t-distributed Stochastic Neighbor Embedding)
 UMAP (Uniform Manifold Approximation and Projection)

Summary:

 Supervised → Predict outcomes (classification, regression).


 Unsupervised → Discover hidden structure (clustering, associations,
dimensionality reduction).

Topic 3: Learning Steps in Machine Learning


Whether for classification or regression, the general ML pipeline is similar.

1. Problem Definition

 Define the task: classification, regression, clustering, etc.


 Example: Predict whether a customer will churn (Yes/No).

2. Data Collection

 Gather dataset from sources (databases, APIs, sensors, files).


 Example: Customer details, usage patterns, and churn status.

3. Data Preprocessing

 Handle missing values, duplicates, outliers.


 Encode categorical variables (e.g., One-Hot Encoding, Label Encoding).
 Feature scaling (Normalization/Standardization).

4. Splitting Dataset

 Train set (e.g., 70–80%) → used for model training.


 Test set (e.g., 20–30%) → used for model evaluation.
 Sometimes a validation set (for hyperparameter tuning).

5. Model Selection

 Choose algorithm (Logistic Regression, Decision Tree, SVM, etc.).


 Consider accuracy, interpretability, speed, and scalability.

6. Model Training

 Feed training data to the algorithm.


 Model learns patterns (weights/parameters).

7. Model Evaluation

 Test on unseen data (test set).


 Use metrics like:
o Accuracy
o Precision, Recall, F1-Score
o Confusion Matrix
o ROC Curve, AUC

8. Model Optimization

 Tune hyperparameters (Grid Search, Random Search).


 Use regularization to prevent overfitting.

9. Deployment

 Deploy model into production (web app, mobile app, embedded system).
 Monitor performance over time.
10. Maintenance & Updates

 Retrain the model with new data to keep it accurate.

So, in short:
Problem → Data → Preprocessing → Train/Test Split → Model →
Training → Evaluation → Deployment → Maintenance

Topic 4: Linear Regression


Linear Regression is a supervised learning algorithm used for predicting a
continuous dependent variable (Y) based on one or more independent variables
(X).

It assumes a linear relationship between input features and output.

Equation of Linear Regression

Types of Linear Regression


1. Simple Linear Regression – one independent variable.
2. Multiple Linear Regression – more than one independent variable.
3. Polynomial Regression –

4. Ridge & Lasso Regression – regularized forms to prevent overfitting.


Cost Function (Loss Function)

Training the Model

Advantages of Linear Regression


1. Simplicity & Interpretability
o Very easy to understand and implement.
o Coefficients clearly show the effect of each feature.
2. Fast & Efficient
o Computationally inexpensive, works well even on large datasets.
3. Less Data Hungry
o Can work with relatively small datasets compared to deep learning.
4. Good for Linearly Separable Data
o Performs well when there is a clear linear relationship between
features and target.
5. Baseline Model
o Often used as a starting point for regression tasks before trying
more complex models.

Disadvantages of Linear Regression


1. Assumption of Linearity
o Works poorly if the relationship between variables is non-linear.
2. Sensitive to Outliers
o A single extreme value can significantly affect the regression line.
3. Multicollinearity Problem
o If independent variables are highly correlated, it affects stability
and interpretation of coefficients.
4. Poor at Complex Relationships
o Cannot capture interactions and non-linear patterns without
modification (e.g., polynomial regression).

5. Assumption Requirements

o Requires assumptions like homoscedasticity (constant variance), normal


distribution of errors, and independence of observations

Python Example (Simple Linear Regression)


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# Sample data

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # independent variable

y = np.array([2, 4, 5, 4, 5]) # dependent variable

# Create and train model

model = LinearRegression()

model.fit(X, y)

# Predictions

y_pred = model.predict(X)
# Plot

plt.scatter(X, y, color="blue", label="Actual data")

plt.plot(X, y_pred, color="red", label="Regression line")

plt.xlabel("X")

plt.ylabel("y")

plt.legend()

plt.show()

# Coefficients

print("Intercept:", model.intercept_)

print("Slope:", model.coef_[0])

👉 In short:

 Use Linear Regression when the relationship is simple, linear, and you
need interpretability.
 Avoid it for complex, non-linear, or high-dimensional problems.

Topic 5: Polynomial Regression


Polynomial Regression is a type of supervised learning regression algorithm.
It is an extension of Linear Regression where the relationship between
independent variable(s) X and dependent variable Y is non-linear.

Equation
When to Use
 When data shows a non-linear trend.
 Example: Predicting growth curves, demand curves, temperature patterns,
etc.

Steps in Polynomial Regression

Advantages
o Captures non-linear relationships.
o Simple to implement (just an extension of linear regression).

Disadvantages
o Can easily overfit if degree is too high.
o Extrapolation (prediction outside range) is unreliable.
o More computational cost with higher degrees.

Python Example: Polynomial Regression


import numpy as np

import matplotlib.pyplot as plt


from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

# Generate sample data

X = np.linspace(0, 10, 50).reshape(-1, 1)

y = 2 + 0.5*X + 1.5*X**2 + np.random.randn(50, 1)*3 # quadratic with noise

# Transform features for polynomial regression (degree=2)

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(X)

# Fit linear regression on transformed features

model = LinearRegression()

model.fit(X_poly, y)

# Predict

y_pred = model.predict(X_poly)

# Plot

plt.scatter(X, y, color='blue', label="Data")

plt.plot(X, y_pred, color='red', linewidth=2, label="Polynomial Regression")

plt.legend()

plt.title("Polynomial Regression (Degree=2)")


plt.show()

topic 6:Ridge Regression (L2 Regularization)


What is Ridge Regression?
 Ridge Regression is a linear regression technique that uses L2
regularization.
 It modifies the ordinary least squares (OLS) regression by adding a
penalty term to the loss function.
 The penalty discourages large coefficients (weights), which helps prevent
overfitting.

Ridge Regression Objective Function

where

🔹 Effect of λ (Regularization Parameter)


 λ = 0 → Equivalent to standard linear regression (no penalty).
 Small λ → Slight shrinkage, keeps model flexible.
 Large λ → Strong shrinkage, coefficients close to zero, model becomes
simpler.

Advantages
 Reduces overfitting by controlling large coefficients.
 Works well when predictors are highly correlated (multicollinearity).
 Always has a unique solution (since penalty ensures matrix invertibility).

Disadvantages
 Doesn’t perform feature selection (unlike Lasso, coefficients don’t
become exactly zero).
 Still assumes linear relationship between predictors and target.

Python Example
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

# Generate sample data

np.random.seed(42)

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1).ravel()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Ridge Regression

ridge = Ridge(alpha=10) # alpha = λ

ridge.fit(X_train, y_train)

# Predictions

y_pred = ridge.predict(X_test)

# Evaluation

print("MSE:", mean_squared_error(y_test, y_pred))

print("Coefficients:", ridge.coef_)

print("Intercept:", ridge.intercept_)

In short:
Ridge Regression = Linear Regression + L2 penalty.
It shrinks coefficients but never eliminates them, making it effective for
reducing overfitting while keeping all predictors.
topic 7:Lasso Regression (L1 Regularization)
What is Lasso Regression?

 Lasso (Least Absolute Shrinkage and Selection Operator) is a type of


linear regression that uses L1 regularization.
 It adds a penalty equal to the absolute value of the coefficients to the
loss function.
 Unlike Ridge, Lasso can shrink some coefficients exactly to zero →
acts as a method for feature selection.

Lasso Regression Objective Function

Effect of λ

 λ = 0 → Equivalent to Linear Regression (no penalty).


 Small λ → Slight shrinkage, model remains flexible.
 Large λ → Many coefficients shrink to zero → sparse model (feature
selection).

Advantages

feature selection (unlike Ridge).


Helps when there are many irrelevant features.
Produces simpler and more interpretable Performs models.

Disadvantages

 Can struggle when predictors are highly correlated (randomly selects


one).
 If number of predictors > number of samples, Lasso selects at most n
features.
 Biased estimates (shrinks coefficients too much).

Python Example
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Lasso

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

# Generate sample data

np.random.seed(42)

X = np.random.rand(100, 5) # 5 features

y = 3*X[:,0] + 2*X[:,1] + np.random.randn(100) # only first 2 features are


relevant

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Lasso Regression

lasso = Lasso(alpha=0.1) # alpha = λ

lasso.fit(X_train, y_train)

# Predictions

y_pred = lasso.predict(X_test)

# Evaluation

print("MSE:", mean_squared_error(y_test, y_pred))

print("Coefficients:", lasso.coef_)

print("Intercept:", lasso.intercept_)

topic 8: Support Vector Regression (SVR)


What is Support Vector Regression (SVR)?

 SVR is the regression counterpart of Support Vector Machine (SVM).


 Instead of finding a hyperplane to separate classes (like in classification),
SVR tries to find a function that predicts continuous values within a
margin of tolerance.
 It focuses on fitting the data while ignoring small errors (within a
certain threshold).

🔹 SVR Key Idea

 In linear regression, we minimize sum of squared errors.


 In SVR, we minimize the error only when it exceeds a certain margin
(ε).
 SVR tries to keep predictions within a tube of radius ε (epsilon-
insensitive zone).
🔹 SVR Objective Function

Parameters in SVR

 C (Regularization parameter)
o Large C → less tolerance for errors (low bias, high variance).
o Small C → more tolerance, simpler model.
 ε (Epsilon)
o Defines the "tube" around the regression line.
o Larger ε → fewer support vectors, simpler model.
 Kernel
o SVR can use linear or nonlinear kernels (polynomial, RBF,
sigmoid).
o RBF kernel is most common.

Advantages

 Works well for both linear and nonlinear regression (using kernels
 Robust to outliers (if ε is large enough).
 Effective in high-dimensional spaces.
Disadvantages

 Computationally expensive for large datasets.


 Choice of kernel and parameters (C, ε, γ) greatly affects performance.
 Harder to interpret compared to linear models.

Python Example (Using RBF Kernel)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR

# Generate sample data


np.random.seed(42)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(40)

# Fit SVR model


svr_rbf = SVR(kernel='rbf', C=100, epsilon=0.1)
svr_rbf.fit(X, y)

# Predict
X_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_pred = svr_rbf.predict(X_test)

# Plot results
plt.scatter(X, y, color='blue', label="Training data")
plt.plot(X_test, y_pred, color='red', label="SVR Prediction")
plt.legend()
plt.show()
topic 9: Decision Tree Regression
What is Decision Tree Regression?

 A Decision Tree Regressor is a machine learning model that predicts


continuous values by splitting the dataset into smaller and smaller
regions.
 It works like a flowchart, where each internal node represents a decision
on a feature, each branch represents the outcome, and each leaf node
gives a predicted value (usually the mean of target values in that region).

How Does It Work?

1. The dataset is split based on a feature that minimizes variance in target


values.
2. The process repeats recursively → creating a tree.
3. At each leaf node, the prediction is usually the average of all training
samples in that node.

Decision Tree Regression Objective

The model tries to minimize the Mean Squared Error (MSE) at each split:

At each step, it chooses the feature & threshold that reduces MSE the most.

Key Parameters

 max_depth → Maximum depth of the tree (prevents overfitting).


 min_samples_split → Minimum samples required to split a node.
 min_samples_leaf → Minimum samples required in a leaf node.
 max_features → Number of features to consider for best split.

Advantages

 Easy to understand and visualize.


 Nonlinear relationships are captured naturally.
 No need for feature scaling (unlike SVR or linear regression).
Disadvantages

 Prone to overfitting if tree is deep.


 Small changes in data can drastically change the tree (high variance)
 Not smooth → predictions are step-wise constant, not continuous.

Python Example
import numpy as np

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor

# Generate sample data

np.random.seed(42)

X = np.sort(5 * np.random.rand(80, 1), axis=0)

y = np.sin(X).ravel() + 0.2 * np.random.randn(80)

# Fit Decision Tree Regressor

tree_reg = DecisionTreeRegressor(max_depth=3)

tree_reg.fit(X, y)

# Predictions

X_test = np.linspace(0, 5, 200).reshape(-1, 1)

y_pred = tree_reg.predict(X_test)

# Plot

plt.scatter(X, y, color="blue", label="Training data")


plt.plot(X_test, y_pred, color="red", label="Decision Tree Prediction")

plt.legend()

plt.show()

topic 10: Random Forest Regression


What is Random Forest Regression?

 Random Forest Regressor is an ensemble learning method that builds


multiple Decision Trees and averages their predictions.
 It’s based on the idea:

"A group of weak learners (decision trees) can work together to form a
strong learner."

 Unlike a single Decision Tree (which can over fit), Random Forest
reduces variance by combining many trees.

How It Works

1. Bootstrap Sampling (Bagging):


o From the training dataset, multiple random samples are drawn with
replacement.
o Each sample trains a separate decision tree.
2. Random Feature Selection:
o At each split, a random subset of features is considered (not all
features).
o This makes trees less correlated and improves generalization.
3. Prediction:
o For regression: the final output is the average of predictions from
all trees.
Random Forest Regression Objective

Key Parameters

 n_estimators → Number of trees in the forest.


 max_depth → Maximum depth of each tree.
 min_samples_split / min_samples_leaf → Control overfitting.
 max_features → Number of features considered at each split.
 bootstrap → Whether to sample with replacement (default=True).

Advantages

 Reduces overfitting (better than a single tree).


 Works well with both linear & nonlinear data.
 Robust to noise and outliers.
 Handles high-dimensional data well.

Disadvantages

 Slower training & prediction (many trees).


 Less interpretable than a single decision tree.
 Still memory-intensive for very large datasets.

Python Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

# Generate sample data


np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(80)

# Fit Random Forest Regressor


forest_reg = RandomForestRegressor(n_estimators=100, max_depth=5,
random_state=42)
forest_reg.fit(X, y)

# Predictions
X_test = np.linspace(0, 5, 200).reshape(-1, 1)
y_pred = forest_reg.predict(X_test)

# Plot
plt.scatter(X, y, color="blue", label="Training data")
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.legend()
plt.show()

 The curve will be smoother than Decision Tree Regression, since it


averages multiple trees.

Decision Tree vs Random Forest


Aspect Decision Tree Random Forest
Variance High (overfits easily) Low (averaging reduces variance)
Bias Low Slightly higher
Interpretability Easy Hard (black box)
Accuracy Moderate High
Robustness Sensitive to noise Robust

In short:
Random Forest Regression = Bagging (Bootstrap + Aggregation) of
Decision Trees.
It improves accuracy, reduces overfitting, and works well in practice.
topic 11: Gradient Boosting Regression (XGBoost,
LightGBM, CatBoost)
What is Gradient Boosting Regression?

How Gradient Boosting Works

1. Start with an initial prediction (e.g., mean of y).


2. Compute residuals (errors).
3. Fit a decision tree to residuals.
4. Update prediction by adding weighted tree output.
5. Repeat for M iterations.

Key Parameters

 n_estimators → number of boosting stages (trees).


 learning_rate → shrinks contribution of each tree (smaller → more trees
needed).
 max_depth → depth of each tree (controls complexity).
 subsample → fraction of training samples used per tree (reduces
overfitting).
 loss → e.g., MSE for regression.
Advantages

 Very high predictive accuracy.


 Handles nonlinearity and interactions well.
 Works well with mixed feature types.
 Can assign feature importance.

Disadvantages

 Slower training than Random Forest (sequential).


 ensitive to hyperparameters (learning_rate, n_estimators, etc.).
 Can overfit if not tuned properly.

Variants of Gradient Boosting

1. XGBoost (Extreme Gradient Boosting)


 Highly optimized gradient boosting.
 Features: Regularization (L1 & L2), parallelization, handling missing
values.
 Very popular in Kaggle competitions.

from xgboost import XGBRegressor

model = XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=4,


random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

2. LightGBM (Light Gradient Boosting Machine)


 Faster than XGBoost (uses leaf-wise tree growth instead of level-wise).
 Efficient for large datasets with many features.

import lightgbm as lgb

model = lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1,


max_depth=-1, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
3. CatBoost
 Gradient boosting optimized for categorical features.
 Automatically handles categorical encoding (no need for one-hot).
 Often outperforms others on datasets with many categorical variables.

from catboost import CatBoostRegressor

model = CatBoostRegressor(iterations=200, learning_rate=0.1, depth=6,


verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Comparison of Boosting Methods


Feature XGBoost LightGBM CatBoost
Speed Fast Very fast Moderate
Large Datasets ✅ ✅ (best) ✅
Categorical ❌ Needs ❌ Needs
✅ Native support
Features encoding encoding
Memory Usage High Low Medium
Accuracy High Very high Very high
General- Very large Datasets with categorical
Best Use Case
purpose datasets variables

When to Use What?

 XGBoost → Default choice if you want accuracy + efficiency.


 LightGBM → Best for big data (millions of rows).
 CatBoost → Best for datasets with lots of categorical features.

In short:
Gradient Boosting Regression builds trees sequentially, correcting previous
errors.

 XGBoost: general powerhouse.


 LightGBM: super fast for large-scale data.
 CatBoost: best for categorical-heavy data.
topic 12: k-Nearest Neighbours (k-NN) Regression
What is k-NN Regression?

 k-Nearest Neighbours (k-NN) is a non-parametric, instance-based


regression method.
 Instead of learning a function explicitly, it stores the training data and
predicts outputs based on the k closest data points to a query point.
 For regression, prediction is usually the average of the target values of
the nearest neighbors.

How k-NN Regression Works

1. Choose the number of neighbors k.


2. Given a test input, compute the distance (usually Euclidean) to all
training samples.
3. Select the k nearest neighbors.
4. Prediction = average (or weighted average) of their target values.

Sometimes, neighbors are weighted by distance (closer points get higher


weight).

Key Parameters

 k (number of neighbors):
o Small k → more flexible, but can overfit (high variance).
o Large k → smoother predictions, but may underfit (high bias).
 Distance metric:
o Euclidean (default)
o Manhattan, Minkowski, cosine, etc.
 Weighting scheme:
o "uniform" → all neighbors contribute equally.
o "distance" → closer neighbors contribute more.
Advantages
 Simple and intuitive.
 Works for nonlinear relationships.
 No training time (lazy learner).

Disadvantages

 Prediction is slow for large datasets (needs distance calculation).


 Sensitive to irrelevant features & feature scaling.
 Choice of k strongly affects performance.

Python Example
import numpy as np

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor

# Generate sample data

np.random.seed(42)

X = np.sort(5 * np.random.rand(80, 1), axis=0)

y = np.sin(X).ravel() + 0.2 * np.random.randn(80)

# Fit k-NN Regressor

knn = KNeighborsRegressor(n_neighbors=5, weights='uniform')

knn.fit(X, y)

# Predictions

X_test = np.linspace(0, 5, 200).reshape(-1, 1)

y_pred = knn.predict(X_test)
# Plot

plt.scatter(X, y, color="blue", label="Training data")

plt.plot(X_test, y_pred, color="red", label="k-NN Prediction")

plt.legend()

plt.show()

Choosing k

 Small k (e.g., 1, 2, 3) → captures noise, high variance.


 Large k (e.g., >20) → smoother predictions, may miss patterns.
 Typically, cross-validation is used to select the best k.

Comparison with Other Regression Models


Feature Scaling
Model Global/Local Smoothness
Needed
Linear
Global Smooth ✅
Regression
Decision Tree Local Step function ❌
Random Forest Local + Averaged Smooth ❌
Global + Local
SVR Smooth ✅
(kernels)
Piecewise
k-NN Local ✅
smooth

In short:
k-NN Regression = Predict using the average of k nearest neighbors.
It’s simple, flexible, and works well on small datasets, but struggles with high
dimensions and large datasets.
Topic 13: multivariate regression
What is Multivariate Regression?

General Equation

Applications

 Predicting exam scores in multiple subjects based on study hours,


attendance, and sleep.
 Predicting multiple health indicators (blood pressure, sugar level, BMI)
from lifestyle factors.
 Forecasting sales across different product categories based on
advertising spend and season.

Python Example (Multivariate Regression with scikit-learn)


import numpy as np
from sklearn.linear_model import LinearRegression

# Independent variables (features)


X = np.array([[1, 2],
[2, 3],
[3, 4],
[4, 5]]) # shape = (4 samples, 2 features)

# Dependent variables (multiple outputs)


Y = np.array([[2, 3],
[3, 5],
[4, 7],
[5, 9]]) # shape = (4 samples, 2 targets)

# Create model
model = LinearRegression()
model. Fit(X, Y)

# Predictions
Y_pred = model.predict(X)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Predictions:\n", Y_pred)

Advantages

o Handles multiple dependent variables at once


o Captures relationships among multiple outputs
o Useful when outputs are correlated (e.g., predicting temperature &
humidity together)

Disadvantages

o More complex than simple/multiple regression


o Needs more data to estimate parameters reliably
o Sensitive to multi-collinearity & outliers

Topic 14: Logistic Regression


Logistic Regression is a supervised learning algorithm used for classification
problems.
Unlike Linear Regression, which predicts continuous values, Logistic
Regression predicts the probability of a class label (e.g., Yes/No, Spam/Not
Spam, 0/1).
Why “Logistic”?

Decision Rule

Cost Function (Log Loss)


Advantages
o Simple and easy to implement
o Works well for binary classification problems
o Outputs probability scores (not just labels)
o Efficient for linearly separable data

Disadvantages
o Assumes linear decision boundary (may fail for complex data)
o Not suitable for non-linear relationships unless combined with feature
engineering
o Can be sensitive to outliers
o Struggles with high-dimensional datasets without regularization

Applications
 Spam detection (Spam / Not Spam)
 Disease prediction (Diabetic / Not Diabetic)
 Customer churn prediction (Leave / Stay)
 Credit risk modeling (Default / No Default)

Python code
# Import libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Load Iris dataset


iris = datasets.load_iris()
X = iris.data[:, :2] # take only first two features for visualization
y = (iris.target != 0) * 1 # convert to binary: class 0 vs others

# Split into train & test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create and train Logistic Regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Decision Boundary Visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)


plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Logistic Regression Decision Boundary")
plt.show()
What this code does:

 Loads Iris dataset


 Converts into a binary classification problem (class 0 vs not class 0)
 Trains a Logistic Regression model
 Prints accuracy, confusion matrix, classification report
 Plots the decision boundary

Topic 16: Support Vector Machine (SVM)

What is SVM?

 Support Vector Machine (SVM) is a supervised learning algorithm


used for classification (and also for regression → SVR).
 The main idea:
Find a hyperplane that best separates data into classes with the
maximum margin.

Key Concepts in SVM


1. Hyperplane

 In 2D → a line that separates classes.


 In 3D → a plane.
 In n-dimensions → a hyperplane.

2. Margin

 Distance between the hyperplane and the nearest data points from each
class.
 SVM tries to maximize this margin → “Maximum Margin Classifier”.

3. Support Vectors

 The data points that lie closest to the hyperplane.


 They “support” or define the position of the hyperplane.

SVM Objective Function


Kernel Trick

 Real-world data is often not linearly separable.


 SVM uses kernels to map data into higher dimensions where separation
is possible.

Common kernels:
 Linear: good for simple, linearly separable data.
 Polynomial: allows curved boundaries.
 RBF (Radial Basis Function / Gaussian): most common, good for
nonlinear data.
 Sigmoid: similar to neural networks.

Parameters in SVM

 C (Regularization parameter):
o High C → less margin violation (overfits).
o Low C → wider margin, more tolerance to misclassification.
 Kernel parameters (e.g., γ in RBF):
o High γ → more complex decision boundary (overfits).
o Low γ → smoother boundary (underfits).
Advantages

 Works well in high-dimensional spaces.


 Effective even when number of features > number of samples.
 Flexible with kernels for nonlinear problems.

Disadvantages

 Training can be slow for large datasets.


 Choice of kernel & parameters (C, γ) is tricky.
 Hard to interpret compared to decision trees.

Python Example (Classification)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC

# Generate sample data (binary classification)


X, y = datasets.make_classification(n_features=2, n_redundant=0,
n_informative=2,random_state=42, n_clusters_per_class=1)

# Train SVM with RBF kernel


svm = SVC(kernel='rbf', C=1, gamma=0.5)
svm.fit(X, y)

# Predict
xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200),
np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y, edgecolors='k')
plt.title("SVM Decision Boundary (RBF Kernel)")
plt.show()

You’ll see a nonlinear boundary separating the two classes.

SVM vs Other Models


Handles Feature Scaling
Model Main Idea
Nonlinear? Needed
Logistic
Linear boundary ❌ ✅
Regression
Decision Tree Splitting rules ✅ ❌
Random Forest Ensemble of trees ✅ ❌
Maximum margin
SVM ✅ (with kernels) ✅
hyperplane
k-NN Neighbors voting ✅ ✅

In short:
SVM = Maximum Margin Classifier.
It finds the hyperplane that best separates classes, using kernels for nonlinear
problems.

Topic: difference between SVM (classification) and SVR


(regression)
Key Difference: Purpose

 SVM (classification): Find a hyperplane that best separates data into


classes with maximum margin.
 SVR (regression): Fit a function that predicts continuous values while
keeping errors within an ε-tube (margin of tolerance).
Comparison Table
Aspect SVM (Classification) SVR (Regression)
Goal Separate data into distinct classes Predict continuous values
Real-valued predictions
Output Class labels (e.g., 0/1, -1/+1)
(e.g., price, temperature)
Maximize margin between Allow deviations within ε
Margin
classes (epsilon-insensitive tube)
Support Points closest to hyperplane that Points lying outside the ε-
Vectors define the boundary tube (errors)
Loss Function Hinge Loss ε-insensitive Loss
f(x)=w⋅x+bf(x) = w \cdot x
Decision sign(w⋅x+b)\text{sign}(w \cdot x
+ bf(x)=w⋅x+b (with ε
Function + b)sign(w⋅x+b)
constraints)
Regularization Controls margin violations Controls tolerance for
(C) (misclassified points) errors beyond ε
Stock prediction, house
Spam detection, image
Use Case price prediction, demand
recognition, sentiment analysis
forecasting

Visual Intuition

 SVM (classification): Finds a hyperplane → all points on one side are


“Class A”, the other side “Class B”. Support vectors define the boundary.
 SVR (regression): Finds a regression line/curve → points inside the ε-
tube are considered “good enough”, only those outside contribute to the
error.

Quick Example

 SVM: Given email data → classify as spam or not spam.


 SVR: Given house features → predict house price (a number).

In short:
 SVM → Classification (separates categories).
 SVR → Regression (predicts continuous values).

Topic 17: Decision Tree Classifier


What is a Decision Tree Classifier?

 A Decision Tree Classifier is a supervised learning model used for


classification tasks.
 It works like a flowchart:
o Internal nodes → represent decisions based on feature values.
o Branches → represent outcomes of those decisions.
o Leaf nodes → represent final class labels.

Example: "If Age > 30 and Income = High → Predict: Yes (buys product)"

How Does It Work?

1. The dataset is split into subsets based on a feature that best separates the
classes.
2. Splitting is done recursively until a stopping condition is met (e.g., max
depth, min samples).
3. The final classification is assigned at leaf nodes (majority class of
samples).

Splitting Criteria (Impurity Measures)


To decide the best feature to split, decision trees use measures of
purity/impurity:
Key Parameters

 criterion → “gini” (default) or “entropy”.


 max_depth → maximum depth of the tree.
 min_samples_split → minimum samples required to split a node.
 min_samples_leaf → minimum samples per leaf.
 max_features → number of features considered at each split.

Advantages

 Easy to understand & interpret.


 Captures nonlinear relationships.
 Works for numerical & categorical data.
 No need for feature scaling.

Disadvantages

 Prone to overfitting if not pruned.


 Unstable (small changes in data → different tree).
 Greedy splits may not give global optimum.

Python Example

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier, plot_tree

# Load dataset (binary classification: 2 classes)

X, y = datasets.make_classification(n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1 random_state=42)

# Train Decision Tree

clf = DecisionTreeClassifier(max_depth=3, criterion='gini')clf.fit(X, y)

# Plot decision tree

plt.figure(figsize=(10,6))

plot_tree(clf, filled=True, feature_names=["Feature1", "Feature2"],


class_names=["Class 0", "Class 1"])

plt.show()

Topic 18: Random Forest Classifier

What is Random Forest Classifier?

 A Random Forest Classifier is an ensemble learning method that


combines multiple decision trees to improve classification accuracy and
reduce overfitting.
 It uses the idea of “wisdom of the crowd”: many weak models (trees)
together form a strong model.

How Does It Work?

1. Bootstrap Sampling (Bagging):


oFrom the training data, multiple random samples are drawn with
replacement.
o Each sample is used to train a separate decision tree.
2. Random Feature Selection:
o At each split, instead of considering all features, a random subset
of features is considered.
o This adds more diversity among trees and reduces correlation.
3. Voting:
o Each tree predicts a class label.
o The final prediction is made by majority voting (most common
class).

Key Parameters in Sklearn

 n_estimators: number of trees (default = 100).


 criterion: “gini” or “entropy” for measuring split quality.
 max_depth: max depth of trees.
 max_features: number of features to consider at each split (default =
√features for classification).
 bootstrap: whether bootstrap sampling is used (default = True).

Advantages

 Reduces overfitting compared to a single decision tree.


 Works well with high-dimensional data.
 Handles missing values and categorical + numerical features.
 Provides feature importance scores.

Disadvantages

 Slower and more memory-intensive than a single tree.


 Less interpretable compared to one decision tree.
 May still overfit on very noisy data.

Python Example

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import accuracy_score, classification_report

# Load dataset

iris = load_iris()

X, y = iris.data, iris.target

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Random Forest Classifier

rf = RandomForestClassifier(n_estimators=100, max_depth=3,
random_state=42)rf.fit(X_train, y_train)

# Predictions

y_pred = rf.predict(X_test)

# Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

print("Feature Importances:", rf.feature_importances_)

Random Forest Classifier vs Decision Tree Classifier


Aspect Decision Tree Random Forest
Model Single tree Ensemble of many trees
Variance High (unstable) Low (averaging reduces variance)
Aspect Decision Tree Random Forest

Bias Can be low if deep Slightly higher


Overfitting Very likely Much less likely
Interpretability Easy Harder

In short:
Random Forest Classifier = A collection of decision trees trained on random
subsets of data + features, combined by majority vote to improve accuracy and
robustness.

Topic 19: Naïve Bayes

What is Naïve Bayes?

 Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem with


a strong (naïve) assumption that features are conditionally independent
given the class.
 Despite the “naïve” assumption, it works surprisingly well in practice,
especially for text classification (spam detection, sentiment analysis,
document categorization).

Bayes’ Theorem
Types of Naïve Bayes Classifiers

1. Gaussian Naïve Bayes


o Assumes features follow a normal distribution.
o Useful for continuous data.
2. Multinomial Naïve Bayes
o Suitable for discrete counts (e.g., word counts in text
classification).
3. Bernoulli Naïve Bayes
o Suitable for binary features (word presence/absence).

Example Intuition

Suppose we want to classify an email as Spam (S) or Not Spam (N).

 Features: presence of words like “free”, “money”, “win”.


 Using Naïve Bayes:

We choose the class with the highest posterior probability.

Advantages

 Simple, fast, and efficient.


 Works well with high-dimensional data (e.g., text).
 Requires small amount of training data.
 Performs surprisingly well even if independence assumption is violated.

Disadvantages

 Assumes independence of features (often unrealistic).


 Struggles with correlated features.
 Not great for continuous variables unless distribution assumption holds.

Python Example
from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split


from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, classification_report

# Load dataset

iris = load_iris()

X, y = iris.data, iris.target

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Train Naïve Bayes classifier

nb = GaussianNB()

nb.fit(X_train, y_train)

# Predictions

y_pred = nb.predict(X_test)

# Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

Naïve Bayes vs Logistic Regression


Aspect Naïve Bayes Logistic Regression
Approach Probabilistic (Bayes theorem) Discriminative (direct decision
Aspect Naïve Bayes Logistic Regression
boundary)
Assumption Feature independence No independence assumption
Class probabilities via
Output Posterior probabilities
sigmoid/softmax
Speed Very fast Slower for large data
Text classification, spam
Use case General classification problems
filtering

In short:
Naïve Bayes = A simple probabilistic classifier that assumes features are
independent given the class, and applies Bayes’ theorem to make predictions.

Topic 20: Clustering (Unsupervised Learning)


Clustering is an unsupervised learning technique that groups similar data
points together without labels.

✅ Goal: Discover hidden patterns or groupings in data.


✅ Input: Only features (no output labels).
✅ Output: Clusters (groups of similar items).

🔹 Common Clustering Algorithms:

 K-Means Clustering → partitions data into k clusters.


 Hierarchical Clustering → builds a tree-like cluster structure
(dendrogram).
 DBSCAN (Density-Based) → finds clusters of varying shape/density.
 Gaussian Mixture Models (GMM) → probabilistic clustering.
🔹 Applications:

 Customer segmentation in marketing (e.g., high spender, medium


spender, budget shopper).
 Document clustering (news articles, research papers).
 Image segmentation (separating objects in images).
 Anomaly detection (fraud detection, outlier detection).

(1) K-Means Clustering

K-Means is one of the most popular unsupervised learning algorithms used


for clustering.

Idea
 Divide data into K clusters (groups).
 Each cluster is represented by its centroid (mean position of points).
 Points are assigned to the cluster with the nearest centroid.

Algorithm Steps (Lloyd’s Algorithm)


1. Choose K → the number of clusters.
2. Initialize centroids → randomly pick K points as cluster centers.
3. Assign points → each data point goes to the nearest centroid.
4. Update centroids → calculate new centroid of each cluster.
5. Repeat steps 3–4 until:
o Centroids don’t change much, OR
o Max iterations reached.

Mathematical Objective

Advantages
o Simple and easy to implement
o Scales well to large datasets
o Works well for spherical, well-separated clusters

Disadvantages
o Must choose K beforehand
o Sensitive to initial centroids (can converge to local minima)
o Not good for non-spherical clusters or clusters with different densities
o Sensitive to outliers

Python Example: K-Means


import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

# Generate sample data

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,


random_state=0)

# Apply K-Means

kmeans = KMeans(n_clusters=4, random_state=0)

y_kmeans = kmeans.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=30, cmap='viridis')


centers = kmeans.cluster_centers_

plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.7, marker='X')

plt.title("K-Means Clustering")

plt.show()

This code:

 Generates synthetic data (4 clusters).


 Runs K-Means with k=4.
 Plots clusters with centroids (red X marks).

(2) Hierarchical Clustering


Hierarchical Clustering is an unsupervised learning algorithm that builds a
hierarchy of clusters.
Unlike K-Means, you don’t need to pre-specify K (number of clusters).

It produces a dendrogram (tree-like diagram) showing how clusters are merged


or split.

Types of Hierarchical Clustering


1. Agglomerative (Bottom-Up)
o Start with each point as its own cluster.
o Iteratively merge the closest clusters until only one cluster
remains.
o Most common approach.
2. Divisive (Top-Down)
o Start with all points in one cluster.
o Iteratively split clusters until each point is its own cluster.

How Clusters Are Merged (Linkage Criteria)


Different ways to define “distance” between clusters:

 Single Linkage → Minimum distance between points of two clusters


 Complete Linkage → Maximum distance between points of two clusters
 Average Linkage → Average distance between all points in clusters
 Ward’s Method → Minimize variance within clusters (most popular)

Advantages
o No need to pre-define number of clusters (can cut dendrogram at any
level)
o Produces a dendrogram (visual interpretability)
o Can capture non-spherical clusters

Disadvantages
o Computationally expensive (O(n²)) → not good for very large datasets
o Sensitive to noise and outliers
o Once a merge/split is done, it cannot be undone

Applications
 Document clustering
 Gene expression data analysis (bioinformatics)
 Market research (grouping customers)

Python Example: Hierarchical Clustering

import matplotlib.pyplot as plt


import numpy as np
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# Generate synthetic data


X, y = make_blobs(n_samples=50, centers=3, random_state=42,
cluster_std=0.8)

# Perform hierarchical clustering (Agglomerative)


Z = linkage(X, method='ward') # 'ward', 'single', 'complete', 'average'

# Plot dendrogram
plt.figure(figsize=(8, 5))
dendrogram(Z)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()

# Cut the dendrogram to form clusters (e.g., 3 clusters)


clusters = fcluster(Z, 3, criterion='maxclust')

# Plot clustered data


plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("Hierarchical Clustering Result")
plt.show()

This code:

 Creates synthetic data.


 Runs Agglomerative Clustering with Ward’s method.
 Plots a dendrogram.
 Cuts the dendrogram to form 3 clusters, and visualizes them.

Topic: DBSCAN
What is DBSCAN?

 DBSCAN is a density-based clustering algorithm.


 It groups together data points that are close to each other (high-density
regions) and marks points in low-density regions as noise/outliers.
 Unlike k-means, it does not require specifying the number of clusters
in advance.

Key Concepts in DBSCAN

DBSCAN uses two important parameters:

1. ε (epsilon):
o The maximum distance between two points for them to be
considered as neighbors.
2. MinPts (minimum points):
o Minimum number of points required to form a dense region (i.e., a
cluster).

Based on these, points are classified into three categories:

 Core Point: Has at least MinPts neighbors within radius ε.


 Border Point: Not a core point, but within the neighborhood of a core
point.
 Noise Point (Outlier): Neither a core point nor a border point.

DBSCAN Algorithm (Steps)

1. Pick a random unvisited point.


2. Check how many points are within ε distance.
o If ≥ MinPts → mark as Core Point and form a cluster.
o If < MinPts → mark as Noise (may later become a border point).
3. Expand the cluster by recursively including all density-reachable points.
4. Repeat until all points are visited.

Advantages

 No need to specify number of clusters (unlike K-means).


 Can find clusters of arbitrary shape (not just spherical).
 Automatically detects outliers (noise points).

Disadvantages

 Sensitive to choice of ε and MinPts.


 Struggles with varying density clusters.
 Performance can degrade on high-dimensional data.

Python Example (DBSCAN with sklearn)

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_moons

from sklearn.cluster import DBSCAN


# Generate toy dataset (two moons shape)

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply DBSCAN

dbscan = DBSCAN(eps=0.2, min_samples=5)

labels = dbscan.fit_predict(X)

# Plot clusters

plt.scatter(X[:,0], X[:,1], c=labels, cmap='plasma', s=50)

plt.title("DBSCAN Clustering")

plt.show()

DBSCAN vs K-Means
Aspect K-Means DBSCAN
Clusters required Yes (must specify k) No (automatic)
Cluster shape Spherical (convex) Arbitrary
Handles noise ❌ No ✅ Yes
Sensitive to scale ✅ Yes ✅ Yes (ε depends on scale)

In short:
DBSCAN groups dense regions of points into clusters and marks sparse regions
as noise, making it great for data with irregular cluster shapes and outliers.
Topic: GMM
What is GMM?

 Gaussian Mixture Model (GMM) is a probabilistic model that


assumes data is generated from a mixture of several Gaussian (Normal)
distributions with unknown parameters.
 Each cluster is represented by a Gaussian distribution (bell curve in
higher dimensions → ellipses).
 Unlike k-means, GMM allows clusters to have different shapes, sizes,
and orientations.
Key Idea

How GMM Works (EM Algorithm)


Advantages

 Can model elliptical clusters (more flexible than K-means).


 Provides soft clustering (probabilities instead of hard assignments).
 Handles overlapping clusters well.

Disadvantages

 Need to specify number of clusters KKK.


 Sensitive to initialization.
 Can struggle with very high-dimensional data.
 Assumes data follows Gaussian distributions.

Python Example (Scikit-learn)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture

# Generate synthetic data


X, _ = make_blobs(n_samples=500, centers=3, cluster_std=1.0,
random_state=42)

# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
labels = gmm.fit_predict(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=40)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', marker='x', s=100)

# cluster centers
plt.title("Gaussian Mixture Model Clustering")
plt.show()

GMM vs K-Means
Aspect K-Means GMM
Cluster shape Spherical Elliptical
Assignment Hard (each point → one Soft (probability for each
Aspect K-Means GMM
cluster) cluster)
Distribution
None Gaussian
assumption
Flexibility Less More (captures covariance)

In short:
GMM is a soft-clustering algorithm that models data as a mixture of Gaussian
distributions, using probabilities to assign points to clusters. It’s more flexible
than K-means, especially for elliptical or overlapping clusters.

Topic 8: Association Rules (Market Basket Analysis)


Association rule learning is an unsupervised learning method used to find
relationships (if-then rules) between variables in large datasets.

o Goal: Identify co-occurrence patterns in data.


o Input: Transactions or sets of items.
o Output: Rules in the form IF (Antecedent) → THEN (Consequent)

Key Measures:
Algorithms:
 Apriori Algorithm
 FP-Growth Algorithm

Applications:

 Market Basket Analysis: “If a customer buys bread, they are likely to
buy butter.”
 Recommender systems (Amazon, Netflix suggestions).
 Medical data analysis (disease co-occurrence).
 Website clickstream analysis.

Summary:
 Clustering → Groups similar items (unsupervised, pattern discovery).
 Association Rules → Finds co-occurrence relationships (if-then
patterns).

Topic 9: Apriori Algorithm (Association Rule Mining)


The Apriori Algorithm is an unsupervised learning algorithm used for
association rule mining.
It helps discover frequent itemsets and association rules from transaction
datasets (like supermarket purchases).

Steps of Apriori Algorithm


1. Set Minimum Support & Confidence (thresholds).
2. Find Frequent Itemsets:
o Generate candidate itemsets of length 1, 2, 3 …
o Keep only those meeting minimum support.
3. Generate Association Rules:
o From frequent itemsets, generate rules that meet minimum
confidence.

Example (Market Basket Data)


Transactions:

 T1: {Milk, Bread, Butter}


 T2: {Bread, Butter}
 T3: {Milk, Bread}
 T4: {Milk, Butter}
 T5: {Bread, Butter}

Frequent Itemsets (Support ≥ 0.4):

 {Milk} → 0.6
 {Bread} → 0.8
 {Butter} → 0.8
 {Milk, Bread} → 0.4
 {Bread, Butter} → 0.6

Rules (Confidence ≥ 0.6):

 {Bread} → {Butter}, Confidence = 0.75, Lift > 1

Python Example: Apriori


import pandas as pd

from mlxtend.frequent_patterns import apriori, association_rules

# Sample dataset (Transactions)

dataset = [

['Milk', 'Bread', 'Butter'],

['Bread', 'Butter'],

['Milk', 'Bread'],

['Milk', 'Butter'],

['Bread', 'Butter']

# Convert dataset to one-hot encoded DataFrame


from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()

te_array = te.fit(dataset).transform(dataset)

df = pd.DataFrame(te_array, columns=te.columns_)

# Apply Apriori

frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)

# Generate association rules

rules = association_rules(frequent_itemsets, metric="confidence",


min_threshold=0.6)

print("Frequent Itemsets:\n", frequent_itemsets)

print("\nAssociation Rules:\n",
rules[['antecedents','consequents','support','confidence','lift']])

This code:
 Creates a simple shopping dataset.
 Converts it into one-hot encoding.
 Uses Apriori to find frequent itemsets (support ≥ 0.4).
 Generates association rules (confidence ≥ 0.6).

You might also like