Unit 4: Classification and Regression
Topic 1: Supervised Learning vs Unsupervised
Learning
Feature Supervised Learning Unsupervised Learning
Learning with labeled data Learning with unlabeled data
Definition (input + correct output (only input data, no output
provided). labels).
Learn a mapping function Discover hidden patterns,
Goal
from input → output. structure, or grouping in data.
Data Requires a large labeled
Works with unlabeled data.
Requirement dataset.
Predictions (classification or Clusters, associations,
Output
regression). dimensionality reduction.
- Linear Regression
- K-Means Clustering
- Logistic Regression
- Hierarchical Clustering
- Decision Trees
Examples of - DBSCAN
- Random Forest
Algorithms - Principal Component
- Support Vector Machines
Analysis (PCA)
(SVM)
- Autoencoders
- Neural Networks
Easy to evaluate (compare Harder to evaluate (no ground
Evaluation
predictions with true labels). truth).
- Spam email detection
- Customer segmentation
(Spam / Not Spam)
- Market basket analysis
Applications - Disease diagnosis (Diabetic
- Document/topic clustering
/ Not)
- Image compression
- Credit score prediction
Topic 2: classification of supervised and unsupervised
algo in ml
(A)Supervised Learning Algorithms
(Supervised = Data has labels → we predict output)
(1)Regression Algorithms (Predict Continuous Values)
Linear Regression
Polynomial Regression
Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)
Elastic Net Regression
Support Vector Regression (SVR)
Decision Tree Regression
Random Forest Regression
Gradient Boosting Regression (XGBoost, LightGBM, CatBoost)
k-Nearest Neighbors (k-NN) Regression
Neural Networks (Deep Learning for regression)
(2) Classification Algorithms (Predict Categories / Classes)
Logistic Regression
k-Nearest Neighbors (k-NN)
Support Vector Machine (SVM)
Decision Tree Classifier
Random Forest Classifier
Naïve Bayes
Gradient Boosting Classifiers (XGBoost, LightGBM, CatBoost)
Neural Networks (Deep Learning for classification)
(B) Unsupervised Learning Algorithms
(Unsupervised = Data has no labels → we find patterns, structure, groups)
(1) Clustering
K-Means Clustering
Hierarchical Clustering (Agglomerative, Divisive)
DBSCAN (Density-Based Spatial Clustering)
Mean-Shift Clustering
Gaussian Mixture Models (GMM, Expectation-Maximization)
Self-Organizing Maps (SOM)
(2) Association Rule Learning
Apriori Algorithm
FP-Growth Algorithm
Eclat Algorithm
(3) Dimensionality Reduction
Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Independent Component Analysis (ICA)
t-SNE (t-distributed Stochastic Neighbor Embedding)
UMAP (Uniform Manifold Approximation and Projection)
Summary:
Supervised → Predict outcomes (classification, regression).
Unsupervised → Discover hidden structure (clustering, associations,
dimensionality reduction).
Topic 3: Learning Steps in Machine Learning
Whether for classification or regression, the general ML pipeline is similar.
1. Problem Definition
Define the task: classification, regression, clustering, etc.
Example: Predict whether a customer will churn (Yes/No).
2. Data Collection
Gather dataset from sources (databases, APIs, sensors, files).
Example: Customer details, usage patterns, and churn status.
3. Data Preprocessing
Handle missing values, duplicates, outliers.
Encode categorical variables (e.g., One-Hot Encoding, Label Encoding).
Feature scaling (Normalization/Standardization).
4. Splitting Dataset
Train set (e.g., 70–80%) → used for model training.
Test set (e.g., 20–30%) → used for model evaluation.
Sometimes a validation set (for hyperparameter tuning).
5. Model Selection
Choose algorithm (Logistic Regression, Decision Tree, SVM, etc.).
Consider accuracy, interpretability, speed, and scalability.
6. Model Training
Feed training data to the algorithm.
Model learns patterns (weights/parameters).
7. Model Evaluation
Test on unseen data (test set).
Use metrics like:
o Accuracy
o Precision, Recall, F1-Score
o Confusion Matrix
o ROC Curve, AUC
8. Model Optimization
Tune hyperparameters (Grid Search, Random Search).
Use regularization to prevent overfitting.
9. Deployment
Deploy model into production (web app, mobile app, embedded system).
Monitor performance over time.
10. Maintenance & Updates
Retrain the model with new data to keep it accurate.
So, in short:
Problem → Data → Preprocessing → Train/Test Split → Model →
Training → Evaluation → Deployment → Maintenance
Topic 4: Linear Regression
Linear Regression is a supervised learning algorithm used for predicting a
continuous dependent variable (Y) based on one or more independent variables
(X).
It assumes a linear relationship between input features and output.
Equation of Linear Regression
Types of Linear Regression
1. Simple Linear Regression – one independent variable.
2. Multiple Linear Regression – more than one independent variable.
3. Polynomial Regression –
4. Ridge & Lasso Regression – regularized forms to prevent overfitting.
Cost Function (Loss Function)
Training the Model
Advantages of Linear Regression
1. Simplicity & Interpretability
o Very easy to understand and implement.
o Coefficients clearly show the effect of each feature.
2. Fast & Efficient
o Computationally inexpensive, works well even on large datasets.
3. Less Data Hungry
o Can work with relatively small datasets compared to deep learning.
4. Good for Linearly Separable Data
o Performs well when there is a clear linear relationship between
features and target.
5. Baseline Model
o Often used as a starting point for regression tasks before trying
more complex models.
Disadvantages of Linear Regression
1. Assumption of Linearity
o Works poorly if the relationship between variables is non-linear.
2. Sensitive to Outliers
o A single extreme value can significantly affect the regression line.
3. Multicollinearity Problem
o If independent variables are highly correlated, it affects stability
and interpretation of coefficients.
4. Poor at Complex Relationships
o Cannot capture interactions and non-linear patterns without
modification (e.g., polynomial regression).
5. Assumption Requirements
o Requires assumptions like homoscedasticity (constant variance), normal
distribution of errors, and independence of observations
Python Example (Simple Linear Regression)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # independent variable
y = np.array([2, 4, 5, 4, 5]) # dependent variable
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
# Plot
plt.scatter(X, y, color="blue", label="Actual data")
plt.plot(X, y_pred, color="red", label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
# Coefficients
print("Intercept:", model.intercept_)
print("Slope:", model.coef_[0])
👉 In short:
Use Linear Regression when the relationship is simple, linear, and you
need interpretability.
Avoid it for complex, non-linear, or high-dimensional problems.
Topic 5: Polynomial Regression
Polynomial Regression is a type of supervised learning regression algorithm.
It is an extension of Linear Regression where the relationship between
independent variable(s) X and dependent variable Y is non-linear.
Equation
When to Use
When data shows a non-linear trend.
Example: Predicting growth curves, demand curves, temperature patterns,
etc.
Steps in Polynomial Regression
Advantages
o Captures non-linear relationships.
o Simple to implement (just an extension of linear regression).
Disadvantages
o Can easily overfit if degree is too high.
o Extrapolation (prediction outside range) is unreliable.
o More computational cost with higher degrees.
Python Example: Polynomial Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Generate sample data
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 + 0.5*X + 1.5*X**2 + np.random.randn(50, 1)*3 # quadratic with noise
# Transform features for polynomial regression (degree=2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Fit linear regression on transformed features
model = LinearRegression()
model.fit(X_poly, y)
# Predict
y_pred = model.predict(X_poly)
# Plot
plt.scatter(X, y, color='blue', label="Data")
plt.plot(X, y_pred, color='red', linewidth=2, label="Polynomial Regression")
plt.legend()
plt.title("Polynomial Regression (Degree=2)")
plt.show()
topic 6:Ridge Regression (L2 Regularization)
What is Ridge Regression?
Ridge Regression is a linear regression technique that uses L2
regularization.
It modifies the ordinary least squares (OLS) regression by adding a
penalty term to the loss function.
The penalty discourages large coefficients (weights), which helps prevent
overfitting.
Ridge Regression Objective Function
where
🔹 Effect of λ (Regularization Parameter)
λ = 0 → Equivalent to standard linear regression (no penalty).
Small λ → Slight shrinkage, keeps model flexible.
Large λ → Strong shrinkage, coefficients close to zero, model becomes
simpler.
Advantages
Reduces overfitting by controlling large coefficients.
Works well when predictors are highly correlated (multicollinearity).
Always has a unique solution (since penalty ensures matrix invertibility).
Disadvantages
Doesn’t perform feature selection (unlike Lasso, coefficients don’t
become exactly zero).
Still assumes linear relationship between predictors and target.
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1).ravel()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Ridge Regression
ridge = Ridge(alpha=10) # alpha = λ
ridge.fit(X_train, y_train)
# Predictions
y_pred = ridge.predict(X_test)
# Evaluation
print("MSE:", mean_squared_error(y_test, y_pred))
print("Coefficients:", ridge.coef_)
print("Intercept:", ridge.intercept_)
In short:
Ridge Regression = Linear Regression + L2 penalty.
It shrinks coefficients but never eliminates them, making it effective for
reducing overfitting while keeping all predictors.
topic 7:Lasso Regression (L1 Regularization)
What is Lasso Regression?
Lasso (Least Absolute Shrinkage and Selection Operator) is a type of
linear regression that uses L1 regularization.
It adds a penalty equal to the absolute value of the coefficients to the
loss function.
Unlike Ridge, Lasso can shrink some coefficients exactly to zero →
acts as a method for feature selection.
Lasso Regression Objective Function
Effect of λ
λ = 0 → Equivalent to Linear Regression (no penalty).
Small λ → Slight shrinkage, model remains flexible.
Large λ → Many coefficients shrink to zero → sparse model (feature
selection).
Advantages
feature selection (unlike Ridge).
Helps when there are many irrelevant features.
Produces simpler and more interpretable Performs models.
Disadvantages
Can struggle when predictors are highly correlated (randomly selects
one).
If number of predictors > number of samples, Lasso selects at most n
features.
Biased estimates (shrinks coefficients too much).
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 5) # 5 features
y = 3*X[:,0] + 2*X[:,1] + np.random.randn(100) # only first 2 features are
relevant
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Lasso Regression
lasso = Lasso(alpha=0.1) # alpha = λ
lasso.fit(X_train, y_train)
# Predictions
y_pred = lasso.predict(X_test)
# Evaluation
print("MSE:", mean_squared_error(y_test, y_pred))
print("Coefficients:", lasso.coef_)
print("Intercept:", lasso.intercept_)
topic 8: Support Vector Regression (SVR)
What is Support Vector Regression (SVR)?
SVR is the regression counterpart of Support Vector Machine (SVM).
Instead of finding a hyperplane to separate classes (like in classification),
SVR tries to find a function that predicts continuous values within a
margin of tolerance.
It focuses on fitting the data while ignoring small errors (within a
certain threshold).
🔹 SVR Key Idea
In linear regression, we minimize sum of squared errors.
In SVR, we minimize the error only when it exceeds a certain margin
(ε).
SVR tries to keep predictions within a tube of radius ε (epsilon-
insensitive zone).
🔹 SVR Objective Function
Parameters in SVR
C (Regularization parameter)
o Large C → less tolerance for errors (low bias, high variance).
o Small C → more tolerance, simpler model.
ε (Epsilon)
o Defines the "tube" around the regression line.
o Larger ε → fewer support vectors, simpler model.
Kernel
o SVR can use linear or nonlinear kernels (polynomial, RBF,
sigmoid).
o RBF kernel is most common.
Advantages
Works well for both linear and nonlinear regression (using kernels
Robust to outliers (if ε is large enough).
Effective in high-dimensional spaces.
Disadvantages
Computationally expensive for large datasets.
Choice of kernel and parameters (C, ε, γ) greatly affects performance.
Harder to interpret compared to linear models.
Python Example (Using RBF Kernel)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
# Generate sample data
np.random.seed(42)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(40)
# Fit SVR model
svr_rbf = SVR(kernel='rbf', C=100, epsilon=0.1)
svr_rbf.fit(X, y)
# Predict
X_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_pred = svr_rbf.predict(X_test)
# Plot results
plt.scatter(X, y, color='blue', label="Training data")
plt.plot(X_test, y_pred, color='red', label="SVR Prediction")
plt.legend()
plt.show()
topic 9: Decision Tree Regression
What is Decision Tree Regression?
A Decision Tree Regressor is a machine learning model that predicts
continuous values by splitting the dataset into smaller and smaller
regions.
It works like a flowchart, where each internal node represents a decision
on a feature, each branch represents the outcome, and each leaf node
gives a predicted value (usually the mean of target values in that region).
How Does It Work?
1. The dataset is split based on a feature that minimizes variance in target
values.
2. The process repeats recursively → creating a tree.
3. At each leaf node, the prediction is usually the average of all training
samples in that node.
Decision Tree Regression Objective
The model tries to minimize the Mean Squared Error (MSE) at each split:
At each step, it chooses the feature & threshold that reduces MSE the most.
Key Parameters
max_depth → Maximum depth of the tree (prevents overfitting).
min_samples_split → Minimum samples required to split a node.
min_samples_leaf → Minimum samples required in a leaf node.
max_features → Number of features to consider for best split.
Advantages
Easy to understand and visualize.
Nonlinear relationships are captured naturally.
No need for feature scaling (unlike SVR or linear regression).
Disadvantages
Prone to overfitting if tree is deep.
Small changes in data can drastically change the tree (high variance)
Not smooth → predictions are step-wise constant, not continuous.
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
# Generate sample data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(80)
# Fit Decision Tree Regressor
tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(X, y)
# Predictions
X_test = np.linspace(0, 5, 200).reshape(-1, 1)
y_pred = tree_reg.predict(X_test)
# Plot
plt.scatter(X, y, color="blue", label="Training data")
plt.plot(X_test, y_pred, color="red", label="Decision Tree Prediction")
plt.legend()
plt.show()
topic 10: Random Forest Regression
What is Random Forest Regression?
Random Forest Regressor is an ensemble learning method that builds
multiple Decision Trees and averages their predictions.
It’s based on the idea:
"A group of weak learners (decision trees) can work together to form a
strong learner."
Unlike a single Decision Tree (which can over fit), Random Forest
reduces variance by combining many trees.
How It Works
1. Bootstrap Sampling (Bagging):
o From the training dataset, multiple random samples are drawn with
replacement.
o Each sample trains a separate decision tree.
2. Random Feature Selection:
o At each split, a random subset of features is considered (not all
features).
o This makes trees less correlated and improves generalization.
3. Prediction:
o For regression: the final output is the average of predictions from
all trees.
Random Forest Regression Objective
Key Parameters
n_estimators → Number of trees in the forest.
max_depth → Maximum depth of each tree.
min_samples_split / min_samples_leaf → Control overfitting.
max_features → Number of features considered at each split.
bootstrap → Whether to sample with replacement (default=True).
Advantages
Reduces overfitting (better than a single tree).
Works well with both linear & nonlinear data.
Robust to noise and outliers.
Handles high-dimensional data well.
Disadvantages
Slower training & prediction (many trees).
Less interpretable than a single decision tree.
Still memory-intensive for very large datasets.
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
# Generate sample data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(80)
# Fit Random Forest Regressor
forest_reg = RandomForestRegressor(n_estimators=100, max_depth=5,
random_state=42)
forest_reg.fit(X, y)
# Predictions
X_test = np.linspace(0, 5, 200).reshape(-1, 1)
y_pred = forest_reg.predict(X_test)
# Plot
plt.scatter(X, y, color="blue", label="Training data")
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.legend()
plt.show()
The curve will be smoother than Decision Tree Regression, since it
averages multiple trees.
Decision Tree vs Random Forest
Aspect Decision Tree Random Forest
Variance High (overfits easily) Low (averaging reduces variance)
Bias Low Slightly higher
Interpretability Easy Hard (black box)
Accuracy Moderate High
Robustness Sensitive to noise Robust
In short:
Random Forest Regression = Bagging (Bootstrap + Aggregation) of
Decision Trees.
It improves accuracy, reduces overfitting, and works well in practice.
topic 11: Gradient Boosting Regression (XGBoost,
LightGBM, CatBoost)
What is Gradient Boosting Regression?
How Gradient Boosting Works
1. Start with an initial prediction (e.g., mean of y).
2. Compute residuals (errors).
3. Fit a decision tree to residuals.
4. Update prediction by adding weighted tree output.
5. Repeat for M iterations.
Key Parameters
n_estimators → number of boosting stages (trees).
learning_rate → shrinks contribution of each tree (smaller → more trees
needed).
max_depth → depth of each tree (controls complexity).
subsample → fraction of training samples used per tree (reduces
overfitting).
loss → e.g., MSE for regression.
Advantages
Very high predictive accuracy.
Handles nonlinearity and interactions well.
Works well with mixed feature types.
Can assign feature importance.
Disadvantages
Slower training than Random Forest (sequential).
ensitive to hyperparameters (learning_rate, n_estimators, etc.).
Can overfit if not tuned properly.
Variants of Gradient Boosting
1. XGBoost (Extreme Gradient Boosting)
Highly optimized gradient boosting.
Features: Regularization (L1 & L2), parallelization, handling missing
values.
Very popular in Kaggle competitions.
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=4,
random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
2. LightGBM (Light Gradient Boosting Machine)
Faster than XGBoost (uses leaf-wise tree growth instead of level-wise).
Efficient for large datasets with many features.
import lightgbm as lgb
model = lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1,
max_depth=-1, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
3. CatBoost
Gradient boosting optimized for categorical features.
Automatically handles categorical encoding (no need for one-hot).
Often outperforms others on datasets with many categorical variables.
from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=200, learning_rate=0.1, depth=6,
verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Comparison of Boosting Methods
Feature XGBoost LightGBM CatBoost
Speed Fast Very fast Moderate
Large Datasets ✅ ✅ (best) ✅
Categorical ❌ Needs ❌ Needs
✅ Native support
Features encoding encoding
Memory Usage High Low Medium
Accuracy High Very high Very high
General- Very large Datasets with categorical
Best Use Case
purpose datasets variables
When to Use What?
XGBoost → Default choice if you want accuracy + efficiency.
LightGBM → Best for big data (millions of rows).
CatBoost → Best for datasets with lots of categorical features.
In short:
Gradient Boosting Regression builds trees sequentially, correcting previous
errors.
XGBoost: general powerhouse.
LightGBM: super fast for large-scale data.
CatBoost: best for categorical-heavy data.
topic 12: k-Nearest Neighbours (k-NN) Regression
What is k-NN Regression?
k-Nearest Neighbours (k-NN) is a non-parametric, instance-based
regression method.
Instead of learning a function explicitly, it stores the training data and
predicts outputs based on the k closest data points to a query point.
For regression, prediction is usually the average of the target values of
the nearest neighbors.
How k-NN Regression Works
1. Choose the number of neighbors k.
2. Given a test input, compute the distance (usually Euclidean) to all
training samples.
3. Select the k nearest neighbors.
4. Prediction = average (or weighted average) of their target values.
Sometimes, neighbors are weighted by distance (closer points get higher
weight).
Key Parameters
k (number of neighbors):
o Small k → more flexible, but can overfit (high variance).
o Large k → smoother predictions, but may underfit (high bias).
Distance metric:
o Euclidean (default)
o Manhattan, Minkowski, cosine, etc.
Weighting scheme:
o "uniform" → all neighbors contribute equally.
o "distance" → closer neighbors contribute more.
Advantages
Simple and intuitive.
Works for nonlinear relationships.
No training time (lazy learner).
Disadvantages
Prediction is slow for large datasets (needs distance calculation).
Sensitive to irrelevant features & feature scaling.
Choice of k strongly affects performance.
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
# Generate sample data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + 0.2 * np.random.randn(80)
# Fit k-NN Regressor
knn = KNeighborsRegressor(n_neighbors=5, weights='uniform')
knn.fit(X, y)
# Predictions
X_test = np.linspace(0, 5, 200).reshape(-1, 1)
y_pred = knn.predict(X_test)
# Plot
plt.scatter(X, y, color="blue", label="Training data")
plt.plot(X_test, y_pred, color="red", label="k-NN Prediction")
plt.legend()
plt.show()
Choosing k
Small k (e.g., 1, 2, 3) → captures noise, high variance.
Large k (e.g., >20) → smoother predictions, may miss patterns.
Typically, cross-validation is used to select the best k.
Comparison with Other Regression Models
Feature Scaling
Model Global/Local Smoothness
Needed
Linear
Global Smooth ✅
Regression
Decision Tree Local Step function ❌
Random Forest Local + Averaged Smooth ❌
Global + Local
SVR Smooth ✅
(kernels)
Piecewise
k-NN Local ✅
smooth
In short:
k-NN Regression = Predict using the average of k nearest neighbors.
It’s simple, flexible, and works well on small datasets, but struggles with high
dimensions and large datasets.
Topic 13: multivariate regression
What is Multivariate Regression?
General Equation
Applications
Predicting exam scores in multiple subjects based on study hours,
attendance, and sleep.
Predicting multiple health indicators (blood pressure, sugar level, BMI)
from lifestyle factors.
Forecasting sales across different product categories based on
advertising spend and season.
Python Example (Multivariate Regression with scikit-learn)
import numpy as np
from sklearn.linear_model import LinearRegression
# Independent variables (features)
X = np.array([[1, 2],
[2, 3],
[3, 4],
[4, 5]]) # shape = (4 samples, 2 features)
# Dependent variables (multiple outputs)
Y = np.array([[2, 3],
[3, 5],
[4, 7],
[5, 9]]) # shape = (4 samples, 2 targets)
# Create model
model = LinearRegression()
model. Fit(X, Y)
# Predictions
Y_pred = model.predict(X)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Predictions:\n", Y_pred)
Advantages
o Handles multiple dependent variables at once
o Captures relationships among multiple outputs
o Useful when outputs are correlated (e.g., predicting temperature &
humidity together)
Disadvantages
o More complex than simple/multiple regression
o Needs more data to estimate parameters reliably
o Sensitive to multi-collinearity & outliers
Topic 14: Logistic Regression
Logistic Regression is a supervised learning algorithm used for classification
problems.
Unlike Linear Regression, which predicts continuous values, Logistic
Regression predicts the probability of a class label (e.g., Yes/No, Spam/Not
Spam, 0/1).
Why “Logistic”?
Decision Rule
Cost Function (Log Loss)
Advantages
o Simple and easy to implement
o Works well for binary classification problems
o Outputs probability scores (not just labels)
o Efficient for linearly separable data
Disadvantages
o Assumes linear decision boundary (may fail for complex data)
o Not suitable for non-linear relationships unless combined with feature
engineering
o Can be sensitive to outliers
o Struggles with high-dimensional datasets without regularization
Applications
Spam detection (Spam / Not Spam)
Disease prediction (Diabetic / Not Diabetic)
Customer churn prediction (Leave / Stay)
Credit risk modeling (Default / No Default)
Python code
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # take only first two features for visualization
y = (iris.target != 0) * 1 # convert to binary: class 0 vs others
# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create and train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Decision Boundary Visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Logistic Regression Decision Boundary")
plt.show()
What this code does:
Loads Iris dataset
Converts into a binary classification problem (class 0 vs not class 0)
Trains a Logistic Regression model
Prints accuracy, confusion matrix, classification report
Plots the decision boundary
Topic 16: Support Vector Machine (SVM)
What is SVM?
Support Vector Machine (SVM) is a supervised learning algorithm
used for classification (and also for regression → SVR).
The main idea:
Find a hyperplane that best separates data into classes with the
maximum margin.
Key Concepts in SVM
1. Hyperplane
In 2D → a line that separates classes.
In 3D → a plane.
In n-dimensions → a hyperplane.
2. Margin
Distance between the hyperplane and the nearest data points from each
class.
SVM tries to maximize this margin → “Maximum Margin Classifier”.
3. Support Vectors
The data points that lie closest to the hyperplane.
They “support” or define the position of the hyperplane.
SVM Objective Function
Kernel Trick
Real-world data is often not linearly separable.
SVM uses kernels to map data into higher dimensions where separation
is possible.
Common kernels:
Linear: good for simple, linearly separable data.
Polynomial: allows curved boundaries.
RBF (Radial Basis Function / Gaussian): most common, good for
nonlinear data.
Sigmoid: similar to neural networks.
Parameters in SVM
C (Regularization parameter):
o High C → less margin violation (overfits).
o Low C → wider margin, more tolerance to misclassification.
Kernel parameters (e.g., γ in RBF):
o High γ → more complex decision boundary (overfits).
o Low γ → smoother boundary (underfits).
Advantages
Works well in high-dimensional spaces.
Effective even when number of features > number of samples.
Flexible with kernels for nonlinear problems.
Disadvantages
Training can be slow for large datasets.
Choice of kernel & parameters (C, γ) is tricky.
Hard to interpret compared to decision trees.
Python Example (Classification)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
# Generate sample data (binary classification)
X, y = datasets.make_classification(n_features=2, n_redundant=0,
n_informative=2,random_state=42, n_clusters_per_class=1)
# Train SVM with RBF kernel
svm = SVC(kernel='rbf', C=1, gamma=0.5)
svm.fit(X, y)
# Predict
xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200),
np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y, edgecolors='k')
plt.title("SVM Decision Boundary (RBF Kernel)")
plt.show()
You’ll see a nonlinear boundary separating the two classes.
SVM vs Other Models
Handles Feature Scaling
Model Main Idea
Nonlinear? Needed
Logistic
Linear boundary ❌ ✅
Regression
Decision Tree Splitting rules ✅ ❌
Random Forest Ensemble of trees ✅ ❌
Maximum margin
SVM ✅ (with kernels) ✅
hyperplane
k-NN Neighbors voting ✅ ✅
In short:
SVM = Maximum Margin Classifier.
It finds the hyperplane that best separates classes, using kernels for nonlinear
problems.
Topic: difference between SVM (classification) and SVR
(regression)
Key Difference: Purpose
SVM (classification): Find a hyperplane that best separates data into
classes with maximum margin.
SVR (regression): Fit a function that predicts continuous values while
keeping errors within an ε-tube (margin of tolerance).
Comparison Table
Aspect SVM (Classification) SVR (Regression)
Goal Separate data into distinct classes Predict continuous values
Real-valued predictions
Output Class labels (e.g., 0/1, -1/+1)
(e.g., price, temperature)
Maximize margin between Allow deviations within ε
Margin
classes (epsilon-insensitive tube)
Support Points closest to hyperplane that Points lying outside the ε-
Vectors define the boundary tube (errors)
Loss Function Hinge Loss ε-insensitive Loss
f(x)=w⋅x+bf(x) = w \cdot x
Decision sign(w⋅x+b)\text{sign}(w \cdot x
+ bf(x)=w⋅x+b (with ε
Function + b)sign(w⋅x+b)
constraints)
Regularization Controls margin violations Controls tolerance for
(C) (misclassified points) errors beyond ε
Stock prediction, house
Spam detection, image
Use Case price prediction, demand
recognition, sentiment analysis
forecasting
Visual Intuition
SVM (classification): Finds a hyperplane → all points on one side are
“Class A”, the other side “Class B”. Support vectors define the boundary.
SVR (regression): Finds a regression line/curve → points inside the ε-
tube are considered “good enough”, only those outside contribute to the
error.
Quick Example
SVM: Given email data → classify as spam or not spam.
SVR: Given house features → predict house price (a number).
In short:
SVM → Classification (separates categories).
SVR → Regression (predicts continuous values).
Topic 17: Decision Tree Classifier
What is a Decision Tree Classifier?
A Decision Tree Classifier is a supervised learning model used for
classification tasks.
It works like a flowchart:
o Internal nodes → represent decisions based on feature values.
o Branches → represent outcomes of those decisions.
o Leaf nodes → represent final class labels.
Example: "If Age > 30 and Income = High → Predict: Yes (buys product)"
How Does It Work?
1. The dataset is split into subsets based on a feature that best separates the
classes.
2. Splitting is done recursively until a stopping condition is met (e.g., max
depth, min samples).
3. The final classification is assigned at leaf nodes (majority class of
samples).
Splitting Criteria (Impurity Measures)
To decide the best feature to split, decision trees use measures of
purity/impurity:
Key Parameters
criterion → “gini” (default) or “entropy”.
max_depth → maximum depth of the tree.
min_samples_split → minimum samples required to split a node.
min_samples_leaf → minimum samples per leaf.
max_features → number of features considered at each split.
Advantages
Easy to understand & interpret.
Captures nonlinear relationships.
Works for numerical & categorical data.
No need for feature scaling.
Disadvantages
Prone to overfitting if not pruned.
Unstable (small changes in data → different tree).
Greedy splits may not give global optimum.
Python Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, plot_tree
# Load dataset (binary classification: 2 classes)
X, y = datasets.make_classification(n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1 random_state=42)
# Train Decision Tree
clf = DecisionTreeClassifier(max_depth=3, criterion='gini')clf.fit(X, y)
# Plot decision tree
plt.figure(figsize=(10,6))
plot_tree(clf, filled=True, feature_names=["Feature1", "Feature2"],
class_names=["Class 0", "Class 1"])
plt.show()
Topic 18: Random Forest Classifier
What is Random Forest Classifier?
A Random Forest Classifier is an ensemble learning method that
combines multiple decision trees to improve classification accuracy and
reduce overfitting.
It uses the idea of “wisdom of the crowd”: many weak models (trees)
together form a strong model.
How Does It Work?
1. Bootstrap Sampling (Bagging):
oFrom the training data, multiple random samples are drawn with
replacement.
o Each sample is used to train a separate decision tree.
2. Random Feature Selection:
o At each split, instead of considering all features, a random subset
of features is considered.
o This adds more diversity among trees and reduces correlation.
3. Voting:
o Each tree predicts a class label.
o The final prediction is made by majority voting (most common
class).
Key Parameters in Sklearn
n_estimators: number of trees (default = 100).
criterion: “gini” or “entropy” for measuring split quality.
max_depth: max depth of trees.
max_features: number of features to consider at each split (default =
√features for classification).
bootstrap: whether bootstrap sampling is used (default = True).
Advantages
Reduces overfitting compared to a single decision tree.
Works well with high-dimensional data.
Handles missing values and categorical + numerical features.
Provides feature importance scores.
Disadvantages
Slower and more memory-intensive than a single tree.
Less interpretable compared to one decision tree.
May still overfit on very noisy data.
Python Example
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, max_depth=3,
random_state=42)rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Feature Importances:", rf.feature_importances_)
Random Forest Classifier vs Decision Tree Classifier
Aspect Decision Tree Random Forest
Model Single tree Ensemble of many trees
Variance High (unstable) Low (averaging reduces variance)
Aspect Decision Tree Random Forest
Bias Can be low if deep Slightly higher
Overfitting Very likely Much less likely
Interpretability Easy Harder
In short:
Random Forest Classifier = A collection of decision trees trained on random
subsets of data + features, combined by majority vote to improve accuracy and
robustness.
Topic 19: Naïve Bayes
What is Naïve Bayes?
Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem with
a strong (naïve) assumption that features are conditionally independent
given the class.
Despite the “naïve” assumption, it works surprisingly well in practice,
especially for text classification (spam detection, sentiment analysis,
document categorization).
Bayes’ Theorem
Types of Naïve Bayes Classifiers
1. Gaussian Naïve Bayes
o Assumes features follow a normal distribution.
o Useful for continuous data.
2. Multinomial Naïve Bayes
o Suitable for discrete counts (e.g., word counts in text
classification).
3. Bernoulli Naïve Bayes
o Suitable for binary features (word presence/absence).
Example Intuition
Suppose we want to classify an email as Spam (S) or Not Spam (N).
Features: presence of words like “free”, “money”, “win”.
Using Naïve Bayes:
We choose the class with the highest posterior probability.
Advantages
Simple, fast, and efficient.
Works well with high-dimensional data (e.g., text).
Requires small amount of training data.
Performs surprisingly well even if independence assumption is violated.
Disadvantages
Assumes independence of features (often unrealistic).
Struggles with correlated features.
Not great for continuous variables unless distribution assumption holds.
Python Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train Naïve Bayes classifier
nb = GaussianNB()
nb.fit(X_train, y_train)
# Predictions
y_pred = nb.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Naïve Bayes vs Logistic Regression
Aspect Naïve Bayes Logistic Regression
Approach Probabilistic (Bayes theorem) Discriminative (direct decision
Aspect Naïve Bayes Logistic Regression
boundary)
Assumption Feature independence No independence assumption
Class probabilities via
Output Posterior probabilities
sigmoid/softmax
Speed Very fast Slower for large data
Text classification, spam
Use case General classification problems
filtering
In short:
Naïve Bayes = A simple probabilistic classifier that assumes features are
independent given the class, and applies Bayes’ theorem to make predictions.
Topic 20: Clustering (Unsupervised Learning)
Clustering is an unsupervised learning technique that groups similar data
points together without labels.
✅ Goal: Discover hidden patterns or groupings in data.
✅ Input: Only features (no output labels).
✅ Output: Clusters (groups of similar items).
🔹 Common Clustering Algorithms:
K-Means Clustering → partitions data into k clusters.
Hierarchical Clustering → builds a tree-like cluster structure
(dendrogram).
DBSCAN (Density-Based) → finds clusters of varying shape/density.
Gaussian Mixture Models (GMM) → probabilistic clustering.
🔹 Applications:
Customer segmentation in marketing (e.g., high spender, medium
spender, budget shopper).
Document clustering (news articles, research papers).
Image segmentation (separating objects in images).
Anomaly detection (fraud detection, outlier detection).
(1) K-Means Clustering
K-Means is one of the most popular unsupervised learning algorithms used
for clustering.
Idea
Divide data into K clusters (groups).
Each cluster is represented by its centroid (mean position of points).
Points are assigned to the cluster with the nearest centroid.
Algorithm Steps (Lloyd’s Algorithm)
1. Choose K → the number of clusters.
2. Initialize centroids → randomly pick K points as cluster centers.
3. Assign points → each data point goes to the nearest centroid.
4. Update centroids → calculate new centroid of each cluster.
5. Repeat steps 3–4 until:
o Centroids don’t change much, OR
o Max iterations reached.
Mathematical Objective
Advantages
o Simple and easy to implement
o Scales well to large datasets
o Works well for spherical, well-separated clusters
Disadvantages
o Must choose K beforehand
o Sensitive to initial centroids (can converge to local minima)
o Not good for non-spherical clusters or clusters with different densities
o Sensitive to outliers
Python Example: K-Means
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=0)
y_kmeans = kmeans.fit_predict(X)
# Plot clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=30, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.7, marker='X')
plt.title("K-Means Clustering")
plt.show()
This code:
Generates synthetic data (4 clusters).
Runs K-Means with k=4.
Plots clusters with centroids (red X marks).
(2) Hierarchical Clustering
Hierarchical Clustering is an unsupervised learning algorithm that builds a
hierarchy of clusters.
Unlike K-Means, you don’t need to pre-specify K (number of clusters).
It produces a dendrogram (tree-like diagram) showing how clusters are merged
or split.
Types of Hierarchical Clustering
1. Agglomerative (Bottom-Up)
o Start with each point as its own cluster.
o Iteratively merge the closest clusters until only one cluster
remains.
o Most common approach.
2. Divisive (Top-Down)
o Start with all points in one cluster.
o Iteratively split clusters until each point is its own cluster.
How Clusters Are Merged (Linkage Criteria)
Different ways to define “distance” between clusters:
Single Linkage → Minimum distance between points of two clusters
Complete Linkage → Maximum distance between points of two clusters
Average Linkage → Average distance between all points in clusters
Ward’s Method → Minimize variance within clusters (most popular)
Advantages
o No need to pre-define number of clusters (can cut dendrogram at any
level)
o Produces a dendrogram (visual interpretability)
o Can capture non-spherical clusters
Disadvantages
o Computationally expensive (O(n²)) → not good for very large datasets
o Sensitive to noise and outliers
o Once a merge/split is done, it cannot be undone
Applications
Document clustering
Gene expression data analysis (bioinformatics)
Market research (grouping customers)
Python Example: Hierarchical Clustering
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
# Generate synthetic data
X, y = make_blobs(n_samples=50, centers=3, random_state=42,
cluster_std=0.8)
# Perform hierarchical clustering (Agglomerative)
Z = linkage(X, method='ward') # 'ward', 'single', 'complete', 'average'
# Plot dendrogram
plt.figure(figsize=(8, 5))
dendrogram(Z)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()
# Cut the dendrogram to form clusters (e.g., 3 clusters)
clusters = fcluster(Z, 3, criterion='maxclust')
# Plot clustered data
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("Hierarchical Clustering Result")
plt.show()
This code:
Creates synthetic data.
Runs Agglomerative Clustering with Ward’s method.
Plots a dendrogram.
Cuts the dendrogram to form 3 clusters, and visualizes them.
Topic: DBSCAN
What is DBSCAN?
DBSCAN is a density-based clustering algorithm.
It groups together data points that are close to each other (high-density
regions) and marks points in low-density regions as noise/outliers.
Unlike k-means, it does not require specifying the number of clusters
in advance.
Key Concepts in DBSCAN
DBSCAN uses two important parameters:
1. ε (epsilon):
o The maximum distance between two points for them to be
considered as neighbors.
2. MinPts (minimum points):
o Minimum number of points required to form a dense region (i.e., a
cluster).
Based on these, points are classified into three categories:
Core Point: Has at least MinPts neighbors within radius ε.
Border Point: Not a core point, but within the neighborhood of a core
point.
Noise Point (Outlier): Neither a core point nor a border point.
DBSCAN Algorithm (Steps)
1. Pick a random unvisited point.
2. Check how many points are within ε distance.
o If ≥ MinPts → mark as Core Point and form a cluster.
o If < MinPts → mark as Noise (may later become a border point).
3. Expand the cluster by recursively including all density-reachable points.
4. Repeat until all points are visited.
Advantages
No need to specify number of clusters (unlike K-means).
Can find clusters of arbitrary shape (not just spherical).
Automatically detects outliers (noise points).
Disadvantages
Sensitive to choice of ε and MinPts.
Struggles with varying density clusters.
Performance can degrade on high-dimensional data.
Python Example (DBSCAN with sklearn)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
# Generate toy dataset (two moons shape)
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)
# Plot clusters
plt.scatter(X[:,0], X[:,1], c=labels, cmap='plasma', s=50)
plt.title("DBSCAN Clustering")
plt.show()
DBSCAN vs K-Means
Aspect K-Means DBSCAN
Clusters required Yes (must specify k) No (automatic)
Cluster shape Spherical (convex) Arbitrary
Handles noise ❌ No ✅ Yes
Sensitive to scale ✅ Yes ✅ Yes (ε depends on scale)
In short:
DBSCAN groups dense regions of points into clusters and marks sparse regions
as noise, making it great for data with irregular cluster shapes and outliers.
Topic: GMM
What is GMM?
Gaussian Mixture Model (GMM) is a probabilistic model that
assumes data is generated from a mixture of several Gaussian (Normal)
distributions with unknown parameters.
Each cluster is represented by a Gaussian distribution (bell curve in
higher dimensions → ellipses).
Unlike k-means, GMM allows clusters to have different shapes, sizes,
and orientations.
Key Idea
How GMM Works (EM Algorithm)
Advantages
Can model elliptical clusters (more flexible than K-means).
Provides soft clustering (probabilities instead of hard assignments).
Handles overlapping clusters well.
Disadvantages
Need to specify number of clusters KKK.
Sensitive to initialization.
Can struggle with very high-dimensional data.
Assumes data follows Gaussian distributions.
Python Example (Scikit-learn)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
# Generate synthetic data
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=1.0,
random_state=42)
# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
labels = gmm.fit_predict(X)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=40)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', marker='x', s=100)
# cluster centers
plt.title("Gaussian Mixture Model Clustering")
plt.show()
GMM vs K-Means
Aspect K-Means GMM
Cluster shape Spherical Elliptical
Assignment Hard (each point → one Soft (probability for each
Aspect K-Means GMM
cluster) cluster)
Distribution
None Gaussian
assumption
Flexibility Less More (captures covariance)
In short:
GMM is a soft-clustering algorithm that models data as a mixture of Gaussian
distributions, using probabilities to assign points to clusters. It’s more flexible
than K-means, especially for elliptical or overlapping clusters.
Topic 8: Association Rules (Market Basket Analysis)
Association rule learning is an unsupervised learning method used to find
relationships (if-then rules) between variables in large datasets.
o Goal: Identify co-occurrence patterns in data.
o Input: Transactions or sets of items.
o Output: Rules in the form IF (Antecedent) → THEN (Consequent)
Key Measures:
Algorithms:
Apriori Algorithm
FP-Growth Algorithm
Applications:
Market Basket Analysis: “If a customer buys bread, they are likely to
buy butter.”
Recommender systems (Amazon, Netflix suggestions).
Medical data analysis (disease co-occurrence).
Website clickstream analysis.
Summary:
Clustering → Groups similar items (unsupervised, pattern discovery).
Association Rules → Finds co-occurrence relationships (if-then
patterns).
Topic 9: Apriori Algorithm (Association Rule Mining)
The Apriori Algorithm is an unsupervised learning algorithm used for
association rule mining.
It helps discover frequent itemsets and association rules from transaction
datasets (like supermarket purchases).
Steps of Apriori Algorithm
1. Set Minimum Support & Confidence (thresholds).
2. Find Frequent Itemsets:
o Generate candidate itemsets of length 1, 2, 3 …
o Keep only those meeting minimum support.
3. Generate Association Rules:
o From frequent itemsets, generate rules that meet minimum
confidence.
Example (Market Basket Data)
Transactions:
T1: {Milk, Bread, Butter}
T2: {Bread, Butter}
T3: {Milk, Bread}
T4: {Milk, Butter}
T5: {Bread, Butter}
Frequent Itemsets (Support ≥ 0.4):
{Milk} → 0.6
{Bread} → 0.8
{Butter} → 0.8
{Milk, Bread} → 0.4
{Bread, Butter} → 0.6
Rules (Confidence ≥ 0.6):
{Bread} → {Butter}, Confidence = 0.75, Lift > 1
Python Example: Apriori
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Sample dataset (Transactions)
dataset = [
['Milk', 'Bread', 'Butter'],
['Bread', 'Butter'],
['Milk', 'Bread'],
['Milk', 'Butter'],
['Bread', 'Butter']
# Convert dataset to one-hot encoded DataFrame
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_array = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_array, columns=te.columns_)
# Apply Apriori
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence",
min_threshold=0.6)
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssociation Rules:\n",
rules[['antecedents','consequents','support','confidence','lift']])
This code:
Creates a simple shopping dataset.
Converts it into one-hot encoding.
Uses Apriori to find frequent itemsets (support ≥ 0.4).
Generates association rules (confidence ≥ 0.6).