Mining Frequent Patterns, Associations, and Correlations
1. Basic Concepts
• Frequent Patterns:
Patterns (itemsets, subsequences, or substructures) that occur frequently in a
dataset.
Example: In a supermarket, bread and butter being bought together often.
• Association Rules:
Imply a strong relationship between items in a dataset.
Form: X ⇒ Y (If X happens, Y is likely to happen too).
• Support:
o Probability that a transaction contains X ∪ Y.
o support(X ⇒ Y) = P(X ∪ Y)
• Confidence:
o How often Y appears in transactions that contain X.
o confidence(X ⇒ Y) = P(Y | X) = support(X ∪ Y) / support(X)
• Lift:
o Measures how much more often X and Y occur together than expected if
independent.
o lift(X ⇒ Y) = confidence(X ⇒ Y) / support(Y)
• Correlations:
o Find relationships between itemsets that go beyond simple co-occurrence.
o Positive correlation if lift > 1.
2. Association Rule Mining
Objective:
• Discover interesting relationships between variables in large datasets.
Challenges:
• Search Space: Exponential growth.
• Interestingness Measures: Not all frequent itemsets are meaningful.
• Scalability: Handle very large data volumes.
3. The Apriori Algorithm
Goal: Find frequent itemsets to generate association rules.
Principle:
• Apriori Property:
If an itemset is frequent, all of its subsets must also be frequent.
Steps:
1. Scan Dataset to find frequent 1-itemsets (items appearing frequently alone).
2. Generate Candidate Itemsets of length k from frequent itemsets of length k-1.
3. Prune candidates whose subsets are not frequent (using the Apriori property).
4. Repeat until no frequent itemsets can be generated.
Theoretical Example:
Transaction ID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Frequent itemsets: {Bread}, {Milk}, {Bread, Milk}, {Diaper, Beer}
4. Python Implementation Tips
You can use mlxtend or apriori libraries.
python
CopyEdit
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
# Sample dataset
dataset = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]
# Convert to one-hot encoding
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
Classification and Prediction
1. Basic Concepts
• Classification:
Predict categorical labels. (E.g., spam or not spam)
• Prediction:
Predict continuous values. (E.g., house prices)
Classification Types:
• Binary Classification: Two classes (Yes/No).
• Multiclass Classification: More than two classes (Cat/Dog/Fish).
• Multi-label Classification: Multiple labels at once (e.g., News tagged as Sports and
Politics).
2. Supervised Learning Flow
Step Description
Data Preparation Collect and preprocess the data
Model Building Train a model using training data
Prediction Use model to predict unseen data
Evaluation Measure accuracy, precision, recall
3. Common Algorithms
Algorithm Quick Idea
Decision Tree Tree-like structure of decisions
Random Forest Ensemble of decision trees
Logistic Regression Probabilistic classification
k-Nearest Neighbors (KNN) Based on closest training examples
Naive Bayes Bayes' theorem with independence assumptions
Support Vector Machine (SVM) Maximize margin between classes
Neural Networks Layered networks of neurons
4. Python Example (Simple Classification)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Dummy dataset
from sklearn.datasets import load_iris
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
# Train classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
Classification
Definition:
Classification is the process of finding a model that describes and distinguishes data classes
or concepts.
The model can be used to predict the class label of new data points.
Issues Regarding Classification
Issue Description
Accuracy How correct the model is.
Speed Training time and prediction time.
Robustness Ability to handle noisy or missing data.
Scalability Ability to handle large datasets.
Interpretability Whether humans can understand the model's decisions.
Overfitting Model performs well on training data but poorly on new data.
Underfitting Model is too simple to capture data patterns.
Imbalanced Classes One class has significantly more samples than others.
Classification by Decision Tree Induction
Decision Tree
• A flowchart-like tree structure where:
o Each internal node represents a test on an attribute.
o Each branch represents an outcome of the test.
o Each leaf node represents a class label.
Popular Algorithms: ID3, C4.5, CART
Algorithm Steps:
1. Start with the entire dataset.
2. Choose the best attribute using a splitting criterion (e.g., Information Gain, Gini
Index).
3. Split the dataset based on the selected attribute.
4. Recur for each branch.
Splitting Criteria:
Python Example (Decision Tree)
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train decision tree
tree = DecisionTreeClassifier(criterion='gini') # or 'entropy'
tree.fit(X, y)
# Visualize tree
plt.figure(figsize=(12,8))
plot_tree(tree, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
Bayesian Classification
Naive Bayes Assumption:
• Features are conditionally independent given the class.
Types:
• Gaussian Naive Bayes (for continuous data).
• Multinomial Naive Bayes (for count data).
• Bernoulli Naive Bayes (for binary/boolean features).
Python Example (Naive Bayes)
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train Naive Bayes classifier
model = GaussianNB()
model.fit(X, y)
# Prediction
y_pred = model.predict(X)
print("Predictions:", y_pred)
Rule-Based Classification
Idea:
• IF-THEN rules to perform classification.
Example:
IF age < 30 AND income = high THEN buy_computer = no
How Rules Are Generated:
• From decision trees (e.g., extract paths).
• From association rule mining.
• Direct rule induction algorithms (e.g., RIPPER, CN2).
Advantages:
• Easy to understand.
• Fast classification.
Metrics for Evaluating Classifier Performance
Metric Formula Description
Accuracy (TP + TN) / (TP + TN + FP + FN) Overall correctness.
How many predicted positives are
Precision TP / (TP + FP)
actual positives.
How many actual positives were
Recall (Sensitivity) TP / (TP + FN)
correctly predicted.
2 × (Precision × Recall) /
F1-Score Balance between precision and recall.
(Precision + Recall)
Diagnostic ability at various
ROC Curve Plot of TPR vs. FPR
thresholds.
AUC (Area Under Single number summary of
Higher AUC = Better model.
Curve) ROC
Where:
• TP = True Positive
• TN = True Negative
• FP = False Positive
• FN = False Negative
Python Example (Evaluation Metrics)
from sklearn.metrics import confusion_matrix, classification_report
# Assume y_test and y_pred are available
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))
Holdout Method and Random Subsampling
Holdout Method:
• Split dataset into:
o Training set (e.g., 70%)
o Testing set (e.g., 30%)
Simple but can suffer from bias if unlucky split happens.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Random Subsampling:
• Repeat holdout method multiple times with different random splits.
• Average the performance over multiple trials.
• Reduces bias compared to one-time holdout.
Pseudo-code:
for i in range(N):
split data randomly
train model
evaluate and record performance
average all recorded performances
Prediction
Definition:
Prediction involves building a model to predict continuous-valued functions (numeric
outcomes), unlike classification which predicts discrete labels.
Example:
• Predict house prices based on area, location.
• Predict temperature for the next week.
Issues Regarding Prediction
Issue Description
Quality of Input
Missing, noisy, or irrelevant features can affect performance.
Data
Model Complexity Overly complex models may overfit; simple models may underfit.
Feature Selection Important to choose the right input attributes.
Interpretability How understandable the prediction model is.
Scalability Handle large datasets efficiently.
Hard to measure "goodness" accurately, especially for time-series
Evaluation
data.
Python Example: (Prediction Error Metrics)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# True and predicted values
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])
print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", mean_squared_error(y_true, y_pred, squared=False))
print("R2 Score:", r2_score(y_true, y_pred))
Evaluating the Accuracy of a Classifier or Predictor
Method Description
Holdout Method Split into training and testing once.
Split into k subsets, train k times with different
k-Fold Cross-Validation
subsets.
Leave-One-Out Cross-Validation Extreme case of k-fold where k = number of
(LOOCV) samples.
Method Description
Resample with replacement to create multiple
Bootstrap
datasets.
Key Metrics:
• For Classifiers: Accuracy, Precision, Recall, F1-score.
• For Predictors: MAE, MSE, RMSE, R2.
Clustering
Cluster Analysis
Definition:
Unsupervised learning task that groups a set of objects such that objects in the same group
(cluster) are more similar to each other than to those in other groups.
Applications:
• Customer segmentation.
• Image compression.
• Anomaly detection.
Agglomerative vs Divisive Hierarchical Clustering
Type Description Example
Agglomerative Start with each point as its own cluster, then Single-link, Complete-
(bottom-up) iteratively merge the closest clusters. link methods.
Start with one large cluster and recursively
Divisive (top-down) Bisecting k-means.
split into smaller clusters.
Hierarchical Clustering creates a tree structure (dendrogram).
Agglomerative Hierarchical Clustering Python Example:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np
# Sample data
X = np.array([[1, 2],
[3, 4],
[5, 6],
[8, 8]])
# Perform hierarchical clustering
linked = linkage(X, 'single') # 'single', 'complete', 'average'
# Plot dendrogram
plt.figure(figsize=(8, 5))
dendrogram(linked, labels=[1, 2, 3, 4])
plt.show()
Evaluation of Clustering
Since clustering is unsupervised, evaluation is tricky!
Metric Description
Measures how similar an object is to its own cluster vs other
Silhouette Coefficient
clusters. Ranges [-1, 1].
Lower value = better clustering (based on intra-cluster
Davies-Bouldin Index
similarity).
Inertia (Within-Cluster Sum
Used in KMeans, lower is better.
of Squares)
External Validation (if labels
Adjusted Rand Index, Mutual Information Score.
known)
Python Example: (Evaluate Clustering)
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Sample data
X = np.random.rand(50, 2)
# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Evaluate
labels = kmeans.labels_
print("Silhouette Score:", silhouette_score(X, labels))
Python Example: Simple Gradient Descent
import numpy as np
# Hypothesis function
def predict(X, theta):
return X.dot(theta)
# Cost function
def cost(X, y, theta):
m = len(y)
return (1/2*m) * np.sum((predict(X, theta) - y)**2)
# Gradient descent function
def gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
for _ in range(iterations):
theta -= (alpha/m) * (X.T.dot(predict(X, theta) - y))
return theta
# Dummy dataset
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
X_b = np.c_[np.ones((3, 1)), X] # add bias term
theta = np.zeros(2)
theta = gradient_descent(X_b, y, theta, alpha=0.1, iterations=1000)
print("Theta:", theta)
Linear Regression with One Variable (Univariate)
Simple case where only one independent variable XXX.
Equation:
hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 xhθ(x)=θ0+θ1x
Use case:
Predict salary based on years of experience.
Python Example (One Variable)
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
# Model
model = LinearRegression()
model.fit(X, y)
# Predict and plot
plt.scatter(X, y, color='red')
plt.plot(X, model.predict(X), color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with One Variable')
plt.show()
Python Example (Multiple Variables)
# Data
X = np.array([[1, 2], [2, 3], [4, 5]])
y = np.array([5, 7, 11])
# Model
model = LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
from sklearn.preprocessing import PolynomialFeatures
# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 5, 10, 17])
# Transform features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Model
model = LinearRegression()
model.fit(X_poly, y)
# Predict and plot
X_fit = np.linspace(1, 4, 100).reshape(-1,1)
X_fit_poly = poly.transform(X_fit)
y_fit = model.predict(X_fit_poly)
plt.scatter(X, y, color='red')
plt.plot(X_fit, y_fit, color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression')
plt.show()
Python Example (Scaling)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Feature Selection
Definition:
Choosing only the most important/related features for the model.
Techniques:
• Filter Methods (e.g., correlation, chi-square test)
• Wrapper Methods (e.g., recursive feature elimination (RFE))
• Embedded Methods (e.g., Lasso regression)
Why Needed:
• Reduce model complexity.
• Improve accuracy.
• Reduce training time.
Python Example (Feature Selection using RFE)
from sklearn.feature_selection import RFE
model = LinearRegression()
rfe = RFE(model, n_features_to_select=1)
rfe = rfe.fit(X, y)
print("Selected Features:", rfe.support_)
Classification Using Logistic Regression
Definition:
Logistic Regression is a supervised learning algorithm used for binary classification (output
0 or 1).
Despite its name, it's used for classification, not regression!
Core Idea:
Rather than fitting a straight line (like in Linear Regression), Logistic Regression fits an S-
shaped curve (Sigmoid function) that maps any real-valued number to a probability between
0 and 1.
Python Example (One Variable)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Dummy Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])
# Model
model = LogisticRegression()
model.fit(X, y)
# Prediction
x_test = np.linspace(0, 6, 100).reshape(-1,1)
y_pred = model.predict_proba(x_test)[:,1]
# Plot
plt.scatter(X, y, color='red')
plt.plot(x_test, y_pred, color='blue')
plt.xlabel('Feature X')
plt.ylabel('Probability')
plt.title('Logistic Regression with One Variable')
plt.show()
Python Example (Multiple Variables)
# Dummy Data
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 5]])
y = np.array([0, 0, 0, 1, 1])
# Model
model = LogisticRegression()
model.fit(X, y)
# Prediction
print("Predicted probabilities:", model.predict_proba([[3, 4]]))
print("Predicted class:", model.predict([[3, 4]]))
Deep Learning
History of Deep Learning
• 1943: McCulloch & Pitts — first mathematical model of a neuron.
• 1958: Frank Rosenblatt — Perceptron (first algorithm for binary classification).
• 1986: Rumelhart, Hinton, Williams — Backpropagation algorithm.
• 1998: Yann LeCun — Convolutional Neural Networks (LeNet for handwritten digit
recognition).
• 2006–2012: "Deep Learning" boom — Geoffrey Hinton popularized Deep Belief
Networks, then came AlexNet (2012 ImageNet winner).
Scope and Specifications of Deep Learning
• Scope:
o Computer Vision (object detection, facial recognition)
o Natural Language Processing (translation, sentiment analysis)
o Healthcare (disease diagnosis)
o Robotics (autonomous control)
o Finance (fraud detection)
• Specifications:
o Requires large datasets.
o Requires high computational power (GPUs/TPUs).
o Involves training models with millions of parameters.
Why Deep Learning Now?
• Big Data: Availability of massive datasets (e.g., ImageNet, OpenAI datasets).
• Hardware Advances: GPUs, TPUs, Parallel Computing.
• Algorithmic Improvements: Better optimizers (Adam, RMSProp), regularization
methods (dropout, batch normalization).
• Open-Source Ecosystem: Libraries like TensorFlow, PyTorch, Keras, Hugging Face.
Building Blocks of Neural Networks
Building Block Description
Neuron (Unit) Mimics a biological neuron, performs weighted sum + activation
Layer Group of neurons; input, hidden, output
Weights & Bias Learnable parameters
Activation Function Non-linear transformation (e.g., ReLU, Sigmoid)
Loss Function Measures prediction error
Optimizer Adjusts weights to minimize loss
Deep Learning Hardware
• GPU: Graphical Processing Unit — parallelism for matrix ops.
• TPU: Tensor Processing Unit — Google's hardware for tensor ops.
• ASICs: Application-Specific Integrated Circuits for DL workloads.
• Frameworks: TensorFlow, PyTorch leverage hardware accelerations.
Forward and Backward Propagation
Direction Purpose
Forward Propagation Compute output predictions from inputs
Backward Propagation Update weights based on prediction error using gradients
Forward Pass:
• Inputs → Multiply weights → Add bias → Activation → Output
Backward Pass:
• Use Chain Rule of derivatives to compute gradient w.r.t each parameter.
XOR Model
• XOR (Exclusive OR) Problem:
o Input: two binary variables.
o Output: 1 if only one of the inputs is 1, else 0.
• Challenge: Single-layer perceptron can't solve XOR (not linearly separable).
• Solution:
o Multi-Layer Neural Networks (MLPs) can solve XOR!
Python Example (Simple XOR with MLP)
import torch
import torch.nn as nn
# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
# Model
model = nn.Sequential(
nn.Linear(2, 2),
nn.Sigmoid(),
nn.Linear(2, 1),
nn.Sigmoid()
)
# Loss and Optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Training loop
for epoch in range(10000):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(model(X).detach())
Layers:
• Input Layer: Takes raw features.
• Hidden Layers: Feature extraction (representation learning).
• Output Layer: Final prediction.
Normalization
Normalization improves training speed and stability:
• Batch Normalization: Normalize activations of a layer during training.
• Input Normalization: Normalize features to mean 0 and variance 1.
Why Normalize?
To prevent some features from dominating and to stabilize gradients.
Hyper-Parameter Tuning
Hyperparameters: Values set before training (not learned).
Examples:
• Learning Rate
• Number of Layers
• Number of Neurons per Layer
• Batch Size
• Number of Epochs
• Dropout Rate
Tuning methods:
• Grid Search
• Random Search
• Bayesian Optimization
• Hyperband
Convolutional Neural Networks (CNNs)
CNNs are specialized for image data!
Component Purpose
Convolution Layer Extract features (edges, corners, patterns)
Pooling Layer Reduce spatial size (downsampling)
Fully Connected Layer Final classification
CNN Architecture Example:
Input Image → Conv Layer → ReLU → Pooling → Conv Layer → ReLU → Pooling → Flatten →
Fully Connected Layer → Output
CNN Architecture Typical Example:
Layer Details
Conv2D filters=32, kernel_size=(3,3)
Activation ReLU
MaxPooling2D pool_size=(2,2)
Conv2D filters=64, kernel_size=(3,3)
Activation ReLU
MaxPooling2D pool_size=(2,2)
Flatten
Dense 128 neurons
Output Dense Softmax for multiclass classification
Python Code (Basic CNN - Keras)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=(64, 64, 3)),
MaxPooling2D(pool_size=(2,2)),
Conv2D(64, (3,3), activation='relu'),
MaxPooling2D(pool_size=(2,2)),
Flatten(),
Dense(128, activation='relu'),
Dense(10, activation='softmax') # 10 classes
])
Iris Dataset Classification
1. Import Libraries
python
CopyEdit
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
2. Load and Explore the Dataset
python
CopyEdit
# Load Dataset
iris = load_iris()
X = iris.data
y = iris.target
# Feature Names
print(iris.feature_names)
# Target Names
print(iris.target_names)
3. Preprocess the Data
• Scaling is optional for tree-based models but good for Logistic Regression.
python
CopyEdit
# Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
4. Split into Train and Test
python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)
# Predictions
y_pred_dt = dt_model.predict(X_test)
6. Evaluate the Models
python
CopyEdit
# Logistic Regression Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test,
y_pred_lr))
# Decision Tree Evaluation
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report for Decision Tree:\n", classification_report(y_test, y_pred_dt))
7. Confusion Matrix Visualization
python
CopyEdit
# Confusion Matrix for Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# Confusion Matrix for Decision Tree
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='d')
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Loan Dataset Classification (Machine Learning)
1. Import Libraries
python
CopyEdit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
2. Load the Dataset
Suppose you have a CSV called loan_data.csv.
(If you don’t have it yet, I can generate a sample dataset for you.)
python
CopyEdit
# Load the dataset
df = pd.read_csv('loan_data.csv')
# View first few rows
print(df.head())
Typical columns in loan dataset:
• Loan_ID
• Gender
• Married
• Dependents
• Education
• Self_Employed
• ApplicantIncome
• CoapplicantIncome
• LoanAmount
• Loan_Amount_Term
• Credit_History
• Property_Area
• Loan_Status (Target variable: Y/N)
3. Preprocess the Data
• Handle missing values.
• Encode categorical variables.
• Feature scaling.
python
CopyEdit
# Drop Loan_ID (not useful for prediction)
df = df.drop('Loan_ID', axis=1)
# Fill missing values
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
# Encode categorical variables
label_encoders = {}
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
'Property_Area', 'Loan_Status']
for col in categorical_columns:
label_encoders[col] = LabelEncoder()
df[col] = label_encoders[col].fit_transform(df[col])
# Separate features and target
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']
# Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
4. Train-Test Split
python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Build Classifiers
(a) Logistic Regression
python
CopyEdit
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
# Predictions
y_pred_lr = lr_model.predict(X_test)
(b) Decision Tree Classifier
python
CopyEdit
# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
# Predictions
y_pred_dt = dt_model.predict(X_test)
6. Evaluate the Models
python
CopyEdit
# Logistic Regression
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test,
y_pred_lr))
# Decision Tree
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report for Decision Tree:\n", classification_report(y_test, y_pred_dt))
7. Confusion Matrix Visualization
python
CopyEdit
# Logistic Regression Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# Decision Tree Confusion Matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='d')
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()