COURSE NAME: DATA ANALYTICS LAB COURSE CODE: 22ML607PC
Write a python programs for the following
1. Data Preprocessing
a. Handling missing values
b. Noise detection removal
c. Identifying data redundancy and elimination
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from scipy import stats
from sklearn.preprocessing import StandardScaler
# Sample dataset with missing values, noise, and redundant data
data = {
'A': [10, 15, np.nan, 20, 25, 30, np.nan, 35, 1000], # Contains missing values & an outlier (1000)
'B': [5, 7, 8, 5, 10, 12, 8, 5, 15], # No missing values
'C': [10, 15, 20, 25, 30, 35, 40, 45, 50], # Highly correlated with A (redundant)
'D': ['yes', 'no', np.nan, 'yes', 'no', 'yes', 'no', 'yes', 'no'] # Categorical with missing values
df = pd.DataFrame(data)
print("Original Data:\n", df)
# ---------- Handling Missing Values ----------
imputer_numeric = SimpleImputer(strategy='mean') # Using mean imputation for numerical columns
df[['A']] = imputer_numeric.fit_transform(df[['A']]) # Fill missing values in column 'A'
imputer_categorical = SimpleImputer(strategy='most_frequent') # Using mode imputation for
categorical data
df[['D']] = imputer_categorical.fit_transform(df[['D']]) # Fill missing values in column 'D'
print("\nData After Handling Missing Values:\n", df)
# ---------- Noise Detection & Removal (Z-score method) ----------
z_scores = np.abs(stats.zscore(df[['A', 'B']])) # Compute Z-scores for numerical columns
df_no_noise = df[(z_scores < 3).all(axis=1)] # Remove rows where Z-score > 3
print("\nData After Removing Noise:\n", df_no_noise)
# ---------- Identifying & Removing Redundant Data ----------
correlation_matrix = df_no_noise.corr() # Compute correlation matrix
high_correlation_features = [col for col in correlation_matrix.columns if any(correlation_matrix[col] >
0.95) and col != 'A']
df_final = df_no_noise.drop(columns=high_correlation_features) # Drop highly correlated features
print("\nFinal Processed Data (Redundant Features Removed):\n", df_final)
OUTPUT:
Original Data:
A B C D
0 10.0 5 10 yes
1 15.0 7 15 no
2 NaN 8 20 NaN
3 20.0 5 25 yes
4 25.0 10 30 no
5 30.0 12 35 yes
6 NaN 8 40 no
7 35.0 5 45 yes
8 1000.0 15 50 no
Data After Handling Missing Values:
A B C D
0 10.000000 5 10 yes
1 15.000000 7 15 no
2 162.142857 8 20 no
3 20.000000 5 25 yes
4 25.000000 10 30 no
5 30.000000 12 35 yes
6 162.142857 8 40 no
7 35.000000 5 45 yes
8 1000.000000 15 50 no
Data After Removing Noise:
A B C D
0 10.000000 5 10 yes
1 15.000000 7 15 no
2 162.142857 8 20 no
3 20.000000 5 25 yes
4 25.000000 10 30 no
5 30.000000 12 35 yes
6 162.142857 8 40 no
7 35.000000 5 45 yes
8 1000.000000 15 50 no
2. Implement any one imputation model
import pandas as pd
import numpy as np
def mean_imputation(data):
"""Imputes missing values with the mean of each column."""
return data.fillna(data.mean())
# Example dataset with missing values
data = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [3, np.nan, 7, 8, 9],
'C': [10, 11, 12, np.nan, 14]
})
print("Original Data:")
print(data)
# Apply mean imputation
imputed_data = mean_imputation(data)
print("\nData after Mean Imputation:")
print(imputed_data)
OUTPUT:
Original Data:
A B C
0 1.0 3.0 10.0
1 2.0 NaN 11.0
2 NaN 7.0 12.0
3 4.0 8.0 NaN
4 5.0 9.0 14.0
Data after Mean Imputation:
A B C
0 1.0 3.00 10.00
1 2.0 6.75 11.00
2 3.0 7.00 12.00
3 4.0 8.00 11.75
4 5.0 9.00 14.00
What is an imputer?
The imputer is an estimator used to fill the missing values in datasets. For numerical values, it uses
mean, median, and constant. For categorical values, it uses the most frequently used and constant
value. You can also train your model to predict the missing labels.
What is Numpy and Pandas?
NumPy and Pandas are two popular Python libraries often used in data analytics. It is used for working
with arrays. It also has functions for working in domain of linear algebra, fourier transform, and
matrices.NumPy excels in creating N-dimension data objects and performing mathematical operations
efficiently, while Pandas is renowned for data wrangling and its ability to handle large datasets.
What are the uses of sklearn?
It is one of the most useful library for machine learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression, clustering
and dimensionality reduction. Generated sklearn datasets are synthetic datasets, generated using the
sklearn library in Python. They are used for testing, benchmarking and developing machine learning
algorithms/models.
What is a dataframe?
A dataframe is a data structure constructed with rows and columns, similar to a database or Excel
spreadsheet. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or
Calc. Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular
data
What is an dict{} in python?
dictionary can be created by placing a sequence of elements within curly {} braces, separated by a
'comma' Python dictionary are Ordered. Dictionary keys are case sensitive: the same name but different
cases of Key will be treated distinctly. With dictionaries you access values via the keys. The keys can be
of any datatype (int, float, string, and even tuple). A dictionary may contain duplicate values inside it,
but the keys MUST be unique (so it isn't possible to access different values via the same key).
3. Implement Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Linear Regression using sklearn
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# Plot results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
# Manual Implementation of Linear Regression using Normal Equation
X_b = np.c_[np.ones((100, 1)), X] # Add bias term
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
print(f"Calculated coefficients: {theta_best.ravel()}")
what is linear regression?
Linear Regression is a supervised learning algorithm used for predicting a continuous dependent
variable based on one or more independent variables. It models the relationship between
variables by fitting a linear equation:
y= β0+β1x1+β2x2+...+βnxn+ϵ
where:
y is the dependent variable (target),
x i are independent variables (features),
β i are coefficients (weights),
ϵ epsilonϵ is the error term.
The goal of Linear Regression is to find the best-fitting line (or hyperplane in higher dimensions)
that minimizes the difference between predicted and actual values, often using methods like
Ordinary Least Squares (OLS).
Linear Regression
Predicts continuous values.
The model fits a straight line to the data.
4. Implement Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = (X > 1).astype(int).ravel() # Binary classification based on threshold
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Logistic Regression using sklearn
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Plot decision boundary
X_values = np.linspace(0, 2, 100).reshape(-1, 1)
y_proba = model.predict_proba(X_values)[:, 1]
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_values, y_proba, color='red', linewidth=2, label='Predicted Probability')
plt.xlabel("X")
plt.ylabel("Probability")
plt.legend()
plt.show()
What is Logistic Regression?
Logistic Regression is a supervised learning algorithm used for classification problems.
Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts
probabilities and assigns data points to discrete classes (e.g., 0 or 1, spam or not spam, disease
or no disease).
Mathematical Formulation
Instead of a direct linear equation like in Linear Regression, Logistic Regression uses the
sigmoid (logistic) function to map outputs between 0 and 1:
P(y=1)=11+e−(β0+β1x1+β2x2+...+βnxn)P(y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \
beta_2 x_2 + ... + \beta_n x_n)}}P(y=1)=1+e−(β0+β1x1+β2x2+...+βnxn)1
where:
P(y=1)P(y=1)P(y=1) is the probability that the output belongs to class 1.
β0,β1,...,βn\beta_0, \beta_1, ..., \beta_nβ0,β1,...,βn are the model coefficients.
x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn are input features.
The sigmoid function transforms the linear output into a probability range of (0,1)(0,1)
(0,1).
Classification Decision
Once the probability is computed, the decision boundary is set (commonly at 0.5):
If P(y=1)≥0.5P(y=1) \geq 0.5P(y=1)≥0.5, classify as 1.
If P(y=1)<0.5P(y=1) < 0.5P(y=1)<0.5, classify as 0.
Types of Logistic Regression
1. Binary Logistic Regression (Two classes, e.g., spam vs. not spam).
2. Multinomial Logistic Regression (More than two classes, e.g., cat, dog, horse).
3. Ordinal Logistic Regression (Ordered classes, e.g., low, medium, high risk).
Loss Function in Logistic Regression
Instead of Mean Squared Error (MSE) used in Linear Regression, Logistic Regression optimizes
the Log Loss (Cross-Entropy Loss):
It ensures the model penalizes wrong classifications more strongly.
Logistic Regression
Predicts binary class probabilities.
The model fits an S-shaped sigmoid curve.
***Linear Regression produces continuous outputs, while Logistic Regression produces probabilities
mapped to class labels
When to Use Logistic Regression?
✅ When you need classification (yes/no, pass/fail, fraud/not fraud).
✅ When the relationship between independent variables and output is non-linear but can be
mapped using probabilities.
✅ When the dataset is small to medium-sized, as logistic regression is computationally
efficient.
Logistic Regression is suitable for classification tasks, whereas Linear Regression is for regression tasks.
Real-Life Examples of Linear and Logistic Regression
📌 Linear Regression Examples (Predicting Continuous Values)
1. House Price Prediction
o Predicting house prices based on factors like size, location, number of bedrooms, and
age.
2. Stock Market Forecasting
o Predicting stock prices based on past trends, economic indicators, and company
performance.
3. Salary Prediction
o Estimating an employee's salary based on experience, education, and skills.
4. Temperature Prediction
o Forecasting the temperature of a city based on historical weather data, humidity, and
wind speed.
Logistic Regression Examples (Predicting Classifications)
1. Spam Detection 📩
o Classifying emails as spam (1) or not spam (0) based on word frequency and metadata.
2. Disease Diagnosis 🏥
o Predicting whether a patient has diabetes (1) or not (0) based on glucose levels, age, and
BMI.
3. Credit Card Fraud Detection 💳
o Identifying fraudulent transactions based on transaction patterns, location, and
frequency.
4. Customer Churn Prediction 📊
o Predicting whether a customer will continue (0) or cancel (1) a subscription based on
usage and complaints.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Display the decision tree rules
print(export_text(clf, feature_names=data.feature_names))
NOTES:
Decision Tree Classifier:
A Decision Tree Classifier is a supervised learning algorithm used for
classification tasks. It works by splitting the data into subsets based on feature
values, forming a tree-like structure where:
Each internal node represents a decision based on a feature.
Each branch represents an outcome of that decision.
Each leaf node represents a class label (final prediction).
How It Works:
The algorithm selects the best feature to split the dataset using criteria like:
Gini Impurity (default in sklearn)
Entropy (Information Gain)
It recursively splits the data, forming a tree structure.
The process stops when:
A predefined depth is reached.
All samples in a node belong to the same class.
Further splits don’t improve accuracy.
Advantages:
Easy to understand and interpret
Requires little data preprocessing (no need for feature scaling)
Handles both numerical and categorical data
Disadvantages:
Prone to overfitting (solved using pruning or ensemble methods like Random
Forest)
Can be unstable with small data changes
What is an Iris dataset?
The Iris dataset is a well-known dataset in machine learning and statistics, used primarily for
classification tasks. It consists of 150 samples of iris flowers, categorized into three species:
Setosa
Versicolor
Virginica
Each sample has four features (measured in centimeters):
Sepal length
Sepal width
Petal length
Petal width
It is often used in educational contexts used for classification tasks because:
Simplicity It is small, well-structured, easy to understand and visualize.
It is built into scikit-learn, making it easy to access.
Balanced Classes – The dataset has three classes with roughly equal representation.
Benchmarking – Many algorithms have been tested on it, making it a good reference.
Well-Defined Features – The four numerical features (sepal length, sepal width, petal length,
petal width) provide clear distinctions between classes. The classes are well-separated, making it
a good dataset for testing classification algorithms.
However, you can use other datasets like:
Wine Dataset (sklearn.datasets.load_wine) – Good for multi-class classification.
Breast Cancer Dataset (sklearn.datasets.load_breast_cancer) – Used for binary classification.
Custom Data – You can use real-world datasets from CSV files or databases.
Implement Random Forest Classifier
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
NOTE: In your code to implement a Random Forest classifier instead of a Decision Tree classifier.
NOTES:
A Random Forest Classifier is an ensemble learning method that builds multiple decision trees and
combines their predictions to improve accuracy and reduce overfitting. Here's how it works:
Bootstrap Sampling – The dataset is randomly sampled with replacement to create multiple training
subsets.
Multiple Decision Trees – A decision tree is trained on each subset.
Random Feature Selection – Each tree considers a random subset of features at each split, increasing
diversity among trees.
Voting/Averaging – For classification, the majority vote from all trees determines the final prediction.
Advantages of Random Forest
Reduces overfitting compared to a single decision tree
Handles missing values and large datasets well
Works for both classification and regression tasks
Can measure feature importance