PROGRAM:
import pandas as pd
import numpy as np
from [Link] import SimpleImputer
from [Link] import KBinsDiscretizer
# Sample dataset
data = {
'Age': [25, 30, 35, [Link], 40, 50, 60, 22, [Link], 28],
'Income': [50000, 60000, 80000, 90000, [Link], 120000, 140000, 65000, 70000, 50000],
'Gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F'],
'Score': [88, 92, 95, 88, 75, 84, 80, 91, 89, 70]
}
# Create a DataFrame
df = [Link](data)
print("Original Data:")
print(df)
# a. Attribute Selection (Selecting only relevant features)
# For simplicity, let's assume we are interested in 'Age' and 'Income' only
df_selected = df[['Age', 'Income']]
print("\nSelected Attributes:")
print(df_selected)
# b. Handling Missing Values
# Impute missing values using the mean strategy for numerical columns
imputer = SimpleImputer(strategy='mean')
# We can use 'median' or 'most_frequent' as well
df_selected_imputed = [Link](imputer.fit_transform(df_selected), columns=df_selected.columns)
print("\nData after Handling Missing Values:")
print(df_selected_imputed)
# c. Discretization (Binning continuous variables like 'Age' and 'Income')
# We will use KBinsDiscretizer to convert continuous variables into discrete bins
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df_discretized = [Link](discretizer.fit_transform(df_selected_imputed),
columns=df_selected_imputed.columns)
print("\nData after Discretization:")
print(df_discretized)
# d. Elimination of Outliers
# We'll use the IQR method to detect and remove outliers for the 'Age' and 'Income' columns
# Calculate IQR
Q1 = df_selected_imputed.quantile(0.25)
Q3 = df_selected_imputed.quantile(0.75)
IQR = Q3 - Q1
# Define outlier conditions
outlier_condition = ((df_selected_imputed < (Q1 - 1.5 * IQR)) | (df_selected_imputed > (Q3 + 1.5 * IQR)))
# Remove rows with outliers
df_no_outliers = df_selected_imputed[~outlier_condition.any(axis=1)]
print("\nData after Eliminating Outliers:")
print(df_no_outliers)
OUTPUT:
Original Data:
Age Income Gender Score
0 25.0 50000 M 88
1 30.0 60000 F 92
2 35.0 80000 M 95
3 NaN 90000 F 88
4 40.0 NaN M 75
5 50.0 120000 M 84
6 60.0 140000 F 80
7 22.0 65000 M 91
8 NaN 70000 M 89
9 28.0 50000 F 70
Selected Attributes:
Age Income
0 25.0 50000
1 30.0 60000
2 35.0 80000
3 NaN 90000
4 40.0 NaN
5 50.0 120000
6 60.0 140000
7 22.0 65000
8 NaN 70000
9 28.0 50000
Data after Handling Missing Values:
Age Income
0 25.0 50000.0
1 30.0 60000.0
2 35.0 80000.0
3 35.6 90000.0
4 40.0 83571.4
5 50.0 120000.0
6 60.0 140000.0
7 22.0 65000.0
8 35.6 70000.0
9 28.0 50000.0
Data after Discretization:
Age Income
0 0 0
1 1 0
2 2 1
3 1 1
4 2 1
5 2 2
6 2 2
7 0 1
8 1 1
9 0 0
Data after Eliminating Outliers:
Age Income
0 25.0 50000.0
1 30.0 60000.0
2 35.0 80000.0
3 35.6 90000.0
5 50.0 120000.0
6 60.0 140000.0
7 22.0 65000.0
8 35.6 70000.0
9 28.0 50000.0