0% found this document useful (0 votes)
15 views3 pages

IML Program 2

The document outlines a data processing program using Python's pandas and scikit-learn libraries. It demonstrates attribute selection, handling missing values through imputation, discretization of continuous variables, and elimination of outliers using the IQR method. The program processes a sample dataset containing age and income information, ultimately producing a cleaned and transformed dataset.

Uploaded by

logitech9966
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

IML Program 2

The document outlines a data processing program using Python's pandas and scikit-learn libraries. It demonstrates attribute selection, handling missing values through imputation, discretization of continuous variables, and elimination of outliers using the IQR method. The program processes a sample dataset containing age and income information, ultimately producing a cleaned and transformed dataset.

Uploaded by

logitech9966
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

PROGRAM:

import pandas as pd
import numpy as np
from [Link] import SimpleImputer
from [Link] import KBinsDiscretizer
# Sample dataset
data = {
'Age': [25, 30, 35, [Link], 40, 50, 60, 22, [Link], 28],
'Income': [50000, 60000, 80000, 90000, [Link], 120000, 140000, 65000, 70000, 50000],
'Gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F'],
'Score': [88, 92, 95, 88, 75, 84, 80, 91, 89, 70]
}
# Create a DataFrame
df = [Link](data)
print("Original Data:")
print(df)
# a. Attribute Selection (Selecting only relevant features)
# For simplicity, let's assume we are interested in 'Age' and 'Income' only
df_selected = df[['Age', 'Income']]
print("\nSelected Attributes:")
print(df_selected)
# b. Handling Missing Values
# Impute missing values using the mean strategy for numerical columns
imputer = SimpleImputer(strategy='mean')
# We can use 'median' or 'most_frequent' as well
df_selected_imputed = [Link](imputer.fit_transform(df_selected), columns=df_selected.columns)
print("\nData after Handling Missing Values:")
print(df_selected_imputed)

# c. Discretization (Binning continuous variables like 'Age' and 'Income')


# We will use KBinsDiscretizer to convert continuous variables into discrete bins
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df_discretized = [Link](discretizer.fit_transform(df_selected_imputed),
columns=df_selected_imputed.columns)
print("\nData after Discretization:")
print(df_discretized)

# d. Elimination of Outliers
# We'll use the IQR method to detect and remove outliers for the 'Age' and 'Income' columns
# Calculate IQR
Q1 = df_selected_imputed.quantile(0.25)
Q3 = df_selected_imputed.quantile(0.75)
IQR = Q3 - Q1
# Define outlier conditions
outlier_condition = ((df_selected_imputed < (Q1 - 1.5 * IQR)) | (df_selected_imputed > (Q3 + 1.5 * IQR)))
# Remove rows with outliers
df_no_outliers = df_selected_imputed[~outlier_condition.any(axis=1)]
print("\nData after Eliminating Outliers:")
print(df_no_outliers)
OUTPUT:

Original Data:
Age Income Gender Score
0 25.0 50000 M 88
1 30.0 60000 F 92
2 35.0 80000 M 95
3 NaN 90000 F 88
4 40.0 NaN M 75
5 50.0 120000 M 84
6 60.0 140000 F 80
7 22.0 65000 M 91
8 NaN 70000 M 89
9 28.0 50000 F 70

Selected Attributes:
Age Income
0 25.0 50000
1 30.0 60000
2 35.0 80000
3 NaN 90000
4 40.0 NaN
5 50.0 120000
6 60.0 140000
7 22.0 65000
8 NaN 70000
9 28.0 50000

Data after Handling Missing Values:


Age Income
0 25.0 50000.0
1 30.0 60000.0
2 35.0 80000.0
3 35.6 90000.0
4 40.0 83571.4
5 50.0 120000.0
6 60.0 140000.0
7 22.0 65000.0
8 35.6 70000.0
9 28.0 50000.0

Data after Discretization:


Age Income
0 0 0
1 1 0
2 2 1
3 1 1
4 2 1
5 2 2
6 2 2
7 0 1
8 1 1
9 0 0
Data after Eliminating Outliers:
Age Income
0 25.0 50000.0
1 30.0 60000.0
2 35.0 80000.0
3 35.6 90000.0
5 50.0 120000.0
6 60.0 140000.0
7 22.0 65000.0
8 35.6 70000.0
9 28.0 50000.0

You might also like