0% found this document useful (0 votes)

51 views5 pages

Arbitrary Value Imputation.

The document outlines a data preprocessing workflow using Python libraries such as pandas, numpy, and sklearn on a Titanic dataset. It includes steps for handling missing values through different imputation strategies and visualizing the effects of these strategies on the data distributions. Additionally, it demonstrates the use of a ColumnTransformer for applying multiple imputation methods simultaneously.

Uploaded by

Rudraksh Amar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views5 pages

Arbitrary Value Imputation.

Uploaded by

Rudraksh Amar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

In [29]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [31]:
df = pd.read_csv('titanic_toy.csv')

In [32]:
df.head()

Out[32]: Age Fare Family Survived

0 22.0 7.2500 1 0

1 38.0 71.2833 1 1

2 26.0 7.9250 0 1

3 35.0 53.1000 1 1

4 35.0 8.0500 0 0

In [33]:
df.isnull().mean()

Out[33]: Age 0.198653

Fare 0.050505
Family 0.000000
Survived 0.000000
dtype: float64

In [34]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [35]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_stat

In [36]:
X_train['Age_99'] = X_train['Age'].fillna(99)
X_train['Age_minus1'] = X_train['Age'].fillna(-1)

X_train['Fare_999'] = X_train['Fare'].fillna(999)
X_train['Fare_minus1'] = X_train['Fare'].fillna(-1)
<ipython-input-36-cb3531bd821d>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_train['Age_99'] = X_train['Age'].fillna(99)
<ipython-input-36-cb3531bd821d>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_train['Age_minus1'] = X_train['Age'].fillna(-1)
<ipython-input-36-cb3531bd821d>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_train['Fare_999'] = X_train['Fare'].fillna(999)
<ipython-input-36-cb3531bd821d>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X_train['Fare_minus1'] = X_train['Fare'].fillna(-1)

In [37]:
print('Original Age variable variance: ', X_train['Age'].var())
print('Age Variance after 99 wala imputation: ', X_train['Age_99'].var())
print('Age Variance after -1 wala imputation: ', X_train['Age_minus1'].var())

print('Original Fare variable variance: ', X_train['Fare'].var())

print('Fare Variance after 999 wala imputation: ', X_train['Fare_999'].var())
print('Fare Variance after -1 wala imputation: ', X_train['Fare_minus1'].var()

Original Age variable variance: 204.3495133904614

Age Variance after 99 wala imputation: 951.7275570187172
Age Variance after -1 wala imputation: 318.0896202624484
Original Fare variable variance: 2448.197913706318
Fare Variance after 999 wala imputation: 47219.20265217623
Fare Variance after -1 wala imputation: 2378.5676784883503

In [38]:
fig = plt.figure()
ax = fig.add_subplot(111)

# original variable distribution

X_train['Age'].plot(kind='kde', ax=ax)

# variable imputed with the median

X_train['Age_99'].plot(kind='kde', ax=ax, color='red')

# variable imputed with the mean

X_train['Age_minus1'].plot(kind='kde', ax=ax, color='green')

# add legends
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
Out[38]: <matplotlib.legend.Legend at 0x227a8f2a3a0>

In [20]:
fig = plt.figure()
ax = fig.add_subplot(111)

# original variable distribution

X_train['Fare'].plot(kind='kde', ax=ax)

# variable imputed with the median

X_train['Fare_999'].plot(kind='kde', ax=ax, color='red')

# variable imputed with the mean

X_train['Fare_minus1'].plot(kind='kde', ax=ax, color='green')

# add legends
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[20]: <matplotlib.legend.Legend at 0x227a8bb0430>

In [13]:
X_train.cov()
Out[13]: Age Fare Family Age_99 Age_minus1 Fare_99

Age 204.349513 70.719262 -6.498901 204.349513 204.349513 162.79343

Fare 70.719262 2448.197914 17.258917 -101.671097 125.558364 2448.19791

Family -6.498901 17.258917 2.735252 -7.387287 -4.149246 11.52862

Age_99 204.349513 -101.671097 -7.387287 951.727557 -189.535540 -159.93166

Age_minus1 204.349513 125.558364 -4.149246 -189.535540 318.089620 257.37988

Fare_999 162.793430 2448.197914 11.528625 -159.931663 257.379887 47219.20265

Fare_minus1 63.321188 2448.197914 16.553989 -94.317400 114.394141 762.47498

In [14]:
X_train.corr()

Out[14]: Age Fare Family Age_99 Age_minus1 Fare_999 Fare_m

Age 1.000000 0.092644 -0.299113 1.000000 1.000000 0.051179 0.08

Fare 0.092644 1.000000 0.208268 -0.066273 0.142022 1.000000 1.00

Family -0.299113 0.208268 1.000000 -0.144787 -0.140668 0.032079 0.20

Age_99 1.000000 -0.066273 -0.144787 1.000000 -0.344476 -0.023857 -0.06

Age_minus1 1.000000 0.142022 -0.140668 -0.344476 1.000000 0.066411 0.1

Fare_999 0.051179 1.000000 0.032079 -0.023857 0.066411 1.000000 0.07

Fare_minus1 0.084585 1.000000 0.205233 -0.062687 0.131514 0.071946 1.00

Using Sklearn
In [39]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_stat

In [22]:
imputer1 = SimpleImputer(strategy='constant',fill_value=99)
imputer2 = SimpleImputer(strategy='constant',fill_value=999)

In [40]:
trf = ColumnTransformer([
('imputer1',imputer1,['Age']),
('imputer2',imputer2,['Fare'])
],remainder='passthrough')

In [41]:
trf.fit(X_train)
Out[41]: ColumnTransformer(remainder='passthrough',
transformers=[('imputer1',
SimpleImputer(fill_value=99,
strategy='constant'),
['Age']),
('imputer2',
SimpleImputer(fill_value=999,
strategy='constant'),
['Fare'])])

In [42]:
trf.named_transformers_['imputer1'].statistics_

Out[42]: array([99.])

In [43]:
trf.named_transformers_['imputer2'].statistics_

Out[43]: array([999.])

In [44]:
X_train = trf.transform(X_train)
X_test = trf.transform(X_test)

In [45]:
X_train

Out[45]: array([[ 40. , 27.7208, 0. ],

[ 4. , 16.7 , 2. ],
[ 47. , 9. , 0. ],
...,
[ 71. , 49.5042, 0. ],
[ 99. , 221.7792, 0. ],
[ 99. , 25.925 , 0. ]])

In [ ]:

1
No ratings yet
1
13 pages
Practical No 01
No ratings yet
Practical No 01
9 pages
Python ML Algorithms Guide
No ratings yet
Python ML Algorithms Guide
7 pages
1st PGM
No ratings yet
1st PGM
10 pages
Practicalpgm ML
No ratings yet
Practicalpgm ML
33 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Data - Preprocessing - Tools - Ipynb - Colaboratory
No ratings yet
Data - Preprocessing - Tools - Ipynb - Colaboratory
4 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
ML Record Print
No ratings yet
ML Record Print
20 pages
Exp. 1
No ratings yet
Exp. 1
4 pages
CCC
No ratings yet
CCC
25 pages
Experiment 1
No ratings yet
Experiment 1
19 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
DA Programs
No ratings yet
DA Programs
44 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
Home Work
No ratings yet
Home Work
12 pages
S6 - Data Mining Lab Experiments (Except 1)
No ratings yet
S6 - Data Mining Lab Experiments (Except 1)
6 pages
B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
Import As: Pandas PD DF PD - Read - CSV DF - Head
No ratings yet
Import As: Pandas PD DF PD - Read - CSV DF - Head
91 pages
Btech1007022 Lab5.1
No ratings yet
Btech1007022 Lab5.1
9 pages
Machine Learning Lab New
No ratings yet
Machine Learning Lab New
14 pages
MLA Lab Record (2024)
No ratings yet
MLA Lab Record (2024)
47 pages
Sample Code
No ratings yet
Sample Code
8 pages
Detect Fake Social Media Profiles with SVM
No ratings yet
Detect Fake Social Media Profiles with SVM
8 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
Advanced Machine Learning Course Guide
No ratings yet
Advanced Machine Learning Course Guide
36 pages
Da 012307
No ratings yet
Da 012307
8 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Linear and Logistic Regression
No ratings yet
Linear and Logistic Regression
6 pages
Ai Last 5
No ratings yet
Ai Last 5
4 pages
Titanic Data Analysis with Python
No ratings yet
Titanic Data Analysis with Python
20 pages
ML Manual
No ratings yet
ML Manual
18 pages
ML Journal External
No ratings yet
ML Journal External
14 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Predicting BMW Prices with Regression
No ratings yet
Predicting BMW Prices with Regression
5 pages
Machine Learning File
No ratings yet
Machine Learning File
28 pages
ML Manual
No ratings yet
ML Manual
9 pages
AI&ML
No ratings yet
AI&ML
9 pages
B.Tech AI & DS: Data Science Lab
No ratings yet
B.Tech AI & DS: Data Science Lab
35 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Practical Machine Learning Code Examples
No ratings yet
Practical Machine Learning Code Examples
33 pages
Day 11 (Code 1) Mean Median Imputation - Jupyter Notebook
No ratings yet
Day 11 (Code 1) Mean Median Imputation - Jupyter Notebook
6 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Data Analytics
No ratings yet
Data Analytics
10 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
28 pages
Btech1007022 Lab5
No ratings yet
Btech1007022 Lab5
14 pages
Aiml Practicals
No ratings yet
Aiml Practicals
22 pages
WBCD Dataset Model Training Analysis
No ratings yet
WBCD Dataset Model Training Analysis
3 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
Iris Dataset EDA & ML Techniques
100% (2)
Iris Dataset EDA & ML Techniques
24 pages
Machine Learning Lab: Algorithms & Implementation
No ratings yet
Machine Learning Lab: Algorithms & Implementation
11 pages
Code
No ratings yet
Code
6 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Null 0
No ratings yet
Null 0
6 pages
22MCA1008 - Varun ML LAB ASSIGNMENTS
100% (1)
22MCA1008 - Varun ML LAB ASSIGNMENTS
41 pages
ML
No ratings yet
ML
21 pages
Unit 1
No ratings yet
Unit 1
16 pages
Bin Ar Ization
No ratings yet
Bin Ar Ization
3 pages
Automatically Select Imputer Parameters
No ratings yet
Automatically Select Imputer Parameters
5 pages
DC M:C Notes
No ratings yet
DC M:C Notes
233 pages
3 Phase Transformer
No ratings yet
3 Phase Transformer
143 pages
3 Phase Induction Motor 2 Upto Speed Control
No ratings yet
3 Phase Induction Motor 2 Upto Speed Control
52 pages
AE6170 Project Report
No ratings yet
AE6170 Project Report
5 pages
Fixed Points Yu A Shashkin
No ratings yet
Fixed Points Yu A Shashkin
2 pages
Image Restoration
No ratings yet
Image Restoration
28 pages
Differential Calculus Course Overview
No ratings yet
Differential Calculus Course Overview
3 pages
Vector Integral Calculus: Text Chapter 10 Emphasis: Sections 10.4, 10.7-10.9
No ratings yet
Vector Integral Calculus: Text Chapter 10 Emphasis: Sections 10.4, 10.7-10.9
72 pages
Taha Sochi - Tensor Calculus Made Simple (2016)
No ratings yet
Taha Sochi - Tensor Calculus Made Simple (2016)
126 pages
GAMS
No ratings yet
GAMS
49 pages
Topology: Quotient Spaces Explained
No ratings yet
Topology: Quotient Spaces Explained
6 pages
FCH Mso202
No ratings yet
FCH Mso202
2 pages
Intro to Functions & Graphs
No ratings yet
Intro to Functions & Graphs
116 pages
Unit-I: Matrices
No ratings yet
Unit-I: Matrices
197 pages
Bisection Method
100% (2)
Bisection Method
7 pages
Applied Calculus PDF
No ratings yet
Applied Calculus PDF
1 page
TQ 1ST 7-26-19
No ratings yet
TQ 1ST 7-26-19
8 pages
Linear Equations Practice Sheets
No ratings yet
Linear Equations Practice Sheets
20 pages
Unit 8: Area Between Curves
No ratings yet
Unit 8: Area Between Curves
15 pages
Composite Hygro Thermal Formula
No ratings yet
Composite Hygro Thermal Formula
12 pages
Definite Integration With U Substitution Homework Answers
100% (1)
Definite Integration With U Substitution Homework Answers
8 pages
DLP Math-8 - Q1 Performance-Task
No ratings yet
DLP Math-8 - Q1 Performance-Task
4 pages
WMA02 01 Que 20190116dasd
No ratings yet
WMA02 01 Que 20190116dasd
52 pages
Lecture 37
No ratings yet
Lecture 37
6 pages
Dynamic Programming Overview
No ratings yet
Dynamic Programming Overview
5 pages
131 Test 2 2024 Memo
No ratings yet
131 Test 2 2024 Memo
12 pages
Relations and Functions
100% (1)
Relations and Functions
30 pages
Problems and Exercises in Integral Equations-Krasnov-Kiselev-Makarenko
100% (3)
Problems and Exercises in Integral Equations-Krasnov-Kiselev-Makarenko
224 pages
Mathematics - IIIA: (Numerical Analysis, Complex Analysis and Probability and Statistics)
No ratings yet
Mathematics - IIIA: (Numerical Analysis, Complex Analysis and Probability and Statistics)
51 pages
Add & Subtract Radical Expressions
No ratings yet
Add & Subtract Radical Expressions
12 pages
Topic13 Inversion of Z Transform
No ratings yet
Topic13 Inversion of Z Transform
6 pages
Mathongo Mathematics Priority B Part 1
No ratings yet
Mathongo Mathematics Priority B Part 1
45 pages

Arbitrary Value Imputation.

Uploaded by

Arbitrary Value Imputation.

Uploaded by

In [29]:

Out[32]: Age Fare Family Survived

Out[33]: Age 0.198653

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

print('Original Fare variable variance: ', X_train['Fare'].var())

Original Age variable variance: 204.3495133904614

# original variable distribution

# variable imputed with the median

# variable imputed with the mean

# original variable distribution

# variable imputed with the median

# variable imputed with the mean

Out[20]: <matplotlib.legend.Legend at 0x227a8bb0430>

Age 204.349513 70.719262 -6.498901 204.349513 204.349513 162.79343

Fare 70.719262 2448.197914 17.258917 -101.671097 125.558364 2448.19791

Family -6.498901 17.258917 2.735252 -7.387287 -4.149246 11.52862

Age_99 204.349513 -101.671097 -7.387287 951.727557 -189.535540 -159.93166

Age_minus1 204.349513 125.558364 -4.149246 -189.535540 318.089620 257.37988

Fare_999 162.793430 2448.197914 11.528625 -159.931663 257.379887 47219.20265

Fare_minus1 63.321188 2448.197914 16.553989 -94.317400 114.394141 762.47498

Out[14]: Age Fare Family Age_99 Age_minus1 Fare_999 Fare_m

Age 1.000000 0.092644 -0.299113 1.000000 1.000000 0.051179 0.08

Fare 0.092644 1.000000 0.208268 -0.066273 0.142022 1.000000 1.00

Family -0.299113 0.208268 1.000000 -0.144787 -0.140668 0.032079 0.20

Age_99 1.000000 -0.066273 -0.144787 1.000000 -0.344476 -0.023857 -0.06

Age_minus1 1.000000 0.142022 -0.140668 -0.344476 1.000000 0.066411 0.1

Fare_999 0.051179 1.000000 0.032079 -0.023857 0.066411 1.000000 0.07

Fare_minus1 0.084585 1.000000 0.205233 -0.062687 0.131514 0.071946 1.00

Out[45]: array([[ 40. , 27.7208, 0. ],

You might also like