Assignment 23
NIELIT
Handle missing values in a dataset by filling with mean, median, and dropping rows.Encode
categorical data using one-hot encoding and label encoding.Scale features using
standardization and normalization.Split the dataset into training and testing sets using an 80-20
split.Remove duplicate rows from a dataset.Rename columns in a dataset.Take a dataset for
above questions and write the code for the questions mentioned.ORUse below mentioned
dataset and write the code for the questions mentioned above.data = { 'ID': [1, 2, 3, 4, 5, 6, 7, 8,
9, 10], 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Red', np.nan, 'Blue', 'Green'], 'Size':
['S', 'M', 'L', 'XL', 'M', 'S', 'XL', 'L', 'M', 'S'], 'Height': [150, 160, 170, np.nan, 190, 180, 175, 165,
np.nan, 155], 'Weight': [60, 65, 70, 75, 80, 85, np.nan, 95, 90, np.nan], 'Age': [25, 30, 35, 40, 45,
np.nan, 55, 60, 65, 70], 'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Feature2': [10, 20, 30, 40, 50,
60, 70, 80, 90, 100] }Convert to DataFramedf = pd.DataFrame(data)
Solution
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler,
MinMaxScaler
from sklearn.model_selection import train_test_split
# Sample dataset
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Red', np.nan, 'Blue', 'Green'],
'Size': ['S', 'M', 'L', 'XL', 'M', 'S', 'XL', 'L', 'M', 'S'],
'Height': [150, 160, 170, np.nan, 190, 180, 175, 165, np.nan, 155],
'Weight': [60, 65, 70, 75, 80, 85, np.nan, 95, 90, np.nan],
'Age': [25, 30, 35, 40, 45, np.nan, 55, 60, 65, 70],
'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
df = pd.DataFrame(data)
# 1. Handling missing values
# Filling with mean for numerical columns
df_mean_filled = df.fillna(df.mean())
# Filling with median for numerical columns
df_median_filled = df.fillna(df.median())
# Dropping rows with any missing values
df_dropna = df.dropna()
# 2. Encoding categorical data
# One-hot encoding
ohe = OneHotEncoder(sparse=False)
color_encoded = ohe.fit_transform(df[['Color']].fillna('Missing'))
color_encoded_df = pd.DataFrame(color_encoded,
columns=ohe.get_feature_names_out(['Color']))
# Label encoding
le = LabelEncoder()
size_encoded = le.fit_transform(df['Size'])
df['Size_LabelEncoded'] = size_encoded
# Merging one-hot encoded columns back to the dataframe
df = df.drop(columns=['Color']).join(color_encoded_df)
# 3. Scaling features
# Standardization
scaler_standard = StandardScaler()
df[['Height', 'Weight', 'Age']] = scaler_standard.fit_transform(df[['Height', 'Weight', 'Age']].fillna(0))
# Normalization
scaler_minmax = MinMaxScaler()
df[['Feature1', 'Feature2']] = scaler_minmax.fit_transform(df[['Feature1', 'Feature2']])
# 4. Splitting the dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# 5. Removing duplicate rows
df_no_duplicates = df.drop_duplicates()
# 6. Renaming columns
df.rename(columns={'Size_LabelEncoded': 'Size_Encoded'}, inplace=True)
Explanation:Handling Missing Values: The code demonstrates three approaches to handling
missing data: filling with the mean, filling with the median, and dropping rows.Encoding
Categorical Data: Categorical variables are encoded using both one-hot encoding and label
encoding. One-hot encoding is used for the 'Color' column, while label encoding is applied to the
'Size' column.Scaling Features: Numerical features are scaled using standardization and
normalization. StandardScaler is used to standardize 'Height', 'Weight', and 'Age'. MinMaxScaler
is used for normalizing 'Feature1' and 'Feature2'.Splitting the Dataset: The dataset is split into
training and testing sets using an 80-20 split.Removing Duplicate Rows: Any duplicate rows in
the dataset are removed.Renaming Columns: The code renames a column to reflect its
encoded nature.
# Display the final DataFrame
print(df)