0% found this document useful (0 votes)
14 views3 pages

Untitled Document 5

Uploaded by

saurabhsin6294
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Untitled Document 5

Uploaded by

saurabhsin6294
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment 23

NIELIT

Handle missing values in a dataset by filling with mean, median, and dropping rows.Encode
categorical data using one-hot encoding and label encoding.Scale features using
standardization and normalization.Split the dataset into training and testing sets using an 80-20
split.Remove duplicate rows from a dataset.Rename columns in a dataset.Take a dataset for
above questions and write the code for the questions mentioned.ORUse below mentioned
dataset and write the code for the questions mentioned above.data = { 'ID': [1, 2, 3, 4, 5, 6, 7, 8,
9, 10], 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Red', np.nan, 'Blue', 'Green'], 'Size':
['S', 'M', 'L', 'XL', 'M', 'S', 'XL', 'L', 'M', 'S'], 'Height': [150, 160, 170, np.nan, 190, 180, 175, 165,
np.nan, 155], 'Weight': [60, 65, 70, 75, 80, 85, np.nan, 95, 90, np.nan], 'Age': [25, 30, 35, 40, 45,
np.nan, 55, 60, 65, 70], 'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Feature2': [10, 20, 30, 40, 50,
60, 70, 80, 90, 100] }Convert to DataFramedf = pd.DataFrame(data)

Solution

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler,
MinMaxScaler
from sklearn.model_selection import train_test_split

# Sample dataset
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Red', np.nan, 'Blue', 'Green'],
'Size': ['S', 'M', 'L', 'XL', 'M', 'S', 'XL', 'L', 'M', 'S'],
'Height': [150, 160, 170, np.nan, 190, 180, 175, 165, np.nan, 155],
'Weight': [60, 65, 70, 75, 80, 85, np.nan, 95, 90, np.nan],
'Age': [25, 30, 35, 40, 45, np.nan, 55, 60, 65, 70],
'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}

df = pd.DataFrame(data)

# 1. Handling missing values


# Filling with mean for numerical columns
df_mean_filled = df.fillna(df.mean())

# Filling with median for numerical columns


df_median_filled = df.fillna(df.median())
# Dropping rows with any missing values
df_dropna = df.dropna()

# 2. Encoding categorical data


# One-hot encoding
ohe = OneHotEncoder(sparse=False)
color_encoded = ohe.fit_transform(df[['Color']].fillna('Missing'))
color_encoded_df = pd.DataFrame(color_encoded,
columns=ohe.get_feature_names_out(['Color']))

# Label encoding
le = LabelEncoder()
size_encoded = le.fit_transform(df['Size'])
df['Size_LabelEncoded'] = size_encoded

# Merging one-hot encoded columns back to the dataframe


df = df.drop(columns=['Color']).join(color_encoded_df)

# 3. Scaling features
# Standardization
scaler_standard = StandardScaler()
df[['Height', 'Weight', 'Age']] = scaler_standard.fit_transform(df[['Height', 'Weight', 'Age']].fillna(0))

# Normalization
scaler_minmax = MinMaxScaler()
df[['Feature1', 'Feature2']] = scaler_minmax.fit_transform(df[['Feature1', 'Feature2']])

# 4. Splitting the dataset


train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# 5. Removing duplicate rows


df_no_duplicates = df.drop_duplicates()

# 6. Renaming columns
df.rename(columns={'Size_LabelEncoded': 'Size_Encoded'}, inplace=True)

Explanation:Handling Missing Values: The code demonstrates three approaches to handling


missing data: filling with the mean, filling with the median, and dropping rows.Encoding
Categorical Data: Categorical variables are encoded using both one-hot encoding and label
encoding. One-hot encoding is used for the 'Color' column, while label encoding is applied to the
'Size' column.Scaling Features: Numerical features are scaled using standardization and
normalization. StandardScaler is used to standardize 'Height', 'Weight', and 'Age'. MinMaxScaler
is used for normalizing 'Feature1' and 'Feature2'.Splitting the Dataset: The dataset is split into
training and testing sets using an 80-20 split.Removing Duplicate Rows: Any duplicate rows in
the dataset are removed.Renaming Columns: The code renames a column to reflect its
encoded nature.

# Display the final DataFrame


print(df)

You might also like