DATA SCIENCE AND
VISUALIZATION
12202080501060
202046707
Practical 5:
Implement a method to handle missing values for gender and marks. (Using Practical 3
dataset)
Introduction:
In real-world datasets, missing values are very common and need to be handled properly
before performing any analysis or machine learning tasks. Missing values in categorical
variables such as Gender can be handled using the Mode, while missing values in
numerical variables such as Marks can be imputed using measures like Mean, Median, or
Mode. In this practical, we use the mean for handling missing marks and the mode for
handling missing gender values.
# Import required libraries
import numpy as np
import pandas as pd
import sklearn
# Load dataset
df = pd.read_csv('/content/drive/MyDrive/DSV /Dataset_(12202080501060)/
student_dataset_with_missing_values.c’)
df.info()
GCET
17
DATA SCIENCE AND
VISUALIZATION
12202080501060
202046707
x = df.iloc[:, :-1].values
y = df.iloc[:,3].values
# Handle missing values in marks using Mean df['Sem1_Math'] =
df['Sem1_Math'].fillna(df['Sem1_Math'].mean()) df['Sem1_Science'] =
df['Sem1_Science'].fillna(df['Sem1_Science'].mean())
df['Sem1_English'] =
df['Sem1_English'].fillna(df['Sem1_English'].mean())
df['Sem1_History'] =
df['Sem1_History'].fillna(df['Sem1_History'].mean()) df['Sem1_CS'] =
df['Sem1_CS'].fillna(df['Sem1_CS'].mean()) df['Sem2_Math'] =
df['Sem2_Math'].fillna(df['Sem2_Math'].mean()) df['Sem2_Science'] =
df['Sem2_Science'].fillna(df['Sem2_Science'].mean())
df['Sem2_English'] =
df['Sem2_English'].fillna(df['Sem2_English'].mean())
df['Sem2_History'] =
df['Sem2_History'].fillna(df['Sem2_History'].mean()) df['Sem2_CS'] =
df['Sem2_CS'].fillna(df['Sem2_CS'].mean()) df['Sem3_Math'] =
df['Sem3_Math'].fillna(df['Sem3_Math'].mean()) df['Sem3_Science'] =
df['Sem3_Science'].fillna(df['Sem3_Science'].mean())
df['Sem3_English'] =
df['Sem3_English'].fillna(df['Sem3_English'].mean())
df['Sem3_History'] =
GCET
18
DATA SCIENCE AND
VISUALIZATION
12202080501060
202046707
df['Sem3_History'].fillna(df['Sem3_History'].mean()) df['Sem3_CS'] =
df['Sem3_CS'].fillna(df['Sem3_CS'].mean()) df['Sem4_Math'] =
df['Sem4_Math'].fillna(df['Sem4_Math'].mean()) df['Sem4_Science'] =
df['Sem4_Science'].fillna(df['Sem4_Science'].mean())
df['Sem4_English'] =
df['Sem4_English'].fillna(df['Sem4_English'].mean())
df['Sem4_History'] =
df['Sem4_History'].fillna(df['Sem4_History'].mean()) df['Sem4_CS'] =
df['Sem4_CS'].fillna(df['Sem4_CS'].mean())
# Handle missing values in Gender using Mode
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
y = df.iloc[:,3].values
y_reshaped = y.reshape(-1, 1)
imputer.fit(y_reshaped)
y_imputed = imputer.transform(y_reshaped)
y_imputed
GCET
19
DATA SCIENCE AND
VISUALIZATION
12202080501060
202046707
Important Points:
- Missing values can introduce bias or reduce the quality of analysis.
- For numerical data (marks), mean imputation ensures that overall distribution is less
disturbed. - For categorical data (gender), mode imputation is preferred as it maintains
majority class consistency.
- Sklearn and Pandas provide multiple imputation techniques.
Conclusion:
In this practical, missing values in the student dataset were successfully handled. We used
mean imputation for numerical marks and mode imputation for categorical gender data.
Handling missing values is an essential preprocessing step to ensure reliable and accurate
analysis in data science and machine learning.
GCET
20