DATA SCIENCE AND VISUALIZATION 12202080501060
202046707
Practical 6:
Perform encoding of categorical variables in the given dataset.
Introduction:
In data preprocessing, categorical variables need to be transformed into numerical
representations so that machine learning algorithms can process them effectively. This
practical demonstrates how to apply One-Hot Encoding, Label Encoding, and
preprocessing techniques such as scaling, normalization, and handling missing values. The
dataset used contains student details, including gender, city, mobile, semester marks, and
more.
Code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/DSV
/Dataset_(12202080501060)/student_dataset_with_missing_values.csv')
df = df.drop(['Name', 'Enrollment'], axis=1)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
gender_col_index = df.columns.get_loc('Gender')
city_col_index = df.columns.get_loc('City')
mobile_col_index = df.columns.get_loc('Mobile')
GCET
DATA SCIENCE AND VISUALIZATION 12202080501060
202046707
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
numeric_transformer = SimpleImputer(strategy='mean')
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
ct = make_column_transformer(
(categorical_transformer, [gender_col_index, city_col_index]),
(numeric_transformer, [mobile_col_index]),
remainder='passthrough'
X = ct.fit_transform(X)
X = X.toarray() if hasattr(X, 'toarray') else X
print("Data after encoding 'Gender' and 'City' and handling 'Mobile':")
print(X[:5])
from sklearn.preprocessing import LabelEncoder
GCET
DATA SCIENCE AND VISUALIZATION 12202080501060
202046707
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)
X_train
X_test
y_train
y_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_numeric = X_train[:, 8:]
X_test_numeric = X_test[:, 8:]
GCET
DATA SCIENCE AND VISUALIZATION 12202080501060
202046707
X_train_scaled = sc.fit_transform(X_train_numeric)
X_test_scaled = sc.transform(X_test_numeric)
print("Scaled X_train (numerical columns):")
print(X_train_scaled)
from sklearn.preprocessing import Normalizer
from sklearn.impute import SimpleImputer
import numpy as np
nm = Normalizer()
numerical_cols_indices = slice(8, None)
imputer_numerical = SimpleImputer(missing_values=np.nan, strategy='mean')
GCET
DATA SCIENCE AND VISUALIZATION 12202080501060
202046707
X_train[:, numerical_cols_indices] = imputer_numerical.fit_transform(X_train[:,
numerical_cols_indices])
X_test[:, numerical_cols_indices] = imputer_numerical.transform(X_test[:,
numerical_cols_indices])
X_train[:, numerical_cols_indices] = nm.fit_transform(X_train[:, numerical_cols_indices])
X_test[:, numerical_cols_indices] = nm.transform(X_test[:, numerical_cols_indices])
print("Numerical columns normalized after imputation.")
print(X_train)
Important Points:
1. One-Hot Encoding is used for categorical variables like Gender and City.
2. Label Encoding is applied on the target variable.
3. Missing values in numerical columns are handled using mean imputation.
4. StandardScaler normalizes numerical values to a common scale.
5. Normalizer ensures feature vectors have unit norm.
Conclusion:
Encoding categorical variables is a crucial step in data preprocessing. It allows machine
learning models to interpret categorical data effectively. In this practical, we successfully
encoded categorical features, handled missing values, and applied scaling and
normalization to numerical data, preparing the dataset for model building.
GCET