Data Cleaning and Preprocessing
1. Introduction
Data cleaning and preprocessing are essential steps in data science and
machine learning. Raw data is often messy, containing missing values,
inconsistencies, and irrelevant features. A well-processed dataset leads to
accurate and efficient machine learning models.
1.1 Importance of Data Cleaning and Preprocessing
Data cleaning and preprocessing are crucial because they ensure the
dataset is reliable, consistent, and suitable for analysis. Below are some
key reasons why this process is essential:
Enhances Data Quality and Reliability: Unclean data can lead to
inaccurate insights and poor decision-making. Cleaning ensures that
data is consistent and free of errors.
Eliminates Biases and Inconsistencies: Datasets often contain
biased, redundant, or incorrect information that can skew results.
Cleaning ensures that only relevant and unbiased data is used.
Reduces Noise and Irrelevant Information: Raw data may
contain unnecessary or misleading values, which can negatively
impact models and analyses.
Improves Model Accuracy and Generalizability: Well-
preprocessed data helps machine learning models perform better by
removing inconsistencies and irrelevant features.
Ensures Data is in the Correct Format for Analysis: Different
sources provide data in varying formats. Preprocessing ensures all
data is formatted uniformly for easy analysis.
Enhances Computational Efficiency: Cleaning reduces the size of
the dataset, making computations more efficient and reducing
processing time.
Prevents Data Leakage: Ensuring that data is properly cleaned
prevents unintentional information leakage that could lead to
misleading results.
Facilitates Better Feature Engineering: Clean data allows for
more meaningful feature extraction, leading to more robust
predictive models.
Aids in Regulatory Compliance: Many industries have regulations
that require data to be accurate and complete. Cleaning ensures
compliance with data governance standards.
2. Data Cleaning
Types of Missing Data
1. Missing Completely at Random (MCAR) - Data is missing with
no specific pattern.
2. Missing at Random (MAR) - Missing values depend on other
observed variables.
3. Missing Not at Random (MNAR) - Data is missing for a
specific reason.
2.1 Handling Missing Values
Missing values can significantly impact the quality of data. Common
techniques to handle missing values include:
Removing Missing Values: If missing values are few, they can be
removed.
df.dropna(inplace=True)
Filling Missing Values (Imputation):
o Mean/Median Imputation: Suitable for numerical data.
o df['column'].fillna(df['column'].mean(), inplace=True)
o Mode Imputation: Suitable for categorical data.
o df['column'].fillna(df['column'].mode()[0], inplace=True)
o Forward/Backward Fill: Used for time-series data.
o df.fillna(method='ffill', inplace=True)
o df.fillna(method='bfill', inplace=True)
2.2 Removing Duplicates
Duplicate data can distort analysis and predictions. Removing duplicates
ensures data integrity.
df.drop_duplicates(inplace=True)
2.3 Handling Outliers
Outliers can skew results, making them unreliable. Common methods to
handle outliers:
Using the IQR Method (Interquartile Range):
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= (Q1 - 1.5 * IQR)) & (df['column'] <= (Q3 + 1.5 * IQR))]
Using Z-score Method:
from scipy import stats
df = df[(np.abs(stats.zscore(df['column'])) < 3)]
2.4 Fixing Inconsistent Data Entries
Inconsistent entries can occur due to human errors or different data
sources.
Standardizing Text Data:
df['column'] = df['column'].str.lower().str.strip()
Replacing Incorrect Values:
df.replace({'wrong_value': 'correct_value'}, inplace=True)
2.5 Handling Data Type Inconsistencies
Ensuring correct data types improves processing efficiency.
Converting Data Types:
df['column'] = df['column'].astype(int) # Convert to integer
df['date_column'] = pd.to_datetime(df['date_column']) # Convert to datetime
3. Data Preprocessing
3.1 Feature Scaling
Scaling ensures that numerical features are within the same range,
improving ML performance.
Min-Max Scaling (Normalization) (Values between 0 and 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1', 'col2']])
Standardization (Z-score Normalization)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1', 'col2']])
3.2 Encoding Categorical Variables
Many machine learning algorithms require numerical input, so categorical
data must be converted into numeric representations.
One-Hot Encoding (For Nominal Categories)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['category']]).toarray()
Label Encoding (For Ordinal Categories)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['category'] = encoder.fit_transform(df['category'])
3.3 Feature Engineering
Feature engineering involves creating new meaningful features from
existing data to improve model performance.
Extracting Date Components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
3.4 Handling Imbalanced Data
Imbalanced datasets can lead to biased machine learning models.
Oversampling (SMOTE - Synthetic Minority Over-sampling
Technique)
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
3.5 Principal Component Analysis (PCA) for Dimensionality
Reduction
PCA reduces the number of features while retaining essential information.
Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df)