Data Cleaning Approaches in Machine Learning Algorithms
1. Handling Missing Data
Identify missing values.
Impute or remove missing data using appropriate techniques.
Python Code:
import pandas as pd
from sklearn.impute import SimpleImputer
# Identify missing data
missing_values = data.isnull().sum()
# Mean/Median/Mode Imputation
imputer = SimpleImputer(strategy='mean') # Can change to 'median' or 'most_frequent'
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Forward/Backward Fill
data_filled = data.fillna(method='ffill') # Can change to 'bfill' for backward fill
2. Handling Outliers
Detect outliers using statistical methods or visual tools.
Handle outliers by capping, transforming, or removing them.
Python Code:
import numpy as np
# Z-score Method for Outlier Detection
z_scores = (data - data.mean()) / data.std()
data_no_outliers = data[(np.abs(z_scores) < 3).all(axis=1)] # Remove data with z > 3
# IQR Method for Outlier Detection
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data_no_outliers = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 *
IQR))).any(axis=1)]
3. Removing Duplicates
Identify duplicate records in the dataset.
Remove duplicates while retaining necessary unique entries.
Python Code:
# Identify and remove duplicates
data_no_duplicates = data.drop_duplicates()
4. Normalizing and Scaling
Normalize or scale features for algorithms sensitive to different feature scales.
Python Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(data),
columns=data.columns)
5. Encoding Categorical Variables
Convert categorical variables into numerical values using encoding techniques
like One-Hot Encoding or Label Encoding.
Python Code:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding
label_encoder = LabelEncoder()
data['encoded_column'] = label_encoder.fit_transform(data['categorical_column'])
# One-Hot Encoding
one_hot_encoder = pd.get_dummies(data, columns=['categorical_column'],
drop_first=True)
6. Dealing with Imbalanced Data
Apply oversampling or undersampling techniques to balance class distributions.
Python Code:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# SMOTE Oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
7. Handling Inconsistent Data
Standardize formats, correct typos, and handle inconsistencies in data types or
units.
Python Code:
# Correct inconsistent data
data['date_column'] = pd.to_datetime(data['date_column'])
data['text_column'] = data['text_column'].str.lower() # Lowercase text
# Handling typos using fuzzy matching
import fuzzywuzzy
from fuzzywuzzy import process
correct_spellings = ["category1", "category2"]
data['corrected_column'] = data['categorical_column'].apply(lambda x:
process.extractOne(x, correct_spellings)[0])
8. Feature Engineering
Create new features based on existing ones, or use interaction and polynomial
features.
Python Code:
from sklearn.preprocessing import PolynomialFeatures
# Creating new features (e.g., interaction terms)
data['new_feature'] = data['feature1'] * data['feature2']
# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['feature1', 'feature2']])
9. Removing Irrelevant Features
Remove features that provide little or no information.
Python Code:
# Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
data_reduced = selector.fit_transform(data)
10. Handling Multicollinearity
Detect multicollinearity using correlation matrices or VIF and remove highly
correlated features.
Python Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Feature'] = X.columns
# Remove highly collinear features based on VIF score
X_reduced = X.drop(columns=['high_vif_feature'])
11. Text Data Cleaning
Clean and preprocess text data by tokenizing, removing stopwords, and
normalizing case.
Python Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Remove punctuation, stopwords, and lowercase
stop_words = set(stopwords.words('english'))
data['cleaned_text'] = data['text_column'].apply(lambda x: ' '.join([word for word in
word_tokenize(x.lower()) if word not in stop_words and word not in string.punctuation]))
12. Date/Time Data Handling
Extract features from date columns or normalize to a common time zone.
Python Code:
# Extracting year, month, day from datetime
data['year'] = data['date_column'].dt.year
data['month'] = data['date_column'].dt.month
data['day'] = data['date_column'].dt.day
13. Handling Data Leakage
Prevent target leakage by separating training and test datasets early and
ensuring no future information is included.
Python Code:
# Separate data before processing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
14. Handling Zero-Variance Features
Identify features with no variance and remove them from the dataset.
Python Code:
# Variance Threshold to remove zero variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0)
data_cleaned = selector.fit_transform(data)
15. Addressing Class Imbalance in Regression
Use techniques like stratified sampling or weighted loss functions to handle
imbalanced data in regression problems.
Python Code:
from sklearn.utils import class_weight
# Class weights in regression
class_weights = class_weight.compute_sample_weight(class_weight='balanced',
y=y_train)
16. Addressing Imbalanced Data in Classification
Use oversampling, undersampling, or adjusting decision thresholds to handle
imbalanced classes.
Python Code:
# Adjust decision threshold for a classifier
from sklearn.metrics import precision_recall_curve
y_pred_prob = classifier.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
best_threshold = thresholds[np.argmax(precision)]
17. Handling Missing Categorical Values
Impute missing categorical values using the mode or create a separate category.
Python Code:
# Impute missing categorical values with the most frequent category (mode)
imputer = SimpleImputer(strategy='most_frequent')
data['categorical_column'] = imputer.fit_transform(data[['categorical_column']])
18. Log Transformation
Apply logarithmic transformation to reduce skewness in data.
Python Code:
# Log transformation for skewed features
data['log_transformed_feature'] = np.log(data['skewed_feature'] + 1) # Adding 1 to
avoid log(0)
19. Binning Continuous Variables
Convert continuous features into discrete intervals or bins for simplification.
Python Code:
# Binning a continuous feature into discrete categories
data['binned_feature'] = pd.cut(data['continuous_feature'], bins=5, labels=['very low',
'low', 'medium', 'high', 'very high'])
20. Converting Numerical to Categorical
Convert numerical variables into categorical ones based on specific ranges or
thresholds.
Python Code:
# Convert numerical age into categories
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young
adult', 'senior'])
Note: The above data cleaning techniques and their corresponding Python code can
help you create a robust preprocessing pipeline, improving the quality of the datasets
before feeding them into machine learning models.