Data Preprocess
Data Pre-Processing
A Guide to Understanding and Preparing Your Data
mlpfu.pages.dev 1
Data Preprocess
Core Libraries
These are the foundational tools for data manipulation and visualization
in Python.
Pandas: For data structures and analysis.
NumPy: For numerical operations.
Matplotlib & Seaborn: For data visualization.
# Importing the essentials
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
mlpfu.pages.dev 2
Data Preprocess
Loading Data
The first step is to get your data into a Pandas DataFrame.
# Load data from a CSV file
df = pd.read_csv('your_dataset.csv')
# Display the first 5 rows
print(df.head())
mlpfu.pages.dev 3
Data Preprocess
Initial Data Exploration
Get a first look at your dataset's structure and content.
# Get a concise summary of the dataframe
df.info()
# Generate descriptive statistics
df.describe()
mlpfu.pages.dev 4
Data Preprocess
Deeper Data Inspection
Go beyond the basics to understand your data's shape and types.
# Check the dimensions of the DataFrame (rows, columns)
print(df.shape)
# List all column names
print(df.columns)
# Check the data type of each column
print(df.dtypes)
mlpfu.pages.dev 5
Data Preprocess
Exploring Categorical Data with value_counts
This is the best first step for any categorical column. It shows the number
of times each category appears.
Why is this important?
Reveals the distribution of categories.
Helps identify data quality issues (e.g., typos, mixed case).
Shows if you have a data imbalance problem.
# Get the counts of each unique value in a column
print(df['category_column'].value_counts())
mlpfu.pages.dev 6
Data Preprocess
Correcting Data Types
Sometimes data is loaded with the wrong type (e.g., numbers as text).
This is a common and critical cleaning step.
# Example: A column 'price' was read as object (text)
# because of '$' symbols.
# First, remove the non-numeric characters.
df['price_cleaned'] = df['price'].replace({'\$': ''}, regex=True)
# Now, convert the column to a numeric type.
# errors='coerce' will turn any remaining non-numeric values into NaN
df['price_numeric'] = pd.to_numeric(df['price_cleaned'], errors='coerce')
# Verify the change
print(df.dtypes)
mlpfu.pages.dev 7
Data Preprocess
Analyzing Two Categories: Crosstabulation
A crosstab (or contingency table) is the best way to see the relationship
and frequency between two categorical variables.
# Create a table showing the frequency of each combination
# of 'category1' and 'category2'.
crosstab_result = pd.crosstab(df['category1'], df['category2'])
print(crosstab_result)
# This can be visualized with a heatmap for better readability
sns.heatmap(crosstab_result, annot=True, fmt='d')
mlpfu.pages.dev 8
Data Preprocess
Handling Duplicates & Unwanted Columns
Keep your dataset clean and relevant.
# Check for and count duplicate rows
print(df.duplicated().sum())
# Remove duplicate rows
df = df.drop_duplicates()
# Remove a single unwanted column
df = df.drop('unwanted_column_name', axis=1)
mlpfu.pages.dev 9
Data Preprocess
Creating Custom Columns
Engineer new features from existing data.
# Example: Create 'price_per_sqft' from 'price' and 'sqft'
df['price_per_sqft'] = df['price'] / df['sqft_living']
# Display the new column
print(df[['price', 'sqft_living', 'price_per_sqft']].head())
mlpfu.pages.dev 10
Data Preprocess
Handling Missing Values
Missing data is common. Here's how to handle it.
# Check for missing values in each column
print(df.isnull().sum())
# Option 1: Drop rows with any missing values
df_cleaned = df.dropna()
# Option 2: Fill missing values with the mean
mean_value = df['some_column'].mean()
df_filled = df.fillna(mean_value)
mlpfu.pages.dev 11
Data Preprocess
Data Transformation: Encoding
Machine learning models need numerical input.
# One-Hot Encode a categorical column
# 'prefix' is used to name the new columns
dummies = pd.get_dummies(df['category_column'], prefix='cat')
# Add the new columns and drop the original
df = pd.concat([df, dummies], axis=1)
df = df.drop('category_column', axis=1)
mlpfu.pages.dev 12
Data Preprocess
Data Transformation: Feature Scaling
Not to be confused with model scaling. This is about transforming the
data itself.
from sklearn.preprocessing import MinMaxScaler
# Scale a numerical feature to a range (e.g., 0-1)
scaler = MinMaxScaler()
df['scaled_feature'] = scaler.fit_transform(df[['numeric_feature']])
mlpfu.pages.dev 13
Data Preprocess
Visualization: Correlation Heatmap
Visualize the correlation between numerical features.
# Calculate the correlation matrix
corr_matrix = df.corr()
# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
mlpfu.pages.dev 14
Data Preprocess
Visualization: Bivariate Analysis with Pairplot
Quickly visualize relationships across your entire dataset.
# Creates scatterplots for joint relationships and histograms for univariate distributions.
# Use a subset of columns for readability on large datasets.
sns.pairplot(df[['col1', 'col2', 'col3']])
plt.show()
mlpfu.pages.dev 15
Data Preprocess
Visualization: Grouping and Aggregating
Explore data by grouping it based on categories.
# Group by a category and calculate the mean of another column
avg_price_by_category = df.groupby('category')['price'].mean()
print(avg_price_by_category)
# Visualize the aggregated data
avg_price_by_category.plot(kind='bar')
plt.show()
mlpfu.pages.dev 16
Data Preprocess
Visualization: Outlier Detection
Identify outliers in your data using box plots.
# Create a box plot for a specific feature
sns.boxplot(x=df['feature_to_check'])
plt.title('Outlier Detection in Feature')
plt.show()
mlpfu.pages.dev 17
Data Preprocess
Visualization: Distributions
Understand the distribution of a single variable.
# Histogram with Seaborn
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()
mlpfu.pages.dev 18
Data Preprocess
Visualization: Relationships
Explore how two variables relate to each other.
# Scatter plot to see the relationship between two variables
sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title('Feature1 vs. Feature2')
plt.show()
mlpfu.pages.dev 19
Data Preprocess
Visualization: Categorical Data
Use bar plots for categorical data.
# Count plot for a categorical variable
sns.countplot(x='category_column', data=df)
plt.title('Count of Categories')
plt.xticks(rotation=45)
plt.show()
mlpfu.pages.dev 20