0% found this document useful (0 votes)
22 views20 pages

Data Processing

The document provides a comprehensive guide on data preprocessing using Python, covering essential libraries like Pandas and NumPy, and steps for loading, exploring, and cleaning data. It includes techniques for handling missing values, correcting data types, and visualizing relationships through various plots. Additionally, it discusses feature engineering, encoding categorical variables, and scaling features for machine learning applications.

Uploaded by

Hshshe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

Data Processing

The document provides a comprehensive guide on data preprocessing using Python, covering essential libraries like Pandas and NumPy, and steps for loading, exploring, and cleaning data. It includes techniques for handling missing values, correcting data types, and visualizing relationships through various plots. Additionally, it discusses feature engineering, encoding categorical variables, and scaling features for machine learning applications.

Uploaded by

Hshshe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Preprocess

Data Pre-Processing

A Guide to Understanding and Preparing Your Data

mlpfu.pages.dev 1
Data Preprocess

Core Libraries
These are the foundational tools for data manipulation and visualization
in Python.

Pandas: For data structures and analysis.


NumPy: For numerical operations.
Matplotlib & Seaborn: For data visualization.

# Importing the essentials


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

mlpfu.pages.dev 2
Data Preprocess

Loading Data
The first step is to get your data into a Pandas DataFrame.

# Load data from a CSV file


df = pd.read_csv('your_dataset.csv')

# Display the first 5 rows


print(df.head())

mlpfu.pages.dev 3
Data Preprocess

Initial Data Exploration


Get a first look at your dataset's structure and content.

# Get a concise summary of the dataframe


df.info()

# Generate descriptive statistics


df.describe()

mlpfu.pages.dev 4
Data Preprocess

Deeper Data Inspection


Go beyond the basics to understand your data's shape and types.

# Check the dimensions of the DataFrame (rows, columns)


print(df.shape)

# List all column names


print(df.columns)

# Check the data type of each column


print(df.dtypes)

mlpfu.pages.dev 5
Data Preprocess

Exploring Categorical Data with value_counts


This is the best first step for any categorical column. It shows the number
of times each category appears.

Why is this important?

Reveals the distribution of categories.


Helps identify data quality issues (e.g., typos, mixed case).
Shows if you have a data imbalance problem.

# Get the counts of each unique value in a column


print(df['category_column'].value_counts())

mlpfu.pages.dev 6
Data Preprocess

Correcting Data Types


Sometimes data is loaded with the wrong type (e.g., numbers as text).
This is a common and critical cleaning step.

# Example: A column 'price' was read as object (text)


# because of '$' symbols.
# First, remove the non-numeric characters.
df['price_cleaned'] = df['price'].replace({'\$': ''}, regex=True)

# Now, convert the column to a numeric type.


# errors='coerce' will turn any remaining non-numeric values into NaN
df['price_numeric'] = pd.to_numeric(df['price_cleaned'], errors='coerce')

# Verify the change


print(df.dtypes)

mlpfu.pages.dev 7
Data Preprocess

Analyzing Two Categories: Crosstabulation


A crosstab (or contingency table) is the best way to see the relationship
and frequency between two categorical variables.

# Create a table showing the frequency of each combination


# of 'category1' and 'category2'.
crosstab_result = pd.crosstab(df['category1'], df['category2'])

print(crosstab_result)

# This can be visualized with a heatmap for better readability


sns.heatmap(crosstab_result, annot=True, fmt='d')

mlpfu.pages.dev 8
Data Preprocess

Handling Duplicates & Unwanted Columns


Keep your dataset clean and relevant.

# Check for and count duplicate rows


print(df.duplicated().sum())

# Remove duplicate rows


df = df.drop_duplicates()

# Remove a single unwanted column


df = df.drop('unwanted_column_name', axis=1)

mlpfu.pages.dev 9
Data Preprocess

Creating Custom Columns


Engineer new features from existing data.

# Example: Create 'price_per_sqft' from 'price' and 'sqft'


df['price_per_sqft'] = df['price'] / df['sqft_living']

# Display the new column


print(df[['price', 'sqft_living', 'price_per_sqft']].head())

mlpfu.pages.dev 10
Data Preprocess

Handling Missing Values


Missing data is common. Here's how to handle it.

# Check for missing values in each column


print(df.isnull().sum())

# Option 1: Drop rows with any missing values


df_cleaned = df.dropna()

# Option 2: Fill missing values with the mean


mean_value = df['some_column'].mean()
df_filled = df.fillna(mean_value)

mlpfu.pages.dev 11
Data Preprocess

Data Transformation: Encoding


Machine learning models need numerical input.

# One-Hot Encode a categorical column


# 'prefix' is used to name the new columns
dummies = pd.get_dummies(df['category_column'], prefix='cat')

# Add the new columns and drop the original


df = pd.concat([df, dummies], axis=1)
df = df.drop('category_column', axis=1)

mlpfu.pages.dev 12
Data Preprocess

Data Transformation: Feature Scaling


Not to be confused with model scaling. This is about transforming the
data itself.

from sklearn.preprocessing import MinMaxScaler

# Scale a numerical feature to a range (e.g., 0-1)


scaler = MinMaxScaler()
df['scaled_feature'] = scaler.fit_transform(df[['numeric_feature']])

mlpfu.pages.dev 13
Data Preprocess

Visualization: Correlation Heatmap


Visualize the correlation between numerical features.

# Calculate the correlation matrix


corr_matrix = df.corr()

# Plot the heatmap


plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

mlpfu.pages.dev 14
Data Preprocess

Visualization: Bivariate Analysis with Pairplot


Quickly visualize relationships across your entire dataset.

# Creates scatterplots for joint relationships and histograms for univariate distributions.
# Use a subset of columns for readability on large datasets.
sns.pairplot(df[['col1', 'col2', 'col3']])
plt.show()

mlpfu.pages.dev 15
Data Preprocess

Visualization: Grouping and Aggregating


Explore data by grouping it based on categories.

# Group by a category and calculate the mean of another column


avg_price_by_category = df.groupby('category')['price'].mean()
print(avg_price_by_category)

# Visualize the aggregated data


avg_price_by_category.plot(kind='bar')
plt.show()

mlpfu.pages.dev 16
Data Preprocess

Visualization: Outlier Detection


Identify outliers in your data using box plots.

# Create a box plot for a specific feature


sns.boxplot(x=df['feature_to_check'])
plt.title('Outlier Detection in Feature')
plt.show()

mlpfu.pages.dev 17
Data Preprocess

Visualization: Distributions
Understand the distribution of a single variable.

# Histogram with Seaborn


sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

mlpfu.pages.dev 18
Data Preprocess

Visualization: Relationships
Explore how two variables relate to each other.

# Scatter plot to see the relationship between two variables


sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title('Feature1 vs. Feature2')
plt.show()

mlpfu.pages.dev 19
Data Preprocess

Visualization: Categorical Data


Use bar plots for categorical data.

# Count plot for a categorical variable


sns.countplot(x='category_column', data=df)
plt.title('Count of Categories')
plt.xticks(rotation=45)
plt.show()

mlpfu.pages.dev 20

You might also like