0% found this document useful (0 votes)

22 views20 pages

Data Processing

The document provides a comprehensive guide on data preprocessing using Python, covering essential libraries like Pandas and NumPy, and steps for loading, exploring, and cleaning data. It includes techniques for handling missing values, correcting data types, and visualizing relationships through various plots. Additionally, it discusses feature engineering, encoding categorical variables, and scaling features for machine learning applications.

Uploaded by

Hshshe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views20 pages

Data Processing

Uploaded by

Hshshe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Preprocess

Data Pre-Processing

A Guide to Understanding and Preparing Your Data

mlpfu.pages.dev 1
Data Preprocess

Core Libraries
These are the foundational tools for data manipulation and visualization
in Python.

Pandas: For data structures and analysis.

NumPy: For numerical operations.
Matplotlib & Seaborn: For data visualization.

# Importing the essentials

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

mlpfu.pages.dev 2
Data Preprocess

Loading Data
The first step is to get your data into a Pandas DataFrame.

# Load data from a CSV file

df = pd.read_csv('your_dataset.csv')

# Display the first 5 rows

print(df.head())

mlpfu.pages.dev 3
Data Preprocess

Initial Data Exploration

Get a first look at your dataset's structure and content.

# Get a concise summary of the dataframe

df.info()

# Generate descriptive statistics

df.describe()

mlpfu.pages.dev 4
Data Preprocess

Deeper Data Inspection

Go beyond the basics to understand your data's shape and types.

# Check the dimensions of the DataFrame (rows, columns)

print(df.shape)

# List all column names

print(df.columns)

# Check the data type of each column

print(df.dtypes)

mlpfu.pages.dev 5
Data Preprocess

Exploring Categorical Data with value_counts

This is the best first step for any categorical column. It shows the number
of times each category appears.

Why is this important?

Reveals the distribution of categories.

Helps identify data quality issues (e.g., typos, mixed case).
Shows if you have a data imbalance problem.

# Get the counts of each unique value in a column

print(df['category_column'].value_counts())

mlpfu.pages.dev 6
Data Preprocess

Correcting Data Types

Sometimes data is loaded with the wrong type (e.g., numbers as text).
This is a common and critical cleaning step.

# Example: A column 'price' was read as object (text)

# because of '$' symbols.
# First, remove the non-numeric characters.
df['price_cleaned'] = df['price'].replace({'\$': ''}, regex=True)

# Now, convert the column to a numeric type.

# errors='coerce' will turn any remaining non-numeric values into NaN
df['price_numeric'] = pd.to_numeric(df['price_cleaned'], errors='coerce')

# Verify the change

print(df.dtypes)

mlpfu.pages.dev 7
Data Preprocess

Analyzing Two Categories: Crosstabulation

A crosstab (or contingency table) is the best way to see the relationship
and frequency between two categorical variables.

# Create a table showing the frequency of each combination

# of 'category1' and 'category2'.
crosstab_result = pd.crosstab(df['category1'], df['category2'])

print(crosstab_result)

# This can be visualized with a heatmap for better readability

sns.heatmap(crosstab_result, annot=True, fmt='d')

mlpfu.pages.dev 8
Data Preprocess

Handling Duplicates & Unwanted Columns

Keep your dataset clean and relevant.

# Check for and count duplicate rows

print(df.duplicated().sum())

# Remove duplicate rows

df = df.drop_duplicates()

# Remove a single unwanted column

df = df.drop('unwanted_column_name', axis=1)

mlpfu.pages.dev 9
Data Preprocess

Creating Custom Columns

Engineer new features from existing data.

# Example: Create 'price_per_sqft' from 'price' and 'sqft'

df['price_per_sqft'] = df['price'] / df['sqft_living']

# Display the new column

print(df[['price', 'sqft_living', 'price_per_sqft']].head())

mlpfu.pages.dev 10
Data Preprocess

Handling Missing Values

Missing data is common. Here's how to handle it.

# Check for missing values in each column

print(df.isnull().sum())

# Option 1: Drop rows with any missing values

df_cleaned = df.dropna()

# Option 2: Fill missing values with the mean

mean_value = df['some_column'].mean()
df_filled = df.fillna(mean_value)

mlpfu.pages.dev 11
Data Preprocess

Data Transformation: Encoding

Machine learning models need numerical input.

# One-Hot Encode a categorical column

# 'prefix' is used to name the new columns
dummies = pd.get_dummies(df['category_column'], prefix='cat')

# Add the new columns and drop the original

df = pd.concat([df, dummies], axis=1)
df = df.drop('category_column', axis=1)

mlpfu.pages.dev 12
Data Preprocess

Data Transformation: Feature Scaling

Not to be confused with model scaling. This is about transforming the
data itself.

from sklearn.preprocessing import MinMaxScaler

# Scale a numerical feature to a range (e.g., 0-1)

scaler = MinMaxScaler()
df['scaled_feature'] = scaler.fit_transform(df[['numeric_feature']])

mlpfu.pages.dev 13
Data Preprocess

Visualization: Correlation Heatmap

Visualize the correlation between numerical features.

# Calculate the correlation matrix

corr_matrix = df.corr()

# Plot the heatmap

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

mlpfu.pages.dev 14
Data Preprocess

Visualization: Bivariate Analysis with Pairplot

Quickly visualize relationships across your entire dataset.

# Creates scatterplots for joint relationships and histograms for univariate distributions.
# Use a subset of columns for readability on large datasets.
sns.pairplot(df[['col1', 'col2', 'col3']])
plt.show()

mlpfu.pages.dev 15
Data Preprocess

Visualization: Grouping and Aggregating

Explore data by grouping it based on categories.

# Group by a category and calculate the mean of another column

avg_price_by_category = df.groupby('category')['price'].mean()
print(avg_price_by_category)

# Visualize the aggregated data

avg_price_by_category.plot(kind='bar')
plt.show()

mlpfu.pages.dev 16
Data Preprocess

Visualization: Outlier Detection

Identify outliers in your data using box plots.

# Create a box plot for a specific feature

sns.boxplot(x=df['feature_to_check'])
plt.title('Outlier Detection in Feature')
plt.show()

mlpfu.pages.dev 17
Data Preprocess

Visualization: Distributions
Understand the distribution of a single variable.

# Histogram with Seaborn

sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

mlpfu.pages.dev 18
Data Preprocess

Visualization: Relationships
Explore how two variables relate to each other.

# Scatter plot to see the relationship between two variables

sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title('Feature1 vs. Feature2')
plt.show()

mlpfu.pages.dev 19
Data Preprocess

Visualization: Categorical Data

Use bar plots for categorical data.

# Count plot for a categorical variable

sns.countplot(x='category_column', data=df)
plt.title('Count of Categories')
plt.xticks(rotation=45)
plt.show()

mlpfu.pages.dev 20

Python Syntax and Functions For Data Mining
No ratings yet
Python Syntax and Functions For Data Mining
6 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Data Preprocessing for Visualization
No ratings yet
Data Preprocessing for Visualization
25 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Datascience
No ratings yet
Datascience
26 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Data Preparation Guide
No ratings yet
Data Preparation Guide
6 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Cleaning Techniques for Machine Learning
No ratings yet
Data Cleaning Techniques for Machine Learning
13 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Python in Research
No ratings yet
Python in Research
18 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Features of A Datase1
No ratings yet
Features of A Datase1
11 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
EDA Exp 2 Outout
No ratings yet
EDA Exp 2 Outout
7 pages
DMV U4 RK
No ratings yet
DMV U4 RK
16 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
DMKD External Exam Answers
No ratings yet
DMKD External Exam Answers
12 pages
Python Data Science Cheat Sheet
100% (2)
Python Data Science Cheat Sheet
6 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Prac 7
No ratings yet
Prac 7
5 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
Matplotlib Project Report AIPT
No ratings yet
Matplotlib Project Report AIPT
6 pages
Guides
No ratings yet
Guides
23 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Python for High School Data Exploration
No ratings yet
Python for High School Data Exploration
28 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Data Analysis With Python Core Libraries
No ratings yet
Data Analysis With Python Core Libraries
5 pages
Python Data Science Cheat Sheet
0% (1)
Python Data Science Cheat Sheet
3 pages
Learninng Plan
No ratings yet
Learninng Plan
6 pages
Facilitation Skills
No ratings yet
Facilitation Skills
53 pages
Short Question 2
No ratings yet
Short Question 2
3 pages
Charlotte M. Yonge - History of Christian Names PDF
No ratings yet
Charlotte M. Yonge - History of Christian Names PDF
630 pages
Act Exam
No ratings yet
Act Exam
15 pages
ENG-411 Stylistics
No ratings yet
ENG-411 Stylistics
2 pages
TS TET Paper 2 Maths Science 2023
No ratings yet
TS TET Paper 2 Maths Science 2023
16 pages
Pengertian Asking and Giving Direction
No ratings yet
Pengertian Asking and Giving Direction
10 pages
Java Reliable Multicast Overview
No ratings yet
Java Reliable Multicast Overview
22 pages
Assembly Language String Instructions
No ratings yet
Assembly Language String Instructions
25 pages
Ap 6 - Ikalawang Markahan
No ratings yet
Ap 6 - Ikalawang Markahan
8 pages
Harmonic Syntax & Phrase Models
No ratings yet
Harmonic Syntax & Phrase Models
5 pages
STAGE AREAS Note and Worksheet GRADE 9 VERSION
No ratings yet
STAGE AREAS Note and Worksheet GRADE 9 VERSION
5 pages
RGPV Syllabus Btech Cs 7 Sem cs701 Software Architectures
No ratings yet
RGPV Syllabus Btech Cs 7 Sem cs701 Software Architectures
1 page
Effective Communication
No ratings yet
Effective Communication
9 pages
CHAPTER 1 - 9 Answers
No ratings yet
CHAPTER 1 - 9 Answers
13 pages
Date Sheet 2025-1
No ratings yet
Date Sheet 2025-1
2 pages
Rational Numbers: Fractions & Decimals Guide
100% (3)
Rational Numbers: Fractions & Decimals Guide
48 pages
EF4e EOI Exam Power Pack C1 Guidelines
No ratings yet
EF4e EOI Exam Power Pack C1 Guidelines
3 pages
Chapter 9.1 Limit
No ratings yet
Chapter 9.1 Limit
10 pages
Detailed Science Lesson Plan for Grade 7
No ratings yet
Detailed Science Lesson Plan for Grade 7
3 pages
Discrete Mathematics Mid 1 Solution
No ratings yet
Discrete Mathematics Mid 1 Solution
5 pages
Spark Performance Tuning in BDA 161
No ratings yet
Spark Performance Tuning in BDA 161
29 pages
4270 14513 2 PB
No ratings yet
4270 14513 2 PB
14 pages
4eme Anglais Devoir départemental-WPS Office
No ratings yet
4eme Anglais Devoir départemental-WPS Office
2 pages
A Level Computer Science P1 Topical 2023 25
No ratings yet
A Level Computer Science P1 Topical 2023 25
35 pages
04 PAS Install Integrations
No ratings yet
04 PAS Install Integrations
47 pages
07 Class Relationship Diagram
No ratings yet
07 Class Relationship Diagram
12 pages
COP 4530 - Final Project Report
No ratings yet
COP 4530 - Final Project Report
4 pages
Build A Bootable BCD From Scratch With Bcdedit
No ratings yet
Build A Bootable BCD From Scratch With Bcdedit
2 pages
Camilla Brudin Borg - Rikard Wingård - Jørgen Bruhn - Contemporary Ecocritical Methods (2024, Rowman & Littlefield) - Libgen - Li
No ratings yet
Camilla Brudin Borg - Rikard Wingård - Jørgen Bruhn - Contemporary Ecocritical Methods (2024, Rowman & Littlefield) - Libgen - Li
299 pages

Data Processing

Uploaded by

Data Processing

Uploaded by

Data Preprocess

A Guide to Understanding and Preparing Your Data

Pandas: For data structures and analysis.

# Importing the essentials

# Load data from a CSV file

# Display the first 5 rows

Initial Data Exploration

# Get a concise summary of the dataframe

# Generate descriptive statistics

Deeper Data Inspection

# Check the dimensions of the DataFrame (rows, columns)

# List all column names

# Check the data type of each column

Exploring Categorical Data with value_counts

Why is this important?

Reveals the distribution of categories.

# Get the counts of each unique value in a column

Correcting Data Types

# Example: A column 'price' was read as object (text)

# Now, convert the column to a numeric type.

# Verify the change

Analyzing Two Categories: Crosstabulation

# Create a table showing the frequency of each combination

# This can be visualized with a heatmap for better readability

Handling Duplicates & Unwanted Columns

# Check for and count duplicate rows

# Remove duplicate rows

# Remove a single unwanted column

Creating Custom Columns

# Example: Create 'price_per_sqft' from 'price' and 'sqft'

# Display the new column

Handling Missing Values

# Check for missing values in each column

# Option 1: Drop rows with any missing values

# Option 2: Fill missing values with the mean

Data Transformation: Encoding

# One-Hot Encode a categorical column

# Add the new columns and drop the original

Data Transformation: Feature Scaling

from sklearn.preprocessing import MinMaxScaler

# Scale a numerical feature to a range (e.g., 0-1)

Visualization: Correlation Heatmap

# Calculate the correlation matrix

# Plot the heatmap

Visualization: Bivariate Analysis with Pairplot

Visualization: Grouping and Aggregating

# Group by a category and calculate the mean of another column

# Visualize the aggregated data

Visualization: Outlier Detection

# Create a box plot for a specific feature

# Histogram with Seaborn

# Scatter plot to see the relationship between two variables

Visualization: Categorical Data

# Count plot for a categorical variable

You might also like