0% found this document useful (0 votes)

12 views6 pages

Data Preparation Guide

The document provides a comprehensive guide on data preparation and exploration, detailing the steps involved in cleaning, transforming, and organizing raw data for analysis. It includes Python code examples for data collection, cleaning, transformation, integration, and reduction, as well as exploratory data analysis techniques. The guide emphasizes the importance of high-quality data in achieving accurate insights and model performance.

Uploaded by

aymenahmed630

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views6 pages

Data Preparation Guide

Uploaded by

aymenahmed630

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Preparation and Exploration: A Detailed Guide with Examples and Python Code

1. Introduction to Data Preparation

Data preparation is the process of cleaning, transforming, and organizing raw data into a usable format for
analysis, modeling, or reporting. It is a critical step in the data science workflow because high-quality data
leads to more accurate and reliable insights. Poor data quality can result in misleading conclusions, poor
model performance, and wasted effort.

2. Steps in Data Preparation

Step 1: Data Collection

Collecting raw data from multiple sources such as:

• Databases (SQL, NoSQL)

• APIs (RESTful services)
• CSV/Excel files
• Web scraping

Python Example:

import pandas as pd
# Load from CSV
df = pd.read_csv('sales_data.csv')
# Load from Excel
df_excel = pd.read_excel('customers.xlsx')

Output Preview:

customer_id age income gender online_purchase store_purchase

0 1 28.0 45000.0 Male 250.0 100.0
1 2 34.0 52000.0 Female 300.0 150.0
2 3 45.0 61000.0 Male 400.0 200.0

1
Step 2: Data Cleaning

a) Handling Missing Values

# Fill missing prices with mean

df['price'].fillna(df['price'].mean(), inplace=True)

Output:

df['price'].isnull().sum()
0

b) Removing Duplicates

# Remove duplicated records

df.drop_duplicates(inplace=True)

Output:

df.duplicated().sum()
0

c) Correcting Data Types and Formats

# Convert date column to datetime

df['date'] = pd.to_datetime(df['date'])

Output:

df.dtypes
customer_id int64
age float64
income float64
gender object
online_purchase float64
store_purchase float64
price float64
date datetime64[ns]
dtype: object

2
d) Handling Outliers

# Remove outliers using IQR

Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
filtered_df = df[(df['income'] >= Q1 - 1.5 * IQR) & (df['income'] <= Q3 + 1.5 *
IQR)]

Output:

filtered_df.shape
(95, 7)

Step 3: Data Transformation

a) Standardization and Normalization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

Output:

df[['age', 'income']].head()
age income
0 -1.214678 -0.892450
1 -0.542135 -0.372156
2 0.421251 0.653435

b) Encoding Categorical Variables

# Binary encoding
df['gender'] = df['gender'].map({'Male': 0, 'Female': 1})

Output:

df['gender'].value_counts()
0 52

3
1 48
Name: gender, dtype: int64

c) Feature Engineering

# Create a new feature

df['total_purchase'] = df['online_purchase'] + df['store_purchase']

Output Preview:

df[['online_purchase', 'store_purchase', 'total_purchase']].head()

online_purchase store_purchase total_purchase
0 250.0 100.0 350.0
1 300.0 150.0 450.0
2 400.0 200.0 600.0

Step 4: Data Integration

merged_df = pd.merge(df_sales, df_customers, on='customer_id')

Output:

merged_df.shape
(100, 10)

Step 5: Data Reduction

a) Feature Selection

df = df[['customer_id', 'gender', 'age', 'income', 'total_purchase']]

Output:

df.columns
Index(['customer_id', 'gender', 'age', 'income', 'total_purchase'],
dtype='object')

4
b) Dimensionality Reduction (PCA)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df[['age', 'income', 'total_purchase']])

Output:

reduced_data.shape
(100, 2)

3. Data Exploration (EDA)

3.1 Descriptive Statistics

print(df.describe())

Output:

customer_id gender age income total_purchase

count 100.000000 100.00000 100.00000 100.00000 100.000000
mean 50.500000 0.48000 0.00000 0.00000 425.000000
std 29.011492 0.50253 1.00000 1.00000 100.000000

3.2 Visual Exploration

(‫ ولكن ُتنتج مباشرة عند تنفيذ الكود‬،‫رسوم بيانية لا ُتعرض كنص‬.)

6. Output of Final Workflow Checks

print(df.head())

Output Preview:

customer_id gender age income total_purchase

0 1 0 -1.214678 -0.892450 350.0

5
1 2 1 -0.542135 -0.372156 450.0
2 3 0 0.421251 0.653435 600.0

print(df[['PCA1', 'PCA2']].describe())

Output:

PCA1 PCA2
count 100.000000 100.00000
mean 0.000000 0.00000
std 1.000000 1.00000

End of Document

Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Kavin
No ratings yet
Kavin
13 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Supermarket Sales Insights
No ratings yet
Supermarket Sales Insights
8 pages
Ads Phase3
No ratings yet
Ads Phase3
9 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
Data Analysis Guide for Beginners
No ratings yet
Data Analysis Guide for Beginners
26 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Mall Customer Data Analysis PDF
No ratings yet
Mall Customer Data Analysis PDF
10 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
EDA Report Week2
No ratings yet
EDA Report Week2
15 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Chapter 02 Overview (Python)
No ratings yet
Chapter 02 Overview (Python)
16 pages
Inthiyas Phase2 PRJ
No ratings yet
Inthiyas Phase2 PRJ
8 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Data Science
No ratings yet
Data Science
6 pages
Python Data Cleaning Cheat Sheet
100% (4)
Python Data Cleaning Cheat Sheet
8 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
Task-by-Task Guide - Retail Data Analysis
No ratings yet
Task-by-Task Guide - Retail Data Analysis
6 pages
Assvid
No ratings yet
Assvid
13 pages
Guides
No ratings yet
Guides
23 pages
Python Syntax and Functions For Data Mining
No ratings yet
Python Syntax and Functions For Data Mining
6 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
Pandas Trampas
No ratings yet
Pandas Trampas
9 pages
Python for Business Analytics
No ratings yet
Python for Business Analytics
11 pages
Datascience
No ratings yet
Datascience
26 pages
Customer Segmentation Analysis
No ratings yet
Customer Segmentation Analysis
34 pages
Data Cleaning Techniques for Machine Learning
No ratings yet
Data Cleaning Techniques for Machine Learning
13 pages
AI Travel Companion Data Analysis
No ratings yet
AI Travel Companion Data Analysis
13 pages
Cleaning The Noise: Investigating Data Pre-Processing Techniques
No ratings yet
Cleaning The Noise: Investigating Data Pre-Processing Techniques
4 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Data Mining Lab Manaul
No ratings yet
Data Mining Lab Manaul
32 pages
Case Study Module 1
No ratings yet
Case Study Module 1
4 pages
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
19 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Project Amazon Sales Data Analysis
No ratings yet
Project Amazon Sales Data Analysis
12 pages
Data Mining and Visualization Techniques
100% (1)
Data Mining and Visualization Techniques
16 pages
Task2 Eda Cleaning
No ratings yet
Task2 Eda Cleaning
33 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Python EDA Guide for Data Analysts
No ratings yet
Python EDA Guide for Data Analysts
13 pages
Document 11
No ratings yet
Document 11
6 pages
Daa 01
No ratings yet
Daa 01
11 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Sample Phase 2 Document
No ratings yet
Sample Phase 2 Document
7 pages
Data Analytics
No ratings yet
Data Analytics
34 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
Analyzing Supermarket Sales Data
No ratings yet
Analyzing Supermarket Sales Data
6 pages
Python Data Science Cheat Sheet
0% (1)
Python Data Science Cheat Sheet
3 pages
Classical Methods for Ore Reserve Estimation
No ratings yet
Classical Methods for Ore Reserve Estimation
48 pages
Calculus All-in-One For Dummies 1st Edition Mark Ryan Ebook Seamless Access
100% (2)
Calculus All-in-One For Dummies 1st Edition Mark Ryan Ebook Seamless Access
48 pages
Professional Jewelry Making Manual
100% (4)
Professional Jewelry Making Manual
748 pages
Designing an Effective SPED Classroom
No ratings yet
Designing an Effective SPED Classroom
6 pages
Chemistry: Fitzroy's Storm Glass
No ratings yet
Chemistry: Fitzroy's Storm Glass
29 pages
Juegos de PlayStation 2 en Venta
No ratings yet
Juegos de PlayStation 2 en Venta
4 pages
DTC Agreement Between Netherlands and Malta
No ratings yet
DTC Agreement Between Netherlands and Malta
25 pages
Chapter 02
No ratings yet
Chapter 02
25 pages
Cebu Doctors' University College of Medicine Medical Education Unit
No ratings yet
Cebu Doctors' University College of Medicine Medical Education Unit
7 pages
Spring Rise Phenomenon Notes
No ratings yet
Spring Rise Phenomenon Notes
2 pages
Manual de Usuario - MAX T115+ - v.1
No ratings yet
Manual de Usuario - MAX T115+ - v.1
22 pages
Ancient Qatari History and Archaeology
No ratings yet
Ancient Qatari History and Archaeology
9 pages
Rio+20: Sustainable Development Goals
No ratings yet
Rio+20: Sustainable Development Goals
13 pages
Hook Up Format PDF
No ratings yet
Hook Up Format PDF
1 page
Verificacion
No ratings yet
Verificacion
7 pages
LWG 431
No ratings yet
LWG 431
1 page
ATP 2025 GR 4 Soc Sci Final
No ratings yet
ATP 2025 GR 4 Soc Sci Final
8 pages
Credit Repair Plan B 19 Day Results
36% (11)
Credit Repair Plan B 19 Day Results
2 pages
English Communication Skills 1 and 2 Sem
No ratings yet
English Communication Skills 1 and 2 Sem
3 pages
Action Research - Handout
No ratings yet
Action Research - Handout
3 pages
Business Sector and Industry Overview
No ratings yet
Business Sector and Industry Overview
5 pages
Understanding the Solar Wind Dynamics
No ratings yet
Understanding the Solar Wind Dynamics
3 pages
Lesson Plan For Fluid Mechanics (CE401)
No ratings yet
Lesson Plan For Fluid Mechanics (CE401)
2 pages
Research Goals in Architectural Studies
No ratings yet
Research Goals in Architectural Studies
10 pages
EE 311 Module 2: Resistance
No ratings yet
EE 311 Module 2: Resistance
13 pages
ADMN 2506A Business Statistics Midterm
No ratings yet
ADMN 2506A Business Statistics Midterm
5 pages
Journalizing
No ratings yet
Journalizing
5 pages
Assignment 3 - Pentateuch
No ratings yet
Assignment 3 - Pentateuch
10 pages
Solved - Calculation of St. Venant Torsional Constant - Autodesk Community - Robot Structural Analysis Products
No ratings yet
Solved - Calculation of St. Venant Torsional Constant - Autodesk Community - Robot Structural Analysis Products
7 pages
ATTACHMENT - REPORT For A. Siziba
100% (2)
ATTACHMENT - REPORT For A. Siziba
84 pages

Data Preparation Guide

Uploaded by

Data Preparation Guide

Uploaded by

Data Preparation and Exploration: A Detailed Guide with Examples and Python Code

1. Introduction to Data Preparation

2. Steps in Data Preparation

Step 1: Data Collection

Collecting raw data from multiple sources such as:

• Databases (SQL, NoSQL)

customer_id age income gender online_purchase store_purchase

a) Handling Missing Values

# Fill missing prices with mean

# Remove duplicated records

c) Correcting Data Types and Formats

# Convert date column to datetime

# Remove outliers using IQR

Step 3: Data Transformation

a) Standardization and Normalization

from sklearn.preprocessing import StandardScaler

b) Encoding Categorical Variables

# Create a new feature

df[['online_purchase', 'store_purchase', 'total_purchase']].head()

Step 4: Data Integration

merged_df = pd.merge(df_sales, df_customers, on='customer_id')

Step 5: Data Reduction

df = df[['customer_id', 'gender', 'age', 'income', 'total_purchase']]

from sklearn.decomposition import PCA

3. Data Exploration (EDA)

3.1 Descriptive Statistics

customer_id gender age income total_purchase

3.2 Visual Exploration

(‫ ولكن ُتنتج مباشرة عند تنفيذ الكود‬،‫رسوم بيانية لا ُتعرض كنص‬.)

6. Output of Final Workflow Checks

customer_id gender age income total_purchase

You might also like