0% found this document useful (0 votes)

17 views84 pages

Unit 2 DEV

The document discusses various data transformation techniques essential for data exploration and visualization, including data deduplication, cleansing, validation, and handling missing data. It provides practical examples using Python's Pandas library for operations such as appending, merging, and joining dataframes. Additionally, it outlines exercises for applying these techniques in real-world scenarios, particularly in sales data analysis.

Uploaded by

rampallynarsimha4191

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views84 pages

Unit 2 DEV

Uploaded by

rampallynarsimha4191

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 84

Data Exploration &

Visualization Dr. SELVA KUMAR S

Unit-2
ASSISTANT PROFESSOR
DEPT. OF CSE
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda

 Transformation Techniques

 Performing data deduplication

 Replacing values

 Discretization and binning

 Introduction to Missing data

 Handling missing data

 Traditional methods

 Maximum Likelihood Estimation

Prerequisites

Pandas
NumPy
Seaborn
Matplotlib

https://www.youtube.com/watch?v=kjXgYBiOzc4

3
Data Transformation

Data transformation is a set

of techniques used to
convert data from one
format or structure to
another format or structure.
Transformation Activities

Data deduplication

Key restructuring
Data cleansing
Data validation

Format revisioning
Data derivation
Data aggregation

Data integration
Data filtering

5
Data joining
Data deduplication

Original Data:

Deduplication Process:
1.Identification of Duplicates: Identify duplicates based on specific
criteria, such as matching email addresses or phone numbers.
2.Removal of Duplicates: After identification, choose which records to
keep and which ones to remove.

Deduplicated Data:

6
Key restructuring
Key restructuring involves transforming any keys with built-in meanings
to the generic keys.

Th restructuring makes the keys more generic and applicable to a wider

range of data or entities, rather than being specifically tied to products.

It allows for easier reuse and scalability if the dataset expands beyond
just products to include other types of entities.

Example:

7
Data cleansing

Data cleansing involves extracting words and deleting out-of-date, inaccurate,

and
incomplete information from the source language without extracting the
Meaning.

Original Data:

Data Cleansing Process:

1.Extracting Words: First, extract words from the product descriptions to
remove spelling errors and make the text more uniform.
2.Deleting Outdated, Inaccurate, and Incomplete Information: Remove
outdated, inaccurate, or incomplete information without altering the intended
meaning.

Cleansed Data:
Data validation

Data validation is a process of formulating rules or algorithms that help in

validating different types of data against some known issues.

Data Validation Rules:

1.Name Field:
1. Rule: The name should contain only alphabetic characters and spaces.
2. Example: John Doe, Alice Smith
2.Email Field:
1. Rule: Email should follow a standard email format.
2. Example: [email protected], [email protected]
3.Phone Number Field:
1. Rule: Phone number should be in a specific format (e.g., ###-###-
####).
2. Example: 123-456-7890, 987-654-3210
Format revisioning
Format revisioning involves converting from one format to another.

Format Revisioning Process:

1.Identifying Old Format Files: Locate the existing text documents in the
older "DOC" format that need to be revised or updated to the newer format.
2.Converting to New Format:
1. Use software or tools capable of batch conversion to convert all "DOC"
files to the "DOCX" format.
2. Microsoft Word, for instance, allows for batch conversion where multiple
"DOC" files can be opened and saved as "DOCX" files.
Data derivation
Data derivation consists of creating a set of rules to generate more information
from the data source.

Example:

1. Creating a New Column - Total Sales: Units_Sold * Revenue

2. Deriving Monthly Sales Data: Sum the Total_sales

Data aggregation

Data aggregation involves searching, extracting, summarizing, and preserving

important information in different types of reporting systems.

Example:

1. Summarizing Daily Sales:

2. Calculating Monthly Revenue:

Data integration

Data integration involves converting different data types and merging them into a
common structure or schema.

Example:
Suppose a retail company has data stored in various formats and locations:

Sales Data: Stored in a SQL database with columns like Transaction_ID,

Product_ID, Quantity, Price, and Timestamp.
Customer Data: Stored in a CSV file containing customer information such as
Customer_ID, Name, Email, and Phone_Number.
Inventory Data: Stored in an Excel sheet with details like Product_ID,
Product_Name, Stock_Quantity, and Supplier.
Data filtering

Data filtering involves identifying information relevant to any particular user.

Example:

Imagine a scenario where a user wants to find movies or TV shows based on

specific criteria, such as genre, release year, and rating.
Data joining

Data joining involves establishing a relationship between two or more tables.

Example:

Result of Data Joining:

Pandas dataframes

Pandas offer different functionalities for

combining DataFrames:

• Append
• Concat
• Merge
• Join
When to use each Function?

• append: Use when adding rows from one DataFrame

to another.

• concat: Use for combining DataFrames along rows or

columns based on axes.

• merge: Use for performing SQL-like joins based on

common columns.

• join: Use for combining DataFrames based on index

or specific columns.
Append function

•Usage: Appending rows from one DataFrame to another.

•Example:

import pandas as pd

# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Appending df2 to df1

df_appended = df1.append(df2)
print(df_appended)
Concat Function

•Usage: Concatenating DataFrames along rows or columns

based on axes.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
•df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

•# Concatenating along rows (axis=0)

•df_concatenated = pd.concat([df1, df2], axis=0)
•print("Concatenated DataFrame:")
•print(df_concatenated)
Merge Function

•Usage: Performing database-style joins by merging

DataFrames based on key columns.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
•df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 28]})

•# Merging based on 'ID' column

•merged_df = pd.merge(df1, df2, on='ID', how='inner')
•print("Merged DataFrame:")
•print(merged_df)
Join Function

•Usage: Joining DataFrames based on their index or a

specific column.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X',
'Y'])
•df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X',
'Y'])

•# Joining based on index

•joined_df = df1.join(df2)
•print("Joined DataFrame:")
Guess the output??

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

df_appended = df1.append(df2, sort=False) # sort=False to handle

different columns
print("Appended DataFrame:")
print(df_appended)
Answer

A B C
0 1 3.0 NaN
1 2 4.0 NaN
0 5 NaN 7.0
1 6 NaN 8.0
Guess the output?

import pandas as pd

# Creating DataFrames with unequal columns

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

# Concatenating along columns (axis=1) with unequal columns

df_concatenated = pd.concat([df1, df2], axis=1, sort=False)
print("Concatenated DataFrame:")
print(df_concatenated)
Answer

A B A C
0 1 3 5 7
1 2 4 6 8
Guess the output?

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30], 'Location': ['NY',
'CA']})

# Merging based on 'ID' column

merged_df = pd.merge(df1, df2, on='ID', how='inner')
print("Merged DataFrame:")
print(merged_df)
Answer

ID Name Age Location

0 1 Alice 25 NY
1 2 Bob 30 CA
Guess the output?

import pandas as pd

# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X', 'Y'])

# Joining based on index

joined_df = df1.join(df2, how='outer')
print("Joined DataFrame:")
print(joined_df)
Answer

A B C D
X 1 3 5 7
Y 2 4 6 8
Exercise-1

Use Case: Sales Data Analysis

Scenario:
Suppose you have two datasets related to sales:

sales_data_1 containing sales figures for various products in different regions.

sales_data_2 containing additional sales information for specific products.
Tasks:
Append Data: Combine both datasets vertically to create a comprehensive sales dataset.
Concatenate Data: Concatenate the sales data along columns for a more detailed analysis.
Merge Data: Merge the datasets based on a common key (product ID) to enrich the sales
information.
Join Data: Join the datasets based on the product names to analyze sales by product names.
Transformation Techniques
Performing data deduplication

• Data deduplication involves removing duplicate rows from a dataset

to maintain data integrity and accuracy.

• Example:

import pandas as pd

# Sample data with duplicates

data = {
'Order_ID': [101, 102, 101, 103, 104, 102],
'Product': ['A', 'B', 'A', 'C', 'D', 'B'],
'Quantity': [5, 3, 4, 2, 6, 3],
'Amount': [100, 80, 100, 120, 150, 80]
}

sales_data = pd.DataFrame(data)
print("Original Data:")
print(sales_data)
Deduplication

• Deduplication using the drop_duplicates() method in Pandas:

• # Removing duplicates based on all columns

• deduplicated_data = sales_data.drop_duplicates()

• print("\nDeduplicated Data:")
• print(deduplicated_data)
Deduplication

# Removing duplicates based on specific columns

deduplicated_data_subset =
sales_data.drop_duplicates(subset=['Order_ID', 'Product'])

print("\nDeduplicated Data (Subset):")

print(deduplicated_data_subset)
Guess the output?

frame3 = pd.DataFrame({'column
1': ['Looping'] * 3 +
['Functions']
* 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3

frame3.duplicated()
Answer
Exercise-2

Scenario: You are working as a data analyst for an

e-commerce company that has a massive dataset
containing customer information. The dataset has
accumulated duplicate entries due to various
reasons like multiple registrations, system errors,
or data entry mistakes. Your task is to use the
Pandas library in Python to identify and remove
these duplicate entries efficiently.
Exercise-3

Scenario: You're analyzing a sales dataset

from a retail store chain that contains
transactions from multiple branches. Due to
system errors and data integration issues,
the dataset has accumulated duplicate
records of sales transactions. Your task is to
use Pandas to identify these duplicates
based on specific columns and retain only
the unique transactions.
Replacing Values

• use the replace method:

import numpy as np

replaceFrame = pd.DataFrame({'column 1':

[200., 3000., -786.,
3000., 234., 444., -786., 332., 3332. ],
'column 2': range(9)})

replaceFrame.replace(to_replace =-786,
value= np.nan)

• Replace multiple values:

Exercise-4

Scenario:
You're working with a dataset that contains information about
product sales. However, you've noticed that certain entries
have inconsistent or incorrect values in the 'Category' column.
You need to standardize these categories by replacing specific
values with a unified category name using Pandas' replace()
method.

data = {
'Product_ID': [1, 2, 3, 4, 5],
'Product_Name': ['Apple', 'Banana', 'Orange', 'Pear', 'Grapes'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Veggie', 'Fruit']
}
Handling Missing Data

• NaN value is used to indicate that there is no value.

• There could be several reasons why a value
could be NaN:
• It can happen when data is retrieved from an
external source and there are some
incomplete values in the dataset.
• It can also happen when we join two different
datasets and some values are not matched.
• Missing values due to data collection errors.
• When the shape of data changes, there are
new additional rows or columns that are not
determined.
• Reindexing of data can result in incomplete
Example

data = np.arange(15, 30).reshape(5, 3)

dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi’, 'grapes',
'mango'], columns=['store1', 'store2', 'store3’])
dfx

dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
NaN values in Pandas Objects

• dfx.isnull()

• dfx.notnull()

• dfx.isnull().sum()

• dfx.isnull().sum().sum()

• dfx.count()
Example

import pandas as pd

# Load your dataset into a DataFrame

df = pd.read_csv('your_dataset.csv')

# Check for missing values in the entire DataFrame

print(df.isnull().sum())

# Check for missing values in a specific column

print(df['Column_Name'].isnull().sum())
Exercise-5

Scenario: You've received a dataset containing information about

students' performance, but the dataset has missing values across
multiple columns due to various reasons. Your task is to analyze and
handle these missing values using Pandas' methods.
Questions:
1.How many missing values are there in the 'Grade' column of the
dataset?
2.What is the total count of missing values in the entire dataset?
3.Can you identify and display rows where 'Absenteeism' values are
missing?
4.How many rows have complete data (non-missing) in the 'Study
Hours' column?
5.What is the sum of missing values across each column in the dataset?
6.How many missing values are there in each column of the dataset?
Dropping missing values

• One of the ways to handle missing values is to simply remove them from
dataset.

• dropna() method just returns a copy of the dataframe by dropping the rows
with NaN. The original dataframe is not changed.

• Drop rows with any missing values

df.dropna(inplace=True) # Use inplace=True to modify the original
DataFrame

• Drop rows where specific columns have missing values

df.dropna(subset=['Column_Name'], inplace=True)

• Drop rows where entire row values are NaN

df.dropna(how=‘all’)

• Drop rows where entire column values are NaN

df.dropna(how=‘all’,axis=1)

• Drop column with minimum number of NaN exist

Example-1

import pandas as pd

# Sample data with missing values

data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Dropping rows with any missing values

df_dropped = df.dropna()

print("DataFrame after dropping rows with any missing values:")

print(df_dropped)
Example-2

import pandas as pd

# Sample data with missing values

data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Dropping rows where 'B' column has missing values

df_dropped_b = df.dropna(subset=['B'])

print("DataFrame after dropping rows with missing values in 'B' column:")

print(df_dropped_b)
Example-3

import pandas as pd

# Sample data with missing values

data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Dropping columns with any missing values

df_dropped_columns = df.dropna(axis=1)

print("DataFrame after dropping columns with any missing values:")

print(df_dropped_columns)
Mathematical operations with NaN

Consider the following example:

ar1 = np.array([100, 200, np.nan, 300])
ser1 = pd.Series(ar1)
ar1.mean(), ser1.mean()

The output of the preceding code is the following:

(nan, 200.0)

• When a NumPy function encounters NaN values, it returns NaN.

• Pandas, on the other hand, ignores the NaN values and moves
ahead with processing. When performing the sum operation, NaN is
treated as 0. If all the values are NaN, the result is also NaN.
Exercise-6

Scenario:
You are analyzing a dataset containing information about employees in a company. However, the
dataset is not entirely clean and has missing values in various columns due to data entry errors and
incomplete records. Your task is to preprocess the data by handling missing values using Pandas'
dropna() method.

Questions:
1. How would you drop all rows with any missing values from the entire dataset
using Pandas?
2. If the 'Salary' column is critical for analysis, how can you drop only the rows
where 'Salary' values are missing while preserving the rest of the dataset?
3.In some columns, missing values are acceptable up to a certain limit. How
would you drop rows with more than two missing values across any column?
4.For the 'Department' column, there are missing values that can't be imputed.
How can you drop rows where 'Department' is missing while keeping rows with
other missing values intact?
5.Considering the 'Address' and 'Phone' columns have a significant number of
missing values, how can you drop these columns entirely from the dataset?
6.If you want to drop rows where 'Years of Experience' are missing and 'Salary' is
also missing or zero, how would you achieve this?
Dataset
Filling Missing values

We can use the fillna() method to replace NaN values with any particular
values.

Example:
filledDf = dfx.fillna(0)
filledDf

# Fill missing values with a specific constant

df['Column_Name'].fillna(value=0, inplace=True)

# Fill missing values with the mean of the column

df['Column_Name'].fillna(df['Column_Name'].mean(), inplace=True)

# Fill missing values with the most frequent value in the column
df['Column_Name'].fillna(df['Column_Name'].mode()[0], inplace=True)
Example
import pandas as pd
import numpy as np

data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}

df = pd.DataFrame(data)

# 1. Fill missing numerical values with the mean of the column

df_filled_mean = df.copy()
df_filled_mean['A'].fillna(df_filled_mean['A'].mean(), inplace=True)
df_filled_mean['C'].fillna(df_filled_mean['C'].mean(), inplace=True)

# 2. Fill missing categorical values with the most frequent value (mode)
df_filled_mode = df.copy()
df_filled_mode['B'].fillna(df_filled_mode['B'].mode()[0], inplace=True)
df_filled_mode['D'].fillna(df_filled_mode['D'].mode()[0], inplace=True)

print("DataFrame with missing values:") print(df)

print("\nDataFrame after filling missing numerical values with mean:") print(df_filled_mean)
print("\nDataFrame after filling missing categorical values with mode:") print(df_filled_mode)
Output
Backward and Forward Filling

NaN values can be filled based on the last known values.

# Forward fill missing values with the previous valid value

df.fillna(method='ffill', inplace=True)

# Backward fill missing values with the next valid value

df.fillna(method='bfill', inplace=True)
Example

# Sample data with missing values

data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}
df = pd.DataFrame(data)

# 1. Forward fill missing values in categorical columns

df_ffill = df.copy()
df_ffill['B'].fillna(method='ffill', inplace=True)
df_ffill['D'].fillna(method='ffill', inplace=True)

# 2. Backward fill missing values in numerical columns

df_bfill = df.copy()
df_bfill['A'].fillna(method='bfill', inplace=True)
df_bfill['C'].fillna(method='bfill', inplace=True)

print("DataFrame with missing values:") print(df)

print("\nDataFrame after forward filling missing categorical values:") print(df_ffill)
print("\nDataFrame after backward filling missing numerical values:") print(df_bfill)
Output
Interpolating missing value

The pandas library provides the interpolate() function both for the series
and the
dataframe.

By default, it performs a linear interpolation of our missing values.

ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])

ser3.interpolate()

# Use linear interpolation to fill missing values

df['Column_Name'] = df['Column_Name'].interpolate(method='linear')
Example

import pandas as pd
import numpy as np

# Sample data with missing values

data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Value': [10, 20, np.nan, 40, 50, np.nan, np.nan, 80, 90, np.nan]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Interpolating missing values in the 'Value' column using linear interpolation

df['Value_interpolated'] = df['Value'].interpolate(method='linear')

# Displaying the original and interpolated DataFrames

print("Original DataFrame:")
print(df)
Using Mask for missing values

Creating a Mask : Once you've identified the missing values, create a boolean mask that indicates the positions of
these missing values in your dataset.

For example, in Pandas, you can use functions like isnull() or notnull() to create a boolean mask.

import pandas as pd
import numpy as np

data = pd.DataFrame({'A': [1, 2, np.nan, 4],

'B': [5, np.nan, 7, 8]})

# Creating a mask for missing values

mask = data.isnull()
Using Mask
• Use the mask to perform specific operations:
Filling missing values with a certain value
Dropping rows or columns with missing values
Performing computations only on non-missing values.

# Filling missing values with a specific number

filled_data = data.fillna(0)

# Dropping rows with any missing values

cleaned_data = data.dropna()

# Performing operations only on non-missing values

mean_without_missing = data.mean(skipna=True)

# Create a mask to identify missing values in a column

mask = df['Column_Name'].isnull()

# Use the mask to perform operations or replacements on missing values

df.loc[mask, 'Column_Name'] = 'Missing' # For instance, replace missing values with a string 'Missing'
Applying the Mask

Apply the mask to filter out missing values or perform operations

selectively on non-missing values.

# Applying the mask to get non-missing values

non_missing_values = data[mask == False]

# Applying mask to perform operations selectively

sum_without_missing = data[mask == False].sum()
Scenario-1

Handling missing values can vary based on the nature of the

data and the context of the analysis.

Scenario 1: Numerical Data

Approach 1: Imputation
Mean/Median/Mode Imputation: Replace missing values in
numerical columns with the mean, median, or mode of the
non-missing values in that column.

# Impute missing values with mean

data['column'].fillna(data['column'].mean(), inplace=True)
Scenario-2
Scenario 2: Categorical Data
Approach 1: Mode Imputation
Replace missing categorical values with the mode (most frequent) value in
that column.

# Impute missing categorical values with mode

data['column'].fillna(data['column'].mode()[0], inplace=True)

Approach 2: Create a New Category

If the missing values represent a category on their own, you might encode
missing values as a separate category.

# Replace missing categorical values with 'Unknown'

data['column'].fillna('Unknown', inplace=True)
Scenario-3

Scenario 3: Time Series Data

Approach 1: Forward or Backward Fill
Use the previous (backward fill) or next (forward fill) known value to fill
missing values in a time series.

# Forward fill missing values

data['column'].fillna(method='ffill', inplace=True)

Approach 2: Interpolation
Interpolate missing values based on existing data points. For time series,
you might use linear interpolation.

# Linear interpolation
data['column'] = data['column'].interpolate(method='linear')
Renaming axis indexes
Renaming axis indexes is a common operation in pandas when you want
to change the labels or names of rows or columns.

Renaming Row Indexes (Axis=0)

import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])
data.rename(index={'X': 'New_X'}, inplace=True)

data.rename(index={'X': 'New_X', 'Y': 'New_Y'}, inplace=True)

Renaming Column Indexes (Axis=1)

data.rename(columns={'A': 'Column_A', 'B': 'Column_B'}, inplace=True)

Scenario
Let's consider a scenario where you have a dataset representing sales data for different products
in a store. The dataset contains sales figures indexed by the product names, and you want to
rename some products and update the column names to better reflect the data.

import pandas as pd

# Creating sample sales data

sales_data = pd.DataFrame({
'Old_Product_A': [100, 120, 90],
'Old_Product_B': [80, 60, 70],
'Old_Product_C': [50, 40, 60]
}, index=['Jan', 'Feb', 'Mar'])

# Renaming indexes (products) and columns

sales_data.rename(index={'Jan': 'January', 'Feb': 'February', 'Mar': 'March'}, inplace=True)
sales_data.rename(columns={'Old_Product_A': 'Product_X', 'Old_Product_B': 'Product_Y',
'Old_Product_C': 'Product_Z'}, inplace=True)

print(sales_data)
Discretization and Binning
Discretization:
Discretization involves dividing a continuous variable into intervals or bins. There are
mainly two types:

Equal Width Discretization:

Divides the range of values into equally spaced intervals.

Example:
import pandas as pd

data = pd.DataFrame({'A': [10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54]})
bins = pd.cut(data['A'], bins=3) # Divides data into 3 equal-width bins

Equal Frequency Discretization (Quantile-Based):

Divides the data into intervals with approximately the same number of observations in
each bin.
Example:
bins = pd.qcut(data['A'], q=3) # Divides data into 3 quantile-based bins
Equal Width Discretization

Dataset points: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54

Step 1: Sort the dataset in ascending order: 10, 15, 18, 20, 31, 34, 41, 46, 51,
53, 54.

Step 2: Determine the width of each bin using the formula: w = (max-min) / N (
N is number of bin)

w is calculated as (54 - 10) / 4 = 11.

BIN 1 : [lower bound , upper bound] = [(min) , (min + w -1)] = [10, 20]
BIN 2 : [lower bound , upper bound] = [(min + w) , (min + 2w -1)] = [21, 31]
BIN 3 : [lower bound , upper bound] = [(min + 2w) , (min + 3w -1)] = [32, 42]
BIN 4 : [lower bound , upper bound] = [(min + 3w) , (max)] = [43, 54]

BIN 1 : [10, 15, 18, 20]

BIN 2 : [31]
BIN 3 : [34, 41]
BIN 4 : [46, 51, 53, 54]
Equal Frequency Discretization

Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60

Step 1: sort the data.

Step 2: find the frequency. To calculate the frequency

total number of data
points/number of bins.

The total number of data points is 12, and the number of bins required is 3.
Therefore, the frequency comes out to be 4

BIN 1: 10, 15, 18, 20

BIN 2: 31, 34, 41, 46
BIN 3: 51, 53, 54, 60
Binning

Binning involves assigning continuous values to specific bins or categories based on certain criteria.
This can be applied to group data or simplify its representation.

Example:
Suppose you have a list of ages and want to categorize them into different age groups:

import pandas as pd

ages = [22, 35, 47, 50, 28, 19, 65, 37, 42, 51]

# Define bins and labels for age groups

bins = [18, 30, 40, 50, 60, 100]
labels = ['18-29', '30-39', '40-49', '50-59', '60+']

# Create age groups based on defined bins and labels

age_groups = pd.cut(ages, bins=bins, labels=labels)

# Add age groups to the original data

data = pd.DataFrame({'Age': ages, 'AgeGroup': age_groups})
print(data)
Exercise-7

Scenario: You're working on a dataset containing information

about housing prices in a city. One of the features in your
dataset is the square footage of the houses. You've been
tasked with binning the square footage data into categories
to better understand how the size of a house relates to its
price.
Outlier detection and filtering

• Outlier detection and filtering are important steps in

data preprocessing.
• Identify and handle anomalous or extreme values
that can skew statistical analyses or machine learning
models.
• Several methods can be employed to detect outliers.
• Filtering can involve handling them by removing,
capping, or transforming them.
Outlier Detection Methods

• Statistical Methods:

• Z-Score: Identifying values far from the mean in terms of standard deviations.
• IQR (Interquartile Range): Using quartiles to identify values outside a specific range.

• Visualization Techniques:

• Boxplots: Visual representation of the data's distribution to spot outliers.

• Scatterplots: Visualizing relationships and identifying points distant from others.

• Machine Learning Methods:

• Clustering Algorithms: Detecting data points that don’t cluster well with others.
• Isolation Forest, Local Outlier Factor (LOF): Algorithms specifically designed for outlier detection.
Filtering Outliers

• Removal:

• Deleting the identified outlier data points from the dataset.

• Capping or Flooring:

• Setting a threshold and capping or flooring outlier values to a

specific limit.

• Transformations:

• Applying mathematical transformations (log, square root) to

reduce the impact of outliers.
Example

import pandas as pd
import numpy as np

# Load housing price data

housing_data = pd.read_csv('housing_data.csv')

# Calculate Z-Score for 'Price' column

mean_price = np.mean(housing_data['Price'])
std_price = np.std(housing_data['Price'])
z_scores = (housing_data['Price'] - mean_price) / std_price

# Detect and filter outliers (considering a Z-Score threshold of 3)

outlier_threshold = 3
filtered_data = housing_data[abs(z_scores) < outlier_threshold]

# Explore filtered data

print(filtered_data.head())
Permutation and random sampling

• Permutation refers to rearranging or shuffling the order of a set of

elements.

• In statistics, permutation tests involve creating new arrangements of

the data to assess the likelihood of obtaining the observed results by
chance.

Example:

import numpy as np

# Original data
data = np.array([10, 20, 30, 40, 50])

# Permute the data (shuffle the order)

permuted_data = np.random.permutation(data)
print(permuted_data)

[30 40 10 50 20]
Permutation and random sampling

• Random sampling involves selecting a subset of data points from a

larger dataset.

• Each data point has an equal probability of being chosen.

• It's used to create representative samples for analysis without bias.

• Example:
import pandas as pd
# Original dataset
dataset = pd.read_csv('data.csv')

# Random sampling of 30% of the data

sampled_data = dataset.sample(frac=0.3, random_state=42)
print(sampled_data)
Random sampling without replacement

It involves selecting a subset of items from a larger set without allowing for duplicates
in the selected subset.

Example:

import pandas as pd

# Original dataset (a DataFrame)

data = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100]})

# Random sampling without replacement using Pandas

sampled_data = data['Values'].sample(n=5, replace=False)
print(sampled_data)
Example

Consider a dataset of student grades:

import pandas as pd
import numpy as np

# Create a DataFrame with student grades

data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})

print("Original Data:")
print(data)

# Random sampling without replacement

sampled_data = data['Grade'].sample(n=5, replace=False)

print("\nSampled Data:")
print(sampled_data)
Random sampling with replacement

It involves selecting items from a dataset where each selection is made

independently and with the possibility of selecting the same item multiple
times.

Example:

import pandas as pd
import numpy as np

# Create a DataFrame with student grades

data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})

print("Original Data:")
print(data)
# Random sampling with replacement
sampled_data = data['Grade'].sample(n=5, replace=True, random_state=42)
print("\nSampled Data:")
print(sampled_data)
Exercise-8

You are given a dataset containing the daily sales figures of a retail store for the past
month. The dataset comprises 30 entries, each representing the sales amount for a
particular day.

Use pandas library to perform the following:

Create dataset

Perform random sampling with replacement to select a subset of 8 days' sales figures.
Display the sampled data.

Perform random sampling without replacement to select a subset of 8 days' sales

figures. Display the sampled data.
Benefits of data transformation

Handling Skewed Distributions

Improving Model Performance

Managing Missing Values

Better Visualization

Noise Reduction

Outlier Handling

Data Wrangling & Analysis Guide
100% (1)
Data Wrangling & Analysis Guide
36 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Data Transformation Techniques
No ratings yet
Data Transformation Techniques
13 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
EDA Lecture 7 - 9
No ratings yet
EDA Lecture 7 - 9
7 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Data Cleaning with Pandas
No ratings yet
Data Cleaning with Pandas
81 pages
Data Wrangling: Clean, Transform, Merge
No ratings yet
Data Wrangling: Clean, Transform, Merge
60 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
14 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
Pandas Data Structures and Operations
No ratings yet
Pandas Data Structures and Operations
36 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Data Wrangling with Pandas
No ratings yet
Data Wrangling with Pandas
16 pages
Exp 6
No ratings yet
Exp 6
9 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
07 Data Wrangling
No ratings yet
07 Data Wrangling
51 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
No ratings yet
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
5 pages
Journal
No ratings yet
Journal
48 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
Unit2 Part2 Da
No ratings yet
Unit2 Part2 Da
45 pages
Datascience
No ratings yet
Datascience
26 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Unit 4
No ratings yet
Unit 4
60 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
Pandas Library
No ratings yet
Pandas Library
6 pages
Python 2.1.2
No ratings yet
Python 2.1.2
7 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Introduction to Pandas Library
No ratings yet
Introduction to Pandas Library
31 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
DA - 2. Pandas
No ratings yet
DA - 2. Pandas
79 pages
Unit 4 1
No ratings yet
Unit 4 1
3 pages
Mastering Pandas: DataFrame Operations
100% (2)
Mastering Pandas: DataFrame Operations
33 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Introduction to Pandas DataFrames
100% (1)
Introduction to Pandas DataFrames
21 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Pandas Python Data Analysis Guide
No ratings yet
Pandas Python Data Analysis Guide
32 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
Pandas Cheat Sheet for Data Manipulation
No ratings yet
Pandas Cheat Sheet for Data Manipulation
1 page
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Jonalyn Macas Research On Measure of Position
No ratings yet
Jonalyn Macas Research On Measure of Position
34 pages
Emailing Complete Business Statistics - Amir D. Aczel
No ratings yet
Emailing Complete Business Statistics - Amir D. Aczel
888 pages
Measures of Dispersion in Statistics (Definition & Types)
No ratings yet
Measures of Dispersion in Statistics (Definition & Types)
13 pages
Combined Use of HArmonyCa and Hyaluronic Acid Fillers. A Holistic Approach To Facial Rejuvenation - Sadiye Kus
No ratings yet
Combined Use of HArmonyCa and Hyaluronic Acid Fillers. A Holistic Approach To Facial Rejuvenation - Sadiye Kus
11 pages
Analysis of Machine Learning Techniques For Time Domain Waveform Prediction in Analog and Mixed Signal Integrated Circuit Verification
No ratings yet
Analysis of Machine Learning Techniques For Time Domain Waveform Prediction in Analog and Mixed Signal Integrated Circuit Verification
9 pages
Cumulative Frequency Graphs in Google Sheets
100% (2)
Cumulative Frequency Graphs in Google Sheets
2 pages
6th Grade Math Exam Prep Guide
No ratings yet
6th Grade Math Exam Prep Guide
25 pages
(Maa 4.1-4.3) Statistics - Basic Concepts - Eco
No ratings yet
(Maa 4.1-4.3) Statistics - Basic Concepts - Eco
26 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
Cumulative Frequency Guide
No ratings yet
Cumulative Frequency Guide
19 pages
Unit II Business Statistics and Logic
No ratings yet
Unit II Business Statistics and Logic
51 pages
AP Stats Exam Prep Guide
No ratings yet
AP Stats Exam Prep Guide
39 pages
9 Extracurricular Activity Involvement On The Compassion Academic Competence and Commitment of Collegiate Level Students A Structural Equation Model
No ratings yet
9 Extracurricular Activity Involvement On The Compassion Academic Competence and Commitment of Collegiate Level Students A Structural Equation Model
15 pages
ENENDA30 - TOPIC 05 - Measure of Position
No ratings yet
ENENDA30 - TOPIC 05 - Measure of Position
13 pages
2019 La Salle F6 Math Exam Solutions
No ratings yet
2019 La Salle F6 Math Exam Solutions
7 pages
Business Statistics Etf1100 Notes Business Statistics Etf1100 Notes
No ratings yet
Business Statistics Etf1100 Notes Business Statistics Etf1100 Notes
56 pages
Understanding Measures of Dispersion
No ratings yet
Understanding Measures of Dispersion
8 pages
Box Plots v2
No ratings yet
Box Plots v2
7 pages
AOL Finals Reviewer
No ratings yet
AOL Finals Reviewer
8 pages
Mukerji Et Al 2024
No ratings yet
Mukerji Et Al 2024
11 pages
S1.1 Representation and Summary of Data
No ratings yet
S1.1 Representation and Summary of Data
35 pages
MATH Aa SL
No ratings yet
MATH Aa SL
22 pages
Probability and Statistics Guide
No ratings yet
Probability and Statistics Guide
39 pages
2023 DevOps Report: Key Findings
No ratings yet
2023 DevOps Report: Key Findings
95 pages
Practice 2-Midterm 3 1 1
No ratings yet
Practice 2-Midterm 3 1 1
7 pages
Understanding Data Analysis in Statistics
No ratings yet
Understanding Data Analysis in Statistics
29 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
Understanding Confusion Matrix Basics
No ratings yet
Understanding Confusion Matrix Basics
32 pages
Exercice SL 4.1-4.3 Statistics
No ratings yet
Exercice SL 4.1-4.3 Statistics
21 pages
P6 High Scope - Data Statistics
No ratings yet
P6 High Scope - Data Statistics
4 pages