0% found this document useful (0 votes)
17 views84 pages

Unit 2 DEV

The document discusses various data transformation techniques essential for data exploration and visualization, including data deduplication, cleansing, validation, and handling missing data. It provides practical examples using Python's Pandas library for operations such as appending, merging, and joining dataframes. Additionally, it outlines exercises for applying these techniques in real-world scenarios, particularly in sales data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views84 pages

Unit 2 DEV

The document discusses various data transformation techniques essential for data exploration and visualization, including data deduplication, cleansing, validation, and handling missing data. It provides practical examples using Python's Pandas library for operations such as appending, merging, and joining dataframes. Additionally, it outlines exercises for applying these techniques in real-world scenarios, particularly in sales data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Exploration &

Visualization Dr. SELVA KUMAR S

Unit-2
ASSISTANT PROFESSOR
DEPT. OF CSE
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda

 Transformation Techniques

 Performing data deduplication

 Replacing values

 Discretization and binning

 Introduction to Missing data

 Handling missing data

 Traditional methods

 Maximum Likelihood Estimation


Prerequisites

Pandas
NumPy
Seaborn
Matplotlib

https://www.youtube.com/watch?v=kjXgYBiOzc4

3
Data Transformation

Data transformation is a set


of techniques used to
convert data from one
format or structure to
another format or structure.
Transformation Activities

Data deduplication

Key restructuring
Data cleansing
Data validation

Format revisioning
Data derivation
Data aggregation

Data integration
Data filtering

5
Data joining
Data deduplication

Original Data:

Deduplication Process:
1.Identification of Duplicates: Identify duplicates based on specific
criteria, such as matching email addresses or phone numbers.
2.Removal of Duplicates: After identification, choose which records to
keep and which ones to remove.

Deduplicated Data:

6
Key restructuring
Key restructuring involves transforming any keys with built-in meanings
to the generic keys.

Th restructuring makes the keys more generic and applicable to a wider


range of data or entities, rather than being specifically tied to products.

It allows for easier reuse and scalability if the dataset expands beyond
just products to include other types of entities.

Example:

7
Data cleansing

Data cleansing involves extracting words and deleting out-of-date, inaccurate,


and
incomplete information from the source language without extracting the
Meaning.

Original Data:

Data Cleansing Process:


1.Extracting Words: First, extract words from the product descriptions to
remove spelling errors and make the text more uniform.
2.Deleting Outdated, Inaccurate, and Incomplete Information: Remove
outdated, inaccurate, or incomplete information without altering the intended
meaning.

Cleansed Data:
Data validation

Data validation is a process of formulating rules or algorithms that help in


validating different types of data against some known issues.

Data Validation Rules:


1.Name Field:
1. Rule: The name should contain only alphabetic characters and spaces.
2. Example: John Doe, Alice Smith
2.Email Field:
1. Rule: Email should follow a standard email format.
2. Example: [email protected], [email protected]
3.Phone Number Field:
1. Rule: Phone number should be in a specific format (e.g., ###-###-
####).
2. Example: 123-456-7890, 987-654-3210
Format revisioning
Format revisioning involves converting from one format to another.

Format Revisioning Process:


1.Identifying Old Format Files: Locate the existing text documents in the
older "DOC" format that need to be revised or updated to the newer format.
2.Converting to New Format:
1. Use software or tools capable of batch conversion to convert all "DOC"
files to the "DOCX" format.
2. Microsoft Word, for instance, allows for batch conversion where multiple
"DOC" files can be opened and saved as "DOCX" files.
Data derivation
Data derivation consists of creating a set of rules to generate more information
from the data source.

Example:

1. Creating a New Column - Total Sales: Units_Sold * Revenue

2. Deriving Monthly Sales Data: Sum the Total_sales


Data aggregation

Data aggregation involves searching, extracting, summarizing, and preserving


important information in different types of reporting systems.

Example:

1. Summarizing Daily Sales:

2. Calculating Monthly Revenue:


Data integration

Data integration involves converting different data types and merging them into a
common structure or schema.

Example:
Suppose a retail company has data stored in various formats and locations:

Sales Data: Stored in a SQL database with columns like Transaction_ID,


Product_ID, Quantity, Price, and Timestamp.
Customer Data: Stored in a CSV file containing customer information such as
Customer_ID, Name, Email, and Phone_Number.
Inventory Data: Stored in an Excel sheet with details like Product_ID,
Product_Name, Stock_Quantity, and Supplier.
Data filtering

Data filtering involves identifying information relevant to any particular user.

Example:

Imagine a scenario where a user wants to find movies or TV shows based on


specific criteria, such as genre, release year, and rating.
Data joining

Data joining involves establishing a relationship between two or more tables.

Example:

Result of Data Joining:


Pandas dataframes

Pandas offer different functionalities for


combining DataFrames:

• Append
• Concat
• Merge
• Join
When to use each Function?

• append: Use when adding rows from one DataFrame


to another.

• concat: Use for combining DataFrames along rows or


columns based on axes.

• merge: Use for performing SQL-like joins based on


common columns.

• join: Use for combining DataFrames based on index


or specific columns.
Append function

•Usage: Appending rows from one DataFrame to another.

•Example:

import pandas as pd

# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Appending df2 to df1


df_appended = df1.append(df2)
print(df_appended)
Concat Function

•Usage: Concatenating DataFrames along rows or columns


based on axes.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
•df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

•# Concatenating along rows (axis=0)


•df_concatenated = pd.concat([df1, df2], axis=0)
•print("Concatenated DataFrame:")
•print(df_concatenated)
Merge Function

•Usage: Performing database-style joins by merging


DataFrames based on key columns.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
•df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 28]})

•# Merging based on 'ID' column


•merged_df = pd.merge(df1, df2, on='ID', how='inner')
•print("Merged DataFrame:")
•print(merged_df)
Join Function

•Usage: Joining DataFrames based on their index or a


specific column.

•Example:
•import pandas as pd

•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X',
'Y'])
•df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X',
'Y'])

•# Joining based on index


•joined_df = df1.join(df2)
•print("Joined DataFrame:")
Guess the output??

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

df_appended = df1.append(df2, sort=False) # sort=False to handle


different columns
print("Appended DataFrame:")
print(df_appended)
Answer

A B C
0 1 3.0 NaN
1 2 4.0 NaN
0 5 NaN 7.0
1 6 NaN 8.0
Guess the output?

import pandas as pd

# Creating DataFrames with unequal columns


df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

# Concatenating along columns (axis=1) with unequal columns


df_concatenated = pd.concat([df1, df2], axis=1, sort=False)
print("Concatenated DataFrame:")
print(df_concatenated)
Answer

A B A C
0 1 3 5 7
1 2 4 6 8
Guess the output?

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})


df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30], 'Location': ['NY',
'CA']})

# Merging based on 'ID' column


merged_df = pd.merge(df1, df2, on='ID', how='inner')
print("Merged DataFrame:")
print(merged_df)
Answer

ID Name Age Location


0 1 Alice 25 NY
1 2 Bob 30 CA
Guess the output?

import pandas as pd

# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X', 'Y'])

# Joining based on index


joined_df = df1.join(df2, how='outer')
print("Joined DataFrame:")
print(joined_df)
Answer

A B C D
X 1 3 5 7
Y 2 4 6 8
Exercise-1

Use Case: Sales Data Analysis


Scenario:
Suppose you have two datasets related to sales:

sales_data_1 containing sales figures for various products in different regions.


sales_data_2 containing additional sales information for specific products.
Tasks:
Append Data: Combine both datasets vertically to create a comprehensive sales dataset.
Concatenate Data: Concatenate the sales data along columns for a more detailed analysis.
Merge Data: Merge the datasets based on a common key (product ID) to enrich the sales
information.
Join Data: Join the datasets based on the product names to analyze sales by product names.
Transformation Techniques
Performing data deduplication

• Data deduplication involves removing duplicate rows from a dataset


to maintain data integrity and accuracy.

• Example:

import pandas as pd

# Sample data with duplicates


data = {
'Order_ID': [101, 102, 101, 103, 104, 102],
'Product': ['A', 'B', 'A', 'C', 'D', 'B'],
'Quantity': [5, 3, 4, 2, 6, 3],
'Amount': [100, 80, 100, 120, 150, 80]
}

sales_data = pd.DataFrame(data)
print("Original Data:")
print(sales_data)
Deduplication

• Deduplication using the drop_duplicates() method in Pandas:

• # Removing duplicates based on all columns


• deduplicated_data = sales_data.drop_duplicates()

• print("\nDeduplicated Data:")
• print(deduplicated_data)
Deduplication

# Removing duplicates based on specific columns


deduplicated_data_subset =
sales_data.drop_duplicates(subset=['Order_ID', 'Product'])

print("\nDeduplicated Data (Subset):")


print(deduplicated_data_subset)
Guess the output?

frame3 = pd.DataFrame({'column
1': ['Looping'] * 3 +
['Functions']
* 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3

frame3.duplicated()
Answer
Exercise-2

Scenario: You are working as a data analyst for an


e-commerce company that has a massive dataset
containing customer information. The dataset has
accumulated duplicate entries due to various
reasons like multiple registrations, system errors,
or data entry mistakes. Your task is to use the
Pandas library in Python to identify and remove
these duplicate entries efficiently.
Exercise-3

Scenario: You're analyzing a sales dataset


from a retail store chain that contains
transactions from multiple branches. Due to
system errors and data integration issues,
the dataset has accumulated duplicate
records of sales transactions. Your task is to
use Pandas to identify these duplicates
based on specific columns and retain only
the unique transactions.
Replacing Values

• use the replace method:

import numpy as np

replaceFrame = pd.DataFrame({'column 1':


[200., 3000., -786.,
3000., 234., 444., -786., 332., 3332. ],
'column 2': range(9)})

replaceFrame.replace(to_replace =-786,
value= np.nan)

• Replace multiple values:


Exercise-4

Scenario:
You're working with a dataset that contains information about
product sales. However, you've noticed that certain entries
have inconsistent or incorrect values in the 'Category' column.
You need to standardize these categories by replacing specific
values with a unified category name using Pandas' replace()
method.

data = {
'Product_ID': [1, 2, 3, 4, 5],
'Product_Name': ['Apple', 'Banana', 'Orange', 'Pear', 'Grapes'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Veggie', 'Fruit']
}
Handling Missing Data

• NaN value is used to indicate that there is no value.


• There could be several reasons why a value
could be NaN:
• It can happen when data is retrieved from an
external source and there are some
incomplete values in the dataset.
• It can also happen when we join two different
datasets and some values are not matched.
• Missing values due to data collection errors.
• When the shape of data changes, there are
new additional rows or columns that are not
determined.
• Reindexing of data can result in incomplete
Example

data = np.arange(15, 30).reshape(5, 3)


dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi’, 'grapes',
'mango'], columns=['store1', 'store2', 'store3’])
dfx

dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
NaN values in Pandas Objects

• dfx.isnull()

• dfx.notnull()

• dfx.isnull().sum()

• dfx.isnull().sum().sum()

• dfx.count()
Example

import pandas as pd

# Load your dataset into a DataFrame


df = pd.read_csv('your_dataset.csv')

# Check for missing values in the entire DataFrame


print(df.isnull().sum())

# Check for missing values in a specific column


print(df['Column_Name'].isnull().sum())
Exercise-5

Scenario: You've received a dataset containing information about


students' performance, but the dataset has missing values across
multiple columns due to various reasons. Your task is to analyze and
handle these missing values using Pandas' methods.
Questions:
1.How many missing values are there in the 'Grade' column of the
dataset?
2.What is the total count of missing values in the entire dataset?
3.Can you identify and display rows where 'Absenteeism' values are
missing?
4.How many rows have complete data (non-missing) in the 'Study
Hours' column?
5.What is the sum of missing values across each column in the dataset?
6.How many missing values are there in each column of the dataset?
Dropping missing values

• One of the ways to handle missing values is to simply remove them from
dataset.

• dropna() method just returns a copy of the dataframe by dropping the rows
with NaN. The original dataframe is not changed.

• Drop rows with any missing values


df.dropna(inplace=True) # Use inplace=True to modify the original
DataFrame

• Drop rows where specific columns have missing values


df.dropna(subset=['Column_Name'], inplace=True)

• Drop rows where entire row values are NaN


df.dropna(how=‘all’)

• Drop rows where entire column values are NaN


df.dropna(how=‘all’,axis=1)

• Drop column with minimum number of NaN exist


Example-1

import pandas as pd

# Sample data with missing values


data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Dropping rows with any missing values


df_dropped = df.dropna()

print("DataFrame after dropping rows with any missing values:")


print(df_dropped)
Example-2

import pandas as pd

# Sample data with missing values


data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Dropping rows where 'B' column has missing values


df_dropped_b = df.dropna(subset=['B'])

print("DataFrame after dropping rows with missing values in 'B' column:")


print(df_dropped_b)
Example-3

import pandas as pd

# Sample data with missing values


data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}

# Dropping columns with any missing values


df_dropped_columns = df.dropna(axis=1)

print("DataFrame after dropping columns with any missing values:")


print(df_dropped_columns)
Mathematical operations with NaN

Consider the following example:


ar1 = np.array([100, 200, np.nan, 300])
ser1 = pd.Series(ar1)
ar1.mean(), ser1.mean()

The output of the preceding code is the following:


(nan, 200.0)

• When a NumPy function encounters NaN values, it returns NaN.


• Pandas, on the other hand, ignores the NaN values and moves
ahead with processing. When performing the sum operation, NaN is
treated as 0. If all the values are NaN, the result is also NaN.
Exercise-6

Scenario:
You are analyzing a dataset containing information about employees in a company. However, the
dataset is not entirely clean and has missing values in various columns due to data entry errors and
incomplete records. Your task is to preprocess the data by handling missing values using Pandas'
dropna() method.

Questions:
1. How would you drop all rows with any missing values from the entire dataset
using Pandas?
2. If the 'Salary' column is critical for analysis, how can you drop only the rows
where 'Salary' values are missing while preserving the rest of the dataset?
3.In some columns, missing values are acceptable up to a certain limit. How
would you drop rows with more than two missing values across any column?
4.For the 'Department' column, there are missing values that can't be imputed.
How can you drop rows where 'Department' is missing while keeping rows with
other missing values intact?
5.Considering the 'Address' and 'Phone' columns have a significant number of
missing values, how can you drop these columns entirely from the dataset?
6.If you want to drop rows where 'Years of Experience' are missing and 'Salary' is
also missing or zero, how would you achieve this?
Dataset
Filling Missing values

We can use the fillna() method to replace NaN values with any particular
values.

Example:
filledDf = dfx.fillna(0)
filledDf

# Fill missing values with a specific constant


df['Column_Name'].fillna(value=0, inplace=True)

# Fill missing values with the mean of the column


df['Column_Name'].fillna(df['Column_Name'].mean(), inplace=True)

# Fill missing values with the most frequent value in the column
df['Column_Name'].fillna(df['Column_Name'].mode()[0], inplace=True)
Example
import pandas as pd
import numpy as np

data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}

df = pd.DataFrame(data)

# 1. Fill missing numerical values with the mean of the column


df_filled_mean = df.copy()
df_filled_mean['A'].fillna(df_filled_mean['A'].mean(), inplace=True)
df_filled_mean['C'].fillna(df_filled_mean['C'].mean(), inplace=True)

# 2. Fill missing categorical values with the most frequent value (mode)
df_filled_mode = df.copy()
df_filled_mode['B'].fillna(df_filled_mode['B'].mode()[0], inplace=True)
df_filled_mode['D'].fillna(df_filled_mode['D'].mode()[0], inplace=True)

print("DataFrame with missing values:") print(df)


print("\nDataFrame after filling missing numerical values with mean:") print(df_filled_mean)
print("\nDataFrame after filling missing categorical values with mode:") print(df_filled_mode)
Output
Backward and Forward Filling

NaN values can be filled based on the last known values.

# Forward fill missing values with the previous valid value


df.fillna(method='ffill', inplace=True)

# Backward fill missing values with the next valid value


df.fillna(method='bfill', inplace=True)
Example

# Sample data with missing values


data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}
df = pd.DataFrame(data)

# 1. Forward fill missing values in categorical columns


df_ffill = df.copy()
df_ffill['B'].fillna(method='ffill', inplace=True)
df_ffill['D'].fillna(method='ffill', inplace=True)

# 2. Backward fill missing values in numerical columns


df_bfill = df.copy()
df_bfill['A'].fillna(method='bfill', inplace=True)
df_bfill['C'].fillna(method='bfill', inplace=True)

print("DataFrame with missing values:") print(df)


print("\nDataFrame after forward filling missing categorical values:") print(df_ffill)
print("\nDataFrame after backward filling missing numerical values:") print(df_bfill)
Output
Interpolating missing value

The pandas library provides the interpolate() function both for the series
and the
dataframe.

By default, it performs a linear interpolation of our missing values.

ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])


ser3.interpolate()

# Use linear interpolation to fill missing values


df['Column_Name'] = df['Column_Name'].interpolate(method='linear')
Example

import pandas as pd
import numpy as np

# Sample data with missing values


data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Value': [10, 20, np.nan, 40, 50, np.nan, np.nan, 80, 90, np.nan]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Interpolating missing values in the 'Value' column using linear interpolation


df['Value_interpolated'] = df['Value'].interpolate(method='linear')

# Displaying the original and interpolated DataFrames


print("Original DataFrame:")
print(df)
Using Mask for missing values

Creating a Mask : Once you've identified the missing values, create a boolean mask that indicates the positions of
these missing values in your dataset.

For example, in Pandas, you can use functions like isnull() or notnull() to create a boolean mask.

import pandas as pd
import numpy as np

data = pd.DataFrame({'A': [1, 2, np.nan, 4],


'B': [5, np.nan, 7, 8]})

# Creating a mask for missing values


mask = data.isnull()
Using Mask
• Use the mask to perform specific operations:
Filling missing values with a certain value
Dropping rows or columns with missing values
Performing computations only on non-missing values.

# Filling missing values with a specific number


filled_data = data.fillna(0)

# Dropping rows with any missing values


cleaned_data = data.dropna()

# Performing operations only on non-missing values


mean_without_missing = data.mean(skipna=True)

# Create a mask to identify missing values in a column


mask = df['Column_Name'].isnull()

# Use the mask to perform operations or replacements on missing values


df.loc[mask, 'Column_Name'] = 'Missing' # For instance, replace missing values with a string 'Missing'
Applying the Mask

Apply the mask to filter out missing values or perform operations


selectively on non-missing values.

# Applying the mask to get non-missing values


non_missing_values = data[mask == False]

# Applying mask to perform operations selectively


sum_without_missing = data[mask == False].sum()
Scenario-1

Handling missing values can vary based on the nature of the


data and the context of the analysis.

Scenario 1: Numerical Data


Approach 1: Imputation
Mean/Median/Mode Imputation: Replace missing values in
numerical columns with the mean, median, or mode of the
non-missing values in that column.

# Impute missing values with mean


data['column'].fillna(data['column'].mean(), inplace=True)
Scenario-2
Scenario 2: Categorical Data
Approach 1: Mode Imputation
Replace missing categorical values with the mode (most frequent) value in
that column.

# Impute missing categorical values with mode


data['column'].fillna(data['column'].mode()[0], inplace=True)

Approach 2: Create a New Category


If the missing values represent a category on their own, you might encode
missing values as a separate category.

# Replace missing categorical values with 'Unknown'


data['column'].fillna('Unknown', inplace=True)
Scenario-3

Scenario 3: Time Series Data


Approach 1: Forward or Backward Fill
Use the previous (backward fill) or next (forward fill) known value to fill
missing values in a time series.

# Forward fill missing values


data['column'].fillna(method='ffill', inplace=True)

Approach 2: Interpolation
Interpolate missing values based on existing data points. For time series,
you might use linear interpolation.

# Linear interpolation
data['column'] = data['column'].interpolate(method='linear')
Renaming axis indexes
Renaming axis indexes is a common operation in pandas when you want
to change the labels or names of rows or columns.

Renaming Row Indexes (Axis=0)

import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])
data.rename(index={'X': 'New_X'}, inplace=True)

data.rename(index={'X': 'New_X', 'Y': 'New_Y'}, inplace=True)

Renaming Column Indexes (Axis=1)

data.rename(columns={'A': 'Column_A', 'B': 'Column_B'}, inplace=True)


Scenario
Let's consider a scenario where you have a dataset representing sales data for different products
in a store. The dataset contains sales figures indexed by the product names, and you want to
rename some products and update the column names to better reflect the data.

import pandas as pd

# Creating sample sales data


sales_data = pd.DataFrame({
'Old_Product_A': [100, 120, 90],
'Old_Product_B': [80, 60, 70],
'Old_Product_C': [50, 40, 60]
}, index=['Jan', 'Feb', 'Mar'])

# Renaming indexes (products) and columns


sales_data.rename(index={'Jan': 'January', 'Feb': 'February', 'Mar': 'March'}, inplace=True)
sales_data.rename(columns={'Old_Product_A': 'Product_X', 'Old_Product_B': 'Product_Y',
'Old_Product_C': 'Product_Z'}, inplace=True)

print(sales_data)
Discretization and Binning
Discretization:
Discretization involves dividing a continuous variable into intervals or bins. There are
mainly two types:

Equal Width Discretization:

Divides the range of values into equally spaced intervals.


Example:
import pandas as pd

data = pd.DataFrame({'A': [10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54]})
bins = pd.cut(data['A'], bins=3) # Divides data into 3 equal-width bins

Equal Frequency Discretization (Quantile-Based):

Divides the data into intervals with approximately the same number of observations in
each bin.
Example:
bins = pd.qcut(data['A'], q=3) # Divides data into 3 quantile-based bins
Equal Width Discretization

Dataset points: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54

Step 1: Sort the dataset in ascending order: 10, 15, 18, 20, 31, 34, 41, 46, 51,
53, 54.

Step 2: Determine the width of each bin using the formula: w = (max-min) / N (
N is number of bin)

w is calculated as (54 - 10) / 4 = 11.

BIN 1 : [lower bound , upper bound] = [(min) , (min + w -1)] = [10, 20]
BIN 2 : [lower bound , upper bound] = [(min + w) , (min + 2w -1)] = [21, 31]
BIN 3 : [lower bound , upper bound] = [(min + 2w) , (min + 3w -1)] = [32, 42]
BIN 4 : [lower bound , upper bound] = [(min + 3w) , (max)] = [43, 54]

BIN 1 : [10, 15, 18, 20]


BIN 2 : [31]
BIN 3 : [34, 41]
BIN 4 : [46, 51, 53, 54]
Equal Frequency Discretization

Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60

Step 1: sort the data.

Step 2: find the frequency. To calculate the frequency


total number of data
points/number of bins.

The total number of data points is 12, and the number of bins required is 3.
Therefore, the frequency comes out to be 4

BIN 1: 10, 15, 18, 20


BIN 2: 31, 34, 41, 46
BIN 3: 51, 53, 54, 60
Binning

Binning involves assigning continuous values to specific bins or categories based on certain criteria.
This can be applied to group data or simplify its representation.

Example:
Suppose you have a list of ages and want to categorize them into different age groups:

import pandas as pd

ages = [22, 35, 47, 50, 28, 19, 65, 37, 42, 51]

# Define bins and labels for age groups


bins = [18, 30, 40, 50, 60, 100]
labels = ['18-29', '30-39', '40-49', '50-59', '60+']

# Create age groups based on defined bins and labels


age_groups = pd.cut(ages, bins=bins, labels=labels)

# Add age groups to the original data


data = pd.DataFrame({'Age': ages, 'AgeGroup': age_groups})
print(data)
Exercise-7

Scenario: You're working on a dataset containing information


about housing prices in a city. One of the features in your
dataset is the square footage of the houses. You've been
tasked with binning the square footage data into categories
to better understand how the size of a house relates to its
price.
Outlier detection and filtering

• Outlier detection and filtering are important steps in


data preprocessing.
• Identify and handle anomalous or extreme values
that can skew statistical analyses or machine learning
models.
• Several methods can be employed to detect outliers.
• Filtering can involve handling them by removing,
capping, or transforming them.
Outlier Detection Methods

• Statistical Methods:

• Z-Score: Identifying values far from the mean in terms of standard deviations.
• IQR (Interquartile Range): Using quartiles to identify values outside a specific range.

• Visualization Techniques:

• Boxplots: Visual representation of the data's distribution to spot outliers.


• Scatterplots: Visualizing relationships and identifying points distant from others.

• Machine Learning Methods:

• Clustering Algorithms: Detecting data points that don’t cluster well with others.
• Isolation Forest, Local Outlier Factor (LOF): Algorithms specifically designed for outlier detection.
Filtering Outliers

• Removal:

• Deleting the identified outlier data points from the dataset.

• Capping or Flooring:

• Setting a threshold and capping or flooring outlier values to a


specific limit.

• Transformations:

• Applying mathematical transformations (log, square root) to


reduce the impact of outliers.
Example

import pandas as pd
import numpy as np

# Load housing price data


housing_data = pd.read_csv('housing_data.csv')

# Calculate Z-Score for 'Price' column


mean_price = np.mean(housing_data['Price'])
std_price = np.std(housing_data['Price'])
z_scores = (housing_data['Price'] - mean_price) / std_price

# Detect and filter outliers (considering a Z-Score threshold of 3)


outlier_threshold = 3
filtered_data = housing_data[abs(z_scores) < outlier_threshold]

# Explore filtered data


print(filtered_data.head())
Permutation and random sampling

• Permutation refers to rearranging or shuffling the order of a set of


elements.

• In statistics, permutation tests involve creating new arrangements of


the data to assess the likelihood of obtaining the observed results by
chance.

Example:

import numpy as np

# Original data
data = np.array([10, 20, 30, 40, 50])

# Permute the data (shuffle the order)


permuted_data = np.random.permutation(data)
print(permuted_data)

[30 40 10 50 20]
Permutation and random sampling

• Random sampling involves selecting a subset of data points from a


larger dataset.

• Each data point has an equal probability of being chosen.

• It's used to create representative samples for analysis without bias.

• Example:
import pandas as pd
# Original dataset
dataset = pd.read_csv('data.csv')

# Random sampling of 30% of the data


sampled_data = dataset.sample(frac=0.3, random_state=42)
print(sampled_data)
Random sampling without replacement

It involves selecting a subset of items from a larger set without allowing for duplicates
in the selected subset.

Example:

import pandas as pd

# Original dataset (a DataFrame)


data = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100]})

# Random sampling without replacement using Pandas


sampled_data = data['Values'].sample(n=5, replace=False)
print(sampled_data)
Example

Consider a dataset of student grades:


import pandas as pd
import numpy as np

# Create a DataFrame with student grades


data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})

print("Original Data:")
print(data)

# Random sampling without replacement


sampled_data = data['Grade'].sample(n=5, replace=False)

print("\nSampled Data:")
print(sampled_data)
Random sampling with replacement

It involves selecting items from a dataset where each selection is made


independently and with the possibility of selecting the same item multiple
times.

Example:

import pandas as pd
import numpy as np

# Create a DataFrame with student grades


data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})

print("Original Data:")
print(data)
# Random sampling with replacement
sampled_data = data['Grade'].sample(n=5, replace=True, random_state=42)
print("\nSampled Data:")
print(sampled_data)
Exercise-8

You are given a dataset containing the daily sales figures of a retail store for the past
month. The dataset comprises 30 entries, each representing the sales amount for a
particular day.

Use pandas library to perform the following:


Create dataset

Perform random sampling with replacement to select a subset of 8 days' sales figures.
Display the sampled data.

Perform random sampling without replacement to select a subset of 8 days' sales


figures. Display the sampled data.
Benefits of data transformation

Handling Skewed Distributions

Improving Model Performance

Managing Missing Values

Better Visualization

Noise Reduction

Outlier Handling

You might also like