Data Exploration &
Visualization Dr. SELVA KUMAR S
Unit-2
ASSISTANT PROFESSOR
DEPT. OF CSE
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
Transformation Techniques
Performing data deduplication
Replacing values
Discretization and binning
Introduction to Missing data
Handling missing data
Traditional methods
Maximum Likelihood Estimation
Prerequisites
Pandas
NumPy
Seaborn
Matplotlib
https://www.youtube.com/watch?v=kjXgYBiOzc4
3
Data Transformation
Data transformation is a set
of techniques used to
convert data from one
format or structure to
another format or structure.
Transformation Activities
Data deduplication
Key restructuring
Data cleansing
Data validation
Format revisioning
Data derivation
Data aggregation
Data integration
Data filtering
5
Data joining
Data deduplication
Original Data:
Deduplication Process:
1.Identification of Duplicates: Identify duplicates based on specific
criteria, such as matching email addresses or phone numbers.
2.Removal of Duplicates: After identification, choose which records to
keep and which ones to remove.
Deduplicated Data:
6
Key restructuring
Key restructuring involves transforming any keys with built-in meanings
to the generic keys.
Th restructuring makes the keys more generic and applicable to a wider
range of data or entities, rather than being specifically tied to products.
It allows for easier reuse and scalability if the dataset expands beyond
just products to include other types of entities.
Example:
7
Data cleansing
Data cleansing involves extracting words and deleting out-of-date, inaccurate,
and
incomplete information from the source language without extracting the
Meaning.
Original Data:
Data Cleansing Process:
1.Extracting Words: First, extract words from the product descriptions to
remove spelling errors and make the text more uniform.
2.Deleting Outdated, Inaccurate, and Incomplete Information: Remove
outdated, inaccurate, or incomplete information without altering the intended
meaning.
Cleansed Data:
Data validation
Data validation is a process of formulating rules or algorithms that help in
validating different types of data against some known issues.
Data Validation Rules:
1.Name Field:
1. Rule: The name should contain only alphabetic characters and spaces.
2. Example: John Doe, Alice Smith
2.Email Field:
1. Rule: Email should follow a standard email format.
2. Example:
[email protected],
[email protected]3.Phone Number Field:
1. Rule: Phone number should be in a specific format (e.g., ###-###-
####).
2. Example: 123-456-7890, 987-654-3210
Format revisioning
Format revisioning involves converting from one format to another.
Format Revisioning Process:
1.Identifying Old Format Files: Locate the existing text documents in the
older "DOC" format that need to be revised or updated to the newer format.
2.Converting to New Format:
1. Use software or tools capable of batch conversion to convert all "DOC"
files to the "DOCX" format.
2. Microsoft Word, for instance, allows for batch conversion where multiple
"DOC" files can be opened and saved as "DOCX" files.
Data derivation
Data derivation consists of creating a set of rules to generate more information
from the data source.
Example:
1. Creating a New Column - Total Sales: Units_Sold * Revenue
2. Deriving Monthly Sales Data: Sum the Total_sales
Data aggregation
Data aggregation involves searching, extracting, summarizing, and preserving
important information in different types of reporting systems.
Example:
1. Summarizing Daily Sales:
2. Calculating Monthly Revenue:
Data integration
Data integration involves converting different data types and merging them into a
common structure or schema.
Example:
Suppose a retail company has data stored in various formats and locations:
Sales Data: Stored in a SQL database with columns like Transaction_ID,
Product_ID, Quantity, Price, and Timestamp.
Customer Data: Stored in a CSV file containing customer information such as
Customer_ID, Name, Email, and Phone_Number.
Inventory Data: Stored in an Excel sheet with details like Product_ID,
Product_Name, Stock_Quantity, and Supplier.
Data filtering
Data filtering involves identifying information relevant to any particular user.
Example:
Imagine a scenario where a user wants to find movies or TV shows based on
specific criteria, such as genre, release year, and rating.
Data joining
Data joining involves establishing a relationship between two or more tables.
Example:
Result of Data Joining:
Pandas dataframes
Pandas offer different functionalities for
combining DataFrames:
• Append
• Concat
• Merge
• Join
When to use each Function?
• append: Use when adding rows from one DataFrame
to another.
• concat: Use for combining DataFrames along rows or
columns based on axes.
• merge: Use for performing SQL-like joins based on
common columns.
• join: Use for combining DataFrames based on index
or specific columns.
Append function
•Usage: Appending rows from one DataFrame to another.
•Example:
import pandas as pd
# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Appending df2 to df1
df_appended = df1.append(df2)
print(df_appended)
Concat Function
•Usage: Concatenating DataFrames along rows or columns
based on axes.
•Example:
•import pandas as pd
•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
•df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
•# Concatenating along rows (axis=0)
•df_concatenated = pd.concat([df1, df2], axis=0)
•print("Concatenated DataFrame:")
•print(df_concatenated)
Merge Function
•Usage: Performing database-style joins by merging
DataFrames based on key columns.
•Example:
•import pandas as pd
•# Creating DataFrames
•df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
•df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 28]})
•# Merging based on 'ID' column
•merged_df = pd.merge(df1, df2, on='ID', how='inner')
•print("Merged DataFrame:")
•print(merged_df)
Join Function
•Usage: Joining DataFrames based on their index or a
specific column.
•Example:
•import pandas as pd
•# Creating DataFrames
•df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X',
'Y'])
•df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X',
'Y'])
•# Joining based on index
•joined_df = df1.join(df2)
•print("Joined DataFrame:")
Guess the output??
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})
df_appended = df1.append(df2, sort=False) # sort=False to handle
different columns
print("Appended DataFrame:")
print(df_appended)
Answer
A B C
0 1 3.0 NaN
1 2 4.0 NaN
0 5 NaN 7.0
1 6 NaN 8.0
Guess the output?
import pandas as pd
# Creating DataFrames with unequal columns
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})
# Concatenating along columns (axis=1) with unequal columns
df_concatenated = pd.concat([df1, df2], axis=1, sort=False)
print("Concatenated DataFrame:")
print(df_concatenated)
Answer
A B A C
0 1 3 5 7
1 2 4 6 8
Guess the output?
import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30], 'Location': ['NY',
'CA']})
# Merging based on 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print("Merged DataFrame:")
print(merged_df)
Answer
ID Name Age Location
0 1 Alice 25 NY
1 2 Bob 30 CA
Guess the output?
import pandas as pd
# Creating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X', 'Y'])
# Joining based on index
joined_df = df1.join(df2, how='outer')
print("Joined DataFrame:")
print(joined_df)
Answer
A B C D
X 1 3 5 7
Y 2 4 6 8
Exercise-1
Use Case: Sales Data Analysis
Scenario:
Suppose you have two datasets related to sales:
sales_data_1 containing sales figures for various products in different regions.
sales_data_2 containing additional sales information for specific products.
Tasks:
Append Data: Combine both datasets vertically to create a comprehensive sales dataset.
Concatenate Data: Concatenate the sales data along columns for a more detailed analysis.
Merge Data: Merge the datasets based on a common key (product ID) to enrich the sales
information.
Join Data: Join the datasets based on the product names to analyze sales by product names.
Transformation Techniques
Performing data deduplication
• Data deduplication involves removing duplicate rows from a dataset
to maintain data integrity and accuracy.
• Example:
import pandas as pd
# Sample data with duplicates
data = {
'Order_ID': [101, 102, 101, 103, 104, 102],
'Product': ['A', 'B', 'A', 'C', 'D', 'B'],
'Quantity': [5, 3, 4, 2, 6, 3],
'Amount': [100, 80, 100, 120, 150, 80]
}
sales_data = pd.DataFrame(data)
print("Original Data:")
print(sales_data)
Deduplication
• Deduplication using the drop_duplicates() method in Pandas:
• # Removing duplicates based on all columns
• deduplicated_data = sales_data.drop_duplicates()
• print("\nDeduplicated Data:")
• print(deduplicated_data)
Deduplication
# Removing duplicates based on specific columns
deduplicated_data_subset =
sales_data.drop_duplicates(subset=['Order_ID', 'Product'])
print("\nDeduplicated Data (Subset):")
print(deduplicated_data_subset)
Guess the output?
frame3 = pd.DataFrame({'column
1': ['Looping'] * 3 +
['Functions']
* 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3
frame3.duplicated()
Answer
Exercise-2
Scenario: You are working as a data analyst for an
e-commerce company that has a massive dataset
containing customer information. The dataset has
accumulated duplicate entries due to various
reasons like multiple registrations, system errors,
or data entry mistakes. Your task is to use the
Pandas library in Python to identify and remove
these duplicate entries efficiently.
Exercise-3
Scenario: You're analyzing a sales dataset
from a retail store chain that contains
transactions from multiple branches. Due to
system errors and data integration issues,
the dataset has accumulated duplicate
records of sales transactions. Your task is to
use Pandas to identify these duplicates
based on specific columns and retain only
the unique transactions.
Replacing Values
• use the replace method:
import numpy as np
replaceFrame = pd.DataFrame({'column 1':
[200., 3000., -786.,
3000., 234., 444., -786., 332., 3332. ],
'column 2': range(9)})
replaceFrame.replace(to_replace =-786,
value= np.nan)
• Replace multiple values:
Exercise-4
Scenario:
You're working with a dataset that contains information about
product sales. However, you've noticed that certain entries
have inconsistent or incorrect values in the 'Category' column.
You need to standardize these categories by replacing specific
values with a unified category name using Pandas' replace()
method.
data = {
'Product_ID': [1, 2, 3, 4, 5],
'Product_Name': ['Apple', 'Banana', 'Orange', 'Pear', 'Grapes'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Veggie', 'Fruit']
}
Handling Missing Data
• NaN value is used to indicate that there is no value.
• There could be several reasons why a value
could be NaN:
• It can happen when data is retrieved from an
external source and there are some
incomplete values in the dataset.
• It can also happen when we join two different
datasets and some values are not matched.
• Missing values due to data collection errors.
• When the shape of data changes, there are
new additional rows or columns that are not
determined.
• Reindexing of data can result in incomplete
Example
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi’, 'grapes',
'mango'], columns=['store1', 'store2', 'store3’])
dfx
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
NaN values in Pandas Objects
• dfx.isnull()
• dfx.notnull()
• dfx.isnull().sum()
• dfx.isnull().sum().sum()
• dfx.count()
Example
import pandas as pd
# Load your dataset into a DataFrame
df = pd.read_csv('your_dataset.csv')
# Check for missing values in the entire DataFrame
print(df.isnull().sum())
# Check for missing values in a specific column
print(df['Column_Name'].isnull().sum())
Exercise-5
Scenario: You've received a dataset containing information about
students' performance, but the dataset has missing values across
multiple columns due to various reasons. Your task is to analyze and
handle these missing values using Pandas' methods.
Questions:
1.How many missing values are there in the 'Grade' column of the
dataset?
2.What is the total count of missing values in the entire dataset?
3.Can you identify and display rows where 'Absenteeism' values are
missing?
4.How many rows have complete data (non-missing) in the 'Study
Hours' column?
5.What is the sum of missing values across each column in the dataset?
6.How many missing values are there in each column of the dataset?
Dropping missing values
• One of the ways to handle missing values is to simply remove them from
dataset.
• dropna() method just returns a copy of the dataframe by dropping the rows
with NaN. The original dataframe is not changed.
• Drop rows with any missing values
df.dropna(inplace=True) # Use inplace=True to modify the original
DataFrame
• Drop rows where specific columns have missing values
df.dropna(subset=['Column_Name'], inplace=True)
• Drop rows where entire row values are NaN
df.dropna(how=‘all’)
• Drop rows where entire column values are NaN
df.dropna(how=‘all’,axis=1)
• Drop column with minimum number of NaN exist
Example-1
import pandas as pd
# Sample data with missing values
data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}
# Creating a DataFrame
df = pd.DataFrame(data)
# Dropping rows with any missing values
df_dropped = df.dropna()
print("DataFrame after dropping rows with any missing values:")
print(df_dropped)
Example-2
import pandas as pd
# Sample data with missing values
data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}
# Dropping rows where 'B' column has missing values
df_dropped_b = df.dropna(subset=['B'])
print("DataFrame after dropping rows with missing values in 'B' column:")
print(df_dropped_b)
Example-3
import pandas as pd
# Sample data with missing values
data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}
# Dropping columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print("DataFrame after dropping columns with any missing values:")
print(df_dropped_columns)
Mathematical operations with NaN
Consider the following example:
ar1 = np.array([100, 200, np.nan, 300])
ser1 = pd.Series(ar1)
ar1.mean(), ser1.mean()
The output of the preceding code is the following:
(nan, 200.0)
• When a NumPy function encounters NaN values, it returns NaN.
• Pandas, on the other hand, ignores the NaN values and moves
ahead with processing. When performing the sum operation, NaN is
treated as 0. If all the values are NaN, the result is also NaN.
Exercise-6
Scenario:
You are analyzing a dataset containing information about employees in a company. However, the
dataset is not entirely clean and has missing values in various columns due to data entry errors and
incomplete records. Your task is to preprocess the data by handling missing values using Pandas'
dropna() method.
Questions:
1. How would you drop all rows with any missing values from the entire dataset
using Pandas?
2. If the 'Salary' column is critical for analysis, how can you drop only the rows
where 'Salary' values are missing while preserving the rest of the dataset?
3.In some columns, missing values are acceptable up to a certain limit. How
would you drop rows with more than two missing values across any column?
4.For the 'Department' column, there are missing values that can't be imputed.
How can you drop rows where 'Department' is missing while keeping rows with
other missing values intact?
5.Considering the 'Address' and 'Phone' columns have a significant number of
missing values, how can you drop these columns entirely from the dataset?
6.If you want to drop rows where 'Years of Experience' are missing and 'Salary' is
also missing or zero, how would you achieve this?
Dataset
Filling Missing values
We can use the fillna() method to replace NaN values with any particular
values.
Example:
filledDf = dfx.fillna(0)
filledDf
# Fill missing values with a specific constant
df['Column_Name'].fillna(value=0, inplace=True)
# Fill missing values with the mean of the column
df['Column_Name'].fillna(df['Column_Name'].mean(), inplace=True)
# Fill missing values with the most frequent value in the column
df['Column_Name'].fillna(df['Column_Name'].mode()[0], inplace=True)
Example
import pandas as pd
import numpy as np
data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}
df = pd.DataFrame(data)
# 1. Fill missing numerical values with the mean of the column
df_filled_mean = df.copy()
df_filled_mean['A'].fillna(df_filled_mean['A'].mean(), inplace=True)
df_filled_mean['C'].fillna(df_filled_mean['C'].mean(), inplace=True)
# 2. Fill missing categorical values with the most frequent value (mode)
df_filled_mode = df.copy()
df_filled_mode['B'].fillna(df_filled_mode['B'].mode()[0], inplace=True)
df_filled_mode['D'].fillna(df_filled_mode['D'].mode()[0], inplace=True)
print("DataFrame with missing values:") print(df)
print("\nDataFrame after filling missing numerical values with mean:") print(df_filled_mean)
print("\nDataFrame after filling missing categorical values with mode:") print(df_filled_mode)
Output
Backward and Forward Filling
NaN values can be filled based on the last known values.
# Forward fill missing values with the previous valid value
df.fillna(method='ffill', inplace=True)
# Backward fill missing values with the next valid value
df.fillna(method='bfill', inplace=True)
Example
# Sample data with missing values
data = {
'A': [1, 2, np.nan, 4, 5], 'B': ['a', np.nan, 'c', 'd', 'e’], 'C': [10, 20, np.nan, np.nan, 50],
'D': [np.nan, 'Yes', 'No', 'Yes', np.nan]
}
df = pd.DataFrame(data)
# 1. Forward fill missing values in categorical columns
df_ffill = df.copy()
df_ffill['B'].fillna(method='ffill', inplace=True)
df_ffill['D'].fillna(method='ffill', inplace=True)
# 2. Backward fill missing values in numerical columns
df_bfill = df.copy()
df_bfill['A'].fillna(method='bfill', inplace=True)
df_bfill['C'].fillna(method='bfill', inplace=True)
print("DataFrame with missing values:") print(df)
print("\nDataFrame after forward filling missing categorical values:") print(df_ffill)
print("\nDataFrame after backward filling missing numerical values:") print(df_bfill)
Output
Interpolating missing value
The pandas library provides the interpolate() function both for the series
and the
dataframe.
By default, it performs a linear interpolation of our missing values.
ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])
ser3.interpolate()
# Use linear interpolation to fill missing values
df['Column_Name'] = df['Column_Name'].interpolate(method='linear')
Example
import pandas as pd
import numpy as np
# Sample data with missing values
data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Value': [10, 20, np.nan, 40, 50, np.nan, np.nan, 80, 90, np.nan]
}
# Creating a DataFrame
df = pd.DataFrame(data)
# Interpolating missing values in the 'Value' column using linear interpolation
df['Value_interpolated'] = df['Value'].interpolate(method='linear')
# Displaying the original and interpolated DataFrames
print("Original DataFrame:")
print(df)
Using Mask for missing values
Creating a Mask : Once you've identified the missing values, create a boolean mask that indicates the positions of
these missing values in your dataset.
For example, in Pandas, you can use functions like isnull() or notnull() to create a boolean mask.
import pandas as pd
import numpy as np
data = pd.DataFrame({'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]})
# Creating a mask for missing values
mask = data.isnull()
Using Mask
• Use the mask to perform specific operations:
Filling missing values with a certain value
Dropping rows or columns with missing values
Performing computations only on non-missing values.
# Filling missing values with a specific number
filled_data = data.fillna(0)
# Dropping rows with any missing values
cleaned_data = data.dropna()
# Performing operations only on non-missing values
mean_without_missing = data.mean(skipna=True)
# Create a mask to identify missing values in a column
mask = df['Column_Name'].isnull()
# Use the mask to perform operations or replacements on missing values
df.loc[mask, 'Column_Name'] = 'Missing' # For instance, replace missing values with a string 'Missing'
Applying the Mask
Apply the mask to filter out missing values or perform operations
selectively on non-missing values.
# Applying the mask to get non-missing values
non_missing_values = data[mask == False]
# Applying mask to perform operations selectively
sum_without_missing = data[mask == False].sum()
Scenario-1
Handling missing values can vary based on the nature of the
data and the context of the analysis.
Scenario 1: Numerical Data
Approach 1: Imputation
Mean/Median/Mode Imputation: Replace missing values in
numerical columns with the mean, median, or mode of the
non-missing values in that column.
# Impute missing values with mean
data['column'].fillna(data['column'].mean(), inplace=True)
Scenario-2
Scenario 2: Categorical Data
Approach 1: Mode Imputation
Replace missing categorical values with the mode (most frequent) value in
that column.
# Impute missing categorical values with mode
data['column'].fillna(data['column'].mode()[0], inplace=True)
Approach 2: Create a New Category
If the missing values represent a category on their own, you might encode
missing values as a separate category.
# Replace missing categorical values with 'Unknown'
data['column'].fillna('Unknown', inplace=True)
Scenario-3
Scenario 3: Time Series Data
Approach 1: Forward or Backward Fill
Use the previous (backward fill) or next (forward fill) known value to fill
missing values in a time series.
# Forward fill missing values
data['column'].fillna(method='ffill', inplace=True)
Approach 2: Interpolation
Interpolate missing values based on existing data points. For time series,
you might use linear interpolation.
# Linear interpolation
data['column'] = data['column'].interpolate(method='linear')
Renaming axis indexes
Renaming axis indexes is a common operation in pandas when you want
to change the labels or names of rows or columns.
Renaming Row Indexes (Axis=0)
import pandas as pd
data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])
data.rename(index={'X': 'New_X'}, inplace=True)
data.rename(index={'X': 'New_X', 'Y': 'New_Y'}, inplace=True)
Renaming Column Indexes (Axis=1)
data.rename(columns={'A': 'Column_A', 'B': 'Column_B'}, inplace=True)
Scenario
Let's consider a scenario where you have a dataset representing sales data for different products
in a store. The dataset contains sales figures indexed by the product names, and you want to
rename some products and update the column names to better reflect the data.
import pandas as pd
# Creating sample sales data
sales_data = pd.DataFrame({
'Old_Product_A': [100, 120, 90],
'Old_Product_B': [80, 60, 70],
'Old_Product_C': [50, 40, 60]
}, index=['Jan', 'Feb', 'Mar'])
# Renaming indexes (products) and columns
sales_data.rename(index={'Jan': 'January', 'Feb': 'February', 'Mar': 'March'}, inplace=True)
sales_data.rename(columns={'Old_Product_A': 'Product_X', 'Old_Product_B': 'Product_Y',
'Old_Product_C': 'Product_Z'}, inplace=True)
print(sales_data)
Discretization and Binning
Discretization:
Discretization involves dividing a continuous variable into intervals or bins. There are
mainly two types:
Equal Width Discretization:
Divides the range of values into equally spaced intervals.
Example:
import pandas as pd
data = pd.DataFrame({'A': [10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54]})
bins = pd.cut(data['A'], bins=3) # Divides data into 3 equal-width bins
Equal Frequency Discretization (Quantile-Based):
Divides the data into intervals with approximately the same number of observations in
each bin.
Example:
bins = pd.qcut(data['A'], q=3) # Divides data into 3 quantile-based bins
Equal Width Discretization
Dataset points: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54
Step 1: Sort the dataset in ascending order: 10, 15, 18, 20, 31, 34, 41, 46, 51,
53, 54.
Step 2: Determine the width of each bin using the formula: w = (max-min) / N (
N is number of bin)
w is calculated as (54 - 10) / 4 = 11.
BIN 1 : [lower bound , upper bound] = [(min) , (min + w -1)] = [10, 20]
BIN 2 : [lower bound , upper bound] = [(min + w) , (min + 2w -1)] = [21, 31]
BIN 3 : [lower bound , upper bound] = [(min + 2w) , (min + 3w -1)] = [32, 42]
BIN 4 : [lower bound , upper bound] = [(min + 3w) , (max)] = [43, 54]
BIN 1 : [10, 15, 18, 20]
BIN 2 : [31]
BIN 3 : [34, 41]
BIN 4 : [46, 51, 53, 54]
Equal Frequency Discretization
Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60
Step 1: sort the data.
Step 2: find the frequency. To calculate the frequency
total number of data
points/number of bins.
The total number of data points is 12, and the number of bins required is 3.
Therefore, the frequency comes out to be 4
BIN 1: 10, 15, 18, 20
BIN 2: 31, 34, 41, 46
BIN 3: 51, 53, 54, 60
Binning
Binning involves assigning continuous values to specific bins or categories based on certain criteria.
This can be applied to group data or simplify its representation.
Example:
Suppose you have a list of ages and want to categorize them into different age groups:
import pandas as pd
ages = [22, 35, 47, 50, 28, 19, 65, 37, 42, 51]
# Define bins and labels for age groups
bins = [18, 30, 40, 50, 60, 100]
labels = ['18-29', '30-39', '40-49', '50-59', '60+']
# Create age groups based on defined bins and labels
age_groups = pd.cut(ages, bins=bins, labels=labels)
# Add age groups to the original data
data = pd.DataFrame({'Age': ages, 'AgeGroup': age_groups})
print(data)
Exercise-7
Scenario: You're working on a dataset containing information
about housing prices in a city. One of the features in your
dataset is the square footage of the houses. You've been
tasked with binning the square footage data into categories
to better understand how the size of a house relates to its
price.
Outlier detection and filtering
• Outlier detection and filtering are important steps in
data preprocessing.
• Identify and handle anomalous or extreme values
that can skew statistical analyses or machine learning
models.
• Several methods can be employed to detect outliers.
• Filtering can involve handling them by removing,
capping, or transforming them.
Outlier Detection Methods
• Statistical Methods:
• Z-Score: Identifying values far from the mean in terms of standard deviations.
• IQR (Interquartile Range): Using quartiles to identify values outside a specific range.
• Visualization Techniques:
• Boxplots: Visual representation of the data's distribution to spot outliers.
• Scatterplots: Visualizing relationships and identifying points distant from others.
• Machine Learning Methods:
• Clustering Algorithms: Detecting data points that don’t cluster well with others.
• Isolation Forest, Local Outlier Factor (LOF): Algorithms specifically designed for outlier detection.
Filtering Outliers
• Removal:
• Deleting the identified outlier data points from the dataset.
• Capping or Flooring:
• Setting a threshold and capping or flooring outlier values to a
specific limit.
• Transformations:
• Applying mathematical transformations (log, square root) to
reduce the impact of outliers.
Example
import pandas as pd
import numpy as np
# Load housing price data
housing_data = pd.read_csv('housing_data.csv')
# Calculate Z-Score for 'Price' column
mean_price = np.mean(housing_data['Price'])
std_price = np.std(housing_data['Price'])
z_scores = (housing_data['Price'] - mean_price) / std_price
# Detect and filter outliers (considering a Z-Score threshold of 3)
outlier_threshold = 3
filtered_data = housing_data[abs(z_scores) < outlier_threshold]
# Explore filtered data
print(filtered_data.head())
Permutation and random sampling
• Permutation refers to rearranging or shuffling the order of a set of
elements.
• In statistics, permutation tests involve creating new arrangements of
the data to assess the likelihood of obtaining the observed results by
chance.
Example:
import numpy as np
# Original data
data = np.array([10, 20, 30, 40, 50])
# Permute the data (shuffle the order)
permuted_data = np.random.permutation(data)
print(permuted_data)
[30 40 10 50 20]
Permutation and random sampling
• Random sampling involves selecting a subset of data points from a
larger dataset.
• Each data point has an equal probability of being chosen.
• It's used to create representative samples for analysis without bias.
• Example:
import pandas as pd
# Original dataset
dataset = pd.read_csv('data.csv')
# Random sampling of 30% of the data
sampled_data = dataset.sample(frac=0.3, random_state=42)
print(sampled_data)
Random sampling without replacement
It involves selecting a subset of items from a larger set without allowing for duplicates
in the selected subset.
Example:
import pandas as pd
# Original dataset (a DataFrame)
data = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100]})
# Random sampling without replacement using Pandas
sampled_data = data['Values'].sample(n=5, replace=False)
print(sampled_data)
Example
Consider a dataset of student grades:
import pandas as pd
import numpy as np
# Create a DataFrame with student grades
data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})
print("Original Data:")
print(data)
# Random sampling without replacement
sampled_data = data['Grade'].sample(n=5, replace=False)
print("\nSampled Data:")
print(sampled_data)
Random sampling with replacement
It involves selecting items from a dataset where each selection is made
independently and with the possibility of selecting the same item multiple
times.
Example:
import pandas as pd
import numpy as np
# Create a DataFrame with student grades
data = pd.DataFrame({
'Student_ID': np.arange(1, 21), # IDs for 20 students
'Grade': np.random.randint(60, 100, size=20) # Random grades between 60 and 100
})
print("Original Data:")
print(data)
# Random sampling with replacement
sampled_data = data['Grade'].sample(n=5, replace=True, random_state=42)
print("\nSampled Data:")
print(sampled_data)
Exercise-8
You are given a dataset containing the daily sales figures of a retail store for the past
month. The dataset comprises 30 entries, each representing the sales amount for a
particular day.
Use pandas library to perform the following:
Create dataset
Perform random sampling with replacement to select a subset of 8 days' sales figures.
Display the sampled data.
Perform random sampling without replacement to select a subset of 8 days' sales
figures. Display the sampled data.
Benefits of data transformation
Handling Skewed Distributions
Improving Model Performance
Managing Missing Values
Better Visualization
Noise Reduction
Outlier Handling