Chapter 3 Exploratory Data Analysis

Uploaded by

barnabas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views22 pages

Chapter 3 Exploratory Data Analysis

Uploaded by

barnabas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

EXPLORATORY DATA ANALYSIS

CHAPTER 3

“Introduction to Data Science : Practical Approach with R and Python ”

B.Uma Maheswari and R Sujatha
Copyright @ 2021 Wiley India Pvt. Ltd. All rights reserved.
LEARNING OBJECTIVES
Apply the steps in data pre-processing.
Understand data by looking and visualizing the data
Learn the concept of outliers how to deal with them.
Dealing with missing values during data preprocessing.
Understand the concept of standardization.
Apply R and Python programming for data anlysis
DATA SCIENCE PROCESS MODEL
Objectives of EDA
To develop an understanding of the data
To identify trends and patterns
To understand relationship between variables
To decide on the appropriate models to be executed on
the data
To find answers to questions relating to the data
To test assumptions
STEPS IN DATA PRE-PROCESSING
DATASET DESCRIPTION
S.No. Column Name Description
1 phoneno Phone Number of the customer
2 age Age of the customer (1-> 18-30, 2->31-40, 3->41-50, 4->Above 50)
3 gender Gender of the customer (0->Male, 1->Female)
4 zipcode Zip code of the area where the customer lives
5 calls Number of calls made by the customer per month
6 sms Number of SMS made by the customer per month
7 mms Number of MMS made by the customer per month
8 charges Monthly charges paid by the customer
9 coverage Number of days out of coverage
Type of Complaint (0-> no problem,1->Recharge issues, 2-> Problems in the
10 complaint offer/package , 3->Network problem, 4->Call dropping)
11 sim Single or dual sim (0->Single sim, 1->Dual sim)
12 phone Type of Phone (0->Android, 1-> IOS)
13 prepost Prepaid or Post Paid (0->Prepaid, 1->Post Paid)
14 churn Customer Churn (0-No Churn, 1-Churn)
UNDERSTANDING THE DATA

Summary of the
dataset
Structure of the
dataset
Dimensions of
the data
Load the dataset • dim, nrow,
ncol, names
CONTINOUS AND CATEGORICAL VARIABLES
Continuous variables are quantitative variables which can take
any infinite values and can be measured. Mean, median and mode
can be calculated for continuous variables. For e.g. Height, weight,
speed of the vehicle etc.
Categorical variables are variables which could be categorized
into distinct groups e.g. gender, pass/fail etc. are finite.
In simple words, if we can measure the variables it is a continuous
variable and if we can count the variables it is categorical.
NORMAL DISTRIBUTION

Line drawing
to be drawn
RIGHT SKEWED AND LEFT SKEWED

Line drawing to be drawn

DATA VISUALIZATION
Histogram
(Continuous
variables)
Barplot
(Categorical
variables)
Boxplot
(Continous
variables)
BOXPLOT
A box plot provides a good representation of distribution of quantitative data. It is also known as
a box and whisker plot. It is used in exploratory data analysis to draw inferences from the data..
Boxplot divides the data into quartiles.
The first 25% of the data lies between the minimum value and the start of the box which is the first
quartile(Q1). This is called as whiskers
The second 25% of the data lies between start of the box and the median which is the second
quartile(Q2).
The third 25% of the data lies between the median and the end of the box which is the third
quartile (Q3).
The last 25% of the data lies from the end of the box to the maximum value which is shown as
whiskers.
The length of the whiskers and the position of the median indicates the skewness of the data.
The plot shows the interquartile range (IQR) which is the difference between the 25th and the 75th
percentile
Boxplot also indicates the presence of outliers.
BOX PLOT AND OUTLIERS
1st 2nd 3rd
Minimum Quartile Quartile Maximum
Quartile
value value

Whiskers
Outliers Whiskers

Median
OUTLIER TREATMENT
First 25% of the Second 25% Third 25% of Last 25% of the
data of the data the data data
DEALING WITH MISSING VALUES
STANDARDIZING DATA

This process is also called feature scaling.

This is usually done when there are large differences in the range of values in the
columns of a dataset. This process is done to ensure that the variables are on the same
scale.
This can be done in two ways Normalization and Standardisation.
In normalization the minimum and maximum values are used and in standardisation
mean and standard deviation are used.
MEAN
MEDIAN
MODE
VARIANCE AND STANDARD DEVIATION
The IQR can also be
used to identify
suspected outliers.
In general, a suspected
outlier can exist in the
following two ranges:
= 4 – 16.5= -12.5
= 15 + 16.5= 31.5
Dependent
Independent Variables
Variables

A sample dataset

Unit 3
No ratings yet
Unit 3
20 pages
CHP 2
No ratings yet
CHP 2
52 pages
Visualization - Hist and Box
No ratings yet
Visualization - Hist and Box
23 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
02data Part2
No ratings yet
02data Part2
34 pages
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
No ratings yet
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
4 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Lecture 3 Techniques
No ratings yet
Lecture 3 Techniques
34 pages
Notes 03
No ratings yet
Notes 03
21 pages
02 Data
No ratings yet
02 Data
36 pages
1 Program
No ratings yet
1 Program
20 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
Gagan Jindali Report
No ratings yet
Gagan Jindali Report
11 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
CH 3 - 250408 - 170537
No ratings yet
CH 3 - 250408 - 170537
33 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Boxplot
No ratings yet
Boxplot
22 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Data Acquisition and Exploration
No ratings yet
Data Acquisition and Exploration
10 pages
CH 03
No ratings yet
CH 03
50 pages
4 - SM and Data Visualization
No ratings yet
4 - SM and Data Visualization
61 pages
ISE1204 - Lecture 2
No ratings yet
ISE1204 - Lecture 2
42 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
No ratings yet
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
33 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
COMPSCI 5590-f23-DS-rr-lecture1-4
No ratings yet
COMPSCI 5590-f23-DS-rr-lecture1-4
15 pages
Understanding Measures of Dispersion
No ratings yet
Understanding Measures of Dispersion
15 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
68 pages
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
No ratings yet
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
62 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Module - 3
No ratings yet
Module - 3
43 pages
STQS2223 CH 4
No ratings yet
STQS2223 CH 4
30 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
UNIT02
No ratings yet
UNIT02
41 pages
Boxplots & Histograms in R
No ratings yet
Boxplots & Histograms in R
10 pages
Lecture01
No ratings yet
Lecture01
76 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Lecture Four Origines of The British People3 Thenormans and The Middle Ages
No ratings yet
Lecture Four Origines of The British People3 Thenormans and The Middle Ages
3 pages
Mirpuri Vs Court of Appeals, 318 SCRA 516, G.R. No. 114508, November 19, 1999
No ratings yet
Mirpuri Vs Court of Appeals, 318 SCRA 516, G.R. No. 114508, November 19, 1999
51 pages
Electrician
No ratings yet
Electrician
4 pages
Empoerment Tech Q4Module5 6
No ratings yet
Empoerment Tech Q4Module5 6
14 pages
Navy Federal Personal Loan October 2023
No ratings yet
Navy Federal Personal Loan October 2023
2 pages
HBL632RT2: Construction Electrical Optics Specification Features
No ratings yet
HBL632RT2: Construction Electrical Optics Specification Features
2 pages
Marketing & Finance Compliance Guide
No ratings yet
Marketing & Finance Compliance Guide
10 pages
Startek Zero Tolerance Protocol Agreement
No ratings yet
Startek Zero Tolerance Protocol Agreement
3 pages
Climate Responsive Design
No ratings yet
Climate Responsive Design
13 pages
ER Model and Database Design
No ratings yet
ER Model and Database Design
40 pages
Pre-Feasibility Report For Proposed Construction Project of "Santnagari" at
No ratings yet
Pre-Feasibility Report For Proposed Construction Project of "Santnagari" at
12 pages
The Effects of Online Social in Uencers On Purchasing Behavior of Generation Z: An Empirical Study in Vietnam
No ratings yet
The Effects of Online Social in Uencers On Purchasing Behavior of Generation Z: An Empirical Study in Vietnam
13 pages
Human Resource Management 16th Edition by Sean R Valentine Full Download
No ratings yet
Human Resource Management 16th Edition by Sean R Valentine Full Download
406 pages
MVS Notes Unit-I
No ratings yet
MVS Notes Unit-I
16 pages
Chapter 12-Managing Economic Exposure and Translation Exposure
No ratings yet
Chapter 12-Managing Economic Exposure and Translation Exposure
11 pages
MySQL Functions
No ratings yet
MySQL Functions
28 pages
Gold Market Structure and Demand Analysis
No ratings yet
Gold Market Structure and Demand Analysis
2 pages
1967 F1 Mod for GPL Enthusiasts
No ratings yet
1967 F1 Mod for GPL Enthusiasts
2 pages
11) Building Code of Pakistan
No ratings yet
11) Building Code of Pakistan
267 pages
Financial Literacy Impact on ABM Students
No ratings yet
Financial Literacy Impact on ABM Students
62 pages
Safety Data Sheet SDS: Product: Polyester Resin July 1, 2019
No ratings yet
Safety Data Sheet SDS: Product: Polyester Resin July 1, 2019
7 pages
Dissertation School Uniform
100% (2)
Dissertation School Uniform
5 pages
Venturi Scrubber Optimization
100% (1)
Venturi Scrubber Optimization
10 pages
Cloud Security
No ratings yet
Cloud Security
11 pages
Airtel Payment Bank Account Statement
No ratings yet
Airtel Payment Bank Account Statement
3 pages
Thompson Etal 1999 SEG Alteration-Mapping-in-Exploration
No ratings yet
Thompson Etal 1999 SEG Alteration-Mapping-in-Exploration
13 pages
Jaa TGL-10 (Rnav) PDF
No ratings yet
Jaa TGL-10 (Rnav) PDF
29 pages
Badal Sircar - Third Theatre PDF
No ratings yet
Badal Sircar - Third Theatre PDF
6 pages
B777 Cockpit Poster
No ratings yet
B777 Cockpit Poster
1 page
Petition for Relief: Dagupan Case Analysis
100% (1)
Petition for Relief: Dagupan Case Analysis
4 pages