EXPLORATORY DATA ANALYSIS
CHAPTER 3
“Introduction to Data Science : Practical Approach with R and Python ”
B.Uma Maheswari and R Sujatha
Copyright @ 2021 Wiley India Pvt. Ltd. All rights reserved.
LEARNING OBJECTIVES
Apply the steps in data pre-processing.
Understand data by looking and visualizing the data
Learn the concept of outliers how to deal with them.
Dealing with missing values during data preprocessing.
Understand the concept of standardization.
Apply R and Python programming for data anlysis
DATA SCIENCE PROCESS MODEL
Objectives of EDA
To develop an understanding of the data
To identify trends and patterns
To understand relationship between variables
To decide on the appropriate models to be executed on
the data
To find answers to questions relating to the data
To test assumptions
STEPS IN DATA PRE-PROCESSING
DATASET DESCRIPTION
S.No. Column Name Description
1 phoneno Phone Number of the customer
2 age Age of the customer (1-> 18-30, 2->31-40, 3->41-50, 4->Above 50)
3 gender Gender of the customer (0->Male, 1->Female)
4 zipcode Zip code of the area where the customer lives
5 calls Number of calls made by the customer per month
6 sms Number of SMS made by the customer per month
7 mms Number of MMS made by the customer per month
8 charges Monthly charges paid by the customer
9 coverage Number of days out of coverage
Type of Complaint (0-> no problem,1->Recharge issues, 2-> Problems in the
10 complaint offer/package , 3->Network problem, 4->Call dropping)
11 sim Single or dual sim (0->Single sim, 1->Dual sim)
12 phone Type of Phone (0->Android, 1-> IOS)
13 prepost Prepaid or Post Paid (0->Prepaid, 1->Post Paid)
14 churn Customer Churn (0-No Churn, 1-Churn)
UNDERSTANDING THE DATA
Summary of the
dataset
Structure of the
dataset
Dimensions of
the data
Load the dataset • dim, nrow,
ncol, names
CONTINOUS AND CATEGORICAL VARIABLES
Continuous variables are quantitative variables which can take
any infinite values and can be measured. Mean, median and mode
can be calculated for continuous variables. For e.g. Height, weight,
speed of the vehicle etc.
Categorical variables are variables which could be categorized
into distinct groups e.g. gender, pass/fail etc. are finite.
In simple words, if we can measure the variables it is a continuous
variable and if we can count the variables it is categorical.
NORMAL DISTRIBUTION
Line drawing
to be drawn
RIGHT SKEWED AND LEFT SKEWED
Line drawing to be drawn
DATA VISUALIZATION
Histogram
(Continuous
variables)
Barplot
(Categorical
variables)
Boxplot
(Continous
variables)
BOXPLOT
A box plot provides a good representation of distribution of quantitative data. It is also known as
a box and whisker plot. It is used in exploratory data analysis to draw inferences from the data..
Boxplot divides the data into quartiles.
The first 25% of the data lies between the minimum value and the start of the box which is the first
quartile(Q1). This is called as whiskers
The second 25% of the data lies between start of the box and the median which is the second
quartile(Q2).
The third 25% of the data lies between the median and the end of the box which is the third
quartile (Q3).
The last 25% of the data lies from the end of the box to the maximum value which is shown as
whiskers.
The length of the whiskers and the position of the median indicates the skewness of the data.
The plot shows the interquartile range (IQR) which is the difference between the 25th and the 75th
percentile
Boxplot also indicates the presence of outliers.
BOX PLOT AND OUTLIERS
1st 2nd 3rd
Minimum Quartile Quartile Maximum
Quartile
value value
Whiskers
Outliers Whiskers
Median
OUTLIER TREATMENT
First 25% of the Second 25% Third 25% of Last 25% of the
data of the data the data data
DEALING WITH MISSING VALUES
STANDARDIZING DATA
This process is also called feature scaling.
This is usually done when there are large differences in the range of values in the
columns of a dataset. This process is done to ensure that the variables are on the same
scale.
This can be done in two ways Normalization and Standardisation.
In normalization the minimum and maximum values are used and in standardisation
mean and standard deviation are used.
MEAN
MEDIAN
MODE
VARIANCE AND STANDARD DEVIATION
The IQR can also be
used to identify
suspected outliers.
In general, a suspected
outlier can exist in the
following two ranges:
= 4 – 16.5= -12.5
= 15 + 16.5= 31.5
Dependent
Independent Variables
Variables
A sample dataset