EDA Unit3
EDA Unit3
UNIT III
EXPLORATORY DATA ANALYSIS
UNIVARIATE ANALYSIS - Introduction to Single variable: Distribution Variables -
Numerical Summaries of Level and Spread - Scaling and Standardizing – Inequality.
Heights (in cm) 164 167.3 170 174.2 178 180 186
Suppose that the heights of seven students in a class is recorded (above table). There is
only one variable, which is height, and it is not dealing with any cause or relationship.
1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Ad1 Male 80
3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Ad3 Female 55
Ad1 Male 66
Ad3 Male 35
The click rates could be measured for both men and women and relationships between
variables can then be examined. It is similar to bivariate but contains more than one
dependent variable.
Key points in Multivariate analysis:
1. Analysis Techniques: The ways to perform analysis on this data depends on
the goals to be achieved. Some of the techniques are regression analysis, principal
component analysis, path analysis, factor analysis and multivariate analysis of
variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique depends on the specific
goals of the study. For example, researchers may be interested in predicting one
variable based on others, identifying underlying factors that explain patterns, or
comparing group means across multiple variables.
4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
It is similar to bivariate
It does not contain any It does contain only one
but it contains more than 2
dependent variable. dependent variable.
variables.
5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Example, Suppose an
advertiser wants to
compare the popularity of
four advertisements on a
The example of bivariate
The example of a univariate website.
can be temperature and ice
can be height. Then their click rates
sales in summer vacation.
could be measured for
both men and women and
relationships between
variable can be examined
We could choose to perform univariate analysis on any of the individual variables in the
dataset to gain a better understanding of its distribution of values.
For example, we may choose to perform univariate analysis on the variable Household
Size:
7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Summary Statistics
We can calculate the following measures of central tendency for Household Size:
Mean (the average value): 3.8
Median (the middle value): 4
These values give us an idea of where the “center” value is located.
We can also calculate the following measures of dispersion:
Range (the difference between the max and min): 6
Interquartile Range (the spread of the middle 50% of values): 2.5
Standard Deviation (an average measure of spread): 1.87
These values give us an idea of how spread out the values are for this variable.
Frequency Distributions
We can also create the following frequency distribution table to summarize how often
different values occur:
9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
This allows us to quickly see that the most frequent household size is 4.
Resource: You can use this Frequency Calculator to automatically produce a frequency
distribution for any variable.
Charts
We can create the following charts to help us visualize the distribution of values
for Household Size:
1. Boxplot
A boxplot is a plot that shows the five-number summary of a dataset.
The five-number summary includes:
The minimum value
The first quartile
The median value
The third quartile
The maximum value
Here’s what a boxplot would look like for the variable Household Size:
10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Resource: You can use this Boxplot Generator to automatically produce a boxplot for
any variable.
2. Histogram
A histogram is a type of chart that uses vertical bars to display frequencies. This type of
chart is a useful way to visualize the distribution of values in a dataset.
Here’s what a histogram would look like for the variable Household Size:
3. Density Curve
A density curve is a curve on a graph that represents the distribution of values in a
dataset.
11
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
It’s particularly useful for visualizing the “shape” of a distribution, including whether or
not a distribution has one or more “peaks” of frequently occurring values and whether or
not the distribution is skewed to the left or the right.
Here’s what a density curve would look like for the variable Household Size:
4. Pie Chart
A pie chart is a type of chart that is shaped like a circle and uses slices to represent
proportions of a whole.
Here’s what a pie chart would look like for the variable Household Size:
Depending on the type of data, one of these charts may be more useful for visualizing the
distribution of values than the others.
Numerical Summaries of Level and Spread
12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
expressing this is to say that the residual is the observed data value minus the
predicted value and in this case 45-40 = 5. Any data value such as a
measurement of hours worked or income earned can be thought of as being
composed of two components: a fitted part and a residual part. This can be
expressed as an equation:
Data = Fit + Residual
15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
3. Mode:
o The mode is the value that appears most frequently in a dataset. A dataset
may have one mode (unimodal), more than one mode (bimodal,
trimodal), or no mode (if all values occur with equal frequency).
Dataset: [1, 2, 2, 3, 2, 4, 1, 2]
Mode: 2 (appears most frequently)
Interpretation: The mode is 2, indicating it's the most common number
of cars per family.
4. Percentile:
o A percentile is a measure indicating the value below which a given
percentage of observations in a group of observations fall. For example,
16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
the 25th percentile is the value below which 25% of observations fall.
5. Quartiles (Five-Number Summary):
o Quartiles divide a dataset into four equal parts. The five-number
summary includes:
Minimum: The smallest value in the dataset.
Q1 (First Quartile): The value below which 25% of the data fall.
Median (Second Quartile): The middle value of the dataset.
Q3 (Third Quartile): The value below which 75% of the data fall.
Maximum: The largest value in the dataset.
6. Standard Deviation:
o The standard deviation measures the amount of variation or dispersion of
a set of values. It is the square root of the variance.
17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
7.
Variance:
o Variance measures how far each number in the dataset is from the mean.
It is the average of the squared differences from the mean.
8.
Range:
18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o The range is the difference between the maximum and minimum values
in a dataset.
Example:
Dataset: [25, 30, 35, 40, 45]
Range: 45−25=2045 - 25 = 2045−25=20
Interpretation: The range of ages is 20 years.
9. Proportion:
Proportion refers to a part or share of a whole. It is often expressed as a
fraction or percentage of the total.
Example:
Gender distribution in a survey: Male: 30, Female: 20
Proportion of Males: 30/50=0.6
Proportion of Females: 20/50=0.4
Interpretation: 60% of respondents are male, and 40% are female.
10. Correlation:
o Correlation measures the strength and direction of the linear relationship
between two variables. It ranges from -1 to 1, where:
111: Perfect positive correlation
19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Summaries of spread
Measures of spread (also called measures of dispersion) tell you something about how
wide the set of data is. There are several basic measures of spread used in statistics. The
most common are:
1. The range (including the interquartile range and the interdecile range),
2. The standard deviation,
3. The variance,
4. Quartiles.
5. Midspread
20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
The Range
The Range tells you how much is in between the lowest value
(start) and highest value (end).
The range is a basic statistic that tells you the range of values. For example, if your
minimum value is $10 and the maximum value is $100 then the range is $90 ($100 –
$10). A similar statistic is the interquartile range, which tells you the range in the middle
fifty percent of a set of data; in other words, it’s where the bulk of data tends to lie.
See: The Range and Interquartile Range for examples and calculation steps.
Another, less common measure is the Semi Interquartile Range, which is one half of the
interquartile range.
2. Standard Deviation
21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
where, percentage wise, a certain value falls. For example, let’s say you took a test and it
was normally distributed (shaped like a bell). You score one standard deviation above
the mean. That tells you your score puts you in the top 84% of test takers.
See: Standard Deviation for examples and calculation steps.
3. The Variance
The variance is a very simple statistic that gives you an extremely rough idea of how
spread out a data set is. As a measure of spread, it’s actually pretty weak. A large
variance of 22,000, for example, doesn’t tell you much about the spread of data — other
than it’s big! The most important reason the variance exists is to give you a way to find
the standard deviation: the standard deviation is the square root of variance.
4. Quartiles
o Description: The difference between the third quartile (Q3) and the first
quartile (Q1), representing the middle 50% of the dataset.
o
o
o Description: Scales the data to a fixed range, usually [0, 1]. This method
is suitable when the distribution of the data is known and doesn't have
outliers significantly affecting the range.
2. Max Abs Scaling:
23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Formula
o Description: Scales the data to the [-1, 1] range by dividing all values by
the maximum absolute value in the dataset.
Standardizing in Univariate Analysis
Standardizing transforms the data to have a mean of 0 and a standard deviation of 1.
This makes the distribution of the data more suitable for algorithms that assume a
normal distribution or that are sensitive to the scale of the data.
Method of Standardizing:
Standardization (Z-score Scaling):
24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Standardization:
Mean μ=20μ
Standard Deviation σ=7.07
Standardize Temperature:
25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Types of Inequality
1. Statistical Inequality:
o Description: Statistical measures that quantify the differences or
disparities within a dataset or between datasets.
o Examples: Variance, standard deviation, range, interquartile range
(IQR), coefficient of variation.
2. Income and Wealth Inequality:
o Description: Disparities in income and wealth distribution across
individuals or groups within a population.
o Measures: Gini coefficient, Lorenz curve, percentile ratios (e.g., top 10%
vs bottom 10% income).
3. Social and Economic Inequality:
o Description: Disparities in access to resources, opportunities, and
outcomes among different social or economic groups.
o Examples: Educational attainment, employment rates, healthcare access,
poverty rates.
Measures of Inequality
1. Variance:
o Description: Measures the spread or dispersion of values around the
mean in a dataset.
Standard Deviation:
26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Range:
Description: The difference between the maximum and minimum values in a
dataset.
Gini Coefficient:
Description: Measures income or wealth distribution inequality across a
population, ranging from 0 (perfect equality) to 1 (maximum inequality).
Calculation: Often derived from a Lorenz curve, which plots cumulative income
or wealth against the cumulative share of the population.
Interquartile Range (IQR):
Description: Range between the first quartile (Q1) and the third quartile (Q3),
representing the middle 50% of the dataset.
27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Experian:
Data: Credit scores, income estimates, consumer data.
Website: experian.com
7. Open Data Portals:
o data.gov:
Data: U.S. government’s open data portal, including various
economic datasets.
Website: data.gov
o World Bank Open Data:
Data: Global income, economic indicators.
Website: data.worldbank.org
31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Quantile Shares
Quantile shares are used to understand the proportion of the total sum of the variable
(e.g., income) held by different quantiles of the population.
1. Top Decile Share:
o Description: The percentage of total income held by the top 10% of
earners.
o Example: If the total income in a population is $1,000,000 and the top
10% of earners collectively make $400,000, the top decile share is 40%.
2. Top 1% Share:
o Description: The percentage of total income held by the top 1% of
earners.
o Example: If the total income is $1,000,000 and the top 1% earn
$200,000, the top 1% share is 20%.
3. Bottom Quintile Share:
o Description: The percentage of total income held by the bottom 20% of
earners.
o Example: If the bottom 20% of earners collectively make $50,000 out of
a total of $1,000,000, the bottom quintile share is 5%.
Cumulative income shares and Lorenz curves
Cumulative income shares and Lorenz curves are essential tools in understanding
and visualizing economic inequality. They help depict the distribution of income or
wealth across a population, making it easier to identify disparities.
33
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Lorenz Curves
The Lorenz curve is a graphical representation used to illustrate the distribution
of income or wealth within a population. It shows the proportion of total income earned
by the cumulative percentages of the population. This curve is a fundamental tool in
economics and inequality studies, providing insights into the degree of income or wealth
inequality.
Constructing the Lorenz Curve
Steps to Construct a Lorenz Curve:
1. Sort the Population by Income: Arrange individuals or households in
ascending order of income.
35
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
36
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
37
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
38
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Formula:
where yi and yj are individual incomes, n is the number of individuals, and μ is
the mean income.
2. Theil Index
Description: Measures entropy or disorder in the income distribution. It has two
versions: Theil T and Theil L.
Range: 0 (perfect equality) to ∞ (perfect inequality).
Formula (Theil T)
39
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Formula
where yi is individual income, n is the number of individuals, μ is the
mean income, and ϵ is the inequality aversion parameter.
Smoothing Time Series
Introduction to Time Series
A time series is a sequence of data points recorded at successive points in time,
typically at uniform intervals. Examples include daily stock prices, monthly sales data,
and annual GDP growth rates. Time series analysis involves understanding the
underlying patterns and predicting future values based on historical data.
Smoothing in Time Series
Smoothing is a technique used in time series analysis to reduce noise and reveal
underlying trends. By applying smoothing methods, we can better understand the
structure and patterns within the data, making it easier to identify trends, cycles, and
other important features.
40
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Aim of Smoothing
The primary aim of smoothing is to filter out short-term fluctuations and highlight
longer-term trends and patterns in the data. This makes it easier to:
1. Identify Trends: Detect long-term upward or downward movements in the data.
2. Remove Noise: Eliminate random variations that may obscure the true signal.
3. Enhance Interpretability: Make the data easier to visualize and interpret.
Opinion Poll
In the context of time series, an opinion poll can refer to a series of survey results
collected over time. Smoothing these results can help identify trends in public opinion,
such as shifts in political preferences or consumer confidence.
Refinement
Refinement in time series smoothing refers to the process of selecting and applying the
appropriate smoothing techniques to improve the clarity and interpretability of the data.
This involves choosing the right method (e.g., moving averages, exponential smoothing)
and the right parameters (e.g., window size, smoothing factor).
Meaning of Residual
Residuals are the differences between the observed values and the values predicted by
the smoothing model. In other words, residuals represent the noise or irregularities that
remain after smoothing. Analyzing residuals is crucial for understanding the
effectiveness of the smoothing process and identifying any remaining patterns or
anomalies.
Common Smoothing Techniques
1. Moving Averages
41
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
43
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1. Trends and Patterns Are Clear: The primary trend and significant seasonal
patterns are visible without the distraction of random noise.
2. Residuals Analysis: Residuals should appear random, indicating that the
smoothing process has effectively captured the systematic part of the data.
3. Ready for Further Analysis: The smoothed data is in a form that can be used
for further analysis, forecasting, and decision-making.
By reaching this end point, analysts can confidently move forward with additional
analytical steps, knowing that the core patterns in their time series data have been
effectively isolated and understood.
Conclusion
Smoothing is a vital technique in time series analysis, particularly in exploratory
data analysis (EDA). It helps to filter out noise and highlight important trends, making
the data easier to interpret and analyze. By applying appropriate smoothing techniques
and analyzing residuals, we can gain valuable insights into the underlying patterns and
structures within time series data.
44