0% found this document useful (0 votes)
2K views44 pages

EDA Unit3

Sample 2.1

Uploaded by

ashu300604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views44 pages

EDA Unit3

Sample 2.1

Uploaded by

ashu300604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Rajalakshmi Institute of Technology

(An Autonomous Institution), Affiliated to Anna University, Chennai


Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

UNIT III
EXPLORATORY DATA ANALYSIS
UNIVARIATE ANALYSIS - Introduction to Single variable: Distribution Variables -
Numerical Summaries of Level and Spread - Scaling and Standardizing – Inequality.

Introduction of univariate analysis:


Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so in other words your data has only one variable. It doesn’t deal with causes or
relationships (unlike regression ) and it’s major purpose is to describe; It takes data,
summarizes that data and finds patterns in the data.
What is a variable in Univariate Analysis?
A variable in univariate analysis is just a condition or subset that your data falls
into. You can think of it as a “category.” For example, the analysis might look at a
variable of “age” or it might look at “height” or “weight”. However, it doesn’t look at
more than one variable at a time otherwise it
 Univariate data:
Univariate data refers to a type of data in which each observation or data point
corresponds to a single variable. In other words, it involves the measurement or
observation of a single characteristic or attribute for each individual or item in the
dataset. Analyzing univariate data is the simplest form of analysis in statistics.

Heights (in cm) 164 167.3 170 174.2 178 180 186

Suppose that the heights of seven students in a class is recorded (above table). There is
only one variable, which is height, and it is not dealing with any cause or relationship.
1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Key points in Univariate analysis:


1. No Relationships: Univariate analysis focuses solely on describing and
summarizing the distribution of the single variable. It does not explore relationships
between variables or attempt to identify causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of central
tendency (mean, median, mode) and measures of dispersion (range, standard deviation),
are commonly used in the analysis of univariate data.
3. Visualization: Histograms, box plots, and other graphical representations are
often used to visually represent the distribution of the single variable.
 Bivariate data
Bivariate data involves two different variables, and the analysis of this type of data
focuses on understanding the relationship or association between these two
variables. Example of bivariate data can be temperature and ice cream sales in summer
season.
Temperature Ice Cream Sales Suppose the temperature and ice cream
sales are the two variables of a
20 2000 bivariate data(table 2). Here, the
relationship is visible from the table
25 2500 that temperature and sales are directly
proportional to each other and thus
35 5000 related because as the temperature
increases, the sales also increase.

2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Key points in Bivariate analysis:


1. Relationship Analysis: The primary goal of analyzing bivariate data is to
understand the relationship between the two variables. This relationship could be
positive (both variables increase together), negative (one variable increases while the
other decreases), or show no clear pattern.
2. Scatterplots: A common visualization tool for bivariate data is a scatterplot,
where each data point represents a pair of values for the two variables. Scatterplots help
visualize patterns and trends in the data.
3. Correlation Coefficient: A quantitative measure called the correlation
coefficient is often used to quantify the strength and direction of the linear relationship
between two variables. The correlation coefficient ranges from -1 to 1.
 Multivariate data
Multivariate data refers to datasets where each observation or sample point consists of
multiple variables or features. These variables can represent different aspects,
characteristics, or measurements related to the observed phenomenon. When dealing
with three or more variables, the data is specifically categorized as multivariate.
Example of this type of data is suppose an advertiser wants to compare the popularity
of four advertisements on a website.

Advertisement Gender Click rate

Ad1 Male 80

3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Advertisement Gender Click rate

Ad3 Female 55

Ad2 Female 123

Ad1 Male 66

Ad3 Male 35

The click rates could be measured for both men and women and relationships between
variables can then be examined. It is similar to bivariate but contains more than one
dependent variable.
Key points in Multivariate analysis:
1. Analysis Techniques: The ways to perform analysis on this data depends on
the goals to be achieved. Some of the techniques are regression analysis, principal
component analysis, path analysis, factor analysis and multivariate analysis of
variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique depends on the specific
goals of the study. For example, researchers may be interested in predicting one
variable based on others, identifying underlying factors that explain patterns, or
comparing group means across multiple variables.

4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

3. Interpretation: Multivariate analysis allows for a more nuanced interpretation


of complex relationships within the data. It helps uncover patterns that may not be
apparent when examining variables individually.
There are a lots of different tools, techniques and methods that can be used to conduct
your analysis. You could use software libraries, visualization tools and statistic testing
methods. However, this blog we will be compare Univariate, Bivariate and Multivariate
analysis.
Difference between Univariate, Bivariate and Multivariate data

Univariate Bivariate Multivariate

It only summarize single It only summarize two It only summarize more


variable at a time. variables than 2 variables.

It does deal with causes and It does not deal with


It does not deal with causes
relationships and analysis is causes and relationships
and relationships.
done. and analysis is done.

It is similar to bivariate
It does not contain any It does contain only one
but it contains more than 2
dependent variable. dependent variable.
variables.

5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Univariate Bivariate Multivariate

The main purpose is to


The main purpose is to The main purpose is to
study the relationship
describe. explain.
among them.

Example, Suppose an
advertiser wants to
compare the popularity of
four advertisements on a
The example of bivariate
The example of a univariate website.
can be temperature and ice
can be height. Then their click rates
sales in summer vacation.
could be measured for
both men and women and
relationships between
variable can be examined

Univariate Descriptive Statistics


The term univariate analysis refers to the analysis of one variable. You can
remember this because the prefix “uni” means “one.”
The purpose of univariate analysis is to understand the distribution of values for a single
variable. You can contrast this type of analysis with the following:
 Bivariate Analysis: The analysis of two variables.
6
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Multivariate Analysis: The analysis of two or more variables.


For example, suppose we have the following dataset:

We could choose to perform univariate analysis on any of the individual variables in the
dataset to gain a better understanding of its distribution of values.
For example, we may choose to perform univariate analysis on the variable Household
Size:

7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

There are three common ways to perform univariate analysis:


1. Summary Statistics
The most common way to perform univariate analysis is to describe a variable
using summary statistics.
There are two popular types of summary statistics:
 Measures of central tendency: these numbers describe where the center of a
dataset is located. Examples include the mean and the median.
 Measures of dispersion: these numbers describe how spread out the values are
in the dataset. Examples include the range, interquartile range, standard deviation,
and variance.
2. Frequency Distributions
Another way to perform univariate analysis is to create a frequency distribution, which
describes how often different values occur in a dataset.
3. Charts
Yet another way to perform univariate analysis is to create charts to visualize the
distribution of values for a certain variable.
Common examples include:
 Boxplots
 Histograms
 Density Curves
 Pie Charts
The following examples show how to perform each type of univariate analysis using
the Household Size variable from our dataset mentioned earlier:
8
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Summary Statistics
We can calculate the following measures of central tendency for Household Size:
 Mean (the average value): 3.8
 Median (the middle value): 4
These values give us an idea of where the “center” value is located.
We can also calculate the following measures of dispersion:
 Range (the difference between the max and min): 6
 Interquartile Range (the spread of the middle 50% of values): 2.5
 Standard Deviation (an average measure of spread): 1.87
These values give us an idea of how spread out the values are for this variable.
Frequency Distributions
We can also create the following frequency distribution table to summarize how often
different values occur:

9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

This allows us to quickly see that the most frequent household size is 4.
Resource: You can use this Frequency Calculator to automatically produce a frequency
distribution for any variable.
Charts
We can create the following charts to help us visualize the distribution of values
for Household Size:
1. Boxplot
A boxplot is a plot that shows the five-number summary of a dataset.
The five-number summary includes:
 The minimum value
 The first quartile
 The median value
 The third quartile
 The maximum value
Here’s what a boxplot would look like for the variable Household Size:

10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Resource: You can use this Boxplot Generator to automatically produce a boxplot for
any variable.
2. Histogram
A histogram is a type of chart that uses vertical bars to display frequencies. This type of
chart is a useful way to visualize the distribution of values in a dataset.
Here’s what a histogram would look like for the variable Household Size:

3. Density Curve
A density curve is a curve on a graph that represents the distribution of values in a
dataset.

11
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

It’s particularly useful for visualizing the “shape” of a distribution, including whether or
not a distribution has one or more “peaks” of frequently occurring values and whether or
not the distribution is skewed to the left or the right.
Here’s what a density curve would look like for the variable Household Size:
4. Pie Chart
A pie chart is a type of chart that is shaped like a circle and uses slices to represent
proportions of a whole.
Here’s what a pie chart would look like for the variable Household Size:

Depending on the type of data, one of these charts may be more useful for visualizing the
distribution of values than the others.
Numerical Summaries of Level and Spread

12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Data can be described numerically by various statistics, or statistical measures.


These statistical measures are often grouped in 3 categories:
1. Measures of central tendency
2. Measures of position
3. Measures of dispersion
In modern society there is considerable interest in the length of time people
spend at work. The measurement of hours that people work is important
when analysing a variety of economic and social phenomena. The number
of hours worked is a measure of labour input that can be used to derive
key measures of productivity and labour costs. The patterns of hours
worked and comparisons of the hours worked by different groups within
society give important evidence for studying and understanding lifestyles,
the labour market and social changes.
Focus on working hours to demonstrate how simple descriptive
statistics can be used to provide numerical summaries of level and spread.
By examining data on working hours in Britain taken from the General
Household Survey discussed in the previous chapter. These data are used
to illustrate measures of level such as the mean and the median and
measures of spread or variability such as the standard deviation and the
midspread.
Working hours of couples in Britain
The histograms of the working hours distributions of men and women in
the 2005 General Household Survey are shown in figures 2.1 and 2.2.

13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

We can compare these two distributions in terms of the four features


introduced in the previous chapter, namely level, spread, shape and outliers.
We can then see that:
Summaries of level
The level expresses where on the scale of numbers found in the dataset the
distribution is concentrated. In the previous example, it expresses where on
a scale running from 1 hour per week to 100 hours per week is the
distribution's centre point.
Residuals
A residual can be defined as the difference between a data point and the
observed typical, or average, value. For example if we had chosen 40 hours
a week as the typical level of men's working hours, using data from the
General Household Survey in 2005, then a man who was recorded in the survey
as working 45 hours a week would have a residual of 5 hours. Another way of
14
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

expressing this is to say that the residual is the observed data value minus the
predicted value and in this case 45-40 = 5. Any data value such as a
measurement of hours worked or income earned can be thought of as being
composed of two components: a fitted part and a residual part. This can be
expressed as an equation:
Data = Fit + Residual

1. Mean (Arithmetic Mean):


o The mean is the sum of all values in a dataset divided by the number of
values. It is a measure of central tendency.
Mean=∑i=1nxin\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}Mean=n∑i=1nxi
where xix_ixi are the individual values and nnn is the number of values.
Ex: Dataset: Exam scores of students in a class: [85, 90, 75, 88, 92]

Interpretation: The average exam score in the class is 86.


2. Median:
o The median is the middle value in a sorted dataset. It divides the dataset
into two equal halves. It is less sensitive to outliers compared to the
mean. The median is the middle value in a sorted dataset.
o Formula:

15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

3. Mode:
o The mode is the value that appears most frequently in a dataset. A dataset
may have one mode (unimodal), more than one mode (bimodal,
trimodal), or no mode (if all values occur with equal frequency).
Dataset: [1, 2, 2, 3, 2, 4, 1, 2]
Mode: 2 (appears most frequently)
Interpretation: The mode is 2, indicating it's the most common number
of cars per family.
4. Percentile:
o A percentile is a measure indicating the value below which a given
percentage of observations in a group of observations fall. For example,

16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

the 25th percentile is the value below which 25% of observations fall.
5. Quartiles (Five-Number Summary):
o Quartiles divide a dataset into four equal parts. The five-number
summary includes:
 Minimum: The smallest value in the dataset.
 Q1 (First Quartile): The value below which 25% of the data fall.
 Median (Second Quartile): The middle value of the dataset.
 Q3 (Third Quartile): The value below which 75% of the data fall.
 Maximum: The largest value in the dataset.
6. Standard Deviation:
o The standard deviation measures the amount of variation or dispersion of
a set of values. It is the square root of the variance.

17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

7.
Variance:
o Variance measures how far each number in the dataset is from the mean.
It is the average of the squared differences from the mean.

8.
Range:

18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o The range is the difference between the maximum and minimum values
in a dataset.

Example:
Dataset: [25, 30, 35, 40, 45]
Range: 45−25=2045 - 25 = 2045−25=20
Interpretation: The range of ages is 20 years.

9. Proportion:
Proportion refers to a part or share of a whole. It is often expressed as a
fraction or percentage of the total.
Example:
 Gender distribution in a survey: Male: 30, Female: 20
 Proportion of Males: 30/50=0.6
 Proportion of Females: 20/50=0.4
 Interpretation: 60% of respondents are male, and 40% are female.

10. Correlation:
o Correlation measures the strength and direction of the linear relationship
between two variables. It ranges from -1 to 1, where:
 111: Perfect positive correlation
19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 −1-1−1: Perfect negative correlation


 000: No correlation
These statistical measures are fundamental in analyzing and summarizing data,
providing insights into its central tendency, spread, variation, and relationships between
variables.

Summaries of spread
Measures of spread (also called measures of dispersion) tell you something about how
wide the set of data is. There are several basic measures of spread used in statistics. The
most common are:
1. The range (including the interquartile range and the interdecile range),
2. The standard deviation,
3. The variance,
4. Quartiles.
5. Midspread

20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

The Range

The Range tells you how much is in between the lowest value
(start) and highest value (end).
The range is a basic statistic that tells you the range of values. For example, if your
minimum value is $10 and the maximum value is $100 then the range is $90 ($100 –
$10). A similar statistic is the interquartile range, which tells you the range in the middle
fifty percent of a set of data; in other words, it’s where the bulk of data tends to lie.
See: The Range and Interquartile Range for examples and calculation steps.
Another, less common measure is the Semi Interquartile Range, which is one half of the
interquartile range.
2. Standard Deviation

Simply put, the standard deviation is a measure of how spread


out data is around center of the distribution (the mean). It also gives you an idea of

21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

where, percentage wise, a certain value falls. For example, let’s say you took a test and it
was normally distributed (shaped like a bell). You score one standard deviation above
the mean. That tells you your score puts you in the top 84% of test takers.
See: Standard Deviation for examples and calculation steps.
3. The Variance
The variance is a very simple statistic that gives you an extremely rough idea of how
spread out a data set is. As a measure of spread, it’s actually pretty weak. A large
variance of 22,000, for example, doesn’t tell you much about the spread of data — other
than it’s big! The most important reason the variance exists is to give you a way to find
the standard deviation: the standard deviation is the square root of variance.
4. Quartiles

A set of numbers (-2,-1,0,1,2) divided into four quartiles.


A set of numbers (-2,-1,0,1,2) divided into four quartiles.Quartiles divide your data set
into quarters according to where those numbers falls on the number line. Like the
variance, the quartile isn’t very useful on its own. Instead, it’s used to find more useful
values like the interquartile range.
5.Midspread
"Midspread" generally refers to measures that describe the central tendency or the
spread of values around the median of a dataset. Here are a few common measures of
midspread:
1. Interquartile Range (IQR):
22
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Description: The difference between the third quartile (Q3) and the first
quartile (Q1), representing the middle 50% of the dataset.
o

Scaling and Standardizing


Scaling in Univariate Analysis
Scaling adjusts the range of values of a variable to a specific interval, making it easier to
compare variables that initially had different ranges. It is particularly useful when the
variables have different units or scales.
Methods of Scaling:
1. Min-Max Scaling (Normalization):

o
o Description: Scales the data to a fixed range, usually [0, 1]. This method
is suitable when the distribution of the data is known and doesn't have
outliers significantly affecting the range.
2. Max Abs Scaling:
23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Formula
o Description: Scales the data to the [-1, 1] range by dividing all values by
the maximum absolute value in the dataset.
Standardizing in Univariate Analysis
Standardizing transforms the data to have a mean of 0 and a standard deviation of 1.
This makes the distribution of the data more suitable for algorithms that assume a
normal distribution or that are sensitive to the scale of the data.
Method of Standardizing:
 Standardization (Z-score Scaling):

o Description: Centers the data around 0 with a standard deviation of 1. It


is robust to outliers and suitable when the distribution of the data is
approximately normal or when using algorithms like PCA (Principal
Component Analysis).
Example:
Consider a dataset of monthly temperatures (in Celsius):
Temperatures=[10,15,20,25,30]
Min-Max Scaling:

24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Standardization:
 Mean μ=20μ
 Standard Deviation σ=7.07
 Standardize Temperature:

 In univariate analysis, these techniques ensure that data is uniformly scaled or


standardized, making it easier to interpret statistical measures like mean, median,
variance, and also preparing it for further advanced analyses or modeling. These
transformations are crucial for ensuring that the inherent characteristics of the
data do not bias or skew the results of the analysis.
Inequality
Inequality in the context of statistics and data analysis typically refers to disparities or
differences between variables, groups, or distributions. It's a fundamental concept used
to describe relationships and variations within datasets. Here's a breakdown of how
inequality is understood and measured:

25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Types of Inequality
1. Statistical Inequality:
o Description: Statistical measures that quantify the differences or
disparities within a dataset or between datasets.
o Examples: Variance, standard deviation, range, interquartile range
(IQR), coefficient of variation.
2. Income and Wealth Inequality:
o Description: Disparities in income and wealth distribution across
individuals or groups within a population.
o Measures: Gini coefficient, Lorenz curve, percentile ratios (e.g., top 10%
vs bottom 10% income).
3. Social and Economic Inequality:
o Description: Disparities in access to resources, opportunities, and
outcomes among different social or economic groups.
o Examples: Educational attainment, employment rates, healthcare access,
poverty rates.
Measures of Inequality
1. Variance:
o Description: Measures the spread or dispersion of values around the
mean in a dataset.

Standard Deviation:
26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Description: Square root of the variance, indicating the average deviation of


values from the mean.

Range:
 Description: The difference between the maximum and minimum values in a
dataset.

Gini Coefficient:
 Description: Measures income or wealth distribution inequality across a
population, ranging from 0 (perfect equality) to 1 (maximum inequality).
 Calculation: Often derived from a Lorenz curve, which plots cumulative income
or wealth against the cumulative share of the population.
Interquartile Range (IQR):
 Description: Range between the first quartile (Q1) and the third quartile (Q3),
representing the middle 50% of the dataset.

Examples of Inequality Measures


 Example 1: Income Distribution

27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Data: Monthly incomes of a population [2000, 3000, 3500, 4000, 5000,


8000, 10000]
o Inequality Measure: Calculate the Gini coefficient or plot a Lorenz
curve to visualize income distribution inequality.
 Example 2: Educational Attainment
o Data: Educational levels (e.g., high school diploma, bachelor's degree,
graduate degree) across different demographics.
o Inequality Measure: Compare the proportion of individuals with higher
education among different income groups or regions.
 Example 3: Healthcare Access
o Data: Availability of healthcare facilities and services in urban vs rural
areas.
o Inequality Measure: Analyze the disparity in healthcare access using
demographic data and geographic distribution.
Inequality measures are essential for understanding disparities within data and
populations, informing policy decisions, and addressing social and economic challenges.
They provide quantitative insights into the distribution of resources, opportunities, and
outcomes across different groups or segments of society.
Sources of data on income
Accurate and comprehensive data on income is essential for economic analysis,
policy-making, and research. Various organizations and databases provide detailed
information on income distribution, wages, and related economic metrics. Here are some
key sources of income data:
1. Government Statistical Agencies:
28
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o United States Census Bureau:


 Data: Annual Social and Economic Supplement (ASEC) of the
Current Population Survey (CPS), American Community Survey
(ACS).
 Website: census.gov
o Bureau of Labor Statistics (BLS):
 Data: Occupational Employment and Wage Statistics (OEWS),
Consumer Expenditure Survey (CES).
 Website: bls.gov
o Internal Revenue Service (IRS):
 Data: Statistics of Income (SOI).
 Website: irs.gov
2. International Organizations:
o World Bank:
 Data: World Development Indicators (WDI), PovcalNet.
 Website: worldbank.org
o Organisation for Economic Co-operation and Development (OECD):
 Data: Income Distribution Database.
 Website: oecd.org
o International Labour Organization (ILO):
 Data: ILOSTAT.
 Website: ilo.org
3. National Statistical Agencies:
o Eurostat (European Union):
29
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Data: Income and Living Conditions (EU-SILC).


 Website: ec.europa.eu/eurostat
o Office for National Statistics (ONS) - United Kingdom:
 Data: Annual Survey of Hours and Earnings (ASHE), Family
Resources Survey (FRS).
 Website: ons.gov.uk
4. Research Institutions and Think Tanks:
o Pew Research Center:
 Data: Studies on income distribution, economic inequality.
 Website: pewresearch.org
o Brookings Institution:
 Data: Economic studies, income inequality research.
 Website: brookings.edu
5. Surveys and Longitudinal Studies:
o Panel Study of Income Dynamics (PSID):
 Data: Longitudinal data on income, wealth, and expenditures.
 Website: psidonline.isr.umich.edu
o Survey of Consumer Finances (SCF):
 Data: Detailed data on household income and wealth.
 Website: federalreserve.gov/econres/scfindex.htm
6. Private Data Providers:
o Nielsen:
 Data: Consumer spending and income data.
 Website: nielsen.com
30
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Experian:
 Data: Credit scores, income estimates, consumer data.
 Website: experian.com
7. Open Data Portals:
o data.gov:
 Data: U.S. government’s open data portal, including various
economic datasets.
 Website: data.gov
o World Bank Open Data:
 Data: Global income, economic indicators.
 Website: data.worldbank.org

Measuring inequality: quantiles and quantile shares


Introduction
Quantiles and quantile shares are statistical tools used to measure and analyze
inequality within a dataset. They provide insights into the distribution of data,
particularly in understanding how income, wealth, or other variables are spread across a
population.
Quantiles
Quantiles divide a dataset into equal-sized, ordered segments. They help in
understanding the spread and distribution of the data.
1. Quartiles:
o Description: Quartiles divide the data into four equal parts.

31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Q1 (First Quartile): The value below which 25% of the data


falls.
 Q2 (Second Quartile/Median): The value below which 50% of
the data falls.
 Q3 (Third Quartile): The value below which 75% of the data
falls.
o Example: In a dataset of incomes $20,000, $30,000, $40,000, $50,000,
$60,000, Q1 is $30,000, Q2 (median) is $40,000, and Q3 is $50,000.
2. Deciles:
o Description: Deciles divide the data into ten equal parts.
 D1 (First Decile): The value below which 10% of the data falls.
 D2 (Second Decile): The value below which 20% of the data
falls, and so on.
o Example: In a dataset of incomes $10,000, $20,000, $30,000, $40,000,
$50,000, $60,000, $70,000, $80,000, $90,000, $100,000, D1 is $10,000,
D2 is $20,000, and so on.
3. Percentiles:
o Description: Percentiles divide the data into 100 equal parts.
 P1 (First Percentile): The value below which 1% of the data
falls.
 P50 (Median/50th Percentile): The value below which 50% of
the data falls, etc.
o Example: In a dataset, if the 90th percentile (P90) of incomes is $90,000,
it means 90% of individuals earn less than $90,000.
32
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Quantile Shares
Quantile shares are used to understand the proportion of the total sum of the variable
(e.g., income) held by different quantiles of the population.
1. Top Decile Share:
o Description: The percentage of total income held by the top 10% of
earners.
o Example: If the total income in a population is $1,000,000 and the top
10% of earners collectively make $400,000, the top decile share is 40%.
2. Top 1% Share:
o Description: The percentage of total income held by the top 1% of
earners.
o Example: If the total income is $1,000,000 and the top 1% earn
$200,000, the top 1% share is 20%.
3. Bottom Quintile Share:
o Description: The percentage of total income held by the bottom 20% of
earners.
o Example: If the bottom 20% of earners collectively make $50,000 out of
a total of $1,000,000, the bottom quintile share is 5%.
Cumulative income shares and Lorenz curves
Cumulative income shares and Lorenz curves are essential tools in understanding
and visualizing economic inequality. They help depict the distribution of income or
wealth across a population, making it easier to identify disparities.

33
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Cumulative Income Shares


Cumulative income shares measure the proportion of total income held by cumulative
percentages of the population, from the poorest to the richest.
Steps to Calculate Cumulative Income Shares:
1. Sort the Population by Income: Arrange individuals or households in
ascending order of income.
2. Calculate Cumulative Income: Compute the cumulative income as you move
through the sorted list.
3. Calculate Cumulative Population Percentage: Determine the cumulative
percentage of the population corresponding to each cumulative income value.
4. Calculate Cumulative Income Share: For each cumulative population
percentage, calculate the corresponding cumulative income share.
Example:
Consider a population with incomes: $15,000, $18,000, $22,000, $27,000,
$30,000, $35,000, $40,000, $45,000, $50,000, $60,000.

34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Lorenz Curves
The Lorenz curve is a graphical representation used to illustrate the distribution
of income or wealth within a population. It shows the proportion of total income earned
by the cumulative percentages of the population. This curve is a fundamental tool in
economics and inequality studies, providing insights into the degree of income or wealth
inequality.
Constructing the Lorenz Curve
Steps to Construct a Lorenz Curve:
1. Sort the Population by Income: Arrange individuals or households in
ascending order of income.
35
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

2. Calculate Cumulative Income: Compute the cumulative income as you move


through the sorted list.
3. Calculate Cumulative Population Percentage: Determine the cumulative
percentage of the population corresponding to each cumulative income value.
4. Calculate Cumulative Income Share: For each cumulative population
percentage, calculate the corresponding cumulative income share.
5. Plot the Lorenz Curve: Plot the cumulative population percentage on the X-
axis and the corresponding cumulative income share on the Y-axis.
Example:
Using the previous example, plot the following points:
 (0%, 0%)
 (10%, 4.4%)
 (20%, 9.6%)
 (30%, 16.1%)
 (40%, 24%)
 (50%, 32.7%)
 (60%, 43%)
 (70%, 54.7%)
 (80%, 67.8%)
 (90%, 82.5%)
 (100%, 100%)

36
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Desirable properties in a summary measure of inequality


When evaluating and comparing inequality across different populations or over time, it's
important to use summary measures that possess certain desirable properties. These
properties ensure that the measures are reliable, meaningful, and useful for policy
analysis and decision-making. Below are some key desirable properties:
1. Anonymity (Symmetry)
The measure should not depend on the identity of individuals. It should only
consider the distribution of income or wealth, not who earns it.
 Example: If two people swap incomes, the measure of inequality should remain
unchanged.
2. Population Invariance (Replication Invariance)
The measure should remain unchanged if the population is replicated. Doubling
the population while maintaining the same income distribution should not affect
the inequality measure.

37
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Example: If a society with 10 people each earning $10,000 is replicated to create


a society with 20 people each earning $10,000, the inequality measure should be
the same.
3. Scale Invariance (Income Homogeneity)
The measure should be unaffected by proportional changes in all incomes. If all
incomes are multiplied by a constant factor, the measure should remain the same.
 Example: If everyone's income in a society is doubled, the measure of inequality
should remain the same.
4. Transfer Principle (Pigou-Dalton Principle)
The measure should decrease if income is transferred from a richer individual to
a poorer individual, assuming the transfer does not reverse their income ranking.
 Example: If a rich person with $100,000 gives $1,000 to a poorer person with
$10,000, the inequality measure should reflect a reduction in inequality.
5. Decomposability
The measure should be decomposable into within-group and between-group
components. This property allows the measure to attribute overall inequality to
inequality within subgroups and inequality between subgroups.
 Example: In a country with multiple regions, it should be possible to decompose
the national inequality into the inequality within each region and the inequality
between regions.
6. Additivity
The measure should allow for the aggregation of inequality measures of different
subpopulations to obtain the total inequality measure.

38
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Example: If a country is divided into several states, the national inequality


measure should be the sum of the inequality measures of the individual states.
7. Sensitivity to Income Differences
The measure should be sensitive to changes in income distribution. Small
changes in income distribution should be reflected in the inequality measure.
 Example: If a small amount of income is transferred from a richer to a poorer
person, the measure should show a corresponding decrease in inequality.
Common Measures of Inequality
1. Gini Coefficient
 Description: Measures the extent to which the distribution of income or wealth
deviates from a perfectly equal distribution.
 Range: 0 (perfect equality) to 1 (perfect inequality).

 Formula:
 where yi and yj are individual incomes, n is the number of individuals, and μ is
the mean income.
2. Theil Index
 Description: Measures entropy or disorder in the income distribution. It has two
versions: Theil T and Theil L.
 Range: 0 (perfect equality) to ∞ (perfect inequality).

 Formula (Theil T)
39
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 where yi is individual income, n is the number of individuals, and μ is the mean


income.
3. Atkinson Index
 Description: Measures inequality with sensitivity to different parts of the
income distribution, parameterized by ϵ which reflects the society's aversion to
inequality.
 Range: 0 (perfect equality) to 1 (perfect inequality).

 Formula
 where yi is individual income, n is the number of individuals, μ is the
mean income, and ϵ is the inequality aversion parameter.
Smoothing Time Series
Introduction to Time Series
A time series is a sequence of data points recorded at successive points in time,
typically at uniform intervals. Examples include daily stock prices, monthly sales data,
and annual GDP growth rates. Time series analysis involves understanding the
underlying patterns and predicting future values based on historical data.
Smoothing in Time Series
Smoothing is a technique used in time series analysis to reduce noise and reveal
underlying trends. By applying smoothing methods, we can better understand the
structure and patterns within the data, making it easier to identify trends, cycles, and
other important features.
40
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Aim of Smoothing
The primary aim of smoothing is to filter out short-term fluctuations and highlight
longer-term trends and patterns in the data. This makes it easier to:
1. Identify Trends: Detect long-term upward or downward movements in the data.
2. Remove Noise: Eliminate random variations that may obscure the true signal.
3. Enhance Interpretability: Make the data easier to visualize and interpret.
Opinion Poll
In the context of time series, an opinion poll can refer to a series of survey results
collected over time. Smoothing these results can help identify trends in public opinion,
such as shifts in political preferences or consumer confidence.
Refinement
Refinement in time series smoothing refers to the process of selecting and applying the
appropriate smoothing techniques to improve the clarity and interpretability of the data.
This involves choosing the right method (e.g., moving averages, exponential smoothing)
and the right parameters (e.g., window size, smoothing factor).
Meaning of Residual
Residuals are the differences between the observed values and the values predicted by
the smoothing model. In other words, residuals represent the noise or irregularities that
remain after smoothing. Analyzing residuals is crucial for understanding the
effectiveness of the smoothing process and identifying any remaining patterns or
anomalies.
Common Smoothing Techniques
1. Moving Averages

41
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Simple Moving Average (SMA): The average of a fixed number of


consecutive observations.
o Weighted Moving Average (WMA): Similar to SMA but assigns
different weights to observations.
2. Exponential Smoothing
o Single Exponential Smoothing (SES): Applies exponentially decreasing
weights to past observations.
o Double Exponential Smoothing (Holt’s Linear Trend Model):
Extends SES to capture trends.
o Triple Exponential Smoothing (Holt-Winters Method): Captures
seasonality along with trend.
3. LOESS (Locally Estimated Scatterplot Smoothing)
o A non-parametric method that fits multiple regressions in localized
subsets of the data.
Example in Python
Below is an example of smoothing a time series using a simple moving average
and exponential smoothing in Python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Example time series data


data = {'value': [112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118]}
df = pd.DataFrame(data)
42
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

# Simple Moving Average (SMA)


df['SMA_3'] = df['value'].rolling(window=3).mean()

# Single Exponential Smoothing (SES)


alpha = 0.2
df['SES'] = df['value'].ewm(alpha=alpha, adjust=False).mean()

# Plotting the original data and smoothed data


plt.plot(df['value'], label='Original Data')
plt.plot(df['SMA_3'], label='SMA (3-period)', color='orange')
plt.plot(df['SES'], label='SES (α=0.2)', color='red')
plt.legend()
plt.show()
Interpretation of Residuals
After applying smoothing techniques, residuals can be analyzed to understand the
effectiveness of the smoothing process. Ideally, residuals should resemble white noise,
indicating that the model has successfully captured the underlying trend and any
systematic patterns.
End Point of Smoothing
The end point of smoothing is the achievement of a clear and interpretable
representation of the underlying trend, seasonality, and patterns in the time series data.
Once the data is sufficiently smoothed:

43
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1. Trends and Patterns Are Clear: The primary trend and significant seasonal
patterns are visible without the distraction of random noise.
2. Residuals Analysis: Residuals should appear random, indicating that the
smoothing process has effectively captured the systematic part of the data.
3. Ready for Further Analysis: The smoothed data is in a form that can be used
for further analysis, forecasting, and decision-making.
By reaching this end point, analysts can confidently move forward with additional
analytical steps, knowing that the core patterns in their time series data have been
effectively isolated and understood.
Conclusion
Smoothing is a vital technique in time series analysis, particularly in exploratory
data analysis (EDA). It helps to filter out noise and highlight important trends, making
the data easier to interpret and analyze. By applying appropriate smoothing techniques
and analyzing residuals, we can gain valuable insights into the underlying patterns and
structures within time series data.

Unit III Completed

44

You might also like