3/8/2021
The Nature and Sources of
Data
Nature and Sources of Data
Econometric analysis requires data
Different kinds of economic data sets are:
Cross-sectional data
Time series data
Pooled crosssectional
Panel/Longitudinal data
Econometric methods depend on the nature of the data
used
Use of inappropriate methods may lead to misleading
results
1
3/8/2021
Cross-Sectional Dat Sets
Sample of individuals, households, firms, cities, states,
countries, or other units of interest at a given point of
time/in a given period
Cross-sectional observations are more or less
independent
For example, pure random sampling from a population
Sometimes pure random sampling is violated, e.g. units
refuse to respond in surveys, or if sampling is
characterized by clustering
Cross-sectional data typically encountered in applied
microeconomics
Cross-Sectional Dat Sets :
Example1
Cross-sectional data set on wages and other characteristics
Indicator variables
(1=yes, 0=no)
Observation number Hourly wage
2
3/8/2021
Time Series Data
Observations of a variable or several variables over time
For example, stock prices, money supply, consumer price
index, gross domestic product, annual homicide rates,
automobile sales, …
Time series observations are typically serially correlated
Ordering of observations conveys important information
Data frequency: daily, weekly, monthly, quarterly, annually, …
Typical features of time series: trends and seasonality
Typical applications: applied macroeconomics and finance
Time Series Dat Sets : Example
Time series data on minimum wages and related variables
Average minimum Average Unemployment Gross national
wage for given year coverage rate rate product
3
3/8/2021
Pooled Cross Sections Data
Two or more cross sections are combined in one data set
Cross sections are drawn independently of each other
Pooled cross sections often used to evaluate policy
changes
Example:
Evaluate effect of change in property taxes on house prices
Random sample of house prices for the year 1993
A new random sample of house prices for the year 1995
Compare before/after (1993: before reform, 1995: after reform)
Pooled Cross Sections Data :
Example
Pooled cross sections on housing prices
Property tax
Size of house
in square feet
Number of
bathrooms
Before reform
After reform
4
3/8/2021
Panel or longitudinal data
The same cross-sectional units are followed over time
Panel data have a cross-sectional and a time series
dimension
Panel data can be used to account for time-invariant
unobservables
Panel data can be used to model lagged responses
Example:
• City crime statistics; each city is observed in two years
• Time-invariant unobserved city characteristics may be modeled
• Effect of police on crime rates may exhibit time lag
Panel or longitudinal data:
Example
Two-year panel data on city crime statistics
Each city has two time
series observations
Number of
police in 1986
Number of
police in 1990
5
3/8/2021
Sources of Data
Sources of data (International)
Success of regression is dependent on availability
of quality data
International Sources?
World Bank, World Development Indicators (WDI)
IMF: International Monetary Website
Asian Development Bank, International Energy
Organization
Yet best source is: Google itself.. And you can
Google for relevant international data
6
3/8/2021
Sources of data (Local)
Local Sources of Data?
1. Economic Survey
(http://www.finance.gov.pk/survey_1819.html)
2. Handbook of SBP
(http://www.sbp.org.pk/departments/stats/PakEcon
omy_HandBook/index.htm)
3. Surveys of Pakistan Bureau of Statistics
4. Websites of Planning Commission, SBP, PBS,
Ministry of Finance, SECP etc.
Screen shoots of Pakistani
sources of Data
7
3/8/2021
Quality of the data
…success of regression analysis depends on quality
and availability of data
…data may not be always available & not all
available data may be of good quality
check carefully the quality of the agency that collects
data
possibility of errors of measurement, errors of
omission or errors of rounding in data & need to be
checked
Data at a higher aggregate level and not possible to
use for disintegrate level
….the results of research are only as good as the
quality of the data
Estimating a Linear Regression
Model
8
3/8/2021
Linear Regression Model (LRM)
Consider the “simple Linear Regression Model”
Yi = β1 + β2X2i + εi
It is called “simple” because it has two variables only
The term “linear” in the linear regression model
refers to linearity in the regression coefficients, the
βs, and not linearity in the Y and X variables.
So X and Y could be used with higher power, with
log scale etc. in linear regression models
βs coefficients cannot be raised to any power or
divide by another coefficient
Estimating the regression
Model
Consider this model with K-variables
Yi = B1 + B2X2i + B3X3i + … + BkXki + ui
Our a simplified version of it with three variables
Wagei = B1 + B2Edui + B3Expi + B4Femalei + ui
Furthermore, let assume that you have data on 450
workers about their “experience”, “education”, “gender”
etc.
….the structure of data could be seen on next slide.
9
3/8/2021
Data on Wage, Education,
Gender (sample data)
How can you “estimate” the regression from this
data?
Ordinary Least Square
(OLS/LS) Method
There are several ways through which we can
calculate this regression model
The most common method is called “Ordinary Least
Square/Least Square method”
It is explained as follows: “OLS calculates the
unknown regression parameters by minimizing the
sum of the squared Errors (errors) of the regression
model”
Formally: Yi = B1 + B2X2i + B3X3i + … + BkXki + ui (1)
ui =Yi – (B1 + B2X2i + B3X3i + … + BkXki ) (2)
or ui = Yi – BX (3)
10
3/8/2021
Ordinary Least Square
(OLS/LS) Method
or ui =Yi – BX (3)
this equation means that “error term” is equal to
difference between the “actual value (Y)” and the
“value of Y obtained from the regression (BX)”
To obtain the values of “B” we try to make the sum of
error terms (Σui=0) equal to Zero… i.e., minimizing the
errors
However, due to some statistical properties its not
possible to minimize errors because Σui=0 …so we
instead minimize the “square sum of errors” i.e.,(Σui2 )
This implies that we can write equation (3) as follows:
Σ ui2= Σ (Yi – BX)2
Ordinary Least Square
(OLS/LS) Method
Σ ui2= Σ (Yi – BX)2
or Σ ui2= Σ (Yi –B1 – B2X2i – B3X3i – … – BkXki)2 (4)
Now we have data only on “Y” and “X” variables but no
data on “Bs”
To obtain values of the regression coefficients,
derivatives are taken with respect to the regression
coefficients and set equal to zero.
This is the standard procedure for optimization: take the
first order derivative of the function with respective to
unknowns one-by-one and equating each of it equal to
zero and solving it.
Same can be followed here… for each of the B… one by
one, and thus for all Bs we get equations that is in-terms
of x and y variables.
11
3/8/2021
Ordinary Least Square
(OLS/LS) Method for SLR
Σ ui2= Σ (Yi – BX)2 (4)
Or for simple linear regression we can write Eq. (4) as
Σ ui2= Σ (Yi –B1 – B2X2i)2
For minimization of errors sum of squares:
𝜕𝑢𝑖2
𝜕𝐵1 =𝜕 𝜕𝐵1 Σ (Yi –B1 – B2X2i)2 =0
𝜕𝑢𝑖2
𝜕𝐵2 =𝜕 𝜕𝐵2 Σ (Yi –B1 – B2X2i)2 =0
This two equations gives the following two results
Derivations
n n
∑ ui2 = ∑ (yi – β1 – β2x2i)2
i=1 i=1
Take derivatives with respect to β1 and set equation equal to zero
n
∂ ∑ ui2 = ∂ ∑ (yi – β1 – β2x2i)2 = 0
∂ β1 ∂ β1 i=1
2 ∑ (yi – β1 – β2x2i)2-1 ∂ (yi ‒ β1‒ β2x2i) = 0
n
i=1 ∂β1
n
2 ∑ (yi – β1 – β2x2i) (-1) = 0
i=1
n n n
∑ (yi ) – ∑ β1 – β2∑ x2i = 0
i=1 i=1 i=1
n n
∑ (yi ) – ( β1 + β1 + .. β1.n ) – β2∑ x2i = 0
i=1 i=1
12
3/8/2021
Derivations
(Cont…)
n n
∑ (yi ) – ( β1 + β1 + .. β1.n ) – β2∑ x2i = 0
i=1 i=1
n n
∑ (yi ) – ( nβ1) – β2 ∑ x2i = 0
i=1 i=1
n
n
( nβ1) = ∑ (yi ) – β2∑ x2i
i=1 i=1
n n
β1 = ∑ (yi ) β2∑x2i
i=1
i=1
–
n n
– –
β1 = Y – β2 X
Derivations
n n
(Cont…)
∑ ui2 = ∑ (yi – β1 – β2x2i)2
i=1 i=1
Take derivatives with respect to β1 and set equation equal to zero
n
∂ ∑ ui2 = ∂ ∑ (yi – β1 – β2x2i)2 = 0
— i=1
∂ β2 ∂ β2
n
2 ∑ (yi – β1 – β2x2i)2-1 ∂ (‒ β2x2i) = 0
i=1 ∂β1
n
‒2∑ (yi – β1 – β2x2i)(x2i ) = 0
i=1
n n n n
∑ (x2iyi ) – β1 ∑ x2i – β2∑ x2i∑ x2i = 0
i=1 i=1 i=1 i=1
n n n
β 2 ∑ x2i2 = ∑ (x2iyi ) – β1 ∑ x2i
i=1 i=1 i=1
13
3/8/2021
Derivations
(Cont…)
But we know that….
– –
β1 = Y – β2 X
n n n
β 2 ∑ x2i
2
= ∑ (x2iyi ) – β1 ∑ x2i
i=1 i=1 i=1
Putting β1 values in above equation
n n
n n n
∑ yi β2 ( ∑ x2i)
β 2 ∑ x2i2 = ∑ (x2iyi ) – ∑ x2i i=1
– i=1
i=1 i=1 i=1 n n
n n
n
n n ∑x2i∑ yi β2( ∑ x2i)2
β 2 ∑ x2i2 ∑ (x2iyi ) – i=1 + i=1
= i=1
i=1 i=1 n n
Derivations
(Cont…)
n n n n n
β 2 ∑ x2i2 = ∑ (x2iyi ) – ∑
i=1
x2i∑ yi
+
β 2 (∑x2i)2
i=1 i=1 i=1 i=1
n n
n n n
n
β 2 ( ∑xi)2 n ∑ x2i ∑ yi
β 2 ∑ xi2 – = ∑ (x2iyi ) – i=1 i=1
i=1
i=1
n i=1 n
n n n
n
(∑xi)2 n
∑ x2i∑ yi
β2 ∑ xi2 – i=1 = ∑ (x2iyi ) – i=1 i=1
i=1 i=1
n n
n n
n ∑ x2i∑ yi
∑ (x2iyi ) – i=1 i=1
i=1 n
β2 = n
n
∑ x2i2 (∑x )2
– i=1 2i
i=1 n
14
3/8/2021
Derivations
(Cont…)
n n
n ∑ x2i ∑ yi
∑ (x2iyi ) – i=1 i=1
i=1
n
β2 = n
n
(∑x2i)2
∑ xi2
i=1
– i=1n
n n n
n∑ (xiyi ) – ∑ x2i∑yi
i=1 i=1 i=1
n
β2 =
n n
n ∑ x2i2 – ( ∑x2i)2
i=1 i=1
Derivations
(Cont…)
n n n
n∑ (x2iyi ) – ∑ x2i ∑yi
i=1 i=1 i=1
n
β2 =
n n
n ∑ x2i2 – ( ∑ x2i)2
i=1 i=1
n n n
n∑ (x2iyi ) – ∑
i=1
x2i∑ yi
i=1 i=1
β2 = n n
n ∑ x2i2 – ( ∑x2i)2
i=1
i=1
15
3/8/2021
Derivations
(Cont…)
For OLS estimation of the model: Y = β1 + β2X2i +Ui
is given as follows
– –
β1 = Y – β2 X2
n n n
∑ xi ∑yi
n∑ (x2iyi ) – i=1
β2 =
i=1 i=1
ˆ2 x x y y
2i 2 i
x x
n n 2
∑x2i)2
n∑ x2i2 – (i=1 2i 2
i=1
Formulas for estimation SLR
Model
For Simple population regression: Yi = B1 + B2X2i + Ui
We can estimate: Yi = b1 + b2X2i + ei with b1 and b2 as
follows:
𝑏1 = 𝑌 − 𝑏2 𝑋2
𝑛 𝑋2 𝑦− 𝑋2 𝑌 (𝑋2 −𝑋2 )(𝑌−𝑌)
𝑏2 = 𝑛 𝑋22 − (𝑋2 )2
or (𝑋2 −𝑋2 )2
Note1: all Σs are for i=0 to n i.e., 𝑛𝑖=0 , however for
simplicity we just wrote Σ.
Note2: We also ignore the “i” in subscripts of the variables
(both X and Y) to make the equations readable & simple
16
3/8/2021
Ordinary Least Square
(OLS/LS) Method for MLR
Σ ui2= Σ (Yi – BX)2 (4)
Or for multiple linear regression (MLR) we can write Eq. (4)
as
Σ ui2= Σ (Yi –B1 – B2X2i– B3X3i)2
For minimization of errors sum of squares:
𝜕𝑢𝑖2
𝜕𝐵1 =𝜕 𝜕𝐵1 Σ (Yi –B1 – B2X2i– B3X3i)2 =0
𝜕𝑢𝑖2
𝜕𝐵2 =𝜕 𝜕𝐵2 Σ (Yi –B1 – B2X2i– B3X3i)2 =0
𝜕𝑢𝑖2
𝜕𝐵3 =𝜕 𝜕𝐵3 Σ (Yi –B1 – B2X2i– B3X3i)2 =0
This three equations gives the following three results
Formulas for Multiple
Regression Models
For regression: Yi = B1 + B2X2i + B3iX3i +Ui
the Multiple regression Yi = b1 + b2X2i + b3X3i + ei
with b1 , b2 and b3 as follows:
𝑏1 = 𝑌 − 𝑏2 𝑋2 − 𝑏3 𝑋3
Σ𝑋32 (Σ 𝑋2 𝑌) −(Σ𝑋2 𝑋3 )(Σ𝑋3 𝑌)
𝑏2 = (Σ𝑋22 )(Σ𝑋32 )−Σ(𝑋2 𝑋3 )2
Σ𝑋22 (Σ 𝑋3 𝑌) −(Σ𝑋2 𝑋3 )(Σ𝑋2 𝑌)
𝑏3 =
(Σ𝑋22 )(Σ𝑋32 )−Σ(𝑋2 𝑋3 )2
17
3/8/2021
From where did these formulas
come?
There is a complete derivations of these formulas by using calculus tools
of function minimization (you do not need to memorize that)
….by minimizing the error sum of squares (that’s what OLS does) of
these models
… if someone is interested, can check the annex of the course book …I
can do it too, but for applications of regression models we do not need the
derivation, rather we are interested in how it works
...…I showed part of derivations on board for simple linear regression
model
…ALL YOU NEED IS TO KNOW THE BACKGROUND AND MUST BE
ABLE TO APPLY THE FORMULAS
18