SIMPLE LINEAR REGRESSION
ANALYSIS
Mr. Biswajit Sahoo
Guest Faculty, P.G. Department of Commerce
Utkal University, Vani Vihar
Introduction
The term “Regression” was introduced by Francis Galton and Galton‘s Law of Universal
regression was confirmed by his Friend, Karl Pearson. The modern interpretation of
regression is quiet different from their analysis.
Regression analysis is concerned with the study of the dependence of one variable
(dependent variable), on one or more other variables, the explanatory variables
(Independent Variable), with a view to estimating and/or predicting the mean or average
value of the former in terms of the n own or fixed values of the later.
Modern Interpretation Of Regression
• The modern interpretation of regression is, however, quite different.
Broadly speaking, we may say Regression analysis is concerned with the
study of the dependence of one variable, the dependent variable, on one
or more other variables, the explanatory variables, with a view to
estimating and/or predicting the (population) mean or average value of the
former in terms of the known or fixed (in repeated sampling) values of the
latter.
Objective
1. The key objective behind regression analysis is the statistical dependence of one
variable, the dependent variable, on one or more other variables, the explanatory
variables.
2. The objective of such analysis is to estimate and/or predict the mean or average
value of the dependent variable on the basis of the known or fixed values of the
explanatory variables.
3. In practice the success of regression analysis depends on the availability of the
appropriate data.
4. In any research, the researcher should clearly state the sources of the data used
in the analysis, their definitions, their methods of collection, and any gaps or
omissions in the data as well as any revisions in the data.
5. The data used by the researcher are properly gathered and that the
computations and analysis are correct.
Simple Regression Model
• The term “simple” regression means, it is a regression in which the dependent
variable is related to a single explanatory variable.
• The term “model” is broadly used to represent any phenomenon in a
mathematical framework.
Two variable or bivariate
• Means regression in which the dependent variable is related to a single
explanatory variable (the regression).
• When mean values depend upon conditioning (variable X) is called conditional
expected value. Regression analysis is largely concerned with estimating and/or
predicting the (population) mean value of the dependent variable on the basis of
the known or fixed values of the explanatory variable (s).
Regression analysis, we are going to estimate the mean value, or average value or
expected value of the dependant variable “Y” based on the known values of the
independent variables “X”s. That is we are estimating E(Y/Xi)
ie, conditional expectation of Y given Xi.
Suppose the outcome of any process is denoted by a random variable Y , called as
dependent (or study) variable, depends on k independent (or explanatory) variables
denoted by X1, X2, X3......Xk Suppose the behaviour of Y can be explained by a relationship
given by,
Y = f (X1 , X2 ,..., Xk , β1, β2, ...... βk)+u
where f is some well-defined function and β1, β2, ...... βk are the parameters which
characterize the role and contribution of X1 , X2 ,... Xk respectively. The term u reflects the
stochastic nature of the relationship between Y and X1 , X2 ,..., Xk and indicates that such a
relationship is not exact in nature. When u=0, then the relationship is called the
mathematical model otherwise the statistical model.
The linear model
Consider a simple linear regression model
y = b0 + b1 X + e
where y is termed as the dependent or study variable and X is termed as the independent
or explanatory variable.
The terms b0 and b1 are the parameters of the model.
The parameter b0 is termed as an intercept term, and the parameter b1 is termed as the
slope parameter. These parameters are usually called a regression coefficients.
The unobservable error component e accounts for the failure of data to lie on a straight
line and represents the difference between the true and observed realization of y .
There can be several reasons for such difference, e.g., the effect of all deleted variables in
the model, variables may be qualitative, inherent randomness in the observations etc. We
assume that e is observed as an independent and identically distributed random variable
with mean zero and constant variance additionally assume that e is normally distributed.
Simple Regression
• The simple regression model
can be used to study the
relationship between two
variables.
• Suppose the scatter plot
between x and y looks like
Fig 1.
8
Fig. 1
Simple Regression
Simple Regression
Simple Regression
Simple Regression
•
OLS: Graphically
To obtain the best fitted
line, the method of OLS
entails taking each
vertical distance from
the point to
squaring the
minimizing line, it
sum of the and areas
then the
of 13
squares; see Fig. 2
total Fig. 2
OLS: Graphically
14
Fig. 3
OLS: Mathematically
OLS: Mathematically
Estimated Parameters
Use of Parameter Estimates
Example
Example
Change of Scale
Example
• In the table, estimates from two regressions of annual gross
bank credit to industry (y) on annual call money rate (x) for
the period 1990 – 2018 are reported with two different scales
of measurement of y.
y (in ₹ billion) y (in ₹ million)
Intercept 1708.635 17086345
22
Call money rate -943.589 -943589
Linearity
• In order to use OLS, a model has to be linear.
• This means that the relationship between x and y must be capable
of being expressed diagrammatically using a straight line. More
specifically, the model must be linear in the parameters (α and β).
• By ‘linear in the parameters’, it is meant that the parameters are
not multiplied together, divided, squared, or cubed, etc.
23
Linearity
Example
Linearity
Goodness-of-Fit
Goodness-of-Fit
Goodness-of-Fit
Goodness-of-Fit
30
Regression through the Origin
Estimators
• So far we have assumed that we have exact information about
the random variable under discussion, in particular that we
know the probability distribution, in the case of a
discrete random variable, or the probability density
function, in the case of a continuous variable.
• With this information it is possible to work out the population
mean and variance and any other population characteristics in
which we might be interested.
Estimators
• In practice, except for artificially simple random variables such as
the numbers on thrown dice, we do not know the exact probability
distribution or density function.
• It follows that we do not know the population mean or variance.
• However, we would like to obtain an estimate of them or some
other population characteristic.
• This is done by taking a sample of n observations and derive an
estimate of the population characteristic using some
appropriate formula. This formula is technically known as an
estimator and the number obtained is known as the estimate.
Estimators
• The estimator is a general rule or formula, whereas the estimate
is a specific number that will vary from sample to sample.
• Estimator of the population mean μ is
• Estimator of the population variance is
Assumptions of Classical Linear
Regression
3
Classical Linear Regression Model (CLRM)
• In the specification, y t = α + tβx +t u , data fort x is observation,
but yt also depends on ut, it is necessary to be specific about
since
how the uts are generated.
• Certain assumptions are usually made concerning the uts,
the unobservable error or disturbance terms.
• Note that no assumptions are made concerning their observable
counterparts, the estimated model’s residuals.
Assumptions of OLS
1. In the population, the regression model is linear in the
parameters, i.e. y t = α + βxt + ut
2. We have a random sample of size T, {(xt, yt): t = 1, 2, …, T}.
3. The sample outcomes on x, namely, {xt, t = 1, …, T}, are not all
the same values.
4. The error u has an expected value of zero given any value of the
explanatory variable. In other words, E(u|x) = 0.
5. The error u has the same variance given any value of
37
the explanatory variable, i.e. Var (u|x) = σ2
Classical Linear Regression Model (CLRM)
• Alternatively, the model y t = α + tβx +t u is known as CLRM
given
that it fulfils the following assumptions.
1. E(ut ) = 0 the errors have mean zero
2. Var (ut ) = σ2 < the error variance is constant and finite
∞ the errors are linearly independent
i j
3. Cov (ut, u ) = 0 the regressor is unrelated to the
4. Cov (ut, x ) = 0 error
38
term ut is normally distributed
5. u ~ N(0, σ2)
t
ASSUMPTIONS OF CLASSICAL REGRESSION
MODEL
• Assumption-1
The regression model is linear in parameters. It may or may not
be linear in variables. For example, the equation given below is
linear in parameters as well as variables.
parameters, i.e. yt = α + βxt +
u
• ASSUMPTION-2
The explanatory variable is not correlated with the disturbance
term u.
This assumption requires that Σ u,t x t = 0 .
In other words, the covariance between error term and
explanatory
t variable is zero.
This assumption is automatically fulfilled if X is non-stochastic.
It requires that the X values are kept fixed in repeated samples.
• ASSUMPTION-3
The expected value or mean value of the error term u is zero.
In symbols, 𝐸(𝑢 |𝑋 ) = 0. It does not mean that all error terms are
t t
zero.
It implies that the error terms cancel out each other.
• Given E(u|x) = 0, the population regression function (PRF)
is given by E(y|x) = α + βx.
Population Regression Function (PRF)
• This tells us how
the average orexpected
value of y changes with x
or a one-unit
increase in x, changes
the expected value of y by
the amount β.
• Also, this implies that for 42
any given value of x, the
distribution of y is centered
about E(y|x).
Sample Regression Function (SRF)
• ASSUMPTION-4
The variance of each 𝑢 is constant. In symbols, Var (u ) = σ2 . The conditional
distribution of the error term has been displayed in 1.1
• The corresponding error variance for a specific value of the error term has
been depicted in 1.2
• From the figure you can make out that the error variance is constant at all
levels of the X variable. It describes the case of ‘homoscedasticity’.
Homoskedasticity
• Homoskedasticity or constant variance assumption states that the error u
has the same variance given any value of the explanatory variable. In other
words, Var(u|x) = σ2.
• The importance of this assumption will be reflected in deriving the
properties of OLS estimators and their variances.
• Since, Var(u|x) = E(u2|x) – [E(u|x)]2 and E(u|x) = 0, E(u2|x) = σ2, σ2 is
also the unconditional variance of u, i.e. Var (u) = σ2.
46
σ2 is often called the error variance and σ is the
•
standard
deviation of the error term.
Homoskedasticity
• Given the assumption of homoskedasticity,
Var (y|x) = Var(u|x) = σ2
because for given values of x, α, β, and x, all are constant.
• When Var(u|x) depends on x, the error term is said to exhibit
heteroskedasticity (or nonconstant variance).
47
• Because Var(u|x) = Var(y|x), heteroskedasticity is
present whenever Var(y|x) is a function of x.
• ASSUMPTION-5
(v) There is no correlation between the two error terms. This is the
𝑐𝑜𝑣(𝑢i, 𝑢j) = 0 𝑖 ≠ 𝑗
assumption of no autocorrelation.
This assumption implies that the error terms 𝑢i are random.
It implies that there is no systematic relationship between two error terms.
Since two error terms are assumed to be uncorrelated, any two Y values will also
be uncorrelated, i.e., 𝑐𝑜𝑣(𝑌i, 𝑌j) = 0
(i) No Autocorrelation (ii) Positive Autocorrelation (iii) Negative Autocorrelation
• ASSUMPTION-6
The regression model is correctly specified, that is, there is no
specification error in the model.
If certain relevant variable is not included or certain irrelevant variable is
included in the regression model then we commit model specification
error.
For instance, suppose we study the demand for automobiles. If we take
the price of automobiles only and do not include the income of the
consumer income then there is some specification error.
The Simple
Regression Model
under
Homoskedasticity
E(y|x) = α + βx
Example of
Heteroskedastic Variance
51
E(y|x) = α + βx