STATISTICS
Simple Linear Regression
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 1
STATISTICS
Research Interests
• Whether socio-economic status has a larger
effect on educational achievement
• The importance of education on earnings
• How exercise habits effect weight?
• Whether advertising has an impact on sales
volumes of cookies?
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 2
STATISTICS
Regression Analysis
Regression analysis is a statistical technique for
investigating and modeling the relationship
between variables.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 3
STATISTICS
outcome Input
• Example
Production line Amount of Man-hours (x)
production (y)
1 10 300
2 12 250
3 6 200
4 18 450
5 20 500
Dependent Independent
variable variable
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 4
STATISTICS
Y =f(X)
Dependent variable Independent variable
Response variable Predictor variable
Outcome variable Regressor variable
Explanatory variable
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 5
STATISTICS
• Functional relationship and statistical relationship
Functional relationship
Observational fall directly on the curve of
relationship
Statistical relationship
In general, the observations derive from
the real-life scenarios do not directly fall
on the curve of relationship
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 6
STATISTICS
• Examples: functional relationships
• Y=f(X)=
Y X
Notation
y1 x1
Y=f(X) X,Y =variables y2 x2
x,y=data points
y3 x3
y4 x4
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 7
STATISTICS
• Linear function : Y= mX+C
Y= β0 + β1X
Slope =β1
Intercept Gradient or Slope
Intercept
β0
• Simple linear regression
Y= dependent variable
β0 = intercept
Y= β0 + β1X
β1 = slope
X =independent variable
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 8
STATISTICS
• Simple linear regression
Y= β0 + β1X Y= dependent variable
β0 = intercept
β1 = slope
X =independent variable
• Multiple linear regression
Y= β0 + β1X1+β2X2 +β3X3+β4X4 Y= dependent variable
β0 , β1 , β2 , β3 , …= regression
coefficients
X1 , X2 , X3 , …= regressor
variables
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 9
STATISTICS
Our data points We want to know true regression
line i.e. β0, β1
But we get different line for our
limited number of data.
Mean
values So we find estimates for β0, β1
for
each x Parameter Estimates
β0 𝛽0
True regression line = Y= β0 + β1X
β1 𝛽1
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 12
STATISTICS
Residual
The difference between the observed value yi and the
corresponding fitted value 𝑦𝑖 is a residual.
Residual value for 2 nd data point is
Y e2 = 𝑦2 -𝑦2
y2
e2
x2
X Residual value for i th data point
𝑦2 is ei = 𝑦𝑖 -𝑦𝑖
The residuals from the least-squares line have a special property: the mean of the
least-squares residuals is always zero.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 13
STATISTICS
Least Squares Method
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑒𝑖 ) = 𝑦𝑖 − 𝑦𝑖
𝑛
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = (𝑦𝑖 − 𝑦𝑖 )2
𝑖=1
We want to minimize the total error (or residual) because it means that the
data points are collectively as close to the model’s values as possible.
Estimates for β0 and β1 are calculated so that sum of squares of errors (or
residuals) get minimized.
This method of estimating is β0 and β1 is called the Least-squares method.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 14
STATISTICS
Least Squares Estimators
𝛽0 and 𝛽1 are the estimators of the intercept and slope using Least-Squares
method.
𝛽1 𝛽0 = 𝑦 − 𝛽1𝑥
Using the sample data and by using the above two formulas, it is possible to
calculate 𝛽0 and 𝛽1 . Thereby, we get the fitted model as follows.
Fitted regression model is
𝑌= 𝛽0 + 𝛽1 X
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 15
STATISTICS
Regression Analysis on Minitab
• Investigating the relationship between quiz
averages and final exam scores.
Source: online.stat.psu.edu
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 16
STATISTICS
Regression Analysis on Minitab
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 17
STATISTICS
Source: online.stat.psu.edu
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 18
STATISTICS
• What is the Independent Variable?
• What is the dependent Variable?
• What is 𝛽0 of the above regression equation?
• Interpret 𝛽0
• What is 𝛽1 of the above regression equation?
• Interpret 𝛽1
• What is the final score when the quiz-average is 20?
• Is this Regression Significant?
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 19
STATISTICS
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 20
STATISTICS
Assumptions on Regression
1. The mean of 𝑦𝑖 is a linear function of 𝑥𝑖 .
2. The errors 𝑒𝑖 are normally distributed
3. The errors zero mean and constant
variance(homoscedasticity)
4. The errors ( 𝑒𝑖 ) are independent.
Normal distribution
Mean value
Standard deviation
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 21
STATISTICS
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 22
STATISTICS
Checking the Assumptions of the Model
Assumption 1:
The mean of 𝑦𝑖 is a linear function of 𝑥𝑖.
Testing Assumption
Use scatter plot for x and Y to check whether their
relationship is linear
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 23
STATISTICS
Checking the Assumptions of the Model
Assumption 2:
The errors 𝑒𝑖 are normally distributed
Testing Assumption
1. Draw a histogram for residuals
2. Draw a normal quantile plot (or Normal Probability plot) for
the residuals - when the variable is normally distributed the
points in this plot are aligned.
24
STATISTICS
Checking the Assumptions of the Model
Assumption 3:
The errors are normally distributed with mean zero and constant
variance
Testing Assumption
Draw a scatter plot for residuals versus fitted values
The important thing to check here is that
there is no special pattern. For instance, it
could be that the residuals increase,
decrease in a regular way or have a
special clustering.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 25
STATISTICS
Checking the Assumptions of the Model
Assumption 4:
The errors ( 𝑒𝑖) are independent.
Testing Assumption
Draw a scatter plot for residuals versus observation order
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 26
STATISTICS
Prediction
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 27
STATISTICS
Fitted regression model
• Practical Interpretation: Using the regression model,
we would predict oxygen purity of 89.23% when the
hydrocarbon level is x 1.00%.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 28
STATISTICS
• Exercise
The regression equation below relates the scores students in an advanced
statistics course received for homework completed(x) and the subsequent
midterm exam(y). Homework scores(x) are based on assignments that
preceded the exam. The maximum homework score a student could obtain
was 500 and the maximum midterm score was 350. The regression line that
was obtained is given by
𝑦 = −84.4 + .91 𝑥
If a student had a homework score of 420, the midterm score would be
predicted to be (rounded to an integer)
a. 298.
b. 336.
c. 378.
d. None of the choices are correct.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 29
STATISTICS
Interpolation and Extrapolation
Interpolation
• Regression relationships are valid only for values of the regressor variable within
the range of the original data.
E.g: Sarah’s height was plotted against her age
100
95
height (cm)
90
85
80
30 35 40 45 50 55 60 65
age (months)
𝒚 = 71.95 + .383 x
Can you predict her height at age 42 months? 𝒚 = 88 cm
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 31
STATISTICS
Interpolation and Extrapolation
Extrapolation
𝒚 = 71.95 + .383 x
Can we predict Sarah’s height at age 30 years
(360 months)? 𝒚 = 209.8 cm= 6’10.5”
The above prediction is an extrapolation. However, 210
Regression models are not necessarily valid for 190
170
height (cm)
extrapolation purposes. It may not provide reliable 150
130
predictions. 110
90
70
30 90 150 210 270 330 390
age (months)
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 32
STATISTICS
Exercise
The data and the graph below show the scores students in an advanced statistics course
received for homework (Hw) completed and the subsequent midterm exam. Homework
scores are based on assignments that preceded the exam. The maximum homework
score a student could obtain was 500 and the maximum midterm score was 350. The
residual for the student whose homework score was 395 is
a. negative.
b. positive.
c. zero.
d. undetermined.
Hw
score 387 275 280 459 395 314 428 366 421 234
Exam
score 190 200 108 323 315 256 341 236 285 125
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 33
STATISTICS
𝛽1
There is a close connection between correlation and the slope of the least-squares
line.
𝑛 2
𝑖=1(𝑦𝑖 −𝑦)
𝛽1 = r
We can get 𝑛 2
𝑖=1(𝑥𝑖 −𝑥)
The slop and the correlation coefficient always have the same sign.
The least-squares regression line always
passes through (𝑥,𝑦)
Y
𝑛
𝑛 𝑖=1 𝑦𝑖
𝑖=1 𝑥𝑖 𝑦=
𝑥= 𝑛
𝑛
X
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 34
STATISTICS
Measure of Association
• Correlation Coefficient (r)
(Pearson’s product moment correlation coefficient)
Both X and Y are quantitative (Interval or ratio)
-1≤ r ≤ 1
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 35
STATISTICS
• Coefficient of Determination (R2)
𝑆𝑆𝑅
R2 = ; 0≤R ≤1
𝑆𝑆𝑇
Proportion of variance of Y explained by the regressor variable X
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 36
STATISTICS
Coefficient of Determination (R2)
• Values of R2 that are close to 1 imply that
most of the variability in Y is explained by the
regression model.
R value is R value is
high low
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 37
STATISTICS
Coefficient of Determination (R2)
We often refer loosely to R2 as the
amount of variability in the data
explained or accounted for by the
regression model.
R2 =0.84
Therefore, R2 is not always a good
estimator to measure the strength of
linearity model.
Curve
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 38
STATISTICS
SST= SSR + SSE
Explains Explains
Explains total variation of Y variation of
variation of Y associated with Y associated
regression with error
model
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 39
STATISTICS
SST=SSR+SSE
2
• SSE (Sum of Square of Error) = 𝑦𝑖 −𝑦𝑖
2
• SSR (Sum of Square of Regression) = 𝑦𝑖 −𝑦
2
• SST (Sum of Square of Total) = 𝑦𝑖 −𝑦
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 40
STATISTICS
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 41
STATISTICS
Testing the Significance of the Regression
(Using ANOVA)
Source of SS Degree of MS P
variation freedom
Regression 2 1 𝑆𝑆𝑅 p_value
SSR= 𝑦𝑖 −𝑦 MSR=
1
Error 2 n-2 𝑆𝑆𝐸
SSE= 𝑦𝑖 −𝑦𝑖 MSE=
𝑛−2
2
Total SST= 𝑦𝑖 −𝑦 n-1
n=sample size, number of data points
Hypothesis Null hypothesis =gradient is zero= no significant linear
relationship
H0 : β1 = 0
H1 : β1 ≠ 0 If p-value ≤ 0.05 then reject H0
Therefore β1 ≠ 0, The linear relationship between X
and Y is significant
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 42
STATISTICS
• A strong observed association between variables does not
necessarily imply that a causal relationship exists between
those variables.
Social Relationships and Health
House, J., Landis, K., and Umberson, D. “Social Relationships and
Health,” Science, Vol. 241 (1988), pp 540-545.
Does lack of social relationships cause people to become ill? (There was
a strong correlation.)
Or, are unhealthy people less likely to establish and maintain
social relationships? (reversed relationship)
Or, is there some other factor that predisposes people both to
have lower social activity and become ill?
• Designed experiments are the only way to determine cause
and-effect relationships.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 43