University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Linear regression
Haileab F. (BSc., MPH)
Scatter Plots and Correlation
Before trying to fit any model it is better to see its scatter plot
A scatter plot (or scatter diagram) is used to show the
relationship between two variables
If a scatter plot once show some sort of linear relationship, we
can use correlation analysis to measure the strength of linear
relationship between two variables
o Only concerned with strength of linear relationship and its
direction
o We consider the two variables equally; as a result no causal
effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
Weak relationships
Strong relationships
y y
x x
y y
x x
Scatter Plot Examples
No relationship at all
y
x
Correlation Coefficient
The population correlation coefficient ρ (rho) measures
the strength of the association between the variables
The sample correlation coefficient r is an estimate of ρ
and is used to measure the strength of the linear
relationship in the sample observations
Features of ρand r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship
Examples of Approximate r Values
y y y
x x x
r = -1 r = -.6 r=0
y y
r = +0.3 x r = +1 x
Calculating the Correlation Coefficient
Sample correlation coefficient:
r
( x x )( y y ) SS xy / SS SS
xx yy
[ ( x x ) ][ ( y y ) ]
2 2
or the algebraic equivalent:
n xy x y
r
[ n( x 2 ) ( x ) 2 ][n( y 2 ) ( y ) 2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the ‘independent’ variable
y = Value of the ‘dependent’ variable
Example
Child Child
Height Weight
(cm) (Kg)
x y xy x2 y2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
Calculation Example
Child
Height,
y
70 n xy x y
r
60
[n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
50
40 8(3142) (73)(321)
30
[8(713) (73)2 ][8(14111) (321) 2 ]
20
10 0.886
0
0 2 4 6 8 10 12 14
Child weight, x r = 0.886 → relatively strong positive
linear association between x and y
Significance Test for Correlation
Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
r
t
Test statistic 1 r 2 (with n – 2 degrees of freedom)
n2
Here, the degree of freedom is taken to be n-2
because, two points can be joined by a straight line
surely
Example:
Is there evidence of a linear relationship between child
height and weight at the 0.05 level of significance?
H0: ρ = 0 (No correlation)
H1: ρ ≠ 0 (correlation exists)
= 0.05 , df = 8 - 2 = 6
r .886
t 4.68
1 r 2 1 .886 2
n2 82
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on the value of
at least one independent variable
Explain the impact of changes in an independent variable on the
dependent variable
Dependent variable: the variable we wish to explain. In linear
regression it is always continuous variable
Independent variable: the variable used to explain the dependent
variable. In linear regression it could have any measurement scale.
Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is described
by a linear function.
Changes in y are assumed to be caused by
changes in x
Population Linear Regression
The population regression model:
Dependent Population Population Random
Variable Independent Error
y intercept Slope
Variable term, or
Coefficient
residual
y β0 β1x ε
Linear component Random Error
component
Population Linear Regression
y y β0 β1x ε
Observed
Value of y
for xi
εi Slope = β1
Predicted Random Error
Value of y for this x value
for xi
Intercept
= β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Estimated Estimate of Estimate of the
(or predicted) the regression regression slope
y value
intercept
Independent
ŷ i b0 b1x variable
The individual random error terms ei have a mean of zero
Interpretation of the Slope and the Intercept
b0 is the estimated average value of y when the value of x
is zero (provided that x is inside the data range considered).
Otherwise it shows the portion of the variability of the
dependent variable left unexplained by the independent
variables considered
b1 is the estimated change in the average value of y as a
result of a one-unit change in x
Example: Simple Linear Regression
A researcher wishes to examine the relationship between
the amount of the daily average diets taken by a cohort of
20 sample children and the weight gained by them in one
month (both measured in kg). The content of the food is
the same for all of them.
Dependent variable (y) = weight gained in one month
measured in kilogram
Independent variable (x) = average weight of diet taken per
day by a child measured in Kilogram
Sample Data for child weight Model
Diet (x) Weight gained (y) Diet (x)
Weight gained (y)
0.4 0.65 0.86 1.1
0.46 0.66 0.89 1.12
0.55 0.63 0.91 1.20
0.56 0.73 0.93 1.32
0.65 0.78 0.96 1.33
0.67 0.76 0.98 1.35
0.78 0.72 1.02 1.42
0.79 0.84 1.04 1.1
0.80 0.87 1.08 1.5
0.83 0.97 1.11 1.3
Regression Using SPSS
Analyze/ regression/linear….
Coefficients
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
(Constant) 0.160 .077 2.065 .054
foodweight 0.643 .073 .900 8.772 .000
Weight gained = 0.16 +0.643(food weight)
Interpretation of the Intercept, b0
Weight gained = 0.16 + 0.643(food weight)
Here, no child had had 0 kilogram of food per day, so for
foods within the range of sizes observed, 0.16Kg is the
portion of the weight gained not explained by food.
Whereas, b1 = 0.643 tells us that the average weight of a
child increases by 0.643, on average, for each additional
one kilogram of food taken each day
Multiple linear regression
Multiple Linear Regression (MLR) is a
statistical method for estimating the
relationship between a dependent variable and
two or more independent (or predictor)
variables.
Function: Ypred = Bo + B1X1 + B2X2 +… + BnXn
Multiple Linear Regression
Simply, MLR is a method for studying the
relationship between a dependent variable
and two or more independent variables.
Purposes:
Prediction
Explanation
Theory building
Variations
Total Variation in Y Predictable variation by
the combination of
independent variables
Unpredictable
Variation
Multiple regression
%fat age Sex
9.5 23.0 0.0
Example: 27.9 23.0 1.0
7.8 27.0 0.0
17.8 27.0 0.0
Regress the percentage of fat relative 31.4 39.0 1.0
25.9 41.0 1.0
to body on age and sex 27.4 45.0 0.0
25.2 49.0 1.0
31.1 50.0 1.0
34.7 53.0 1.0
42.0 53.0 1.0
42.0 54.0 1.0
SPSS result on the next slide! 29.1 54.0 1.0
32.5 56.0 1.0
30.3 57.0 1.0
21.0 57.0 1.0
33.0 58.0 1.0
33.8 58.0 1.0
41.1 60.0 1.0
34.5 61.0 1.0
Model Summary
Change Statistics
R Adjusted R Std. Error of R Square Sig. F
Model R Square Square the Estimate Change F Change df1 df2 Change
1 .729a .532 .506 6.5656 .532 20.440 1 18 .000
2 .794b .631 .587 5.9986 .099 4.564 1 17 .047
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age
ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 881.128 1 881.128 20.440 .000a
Residual 775.932 18 43.107
Total 1657.060 19
2 Regression 1045.346 2 522.673 14.525 .000b
Residual 611.714 17 35.983
Total 1657.060 19
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat
Coefficients
Standardized 95% Confidence
Unstandardized Coefficients Coefficients Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522
sex 16.594 3.670 .729 4.521 .000 8.883 24.305
2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457
sex 10.130 4.517 .445 2.243 .039 .600 19.659
age .309 .145 .424 2.136 .047 .004 .614
a. Dependent Variable: %age of body fat relative to body
Write the model for the output.
Interpret the findings.
Thank you!!