0% found this document useful (0 votes)
20 views51 pages

Simple Linear Regression1

The document provides an overview of simple linear regression, including its purpose, methodology, and interpretation of results. It discusses the relationship between dependent and independent variables, the least squares method for estimating regression coefficients, and the importance of residual analysis for validating assumptions. Additionally, it includes examples using real estate data to illustrate the application of regression analysis and the calculation of the coefficient of determination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views51 pages

Simple Linear Regression1

The document provides an overview of simple linear regression, including its purpose, methodology, and interpretation of results. It discusses the relationship between dependent and independent variables, the least squares method for estimating regression coefficients, and the importance of residual analysis for validating assumptions. Additionally, it includes examples using real estate data to illustrate the application of regression analysis and the calculation of the coefficient of determination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Basic Business

Statistics

Simple Linear Regression


Business Analytics

Descriptive analytics
 What has happened?
 Mean, Min, Max.
Predictive analytics
 What could happen?
 Regression, Time Series Analysis
Prescriptive analytics
 What should we do?
 Optimization, Simulation
Optimization/Simulation Example

X + 2Y = 50 and Max (X * Y)
 X = 48, Y = 1  48
 X = 2, Y = 24  48
 X = 25, Y = 12.5  312.5
Correlation vs. Regression

 A scatter plot can be used to show the relationship


between two variables

 Correlation analysis is used to measure the strength


of the association (linear relationship) between two
variables
 Correlation is only concerned with strength of the relationship
 No causal effect is implied with correlation
Introduction to
Regression Analysis

 Regression analysis is used to:


 Predict the value of a dependent variable based on the value of at
least one independent variable
 Explain the impact of changes in an independent variable on the
dependent variable

Dependent variable: the variable we wish to


predict or explain

Independent variable: the variable used to


predict or explain the
dependent variable
Simple Linear Regression Model

Only one independent variable, X

Changes in Y are assumed to be related to


changes in X

Relationship between X and Y is described


by a linear function
Types of Relationships

Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Types of Relationships

(continued)
No relationship

X
Simple Linear Regression Model

Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi = β0 + β1Xi + ε i
Linear component Random Error
component
Simple Linear Regression Model
(continued)

Y Yi = β0 + β1Xi + ε i
Observed Value
of Y for Xi

εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value

Intercept = β0

Xi X
Simple Linear Regression Equation
(Prediction Line)

The simple linear regression equation provides an


estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept

Value of X for

Ŷi = b0 + b1Xi
observation i
The Least Squares Method

b0 and b1 are obtained by finding the values of that


minimize the sum of the squared differences
between Y and :

min ∑ (Yi −Ŷi ) = min ∑ (Yi − (b0 + b1Xi ))


2 2
Finding the Least Squares Equation

 The coefficients b0 and b1 , and other regression


results in this chapter, will be found using Excel
or Minitab

Formulas are shown in the text for those who


are interested
Interpretation of the
Slope and the Intercept

 b0 is the estimated value of Y when the value of


X is zero

 b1 is the estimated change in the value of Y as


a result of a one-unit change in X
Simple Linear Regression Example

A real estate agent wishes to examine the


relationship between the selling price of a
home and its size (measured in square feet)

A random sample of 10 houses is selected


 Dependent variable (Y) = house price in
$1000s
 Independent variable (X) = square feet
Simple Linear Regression Example:
Data

House Price in $1000s Square Feet


(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Simple Linear Regression Example:
Scatter Plot

House price model: Scatter Plot

450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example:
Using Excel
Simple Linear Regression Example:
Excel Output

Regression Statistics

Multiple R 0.76211 The regression equation is:


R Square 0.58082

Adjusted R Square 0.52842 house price = 98.24833 + 0.10977 (square feet)


Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F

Regression 1 18934.9348 18934.9348 11.0848 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000

Upper
Coefficients Standard Error t Stat P-value Lower 95% 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580


Simple Linear Regression Example:
Minitab Output

The regression equation is


The regression
Price = 98.2 + 0.110 Square Feet equation is:
Predictor Coef SE Coef T P
Constant 98.25 58.03 1.69 0.129 house price = 98.24833 +
0.10977 (square feet)
Square Feet 0.10977 0.03297 3.33 0.010

S = 41.3303 R-Sq = 58.1% R-Sq(adj) = 52.8%

Analysis of Variance

Source DF SS MS F P
Regression 1 18935 18935 11.08 0.010
Residual Error 8 13666 1708
Total 9 32600
Simple Linear Regression Example:
Graphical Representation

House price model: Scatter Plot and


Prediction Line

450
400
House Price ($1000s)

350 Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price = 98.24833 + 0.10977 (square feet)


Simple Linear Regression Example:
Interpretation of bo

house price = 98.24833 + 0.10977 (square feet)

 b0 is the estimated value of Y when the value of X is


zero (if X = 0 is in the range of observed X values)
 Because a house cannot have a square footage of 0,
b0 has no practical application
Simple Linear Regression Example:
Interpreting b1

house price = 98.24833 + 0.10977 (square feet)

b1 estimates the change in the value of Y as a


result of a one-unit increase in X
 Here, b1 = 0.10977 tells us that the value of a house increases
by .10977($1000) = $109.77, on average, for each additional one
square foot of size
Simple Linear Regression
Example: Making Predictions

Predict the price for a house


with 2000 square feet:

house price = 98.25 + 0.1098 (sq.ft.)

= 98.25 + 0.1098(200 0)

= 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression Example:
Making Predictions

 When using a regression model for prediction, only


predict within the relevant range of data

Relevant range for


interpolation

450
400
House Price ($1000s)

350
300
250
200
150 Do not try to
100
extrapolate
50
0
beyond the range
0 500 1000 1500 2000 2500 3000 of observed X’s
Square Feet
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc..
Measures of Variation

 Total variation is made up of two parts:

SST = SSR + SSE


Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST = ∑ ( Yi − Y )2 SSR = ∑ ( Ŷi − Y )2 SSE = ∑ ( Yi − Ŷi )2


where:
= Mean value of the dependent variable
Y
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
Measures of Variation
(continued)

 SST = total sum of squares (Total Variation)


 Measures the variation of the Yi values around their
mean Y

 SSR = regression sum of squares (Explained Variation)


 Variation attributable to the relationship between X and Y

 SSE = error sum of squares (Unexplained Variation)


 Variation in Y attributable to factors other than X
Measures of Variation

(continued)
Y
Yi ∧ ∧
SSE = ∑(Yi - Yi )2 Y
_
SST = ∑(Yi - Y)2

Y ∧ _
_ SSR = ∑(Yi - Y)2 _
Y Y

Xi X
Coefficient of Determination, r2

 The coefficient of determination is the portion of the


total variation in the dependent variable that is explained
by variation in the independent variable

 The coefficient of determination is also called r-squared


and is denoted as r2

SSR regression sum of squares


r =
2
=
SST total sum of squares

note:
0 ≤ r ≤1
2
Examples of Approximate
r2 Values

Y
r2 = 1

Perfect linear relationship


between X and Y:
X
r2 = 1
Y 100% of the variation in Y is
explained by variation in X

X
r2 =1
Examples of Approximate
r2 Values

Y
0 < r2 < 1

Weaker linear relationships


between X and Y:
X
Some but not all of the
Y
variation in Y is explained
by variation in X

X
Examples of Approximate
r2 Values

r2 = 0
Y
No linear relationship
between X and Y:

The value of Y does not


X depend on X. (None of the
r2 = 0
variation in Y is explained
by variation in X)
Simple Linear Regression Example:
Coefficient of Determination, r2 in Excel

Regression Statistics
SSR 18934.9348
Multiple R 0.76211
r2 = = = 0.58082
R Square 0.58082 SST 32600.5000
Adjusted R Square 0.52842
Standard Error 41.33032
58.08% of the variation in
Observations 10
house prices is explained
by variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Upper
Coefficients Standard Error t Stat P-value Lower 95% 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580


Simple Linear Regression Example:
Coefficient of Determination, r2 in Minitab

The regression equation is

Price = 98.2 + 0.110 Square Feet

Predictor Coef SE Coef T P


Constant 98.25 58.03 1.69 0.129
Square Feet 0.10977 0.03297 3.33 0.010

S = 41.3303 R-Sq = 58.1% R-Sq(adj) = 52.8%


SSR 18934.9348
Analysis of Variance r2 = = = 0.58082
SST 32600.5000
Source DF SS MS F P
Regression 1 18935 18935 11.08 0.010 58.08% of the variation in
Residual Error 8 13666 1708 house prices is explained
Total 9 32600 by variation in square feet
Assumptions of Regression
L.I.N.E

 Linearity
 The relationship between X and Y is linear

 Independence of Errors
 Error values are statistically independent

 Normality of Error
 Error values are normally distributed for any given
value of X
 Equal Variance (also called homoscedasticity)
 The probability distribution of the errors has constant
variance
Residual Analysis

ei = Yi − Ŷi
 The residual for observation i, ei, is the difference between its
observed and predicted value
 Check the assumptions of regression by examining the residuals
 Examine for linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X (homoscedasticity)
 Graphical Analysis of Residuals
 Can plot residuals vs. X
Residual Analysis for Linearity

Y Y

x x

residuals
residuals

x x

Not Linear
 Linear
Residual Analysis for Independence

Not Independent
 Independent
residuals

residuals
X
residuals

X
Residual Analysis for Normality

Not Normal
 Normal

residuals
residuals

X X

Lots of large residuals Lots of small residuals


– No Good! (around the X axis)
and sporadic large residuals
– Good!
Residual Analysis for
Equal Variance

Y Y

x x

residuals
residuals

x x

Non-constant variance  Constant variance


Simple Linear Regression
Example: Excel Residual Output

RESIDUAL OUTPUT House Price Model Residual Plot


Predicted
House Price Residuals
80
1 251.92316 -6.923162
60
2 273.87671 38.12329
3 284.85348 -5.853484 40
4 304.06284 3.937162 Residuals
20
5 218.99284 -19.99284
0
6 268.38832 -49.38832
0 1000 2000 3000
-20
7 356.20251 48.79749
8 367.17929 -43.17929 -40
9 254.6674 64.33264 -60
10 284.85348 -29.85348 Square Feet

Does not appear to violate


any regression assumptions
Z Distribution
t Distribution
t - Test

T statistic  b/Sb

vs

Critical value  alpha and df


Inferences About the Slope:
t Test Example

House Price in
Square Feet
$1000s
(x) Estimated Regression Equation:
(y)

245 1400
house price = 98.25 + 0.1098 (sq.ft.)
312 1600

279 1700

308 1875

199 1100
The slope of this model is 0.1098
219 1550
Is there a relationship between the
405 2350
square footage of the house and its
324 2450
sales price?
319 1425

255 1700
Inferences About the Slope:
t Test

 t test for a population slope


 Is there a linear relationship between X and Y?

 Null and alternative hypotheses


 H0: β1 = 0 (no linear relationship)
 H1: β1 ≠ 0 (linear relationship does exist)

 Test statistic
where:
b1 − β 1
t STAT = b1 = regression slope
coefficient
Sb β1 = hypothesized slope
1
Sb1 = standard
d.f. = n − 2 error of the slope
Inferences About the Slope:
t Test Example

H0: β1 = 0
From Excel output: H1: β1 ≠ 0

Coefficients Standard Error t Stat P-value


Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

From Minitab output: b1 Sb1


Predictor Coef SE Coef T P
Constant 98.25 58.03 1.69 0.129
Square Feet 0.10977 0.03297 3.33 0.010

b1 − β 1 0.10977 − 0
t STAT = = = 3.32938
b1 Sb1 Sb 0.03297
1
Inferences About the Slope:
t Test Example

H 0: β 1 = 0
H 1: β 1 ≠ 0
Test Statistic: tSTAT = 3.329

d.f. = 10- 2 = 8

α/2=.025 α/2=.025
Decision: Reject H0

There is sufficient evidence


Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price
Inferences About the Slope:
t Test Example
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

From Minitab output:


Predictor Coef SE Coef T P p-value
Constant 98.25 58.03 1.69 0.129
Square Feet 0.10977 0.03297 3.33 0.010

Decision: Reject H0, since p-value < α


There is sufficient evidence that
square footage affects house price.
Confidence Interval Estimate
for the Slope

Confidence Interval Estimate of the Slope:


b1 ± tα / 2Sb1 d.f. = n - 2

Excel Printout for House Prices:


Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

At 95% level of confidence, the confidence interval


for the slope is (0.0337, 0.1858)
Confidence Interval Estimate
for the Slope

(continued)

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Since the units of the house price variable is


$1000s, we are 95% confident that the average
impact on sales price is between $33.74 and
$185.80 per square foot of house size

This 95% confidence interval does not include 0.


Conclusion: There is a significant relationship between
house price and square feet at the .05 level of
significance

You might also like