0% found this document useful (0 votes)
74 views41 pages

CH 16 Aslr

This document discusses simple linear regression. It defines the components of the simple regression equation (β1, β0) and explains how to interpret them. It demonstrates the least squares method for calculating β1 and β0, which finds the line that best fits the data by minimizing the vertical distances between the data points and the regression line. It also defines the correlation coefficient and coefficient of determination, and discusses the relationship between correlation and causation.

Uploaded by

benny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views41 pages

CH 16 Aslr

This document discusses simple linear regression. It defines the components of the simple regression equation (β1, β0) and explains how to interpret them. It demonstrates the least squares method for calculating β1 and β0, which finds the line that best fits the data by minimizing the vertical distances between the data points and the regression line. It also defines the correlation coefficient and coefficient of determination, and discusses the relationship between correlation and causation.

Uploaded by

benny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 41

Simple Linear Regression

________________________________________

1) Discuss conceptual differences between ANOVA and


regression.

2) Identify the components of the simple regression


equation (1, 0) and explain their interpretation.

3) Demonstrate the Least-Squares method for calculating


1 and 0.

4) Develop a measure for error in the regression model


and demonstrate a method for comparing the variance
due to error with the variance due to our model.

5) Define and explain the correlation coefficient and the


coefficient of determination.

6) Discuss the relationship between correlation and


causation.
CI vs. ANOVA vs. Regression
__________________________________________

Key word for CI:

Key word for ANOVA:


t-test

Key word(s) for regression:


Trivia Wars
______________________________________

Let’s say Amherst declares war on Northampton because


Northampton tries to lure Judie's into moving out of
Amherst. No one actually wants to kill anyone, so we
decide to settle our differences with a rousing game of
Jeopardy! You are elected the Captain of Amherst’s team
(as if you would be selected instead of me). How are you
going to choose the team?

Multiple criteria:
1) Knowledge
2) Performance under pressure
EX:Cindy Brady
3) Speed

Historical roots in WW II
Who would be a good ball turret gunner?
Regression
______________________________________

What is the relationship between…

Grades or Money or

Relationship or Health
Status
…and Life Satisfaction?
______________________________________

How well can I predict a person’s Life Satisfaction if I


know their …

Grades or Money or

Relationship or Health
Status
______________________________________

How are we going to do this?



General form of Probabilistic (Regression) Models
________________________________________

y= +

or

y = regression line + error

or

y= +

_______________________________________

E(y) -
 Regression line connects
Simple Regression
First-Order
Single-Predictor
___________________________________________

y = 0 + 1x + 
y =

x =

E(y) =

 =
______________________________________

0

y = mx + b

1
Interpretation of y-intercept and slope
________________________________________

Intercept
 Intercept only makes sense if x

 Regression equation only applies

________________________________________

Slope
 Change in y for a unit change in x.
o + implies relationship
o – implies relationship

________________________________________

Most important point:

Give me a value for x and the regression equation


and I can
Steps to completing a regression analysis
(both simple and multiple)
________________________________________

Hypothesize the
deterministic component
of the model.
Step 1

Use sample data to


Step 2

Specify the probability


Step 3 distribution of the

Step 4 Evaluate the usefulness of

Use the model for


Step 5
Fitting a model to our data (Step 2)
________________________________________

Least-Squares method

1) Sum of the vertical distance between each point

2) Square of the vertical distance is


When in doubt, think Bribery!!
____________________________________________

You want to determine the relationship between monetary


gifts and "BONUS POINTS FOR SPECIAL
CONTRIBUTIONS TO CLASS" added to your final
average so that you can decide how large a check to write
at the end of the semester (though I do prefer cash for tax
purposes). Let's say x represents the amount of money
contributed by past students, and y represents the number
of "Bonus Points" awarded to them.

Bribery

10

8
Bonus Points

0
0 1 2 3 4 5 6 7 8 9 10

Donation
Fishing for a regression line
________________________________________
12
Series1
10 y= x+ 1
y=5
8
Bonus Points

0
0 1 2 3 4 5 6 7 8 9 10

Donation

X Y Distance Squared-
Distance
Gift BP y=5 Y=x+1 y=5 y=x+1
4 1 -4 -4 16 16
8 9 4 0 16 0
2 5 0 2 0 4
6 5 0 -2 0 4
0 -4 32 24

Which regression line is better?


Is that the ‘best’ regression line?
Formulae for Least Squares Method
________________________________________

1 = SP / SSx

0 = My – (1* Mx)

______________________________________________

x 2    x  
 2

SSx =  n 

SP =  xy     x  y  
 n
Finding the best-fit regression line
________________________________________

x Y x2 Xy
4 1 16 4
8 9 64 72
2 5 4 10
6 5 36 30
x = 20 y = 20 (x2) = 120 (xy) = 116

SSx = (x2) – [(x)2 / n]


= 120 – [(20)2 / 4]
= 120 – (400 / 4)
= 120 – 100 = 20

SP = (xy) – [(x)y)] / n
= 116 – [(20)(20) / 4]
= 116 – (400 /4)
= 116 – 100 = 16
________________________________________

1 = SP / SSx
= 16 / 20 = 0.8

0 = My – (1* Mx)
= 5 – (.8)(5) = 1.0
________________________________________
The Least-Squares Regression Line
________________________________________

12
Series1
10 y=x+1
y=5
8 Least Squares Reg. Line
Bonus Points

0
0 1 2 3 4 5 6 7 8 9 10
Donation

x y E(y) Distance Squared-


Distance
4 1 4.2 -3.2 10.24
8 9 7.4 1.6 2.56
2 5 2.6 2.4 5.76
6 5 5.8 -0.8 0.64
0 19.20
Testing Example
__________________________________________

Unbeknownst to you, Biff is the heir to his family’s


Widget fortune. For his summer job, Biff was asked to
evaluate a group of employees’ widget making ability
using a standardized widget-making test. Biff’s boss
(Uncle Buck) asks Biff to determine the regression
equation that one would use to predict performance on the
test from years of service with the company. The data
appear below.

x (years) y (score) x2 y2 xy
3 55 9 3025 165
4 78 16 6084 312
4 72 16 5184 288
2 58 4 3364 116
5 89 25 7921 445
3 63 9 3969 189
4 73 16 5329 292
5 84 25 7056 420
3 75 9 5625 225
2 48 4 2304 96
x = 35 y = 695 (x2) = (y2) = (xy) =
133 49,861 2,548
Calculations
______________________________________________

SSx = (x2) – [(x)2 / n]

SP = (xy) – [(x)y)] / n

______________________________________________

1 = SP / SSx

0 = My – (1* Mx)
Widget Test Scatter Plot
________________________________________

100

90

80
Test Score

70

60

50

40
2 3 4 5
Experience
Assumptions regarding Error ()
________________________________________

: essentially vertical distance from regression line


_________________________________________

1) The mean of the probability distribution =

2) The variance of the probability distribution of


 is

3) Distribution of  is

4) Values of  are of one another.


Factors that contribute to Error
________________________________________

Two types of Error

1) Measurement Error -

EX:incorrect reading of beaker

2) Chance factors
EX:unusually non/reactive chemical
Estimation of Variability due to Error (Step 3)
________________________________________

s2 is analogous to MSE
s2 = SSE / dferror = SSE / n – 2

SSE = SSy - 1(SP)

SSy = y2 – [(y)2 / n]


________________________________________

 s2 = SSE / (n-2) = MSE

s = Estimated Standard Error


of the Regression Model

or

= Root MSE
Calculate the error
______________________________________

SSy = y2 – [(y)2 / n]

= 49,861 – [(695)2 / 10]


= 49,861 – (483,025 / 10)
= 49,861 – 48,302.5 = 1558.5

SSE = SSy - 1(SP)

= 1558.5 – 11.0(115.5)
= 1558.5 – 1270.5 = 288

s2 = SSE / (n-2)
= 288 / (10-2) = 36
(a/k/a MSE)

s = 36 = 6
(a/k/a Root MSE)
Important points about error or 
________________________________________

1. The smaller , the better we can

2. The smaller , the more the


individual data points will be around the regression
line.

3. A smaller  implies that x is a predictor of y.


Why?

Also, can use this information to develop a sense of how


far points should fall off the line.
 We can calculate a CI around the regression line.
95% of our points should fall within about 2 RMSEs
of the regression line. If not, HMMMM…
Evaluate the usefulness of the model (Step 4)
________________________________________

Step 1: Specify the null and alternative hypotheses.


 Ho: 1 = 0
 Ha: 1  0

Step 2: Designate the rejection region by selecting .

Step 3: Obtain the critical value for your test statistic


 t
 df = n-2

Collect your data

Step 5: Use your sample data to calculate:


 1 SP / SSx
 s1 = SE = s / SSx

Step 6: Use your parameter estimates to calculate the


observed value of your test statistic
 t = 1 – 0 / s1

Step 7: Compare tobs with tcrit:


 If the test statistic falls in the RR, reject the null.
 Otherwise, we fail to reject the null.
Calculating whether 1 (slope)  0
________________________________________

Ho: 1 = 0
Ha: 1  0

tcrit 2.306
(df = 8;  = .05)

RR |tobs| > 2.306

Observed t = 1 – 0 / (s /  SSx)
= 11 – 0 / (6 / 10.5)
= 11 / 1.85
= 5.94

We would reject the null hypothesis because tobs exceeds


the tcrit. In other words, tobs falls in the rejection region.

Implication:
Correlation Coefficient
________________________________________

Pearson’s product moment coefficient of correlation – a


measure of the strength of the linear relationship
between two variables.
Terminology / notation:
 r
 Pearson’s r
 correlation coefficient

________________________________________
SP
r = ( SSx )(SSy )

Interpretation:

+1 perfect positive relationship


(strong positive relationship)
0 no relationship
(strong negative relationship)
-1 perfect negative relationship
r for the Widget Example
____________________________________________
SP
r = ( SSx)( SSy )

Experience in Years

=
115.5
(10.5)(1558.5)

=
115 .5
16,364.25
= 115.5 / 127.92 = .90

Experience in Months

=
1386
(1512)(1558.5)

=
1386
2,356, 452
= 1386 / 1535.074 = .90
Stress and Health
____________________________________________

There is a strong negative correlation between stress and


health. Generally, the more stressed a person is, the
worse their health is.

But, does that mean that stress causes poor health?

No... Yes...
Coefficient of Determination
________________________________________

r2 represents the proportion of the total sample


variability

For simple, linear regression, r2 = r2.


________________________________________

More general formula is as follows:

r2 = (SSy – SSE) / SSy

= 1 – (SSE / SSy)

________________________________________

SPSS will give us everything we need!


Questions about Regression output
__________________________________________

1) What is r?

2) Is this correlation significant?

3) How much of the variance in # of colds per winter can


be explained by weekend bedtime?

4) What is the y-intercept?

5) Is it significantly different from zero?

6) What is E(y) if x = 10:00 PM (10)?

7) What is E(y) if x = 2:00 AM (14)?

8) Are your answers to questions 6 and 7 meaningful?


SPSS output
______________________________________________
Model Summary
2 2
Model R R Adj R SE
1 .204 .041 .034 1.20

ANOVA
Sum of Mean
Model Squares df Square F Sig.
1 Regression 7.68 1 7.68 5.32 .023
Residual 177.58 123 1.44
Total 185.27 124

Coefficients
Unstand Stand
Model Coeff Coeff t Sig.
B SE Beta
1 (Constant) 5.711 1.69 3.38 .001
bed_we -.266 .12 -.20 -2.31 .023
I just don’t get it
____________________________________________

I know I’m old, but I just don’t get the tattoo thing. I
gotta figure that people regret their decision as time
passes. The data below represent 100 subjects who had
tattoos etched into their skin between 1 and 5 years ago.
They rated their satisfaction with their lifetime scar on a
scale of 1-10 (10 = extremely satisfied). Is there a
relationship between tattoo age and tattoo satisfaction?
(x) (y) (x2) (y2) (x)(y)
300 600 1100 3954 1660

Regression Equation
SP = (xi)(yi) – [(xi)yi)] / n
SSx = xi2 – [(xi)2 / n]
1 = SP / SSx
0 = My – (1* Mx)

Hypothesis Test
SSy = yi2 – [(yi)2 / n]
SSE = SSy - 1(SP)
s2 (MSE) = SSE / (n-2)
t = 1 - 0 / (s / SSxx)

Correlation Coefficient
SP
r = ( SS x )( SS y )
Calculating the regression parameters
______________________________________________

SP = (xi)(yi) – [(xi)yi)] / n

SSx = xi2 – [(xi)2 / n]

1 = SP / SSx

0 = My – (1* Mx)
Let's do a t-test
______________________________________________

SSy = yi2 – [(yi)2 / n]

SSE = SSy - 1* (SP)

s2 = MSE

s =

t = 1 – 0 / (s / SSx)

We reject the null and conclude that there is a significant


NEGATIVE relationship between tattoo age and tattoo
satisfaction.
Let's calculate the correlation coefficient
____________________________________________

SSy = yi2 – [(yi)2 / n]

SP
r = ( SS x )( SS y )

r2 =

Although there is a significant NEGATIVE relationship


between tattoo age and tattoo satisfaction, age only
explains about 25% of the variance in satisfaction.
Clearly, other factors are involved.
Skipping Class
__________________________________________
In a perfect world, the correlation between the number of classes skipped and
the percentage of classes skipped should be 1.00. Let's see how well the
percentage of classes skipped (x) predicts the number of hours of classes
skipped (y). Please calculate the regression line, the correlation
coefficient, and the coefficient of determination.

(x) (y) (x2) (y2) (x)(y)

Regression Equation
SP = (xi)(yi) – [(xi)yi)] / n
SSx = xi2 – [(xi)2 / n]
1 = SP / SSx
0 = My – (1* Mx)

Hypothesis Test
SSy = yi2 – [(yi)2 / n]
SSE = SSy - 1(SP)
s2 (MSE) = SSE / (n-2)
t = 1 - 0 / (s / SSxx)

Correlation Coefficient
SP
r = ( SS x )( SS y )

You might also like