0% found this document useful (0 votes)
99 views18 pages

Statistics for Data Analysts

1. Regression analysis is used to predict a variable (y) based on another variable (x) when their relationship appears linear based on a scatterplot. 2. The regression line minimizes the sum of squared distances between observed y values and predicted y values to find the best fit line. 3. The correlation coefficient (r) measures the strength and direction of the linear relationship between x and y, and is used to calculate the regression line. Regression predicts y values that are closer to the mean when x is farther from its mean, known as regression to the mean.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views18 pages

Statistics for Data Analysts

1. Regression analysis is used to predict a variable (y) based on another variable (x) when their relationship appears linear based on a scatterplot. 2. The regression line minimizes the sum of squared distances between observed y values and predicted y values to find the best fit line. 3. The correlation coefficient (r) measures the strength and direction of the linear relationship between x and y, and is used to calculate the regression line. Regression predicts y values that are closer to the mean when x is farther from its mean, known as regression to the mean.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Prediction is a key task of statistics

Predict the height of a son who is chosen at


random from 928 sons. The average height
of sons, 68.1 in, is the ‘best’ predictor.

62 64 66 68 70 72 74

Heights of 928 Sons

Predict the height of a son whose father is


74
72

72 in tall. This additional information


70
Son's height

about the father should allow us to make a


68
66

better prediction. Regression does just that.


64
62

64 66 68 70 72

Father's height
The correlation coefficient
25000

74
20000

72
70
15000

Son's height
Income

68
10000

66
64
5000

62
0 64 66 68 70 72
6 8 10 12 14 16
Father's height
Education

The scatterplot visualizes the relationship between two quantitative variables. It may
have a direction (sloping up or down), form (a scatter that clusters around a line is
called linear) and strength (how closely do the points follow the form?).
If the form is linear, then a good measure of strength is the correlation coefficient r:
Our data are (xi , yi ), i = 1, . . . , n.
n
1 X xi − x̄ yi − ȳ
r = ×
n sx sy
i=1

(divide by n − 1 instead of n if this is also done for the standard deviations sx , sy ).


Correlation measures linear association
A numerical summary of these pairs of data is given by: x̄, sx , ȳ, sy , r.
As a convention the variable on the horizontal axis is called explanatory variable or
predictor, the one on the vertical axis is called response variable.
r is always between −1 and 1. The sign of r gives the direction of the association and
its absolute value gives the strength:
r = -0.9 r = -0.6 r=0 r = 0.2 r=1

Since both x and y were standardized when computing r, r has no units and is not
affected by changing the center or the scale of either variable.
Correlation measures linear association
Keep in mind that r is only useful for measuring linear association:

r=0

Also remember that correlation does not mean causation:

Among school children there is a high


450

correlation between shoe size and reading


400
Reading score

ability. Both are driven by the lurking


350

variable ‘age’.
300

6 7 8 9 10 11 12

Shoe size
The regression line
If the scatterplot shows a linear association, then this relationship can be summarized
by a line.

40

40
35

35
30

30
Percent body fat

Percent body fat


25

25
20

20
15

15
10

10
30 40 50 60 30 40 50 60

Age Age

To find this line for n pairs of data (x1 , y1 ), . . . , (xn , yn ), recall that the equation of a
line produces the y-value ŷi = a + bxi . The idea is to choose the line that minimizes
the sum of the squared distances between the observed yi and the ŷi . In other words,
find a and b that minimize
X n X n
2
(yi − ŷi ) = (yi − (a + bxi ))2
i=1 i=1
The method of least squares

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize


n
X n
X
(yi − ŷi )2 = (yi − (a + bxi ))2
i=1 i=1

s
This is the method of least squares. It turns out that b = r sxy and a = ȳ − bx̄. This
line ŷ = a + bx is called the regression line.
There is another interpretation of the regression line:
it computes the average value of y when the first coordinate is near x.
Remember that often times an average is the ‘best’ predictor. This shows how the
regression line incorporates the information given by x to produce a good predictor of y.
Regression to the mean
The main use of regression is to predict y from x:
Given x, predict y to be ŷ = a + bx.
The prediction for y at x = x̄ is simply ŷ = ȳ.
s
But b = r sxy means that if x is one standard deviation sx above x̄, then the predicted ŷ
is only r sy above ȳ.
Since r is between −1 and 1, the prediction is ‘towards the mean’: ŷ is fewer standard
deviations away from ȳ than x is from x̄.

100
90
80
Final score

70
60
50

20 30 40 50 60 70

Midterm score
Regression to the mean

This is called regression to the mean (or: the regression effect). In can be observed
in data whose scatter is football-shaped such as the exam scores: In such a test-retest
situation, the top group on the test will drop down somewhat on the retest, while the
bottom group moves up.
A heuristic explanation is this: To score among the very top on the midterm requires
excellent preparation as well as some luck. This luck may not be there any more on the
final exam, and so we expect this group to fall back a bit.
This effect is simply a consequence of there being a scatter around the line.
Erroneously assuming that this occurs due to some action (e.g. ‘the top scorers on the
midterm slackened off’) is the regression fallacy.
Predicting y from x and x from y
If we are given x, then we use the regression line ŷ = a + bx to predict y.
To find this regression line we need only x̄, ȳ, sx , sy and r.
We can use software to compute this line, e.g. ‘lm’ in R, but it can also be done
quickly by hand:
midterm = 49.5, final = 69.1, smid = 10.2, sf inal = 11.8, r = 0.67.
Predict the final exam score of a student who scored 41 on the midterm.
Predict x from y
Predict the midterm score of a student who scored 89 on the final.
When predicting x from y it is a mistake to use the regression line ŷ = a + bx, derived
for regressing y on x, and solve for x. This is because regressing x on y will result in a
different regression line.
To avoid confusing these, always put the predictor on the x-axis and proceed as on the
previous slide.
Normal approximation in regression
Regression requires that the scatter is football-shaped. Then one may use normal
approximation for the y-values conditional on x. That is, the observations whose first
coordinate is near that x have y-values that approximately follow the normal curve.

To standardize, subtract√off the predicted


value ŷ, then divide by 1 − r2 × sy .

Among the students who scored around 41 on the midterm, what percentage scored
above 60 on the final?
Residuals
The differences between observed and predicted y-values are called residuals:
ei = yi − ŷi , i = 1, . . . , n
Residuals are used to check whether the use of regression is appropriate. The residual
plot is a scatterplot of the residuals against the x-values. It should show an
unstructured horizontal band.
100

20
90

10
80
Final score

Residuals
70

0
60

-10
50

-20
20 30 40 50 60 70 20 30 40 50 60 70

Midterm score Midterm score


Residual plots
A curved pattern suggests that the scatter is not linear:

6000
25000

4000
20000

2000
Residuals
Income

15000

0
-2000
10000

-4000
-6000
5000

6 8 10 12 14 16 6 8 10 12 14 16

Education Education

But it may still be possible to analyze these data with√regression! Regression may
applicable after transforming the data, e.g. regress income or log(income) on
Education.
Transformations of the variables
Another violation of the football-shaped assumption about the scatter arises if the
scatter is heteroscedastic:

Residuals
A transformation of the y-variables may produce a homoscedastic scatter, i.e. result in
equal spread of the residuals across x. (However, it may also result in a non-linear
scatter, which may require a second transformation of the x-values to fix!)
Transformation of the variables
2000 Pres. Election in Florida
by county (w/o Palm Beach)

1000

200
800

0
600
Buchanan

Residuals
400

-200
200

-400
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush

The residual plot looks heteroscedastic. Taking log of both variables produces a residual
plot that is very satisfactory:
7

1.0
6

0.5
log(Buchanan)

Residuals

0.0
4

-0.5
3

-1.0
7 8 9 10 11 12 7 8 9 10 11 12

log(Bush) log(Bush)
Outliers

Points with very large residuals (outliers) should be examined: they may represent
typos or interesting phenomena.

2000 Pres. Election in Florida


by county
3500

2500
3000

2000
2500

1500
2000

1000
Buchanan

Residuals
1500

500
1000

0
500

-500
-1000
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush
Leverage and influential points
A point whose x-value is far from the mean of the x-values has high leverage: it has
the potential to cause a big change the regression line.

4.0
3.5
3.0
2.5
2.0
1.5
1.0
1 2 3 4 5 6 7

Whether it does change the line a lot (→ influential point) or not can only be
determined by refitting the regression without the point. An influential point may have
a small residual (because it is influential!), so a residual plot is not helpful for this
analysis.
Some other issues

I Avoid predicting y by extrapolation, i.e. at x-values that are outside the range of
the x-values that were used for the regression: The linear relationship often breaks
down outside a certain range.
I Beware of data that are summaries (e.g. averages of some data). Those are less
variable than individual observations and correlations between averages tend to
overstate the strength of the relationship.
I Regression analyses often report ‘R-squared’: R2 = r2 . It gives the fraction of the
variation in the y-values that is explained by the regression line. (So 1 − r2 is the
fraction of the variation in the y-values that is left in the residuals.)

You might also like