0% found this document useful (0 votes)

99 views18 pages

Statistics for Data Analysts

1. Regression analysis is used to predict a variable (y) based on another variable (x) when their relationship appears linear based on a scatterplot. 2. The regression line minimizes the sum of squared distances between observed y values and predicted y values to find the best fit line. 3. The correlation coefficient (r) measures the strength and direction of the linear relationship between x and y, and is used to calculate the regression line. Regression predicts y values that are closer to the mean when x is farther from its mean, known as regression to the mean.

Uploaded by

admirodebrito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views18 pages

Statistics for Data Analysts

Uploaded by

admirodebrito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Prediction is a key task of statistics

Predict the height of a son who is chosen at

random from 928 sons. The average height
of sons, 68.1 in, is the ‘best’ predictor.

62 64 66 68 70 72 74

Heights of 928 Sons

Predict the height of a son whose father is

74
72

72 in tall. This additional information

70
Son's height

about the father should allow us to make a

68
66

better prediction. Regression does just that.

64
62

64 66 68 70 72

Father's height
The correlation coefficient
25000

74
20000

72
70
15000

Son's height
Income

68
10000

66
64
5000

62
0 64 66 68 70 72
6 8 10 12 14 16
Father's height
Education

The scatterplot visualizes the relationship between two quantitative variables. It may
have a direction (sloping up or down), form (a scatter that clusters around a line is
called linear) and strength (how closely do the points follow the form?).
If the form is linear, then a good measure of strength is the correlation coefficient r:
Our data are (xi , yi ), i = 1, . . . , n.
n
1 X xi − x̄ yi − ȳ
r = ×
n sx sy
i=1

(divide by n − 1 instead of n if this is also done for the standard deviations sx , sy ).

Correlation measures linear association
A numerical summary of these pairs of data is given by: x̄, sx , ȳ, sy , r.
As a convention the variable on the horizontal axis is called explanatory variable or
predictor, the one on the vertical axis is called response variable.
r is always between −1 and 1. The sign of r gives the direction of the association and
its absolute value gives the strength:
r = -0.9 r = -0.6 r=0 r = 0.2 r=1

Since both x and y were standardized when computing r, r has no units and is not
affected by changing the center or the scale of either variable.
Correlation measures linear association
Keep in mind that r is only useful for measuring linear association:

r=0

Also remember that correlation does not mean causation:

Among school children there is a high

450

correlation between shoe size and reading

400
Reading score

ability. Both are driven by the lurking

350

variable ‘age’.
300

6 7 8 9 10 11 12

Shoe size
The regression line
If the scatterplot shows a linear association, then this relationship can be summarized
by a line.

40
35

35
30

30
Percent body fat

Percent body fat

25
20

20
15

15
10

10
30 40 50 60 30 40 50 60

Age Age

To find this line for n pairs of data (x1 , y1 ), . . . , (xn , yn ), recall that the equation of a
line produces the y-value ŷi = a + bxi . The idea is to choose the line that minimizes
the sum of the squared distances between the observed yi and the ŷi . In other words,
find a and b that minimize
X n X n
2
(yi − ŷi ) = (yi − (a + bxi ))2
i=1 i=1
The method of least squares

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize

n
X n
X
(yi − ŷi )2 = (yi − (a + bxi ))2
i=1 i=1

s
This is the method of least squares. It turns out that b = r sxy and a = ȳ − bx̄. This
line ŷ = a + bx is called the regression line.
There is another interpretation of the regression line:
it computes the average value of y when the first coordinate is near x.
Remember that often times an average is the ‘best’ predictor. This shows how the
regression line incorporates the information given by x to produce a good predictor of y.
Regression to the mean
The main use of regression is to predict y from x:
Given x, predict y to be ŷ = a + bx.
The prediction for y at x = x̄ is simply ŷ = ȳ.
s
But b = r sxy means that if x is one standard deviation sx above x̄, then the predicted ŷ
is only r sy above ȳ.
Since r is between −1 and 1, the prediction is ‘towards the mean’: ŷ is fewer standard
deviations away from ȳ than x is from x̄.

100
90
80
Final score

70
60
50

20 30 40 50 60 70

Midterm score
Regression to the mean

This is called regression to the mean (or: the regression effect). In can be observed
in data whose scatter is football-shaped such as the exam scores: In such a test-retest
situation, the top group on the test will drop down somewhat on the retest, while the
bottom group moves up.
A heuristic explanation is this: To score among the very top on the midterm requires
excellent preparation as well as some luck. This luck may not be there any more on the
final exam, and so we expect this group to fall back a bit.
This effect is simply a consequence of there being a scatter around the line.
Erroneously assuming that this occurs due to some action (e.g. ‘the top scorers on the
midterm slackened off’) is the regression fallacy.
Predicting y from x and x from y
If we are given x, then we use the regression line ŷ = a + bx to predict y.
To find this regression line we need only x̄, ȳ, sx , sy and r.
We can use software to compute this line, e.g. ‘lm’ in R, but it can also be done
quickly by hand:
midterm = 49.5, final = 69.1, smid = 10.2, sf inal = 11.8, r = 0.67.
Predict the final exam score of a student who scored 41 on the midterm.
Predict x from y
Predict the midterm score of a student who scored 89 on the final.
When predicting x from y it is a mistake to use the regression line ŷ = a + bx, derived
for regressing y on x, and solve for x. This is because regressing x on y will result in a
different regression line.
To avoid confusing these, always put the predictor on the x-axis and proceed as on the
previous slide.
Normal approximation in regression
Regression requires that the scatter is football-shaped. Then one may use normal
approximation for the y-values conditional on x. That is, the observations whose first
coordinate is near that x have y-values that approximately follow the normal curve.

To standardize, subtract√off the predicted

value ŷ, then divide by 1 − r2 × sy .

Among the students who scored around 41 on the midterm, what percentage scored
above 60 on the final?
Residuals
The differences between observed and predicted y-values are called residuals:
ei = yi − ŷi , i = 1, . . . , n
Residuals are used to check whether the use of regression is appropriate. The residual
plot is a scatterplot of the residuals against the x-values. It should show an
unstructured horizontal band.
100

20
90

10
80
Final score

Residuals
70

0
60

-10
50

-20
20 30 40 50 60 70 20 30 40 50 60 70

Midterm score Midterm score

Residual plots
A curved pattern suggests that the scatter is not linear:

6000
25000

4000
20000

2000
Residuals
Income

15000

0
-2000
10000

-4000
-6000
5000

6 8 10 12 14 16 6 8 10 12 14 16

Education Education

But it may still be possible to analyze these data with√regression! Regression may
applicable after transforming the data, e.g. regress income or log(income) on
Education.
Transformations of the variables
Another violation of the football-shaped assumption about the scatter arises if the
scatter is heteroscedastic:

Residuals
A transformation of the y-variables may produce a homoscedastic scatter, i.e. result in
equal spread of the residuals across x. (However, it may also result in a non-linear
scatter, which may require a second transformation of the x-values to fix!)
Transformation of the variables
2000 Pres. Election in Florida
by county (w/o Palm Beach)

1000

200
800

0
600
Buchanan

Residuals
400

-200
200

-400
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush

The residual plot looks heteroscedastic. Taking log of both variables produces a residual
plot that is very satisfactory:
7

1.0
6

0.5
log(Buchanan)

Residuals

0.0
4

-0.5
3

-1.0
7 8 9 10 11 12 7 8 9 10 11 12

log(Bush) log(Bush)
Outliers

Points with very large residuals (outliers) should be examined: they may represent
typos or interesting phenomena.

2000 Pres. Election in Florida

by county
3500

2500
3000

2000
2500

1500
2000

1000
Buchanan

Residuals
1500

500
1000

0
500

-500
-1000
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush
Leverage and influential points
A point whose x-value is far from the mean of the x-values has high leverage: it has
the potential to cause a big change the regression line.

4.0
3.5
3.0
2.5
2.0
1.5
1.0
1 2 3 4 5 6 7

Whether it does change the line a lot (→ influential point) or not can only be
determined by refitting the regression without the point. An influential point may have
a small residual (because it is influential!), so a residual plot is not helpful for this
analysis.
Some other issues

I Avoid predicting y by extrapolation, i.e. at x-values that are outside the range of
the x-values that were used for the regression: The linear relationship often breaks
down outside a certain range.
I Beware of data that are summaries (e.g. averages of some data). Those are less
variable than individual observations and correlations between averages tend to
overstate the strength of the relationship.
I Regression analyses often report ‘R-squared’: R2 = r2 . It gives the fraction of the
variation in the y-values that is explained by the regression line. (So 1 − r2 is the
fraction of the variation in the y-values that is left in the residuals.)

06 Regression
No ratings yet
06 Regression
18 pages
Correlation and Regression 2
No ratings yet
Correlation and Regression 2
24 pages
1 - Simple Linear Regression
No ratings yet
1 - Simple Linear Regression
43 pages
Understanding the Correlation Coefficient
No ratings yet
Understanding the Correlation Coefficient
54 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
STATG5 - Simple Linear Regression Using SPSS Module
No ratings yet
STATG5 - Simple Linear Regression Using SPSS Module
16 pages
Simple Linear Regression Part 1
No ratings yet
Simple Linear Regression Part 1
63 pages
(Mathe) Simple Linear Regression and Correlation
No ratings yet
(Mathe) Simple Linear Regression and Correlation
61 pages
Relationship - Correlation and Regression
No ratings yet
Relationship - Correlation and Regression
42 pages
@regression
No ratings yet
@regression
33 pages
Linear Regression II
No ratings yet
Linear Regression II
54 pages
Correlation
100% (1)
Correlation
29 pages
5 - Chapter9-Linear Regression
No ratings yet
5 - Chapter9-Linear Regression
15 pages
Looking at Data: Relationships: Least-Squares Regression
No ratings yet
Looking at Data: Relationships: Least-Squares Regression
23 pages
Statistical Analysis: Linear Regression
No ratings yet
Statistical Analysis: Linear Regression
36 pages
Regression and Correlation Analysis Guide
No ratings yet
Regression and Correlation Analysis Guide
10 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Lecture 07 Regression
No ratings yet
Lecture 07 Regression
22 pages
DAM Class 21-24 Regression Analysis
No ratings yet
DAM Class 21-24 Regression Analysis
93 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
d90840b8 1721727178674
No ratings yet
d90840b8 1721727178674
43 pages
Linear Regression Basics Guide
No ratings yet
Linear Regression Basics Guide
6 pages
Correlation and Regression
No ratings yet
Correlation and Regression
31 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
16 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Linear Regression Zamin
No ratings yet
Linear Regression Zamin
29 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Regression: Leech N L, Barret K C & Morgan G A (2011)
No ratings yet
Regression: Leech N L, Barret K C & Morgan G A (2011)
35 pages
STAB27
No ratings yet
STAB27
51 pages
Handout 5 Correlation and Regression (Recovered)
No ratings yet
Handout 5 Correlation and Regression (Recovered)
6 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Parametric Test
No ratings yet
Parametric Test
49 pages
Bivariate Linear Regression Overview
No ratings yet
Bivariate Linear Regression Overview
33 pages
Unit 2 - Scatterplots Correlation and Regression Summer 2021
No ratings yet
Unit 2 - Scatterplots Correlation and Regression Summer 2021
43 pages
Lec1 ppt2019
No ratings yet
Lec1 ppt2019
23 pages
Stat Chapter 9
No ratings yet
Stat Chapter 9
34 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
Stats10 - Chapter+4 2
No ratings yet
Stats10 - Chapter+4 2
54 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
10 Regression Analysis
No ratings yet
10 Regression Analysis
55 pages
Interpret Correlation Coefficient r
No ratings yet
Interpret Correlation Coefficient r
13 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Bivariate Analysis: Correlation & Regression
No ratings yet
Bivariate Analysis: Correlation & Regression
19 pages
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
No ratings yet
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
43 pages
F Regression
No ratings yet
F Regression
65 pages
Lekcija 10 - Korelacija I Regresija
No ratings yet
Lekcija 10 - Korelacija I Regresija
76 pages
Stats 3.2
No ratings yet
Stats 3.2
28 pages
Regression vs. Correlation Explained
No ratings yet
Regression vs. Correlation Explained
32 pages
Correlation Simple Regression
No ratings yet
Correlation Simple Regression
26 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Chapter 3.3. Correlation and Linear Regression
No ratings yet
Chapter 3.3. Correlation and Linear Regression
20 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
Soil Mechanics Laboratory Manual - PDF
100% (5)
Soil Mechanics Laboratory Manual - PDF
202 pages
Soil Mechanics Lab Manual
100% (2)
Soil Mechanics Lab Manual
104 pages
12 Multiple Comparisons
No ratings yet
12 Multiple Comparisons
10 pages
Monte Carlo and Bootstrap Methods
No ratings yet
Monte Carlo and Bootstrap Methods
12 pages
Soil Mechanics Laboratory Manual
No ratings yet
Soil Mechanics Laboratory Manual
39 pages
Soil Settlement Analysis Guide
No ratings yet
Soil Settlement Analysis Guide
98 pages
08 Test of Significance
No ratings yet
08 Test of Significance
21 pages
Descriptive Statistics for Analysts
No ratings yet
Descriptive Statistics for Analysts
21 pages
02 Producing Data, Sampling
No ratings yet
02 Producing Data, Sampling
9 pages
04 Normal Approximation For Data and Binomial Distribution
No ratings yet
04 Normal Approximation For Data and Binomial Distribution
24 pages
Construction Inspection Manual 8th Edition
100% (9)
Construction Inspection Manual 8th Edition
385 pages
Soil Mechanics Lab Manual
100% (2)
Soil Mechanics Lab Manual
104 pages
DWDM Unit-3
100% (1)
DWDM Unit-3
63 pages
Does Physical Activity Attenuate, or Even Eliminate, The Detrimental Association of Sitting Time With Mortality? A Harmonised Meta-Analysis of Data From More Than 1 Million Men and Women
No ratings yet
Does Physical Activity Attenuate, or Even Eliminate, The Detrimental Association of Sitting Time With Mortality? A Harmonised Meta-Analysis of Data From More Than 1 Million Men and Women
9 pages
Factors Influencing Zomato Restaurant Choice
No ratings yet
Factors Influencing Zomato Restaurant Choice
11 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
52 pages
Lesson 2 - Nonparametric Methods
No ratings yet
Lesson 2 - Nonparametric Methods
13 pages
Chapter 07 KNN-Week 05 - 02 - A
No ratings yet
Chapter 07 KNN-Week 05 - 02 - A
21 pages
Thesis Proposal
No ratings yet
Thesis Proposal
14 pages
Lab Session Problem Exercise
No ratings yet
Lab Session Problem Exercise
2 pages
SVM Optimization for ML Experts
No ratings yet
SVM Optimization for ML Experts
27 pages
Sanket Patel
No ratings yet
Sanket Patel
4 pages
Car Interior
No ratings yet
Car Interior
10 pages
9 Skills of The Future Download
No ratings yet
9 Skills of The Future Download
11 pages
BDA Quiz 2 Help
No ratings yet
BDA Quiz 2 Help
4 pages
18BCP012 - (MAIN PROJECT) PDF
No ratings yet
18BCP012 - (MAIN PROJECT) PDF
51 pages
CBDA Competencies: 1: Identify Research Questions
100% (1)
CBDA Competencies: 1: Identify Research Questions
3 pages
SQB - PS - HARD (WithOUT)
No ratings yet
SQB - PS - HARD (WithOUT)
69 pages
Feranmi's CV
No ratings yet
Feranmi's CV
3 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Assignment Final Report: Qualification Pearson BTEC Level 5 Higher National Diploma in Business
No ratings yet
Assignment Final Report: Qualification Pearson BTEC Level 5 Higher National Diploma in Business
27 pages
Báo Cáo PTKD
No ratings yet
Báo Cáo PTKD
13 pages
Business Analytics & Computer Science: B.Tech in
No ratings yet
Business Analytics & Computer Science: B.Tech in
49 pages
DLL Matatag - Tle 7 Q1 W8
No ratings yet
DLL Matatag - Tle 7 Q1 W8
13 pages
Machine Learning Exam Questions Bank
No ratings yet
Machine Learning Exam Questions Bank
3 pages
Business Analytics Program Guide
No ratings yet
Business Analytics Program Guide
31 pages
Hypothesis Testing 15pages
No ratings yet
Hypothesis Testing 15pages
15 pages
Materi Eksplorasi Pengolahan Data - KTI Banten-Compressed
No ratings yet
Materi Eksplorasi Pengolahan Data - KTI Banten-Compressed
36 pages
Time Series Analysis Assignment
No ratings yet
Time Series Analysis Assignment
4 pages
Statistics for Data Analysis
No ratings yet
Statistics for Data Analysis
30 pages
Reliability Analysis of Skill Verity
No ratings yet
Reliability Analysis of Skill Verity
5 pages

Statistics for Data Analysts

Uploaded by

Statistics for Data Analysts

Uploaded by

Prediction is a key task of statistics

Predict the height of a son who is chosen at

Heights of 928 Sons

Predict the height of a son whose father is

72 in tall. This additional information

about the father should allow us to make a

better prediction. Regression does just that.

(divide by n − 1 instead of n if this is also done for the standard deviations sx , sy ).

Also remember that correlation does not mean causation:

Among school children there is a high

correlation between shoe size and reading

ability. Both are driven by the lurking

Percent body fat

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize

To standardize, subtract√off the predicted

Midterm score Midterm score

2000 Pres. Election in Florida

You might also like