-f
Student's Copy
H2 Mathematics JC 2
-:_
ConnrurroN CoErrIcrENT
AI\D LrxnEn RnGREssroN
Include:
r
.
.
t
concepts of scatter diagram, correlation coefficient and linear regression
calculation and interpretation of the product moment correlation coefficient and of the
equation ofthe least squares regression line
interpolation and exfrapolation
use of a square, reciprocal or logarithmic fransformation to achieve linearity
Exclude:
r derivation of formulae
In this unit, students will:
understand that bivariate data consists of the values of two variables ( independent
and dependent variables ) obtained from the same sample, expressed as ordered pairs;
use a graphic calculator to plot the scatter diagram for a set ofbivariate data to
determine ifthere is a linear relationship between the two variables;
understand that the correlation coefficient is a measure ofthe fit of a scatter diagram
to a linear model;
calculate the product moment correlation coefficient for a set of bivariate data using a
graphic calculator, and relate the value (in particular, values close to -1, 0 and -1) to
the appearance of the scatter diagram; I Note: Zero ennelation does not necessarily
imply'no relationship', but rather 'no linear relatiorship'.]
understand thit a high correlation between two variables does not necessarily
imply one directly causes the other;
understand the concepts of linear regression and 'least squares'with reference to the
scatter diagram;
use a graphic calculator to find the equation ofthe least squares regression line, and
interpret its slope and intercept; I Note: A different line ofregression will be obtained
ifwe interchange the independent and dependent variables.]
understand the concepts of extrapolation and interpolation of dat4 and use the
appropriate regression line to make prediction or estimate a value in practical
situations;
use an appropriate transformation to linearise a set of bivariate data to fit the
regression model.
(a)
(b)
(c)
(d)
(e)
(D
G)
hypothesis tests
(h)
(r)
H2 Mathematics JC 2
S I
al
Student's Copy
4,4
lntroduction
Examples of data with two variables include
. displacement of an object and its velocity,
. speed and time of a falling object,
. annual income level and education level of individuals,
r students' examination scores in Chemistry and Physics.
Data with two variables are known as bivariate data and they are usually expressed as
orderedpairs (x,y).
Bivariate Data
Studies ofbivariate data first began in 1860s when Sir Francis
Galton investigated the degree ofresemblance between children
and their parents. In a study carried out by Galton and his
student, Karl Pearson, they measured the heights of 1078 fathers
and the heights of their sons. In order to investigate the data, the
first use of correlation and regression sfudies of data emerged.
W
.'1 .:i,:
.iil
Suppose we wish to examine the relationship between the
Sir Francis Galton
midyear scores and final examination scores of students. We may
(r822-l9l l)
want to find a model that can be used to predict the final
examination score for a sfudent having a known midyear examination score"
5 3 Scatter Diagram (or Scatter Plot)
i
A scatter diagram is
obtained when each
of the observed bivariate data (x,,f i),
i =1,2,...,n is plotted on the Cartesipn plane.
(x,,!,)
on the scatter diagram represents a single data point. From the scatter diagranq
we can judge visually if there is any relationship between x and y.
Each
An example of a data set and its scatter diagram is shown below.
Student
Midyear score
Final examination score
In the previous chapters, we have been studying data with one variable. In this chapter, we
shall investigate data with two variables and their relationship. For example, if weian find a
relationship between the midyear exarnination scores and year-end examination scores of
students, we will be able to use the information to help us make statistical inferences about
the two examination scores.
S 2
40
50
55
60
65
80
50
53
58
60
70
88
a.
\l,,
H2 Mathematics JC 2
Student's Copy
Final Examination Score
100
90
80
70
60
50
40
30
30
40
50
60
70
80
90
Midyear Score
Example
The tenrperature, Z, in degree Celsius ('C) of the tyre of a car is measured when the car
travels at different speed, v (kmtr.I). Eight sets of data are obtained. Sketch the scatter
for the data.
70
80
90
v
60
20
30
40
50
T
66
91
86
98
45
104
52
64
Solution
Using TI84+
Create two new lists, L1 and L2 using the v and
values respectively.
Press lSIITl, select l:Edit, press lEffitE-Fl , Key
into Lr and Lz the values for y and T
respectively.
H2 Mathematics JC 2
Student's Copy
To plot the data, press l2friil [srnr plor] for Stat
Plot.
Press [ENTE-R-I to select Plot l.
Press IENTEH-I to highlight on.
Under Tlpe, choose first icon for Scatter plot.
To seleot Lr for Xist and Lz for Ylist, press
lzno-l[flfor Lr and @iltr for L2,
For 'Mark', select the desired style to represent
the data points.
PIoIT
PI+t3
ff
HPE :E 14 Jh
{IF l.,r
lis t: Lr
liE t:Lz
!E +.
rIH..
eFk
To view the scatter diagran\ press fhTilil, select
9 :ZoomStat, press fETmHl.
To read the coordinates of each point, press
ImdEI.
Using CASIO GC:
t.
2.
@ffiErArsl
Go to
Create two new lists, List I and List 2 using the given y and
Tvalues respectively.
To plot the data, press [!]for Graph.
Press @ for [SET] (settings).
Choose settings as shown on the right.
Choose Lr for X list (independent) and Lz for Y list
(dependent)"
For Mark Type, select the style to represent the data points.
foflowed Uv
To view the plot, pr"rr
To read the coordinates of each point, press
3.
4.
5.
6.
7.
8.
9.
ffi
(TRACE).
M.
lsUlffl[!
sug
E
1
E
ffiilmDlE lm@r-F:
:
!1
Frequencv
l'1ark Tvpe
:q
ffiffiEffi
l.
l.!
lo"
l{=au
Listl
List 2
T={5
H2 Mathematics JC 2
-.4
Analysis of Scatter Diagrams
Scatter diagrams can reveal general patterns and relationships between two variables. We can
comment on the
l. Direction
ctnbe positively related or negatively
The two variables x and
related.
In general, if y increases as .r increases, then x and y are positively
related.
are negatively
In general, if y deoreases as increases, then x and
related.
Negative relationship
2.
Form
The form of a scatter diagram refers to the shape of the distribution of
data points. The relationship may be linear or curvilinear and perhaps
there is no clear relationship.
Quadratic relationship
3.
Strength
No clear relationship
The strength of a scatter diagram describes how tightly clustered the
points are to the underlying form.
If the data points are tightly clustered to a sftaight line, there is a
strong linear relation between x and y .
If the data points are loosely clustered to a straight line, it may indicate
a weak linear relation between x andy.
Strorg linear relationship
Weak linear relationship
H2 Mathematics JC 2
Student's
Copy
' ,
Example 2
Comment on the relationship between the two variables based on the scaffer diagrams below.
100
80
100
80
60
40
20
0
60
40
20
0
tI
30+
2s)
2o) .o o .
15 I
,lj .o .. . ..
o,I
0ro2030|
II
'm+I
rrl .i'
'..a
|.o,
_t**--_
Note
Scatterdiagramscana1sogiveusvisualevidenceof-orsuspiciousobservations.
These data points are points in a sample that lie outside the overall pattern of a distribution.
H2 Mathematics JC 2
Student's Copy
5 Correlation
Interpretation of the relationship between the variables of sample data solely based on the
scatter diagram is subjective.
It can even
be deceiving when different scales for the u(es are
used. The scatter diagrams below are plotted using the same set of data but on different scales
for they axis.
40
30
20
12
8
l0
0
-10
-20
4
0
Since the scale of a scatter diagram can be manipulated, it may be more helpful to use a
numerical approach to measure the strength of a relationship between two variables. The
product moment correlation coefficient, r, is used to measure the strength of
relationship. The formula for r with
a linear
n datapoints, is as follows:
*_Z,ZY
I(,-rX v-Y)
Z*'- ryJ[t,'-tr!)
Using S* = I(,
s,,
(I,')'
=I 0-v)'=Zy'-(Z:)'
,s,z =
wehave
-i)' =,1r'-
!1x- x )(y -fl =|ry -Z-ZJ,
r=L.
Js-s,,
Example 3
In a physical education class, the number of push-ups (-r) and sit-ups (y) done by a sample
of
ten randornly chosen students were recorded and summarised as shown
Student
x
v
I
27
30
22
26
25
Z*y =14257, Zr'
15
--13717,
4
35
42
Zr'
30
38
6
52
40
=15298,
10
35
32
55
54
40
50
4A
Z* =zst,\t=380
Find the product moment correlation coefficient ofthe sample.
43
H2 Mathematics JC 2
Student's Copy
Solution
Since the actual data is giverl we can use the GC to calculate the product moment correlation
coefficient.
Using TI-84+:
To turn on Stat Diagnostics for TI 84+:
Press lili0DEl and scroll down
@
@
@
to STAT
tEfrCH
I
llnrd
clRssrc
HITfl
DIAGNOSTICS, select On, and press [EffiER-].
OUTU FIIEIIftT {iEftPH:
Subsequently, Stat Diagnostics will be switched
on by default and we will not need to activate it
manually.
5TftT DIfi6NE5TI(5:
Store data under Lr.(x) and Lz (y). (Recall
Example l)
Press ISIA'il >CALC, select 4:LinReg(ax+b) and
press [ENER-|.
Key in Lr nnder Xlist ,then Lz under Ylist and
press Calculate.
[IITffiEffi
Hl ist-: Lr
Vl iEt,: Le
FreqLiEt :
Store RegEH:
Ealculat e
mrffi
From the results displayed, r = 0.839.
g=a*hx
a= I 4. 98822556
h=. 6578855 1 75
F E =. 7846588499
r=. 839439 I ?E I
H2 Mathematics JC 2
Student's Copy
Using CASIO GC:
Store data under List 1 (x) and List 2 (y).
1.
SUB
IEI
l5t
'EI
tq
TE
lE
qE
-21
fGltlffi'|ffilEin--Ef fT-
2.
eress @ for [CALC] and
for X (linear).
ffifor
[REG] and
selea@
=14.
F =8.8
re=8. ?
l,l$e=S1.
I(o-F[
Note: rf the independent variabte is not in List 1 or
the
iUELEhiA-h*:'-
dependentvariableisnotinList2,select[SETI,andchangeffi
lists in 2Var Xlist (independent) and 2Var Ylist (dependent). ZUer Fneq : I
m
We will be looking at the significance of the other values that appear in the ssreen shot in the
next section.
q
l.
2.
Note
For any sample data, -1
(r(
1.
The sign ofthe correlation coefficient indicates the direction of linear correlation.
When
r)
O,the correlation betweenr andy ir
ffi.
When r ( 0, the correlationbetween x ardy ir ffi.
3.
The magnitude of r indicates the strength of the linear correlation.
Generally, the bigger the value of lrl, the stronger the linear relationship.
When r = *1, we have perfect positive linear correlation. All the points lie on a straight
line with positive gradient.
When r
-1, we
have perfect negative linear correlation.
All the points lie on a straight
line with negative gradient.
When r
:0
, there is no
linear correlation between x
no relationship between the two variables.
and
y.
It does not mean there is
H2 Mathematics JC 2
Student's Copy
4. A scatter diagram together with the product moment
used to determine
sets
if there is a linear relationship.
Appendk A for an example of data
Correlation does not imply causation.
6.
In general,
0.8 < lrl <
0.5 <
.lrl
indicates strong linear correlation between the two variables.
lrl< 0.8 indicates moderately strong linear correlation between the two variables.
< 0.5 indicates weak linear correlation between the two variables.
The above classification should only be treated as a guide.
7.
The measure
has no units.
It
of the
is
variables. (See Appendix B)
Example 4
The temperature, Z, in degree Celsius ( "C ) of the tyre of a car is measured when the car
travels at different speed, v (kmh.t). Eight sets of data are obtained.
v
(r)
(ir)
20
45
40
64
30
52
=2r
5
Solution CII-84+)
(i) Using TI84+, r =0.975
50
66
60
9l
70
86
80
98
90
104
Find the product moment correlation coefficient between v and' T.
Find the corresponding product moment correlation coefficient between v and F
where .F is the temperature of the tyre measured in degree Fahrenheit ( "F ). (Use
the formul a F
O
@
with the same product moment correlation coefficient but different scatter diagrams.
5.
See
correlation coefficient should be
*nl
Store data under Lr.(x) andLz[).
Press Fmn >CALC, select 8:LinReg(a+bx) and
press lNiEFl.
Key in Ll.under Xlist , then Lz under Ylist and press
Calculate.
t0
H2 Mathematics JC 2
Student's Copy
From the results displayed, r x 0.975
95888944 I I
F=. 97499282 I 5
t^ z =.
(ii) Using TI84+, r:0"975
Key into the header L: the formula for F, which
is 9l5Lz + 32, and press fffiR-l. Note that to enter
L2, wa press
Scroll up to the header Lr and highlight by
pressing [ENTEEI.
Press
lS-r-TATi
[z-i'o-'l@.
If,ilffi
>CALC, select 8:LinReg(a+bx) and
,J=e+bH
press [ffiTEHl.
@
E=81.84285714
h=1.572887145
Key in Lr under Xlist , then Lt under Ylist and
press Calculate.
=. 95888944 I I
F=. 97499?B? I 5
F
From the results displayed, "r x 0.975
!
Solution (CASIO GC):
(1) Using CASIO GC, r =0.975.
1.
EI
1UI
EI
IEI
1l
Store data under List 1 (x) and List 2 (y)"
EE
EEI
sE
I
EII
liffilEu-ffiEE
2.
Rress @|
for [CALC] and @for [REG] and setect
for X (linear).
E'
ffi
1l
H2 Mathematics JC 2
(ii)
L.
2.
Student's Copy
Using CASIO GC, r =0.975.
Scroll up to the header L3 "
Key in the formulafor 7. Since List 2 stores the data for
and Z'
=27 *3z,key in List 3 = f Ur,
55
2+32.
I,
(press
II
ilIE
3l
III
EE
5El
EII
qE
EE
(9+5)List
te5.E
t{1.E
EEI I5B.E
ffiEfor[List])
Notice that the value of r calculated in (i) and (ii) are the same.
SUB
I
E
IO
{5
IE
5E
Eq
{[
lel.E
I
ttI.
|Eh4-E'IEaTIF
IFETIT-
IfoFF
Note
Notice that the value of r calculated in (i) and (ii) are the same. This illustrates an important
property of r- it is independent of the scale of measurement for temperature.
S 6 linear Regression
In the last section, we used both the scatter diagram and the product moment correlation
coefficient between the two variables to indicate whether it is meaningful to model the
observed data with a straight line. If it appears that the data fits into a linear model, we then
attempt to find an equation to represent the relationship by linear regression.
In Example 4,the speed, v, is controlled and the temperature, Z, is measured based on v.
Thus, v is known as the independent variable while 7, whose value depends on y is called
the dependent variable.
In Example 3, we were investigating the number of sit ups and push ups a student can do. In
this case, there is no clear dependency between the two variables.
t2
H2 Mathematics JC 2
5 7
Student's Copy
Method of Least Squares
Consider the data given in the scatter diagram below. We randomly draw a line to fit the data
first. The line drawn below may not be the line that best represents the data, the line of best
fit. There are sweral ways to find a line ofbest fit and we ture a method most commonly used
for finding zuch a line called the least squares method. The line obtained by this method is
called the least squares regression line.
To understand this method, we consider any line drawn to fit the data, for example
! = a+bx
We consider .r as the independent variable and y as the dependent variable. The circled
points in the diagram correspond to observed data points. If we were to use the line given
above to model the data, we would predict a differenty value for the corresponding r-value.
The difference is the error, e, which is known as the residual and is calculated as
e =observed y value - the predicted y value
Each pair of observations (x,,y,) produces a residual, e, for i =1,2,...,n .
>,4
i=l
is known as the sum of the squares of the residuals and we use
ef to denote
Yo2
L"i'
i=l
The least squares regression line
ofy
on
is the line that produces the smallest
Z"?
, where
is the independent variable and .y is the dependent variable.
l3
H2 Mathematics JC 2
Student's Copy
S 8 Equation of the Least Squares Regression line of y on x
Consider a line ofequation,
and b that minimize
Zr?
! = a*bx. Given a set of data, we want to find the values of a
=f
=
It
(observedy value-predictedy value)2
I(x
can be proven that in order to
L @-t)(v-v)
:;
x)'
Llx
---Fi;
-@+bx,))'
minimis"
(Appendix.B),
a=V -b7
(u-ru
Consider the general equation of a line
y = (y
Zn?
------ (l)
! = a+bx arrd substituting (1) into the equation,
-bi)+bx
* Y-T=b(x-7\
Therefore, the equation of the regression line
y-y-b(x-i),
where
is also known as the regression coefficient
b-
is
(x-i)(y -V)
ofy on x.
Note
1.
Itv-D"-,)=2ry
2.
l{*-t)'=Z*'
3.
of y on x
Another formula for
ry
U.
n
tn-Ed/
? " n ,_
Yr'
Ltn
-(I')'
t4
H2 Mathematics JC 2
Student's Copy
S 9
Equation of the Least Squares Regression Line ol x on y
Let the equation of the regression line ofr ony be x = c * dy where c and d are to be found
so that the sum of the squares of the residuals in thex-direction is a minimum.
We consider
as the independent
variable and
as the dependent variable.
We can show that the equation of the regression line of
q
l.
2.
x on y
is
-I - d(y -V) , where d - Z0-D@--)
Zt, - y)'
Note
Zl--yl'=Zy'-U.
n
= -" Z*Zv
n
Another formula for d =L*'=
3.
v u, -(zr)'
L-n
The equation of the regression line of x on y cannot be found by making x
subject in the equation of the regression line of y on x .
the
l5
;
H2 Mathematics JC 2
10
Student's
Copy
Use of Regression Lines for Estimation
Consider the data (x,y\. Given a value of one of the variables, regression lines can be used to
predict or estimate the value of the other. The choice of the regression line used depends on
the context of the situation:
(a) If there is a clear indication that x is the independent variable we will always use the
regression ti"" orffi
to do estimation.
O)
For cases where there is no clear independent variable, if we want to estimate
for a
givenvalueofx,weusetheregressionlh"'-o.ffiIfwewanttoestimatexfora
given value of
q
l.
Z.
4.
,use the regression tine of
ffi.
Note
When we dq
rslimatiaqlitlitrlh
siven range of values of the data, it is known as
jA sgllnaqon o-utside the given
range of values of the data, it is known as
Values obtained by extrapolation may not be reliable since
tfe rbhtionship between the two variables may not follow the same linear model
outside the range of the data.
Estimates using regression lines are more reliable if both the following conditions are
met:
-\[hgrrlyj
(a) The value of
of the data is close to +1, and the scatter diagram also suggest
that there is a strong linear correlation.
(b)
The estimatidn is done within the given range of values of data.
t6
'
'
H2 Mathematics JC 2
Student's Copy
Example 5
In a physical education class, the number ofpush-ups (x ) and sit-ups (y ) done by a sample
of ten randomly chosen students were recorded in the table below.
Student
x
v
27
30
(r)
2
22
26
15
35
30
6
52
25
42
38
40
Find the equation ofthe regression line of
t0
35
32
55
40
50
40
43
54
on x .
(il)
Interpret the slope and intercept in the context of the question.
(iii)
Predict the number of sit-ups a student can do when he can do 50 push-ups. Give a
reason if the predicted value is reliable.
(iv)
Give a reason whether it is reliable to use the equation in (i) to predict the number
sit-ups when 60 push-ups are done.
of
Solution
L
I
H2 Mathematics JC 2
Student's
Copy
Example 6
An electrical fire was switched on in a cold room and the tunperature ofthe room was noted
at 5-minute interval.
Time, x (in minutes) from
switching on fire
Temperature,
(a)
(inoC)
l0
t5
20
25
30
35
40
0.4
1.5
3.4
5.5
7.7
9.7
tt.7
13.5
15.4
Find the equation of the regression line
correlation coefficient between x and
the relationship between x and y.
y.
of y on x
and the proauct moment
Comment on what its value implies about
(b)
Explain why the regression line of y oL x rather than the regression line of .r on y
should be used to predict the time that has passed after switching on the fire if the
ternperature is 93C.
(c)
Predict the temperature of the room when the fire is switched on for 30 and 60
minutes. Comment on the reliability of your arxiwers.
(d)
Starting with the equation of the regression line of y on x, deduce the equation
the regression line of
y ot t where y is the temperature in oC and r is time in hours,
z on x where z is the ternperature in Kelvin (K) and x is time in minutes.
(A temperature in "C is converted to Kby addng273)
(a)
(b)
(iii)
Comment on the values
of
r obtained in (r) and (ii)?
of
'i
H2 Mathematics JC 2
S tl
Student's Copy
Properties of Regression Lines
In general, the regression line
of y
on
i.e.,
ofxony i.e.,x-c+dy.
! = a+bx is different from the regression line
These are some observations about the lines and
(r)
Both the regression line of
through the
(i,
If r > 0, both regression coefficients
If r < 0, both regression coemcients
on
x= ctdy
! = a*bx
as
r:
well as the regression line of
and
D and
D
x on y
passes
d are positive.
d
we negative.
= a+bx
=c
td!
(iii)
12
(iv)
The larger the yalue
=bd.
of lrl, i.e., almost I , the "closer" the regression line of y on x
is to the regression line of x on y .
lf r = *1, the regression lines of y on x and x on y are identical.
lf r = 0, the regression line of y on x and the regression line of x on y are a
pair ofhorizontal and vrtical lines.
l9
:
H2 Mathematics JC 2
Student's
Copy
S t2 Transformations
Not all relatio4ships betwegn x and y Ne linear.'If the relationship between x and y is not
linear, we can sometimes ube a suitable transformation to linearise the relationship. Here are
some examples:
Relationship
Transformation
Linear Relationship
!=axb
Take natural logarithm
(or take logarithm of another base)
h.y = lna+blnx
i.e., lny and lnx have
lnY =lna+bx
Take natural logarithm
(or take logarithm of another base)
= aeb'
y=Jtb
a linear
relationship.
i.e., lny and x have a linear
relationship.
Y2
Square both sides
=ax+b
i.e.,
y'
and x have a linear
relationship.
L=
'
ax+b
Take reciprocal
ax+b
i.".,
1 and x have a linear
v
relationship.
Example 7
A school bookshop sells a popular guidebook. The
successive years (Yr I - 5) me given.
Year(x)
Sales (y )
(t)
(ii)
(iii)
I
1000
2
3000
sales of the guidebook in each
of five
7000
14000
21000
Draw a scatter plot of the above data and find the product moment correlation
coefficient between r and y. Comment on the suitability of the use of a linear
model, y = @c+bfor the sales of the guidebook.
By calculating the product moment correlation coefficient between lny and lnx,
comment on the zuifability of the use of the model ! = axb as compared to the linear
model in (i)
Find the least squares regression line of ln y on ln x . Hence estimate the values of a
and b.
20
''
H2 Mathematics JC 2
Student's Copy
Solution
(1)
2s000
20000
15000
10000
5000
Using GC, the product moment correlation coefficient,
r between x and y is
LI
ffi
Key in x andy data into Lr and L2 respectively.
Scroll up to the header Lr and highlight it by
pressing EmrR-].
,E
1
a
3
1000
3000
7000
'l5
E100{l
1t(l00
L3(lI=
LE
Key in the transformation, in this case
L, = ln(L, )
*d
Press
to generate the transformed data.
[ENTER'I
tu = ln(L, )
1000
3000
7000
(l
21000
1.5t9r
1t000
.65315
1.098E
1.3E83
Find the equation of the regression line using L3
and Le.
IJ=E+bH
E=6.815631664
b=l.919318251
Fz=.9932916127
r.=. 996648 1 62 1
(iii)
Note
2t
H2 Mathematics JC 2
Student's
Copy
From part (i) of the above example, we have seen that the value of r may be very close to 1
but it does not necessarily imply tlnt alinear model = a+bx is the most suitable for the
data. It is always important to draw a scatter diagram to decide which model is more suitable.
Summary
scatter diagram is obtained when each
i =1,2,...,n is plotted on the Cartesian plane.
The product moment correlation coefficient is
I(,-;X
of the observed
bivariate data
(x,,!),
v-v)
I(,-,)'I,0-il'
zr-tr!l[r,,qir]
In order to determine if there is a linear relationship, a scatter diagrarn, together with the
product moment correlation coefficient should be used.
I
t
-1Sr<l
When r ) 0, there is positive linear correlation between x and y .
When r 10 , there is negative linear correlation between x and y .
When r = 0 , there is no linear correlation between x and y .
r is independent of the scale of measurement of x and y .
The least squares regression line of
Equation of the regression line
y on x is the Line that produces
of y on x
y-V-b(x-t)',
r
Equation of the regression line
x-7=d(y-lt),
U=ffi
is
where
O=ffi
If there is a clear independent variable x, we will always
Zr?
is
where
of x on y
the smallest
use the regression line
of .y on
to do estimation.
For cases where there is no clear independent variable, if we want to estimate y for a
given value of x, we use regression line of y on r. If we want to estimate r for a given
value of y , use regression line of r on y .
22
H2 Mathematics JC 2
Student's Copy
Appendix A - Anscombe Quartet
To illustrate the importance of using the scatter diagram to support the product moment
correlation coefficient, F.J Anscombe constructed the following 4 data sets which have
exactly the same r value but appear very different when graphed:
Anscombe's Ouartet
II
10.0
8.04
10.0
9.t4
10.0
7.46
8.0
6.s8
8.0
6.95
8.0
8.14
8.0
6.77
8.0
s.76
13.0
7.58
13.0
8.74
13.0
t2.74
8.0
7.71
9.0
8.81
9.0
8.77
9.0
7.lr
8.0
8.84
11.0
8.33
I1.0 9.26
11.0
7.81
8.0
8.47
14.0
9.96
14.0
8.10
14.0
8.84
8.0
7.04
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.2s
4.0
4.26
4.0
3.10
4.0
5.39
19.0
n.5a
12.0
10.84
12.0
9.13
12.0
8.15
8.0
s.s6
7.0
4.82
7.0
7.26
7.0
6.42
8.0
7.91
5.0
5.68
5.0 4.74 5.0
5.73
8.0
6.89
Note that all 4 sets of data have the same mean, variance, r value and linear regression line.
oo-l
t)
lt)
EE
t
I
ilr)
IV)
23
I
H2 Mathematics JC 2
Student's Copy
Appendix B - Properties of
The measure, r will not change if we add a constant or multiply a positive constant to all the
values of a variable.
If s = a*bx and , = c*dy, where a,b,c,d are constants, with b and d bothpositive,
thertrr, =
rr.
I("
-s-X
7)
I("-s')'I u-r)'
(a + bx) - (a + ut)
(a + bx) - (a + ot))'z
ll(c + ay) - (" * @)
ll(c
+ ay)- (" *
bal(x-xXy-I)
l(ox-m)(ay-@)
l(tx - tt)'Z(tl, - @)'
ml(x-tXy-r)
ffi
O))'
*a2l(x-r1'le -t)'
(Note
Jt'a' =bd '.'bd ispositive.)
Appendix C- Partial Differentiation to Determine Regression Coefficients
Let !1,!2,...,ln ba the observed values corresponding to the values xt,)c2,...,xn.
The least squares line y = a * bx is the line that minimises the sum of the squares of the
residuals ie. S =
L"?
=it
i=l
,-a-bx,)z
Calculating the partial derivative of ,S with respect to a:
AS
( v, a(note the notation
hx, ) (not.
notal' A't k being
= -2y
- * - bx,)
Oa L\t i
=
r"t
-z(\r,
''
-na
;d instead or 9{ .t
use
da
-blx,)
=o to ro o=b--bl*' =y-b,
$oann
"
Calculating the partial derivative of S with respect to b:
oS
x.( v. -a-hr,)
q'v let
k u*i ) and
= 'L)*i\li
'-' Ab -= o.
Ab
-zI
Thus
x, y
i - aZx,
* bZ*,' = p + Zr,yi - Q - ffi) n7 - bI r,'
*b=W+b=ffid=*
= 0.
24
H2 Mathematics JC 2
Student's Copy
Questions
l.
The product moment correlation coefficient is denoted by r. Comment on the validity
of
fo
llowing statements
(a) r = 0 for a set of datu (*, y) implies that X and Y are unrelated.
(b) If X is the number of cigarettes smoked per dayby people dymg of lung cancer
and I is the age at deat[ then r = -0.9 implies that smoking more cigarettes
per day causes a person to die younger.
(c)
The value
of
for a set of sample data (*,
holds for the population
y)
is
means that a linear relation
(X,Y).
I,
25
H2 Mathematics JC 2
2.
(a)
The product moment correlation coemcient for a set of data
State a data point which can be included strch that
set
(b)
Student's
(r,y)
is denoted by
Copy
,
'
r will remain the same for the new
ofdata.
Express, in terms
(t)
(ii)
of
r,
the product moment correlation coefficient
of (y, x).
the product moment correlation coefficient when all values
of x
are inueased
bv 5.
(m)
the product moment correlation coefficient for the set
of (*,* y).
26
H2 Mathematics JC 2
3.
of I
Student's Copy
With the aid of a suitable diagrarn, describe the difference between the regression line
on
and that
ofX on L
Under what circumstanoes would the above two lines be co-
The following summarizes the data from
I,
(l)
(ii)
(iii)
l0
= 17 82,2 y = I 483,1 x2 = 3 I 8086,
sets
of lengths(r) and breadths(y) in mm:
Z y' = 22A257,2.y
264582
Find the value ofthe product moment.correlation coefficient.
Find the equation of the regression line ofy on x and that ofx ony.
Predict the breadth when the length is 185 mm. Comment on the reliability
your prediction.
of
l(i)0.744 (ii) y =0.584x+44.3; x=0.949y+37.4 (iii) l52l
2V
H2 Mathematics JC 2
4.
Student's
Copy
Explain with the aid of a diagrarq what is rneant by the term "least squares" in the
context of regression lines.
Delegates who travelled by car to a Statistics Conference were asked to report d,the
distance travelled (in
km)
and
t,the time taken (in minutes). A random sample of the
reported values was given in the table below:
113
t4
98
130
75
120
143
55
t27
130
25
180
148
100
120
196
48
r6s
(r)
Find the value of the product moment correlation coefficient.
(ii)
Find the equation of the least squares regression line of r on d in the form
t = atbd. Interpret the coefficient b in the context of the question.
(iii)
Explain why it is rnore appropriate to regress t ond.
(1v)
Estimate the time takenby the delegates who travelled
(a)
100
km
O)
lsO km
and comment on the reliability of your estimates.
[(i)
0.8e4 (ii) r =3.43+1.24d (iv)
(a)
127
(b) 189]
!
I
28
rr
H2 Mathematics JC
5.
Student's Copy
F.M N94/4/9
The following summary datarefers to concentrations of carbon dioxide in the atmosphere (y)
in parts per million for the past 8 years 1971,1973,..., 1985 (x).
l{*
-let l) = 56, ZO -tzs) = 6%1{* -tstl)2
It*-tstt)(y
= 560,
ZO
-tzs)2
887
-32s) =704
[Source: Council on Environmental Quality, 1987 .]
(r)
Let u = x -1971and v = y -325. Calculate the equation of the least squares regression
line of v on a. Hence find the equation of the least squares regression line of y on x.
.
tf
(ii)
Calculate the product moment correlation coefficient for x andy. Comment on what
its value implies about the regression line.
(iii)
Estimate the concentration of carbon dioxide in the atmosphere in (a) 1974 and (b)
1988.
Comment on the reliability of your answers.
[(i) v = 1.32u -0.583 ) ! =1.32x-2268 (ii) 0.998 (iii)328;3a7)
29
H2 Mathematics JC 2
Student's
Copy
6.
A comprehensive guide: II2 Mathematics for'A' Level. Ql6 p22l (modified)
The data shows the result of an experiment to invostigate the relationship between two
variables x and /, where
(r)
(ii)
r is dependent
on /.
x 22.5
25.0
28.0
30.5
38.0
40.5
42.5
48.0
54.5
55.0
42.0
33.s
28.0
18.0
13.6
15.0
10.3
9.0
6.3
44.0
Obtain the scatter diagram and comment on anyrelationship between
70.0
4.0
x and t.
Statg with a reasor! which ofthe following models is more appropriate to fit the data
points:
(a)
(b)
(iii)
x=atb wherea>0andD<0
x=a+bt2 wherea>0andb<0.
For the appropriate model, find the product moment correlation coefficient for
the
transformed data. Estimate the values of a and. b.
[(iii) -0.990, a=136, b=*0.453]
30
H2 Mathematics JC 2
7.
Student's Copy
N2009/IU6
The table gives the world record time, in seconds above 3 minutes 30 seconds, for running
mile as at ls January in various years.
(1)
(ii)
Year, x
1930
1940
1950
1960
t970
1980
1990
2000
Time, /
40.4
36.4
31.3
24.5
2t.t
19.0
16.3
13. r
Draw a scatter diagram to illustrate the data.
Comment on whether a linear model would be appropriate, referring both to the scatter
diagram and the context ofthe question.
(iii)
Explain why in this context a quadratic model would probably not be appropriate for
lo
(iv)
ng-term predictions.
Fit a model of the form lnr = a*bx to the data, and use it to predict the world record
time as at l$ January 2010. Comment on the reliability of your prediction.
t(ii) inappropriate
(1v) 3 mins 41.4 secs, unreliablel
3l
student'scopy '
H2 Mathematics JC 2
8.
'
fM2O03/IUllOR modilied
A random sample of eight pairs of values
equations ofthe regression lines
ofy onx
ofx
and,y is used to obtain the following
and )c
71517
Y='-fr*tl0,
ony respectively.
7s=--Y+20
Seven ofthe pairs of data are given in the table.
x
v
l0
11
l2
11
t7
t4
l9
Find the eighth pair of values ofx and.y.
Determine the value of the product nnoment corelation coefficient, and say what it leads you
to expect about the scatter diagram for this sample.
Let Y be the value obtained by substituting a sample value of -r into the equation of the
regression line
ofy
on x. Evaluate Y for each of the eight values of x and state the value
of
z?-v)'.
For each of the eigtrt sample values of x, Y'is givr:n by Y'
constants. What can you say about the value
of
: a * bx, where a and b are any
I(y -y')'Z
l(to,s;,
r:
-0.904,
8.81
32
H2 Mathematics JC 2
9.
For
Student's Copy
FM2006/rU10
a
random sample of 12 observations of pairs of values (x, y), the equation of the
regression line
ofy onx is ! = 4.82-2.25x. The sum ofthe
12 values
of; is 20.64 and the
product moment correlation coefficient for the sample is -0.3.
(r)
(ii)
(iii)
Find the sum of the t2 values ofy.
Find the equation ofthe regression line
ofr ony.
Find the estimated value ofy when x =,2.8 and comment on the reliability of this
estimate.
[(i) 11.4 (ii) x=1.76-0.04y (iii) -1.48]
33
\
H2 Mathematics JC 2
10.
Student's
Copy r.
.:
N2007/IU11
Research is carried out into how the concentration of a druS in the bloodstream varies
with
time, rneasured from when the drug is given. Observations at successive times give the data
shown in the following table.
Time(/minutes)
30
65
15
Concentration ( x microgms
82
60
43
90
37
120
22
150
180
240
300
r9
t2
litre )
It is given that the value of the product moment correlation coefficient for this data is -0.912
correct to 3 decimal places.
Obtain the scatter diagram
of x on r and calculate the equation of the regression line ofx on r.
Calculate the corresponding estimated value
of r when /:
300, comment on the suitability
of the linear model.
Thevariableyisdefinedby y =ln:r. Forthevariables ywrd t,
(i)
calculate the product moment correlation coefficient and comment on its value,
(ii) calculate equation of the appropriate regression line.
Use a regression line to give the best estimate that you can of the time when the drug
concentration is l5 micrograms per litre.
[(il x = -0.260 r + 66.2: x :
-l
1.7 or
-l
1.8
(ii) r
- 0.994:ln x :
4.6t
- 0 01 ?1 r' r :
551