0% found this document useful (0 votes)
25 views67 pages

BSC - Applied Statistics - Correlation and SLR

The document outlines the curriculum for Applied Statistics I, focusing on evaluating relationships between variables, computing correlation coefficients, and fitting simple linear regression models. It includes practical activities, examples, and theoretical concepts such as covariance, correlation, and regression analysis. The course aims to equip students with the skills to analyze and interpret statistical data using software tools.

Uploaded by

Sahan Pramuditha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views67 pages

BSC - Applied Statistics - Correlation and SLR

The document outlines the curriculum for Applied Statistics I, focusing on evaluating relationships between variables, computing correlation coefficients, and fitting simple linear regression models. It includes practical activities, examples, and theoretical concepts such as covariance, correlation, and regression analysis. The course aims to equip students with the skills to analyze and interpret statistical data using software tools.

Uploaded by

Sahan Pramuditha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Applied Statistics I

CC 3201 (2 Credits)
BSc in ARMT, AB, GT

Upuli I Wickramaarachchi
Learning Outcomes:

1. Evaluate the strength and direction of a relationship


between two variables

2. Compute and interpret Correlation coefficients

3. Fit and interpret Simple Linear Regression formula


coefficients

4. Use of statistical software to compute correlation


coefficients and fitting a SLR model
Activity 1

Record the heights and weights of a random sample of


15 students of the same sex (Males).

Is there any apparent relationship between the two


variables?

Would you expect the same relationship (if any) to exit


between the heights and weights of the opposite sex?
Studying Relationships - Example

The data below shows the marks obtained by 1st year ARMT
students for Basic Maths and Statistics courses
Student 3257 326 321 325 328 319 321 326 327 324
ID 0 4 3 9 0 5 5 1 8
Maths 20 23 8 29 14 11 11 20 17 17
(out of
30) x
Stats (out 30 35 21 33 33 26 22 31 33 36
of 30) y
 Is there a relationship between the marks obtained by each
students for maths and statistics?
Studying Relationships – Example
contd.
• A starting point would be to plot the marks as a scatter plot

• We can calculate the means,

= 17

= 30
30

• Now the graph is divided into 4


sections
17 • The problem is to find how strong
the tendency is?
What is Covariance?

 An attempt to quantify the tendency to go from bottom left to


top right is to evaluate the expression

)(
Simplifying Covariance Formula

and are constants (averages), so you can factor them


out of summations
Also, =n and =n
Simplifying Covariance Formula contd.
Replace the constants and
simplify
Simplifying Covariance Formula contd.
Covariance
Covariance measures the direction of the linear
relationship between two variables — whether they
increase or decrease together.

𝟏 𝟏
𝑺𝒙𝒚 = ∑ ( 𝒙 − 𝒙 )( 𝒚 − 𝒚 ¿ ¿= ( ∑ 𝒙𝒚 − 𝒙 𝒚 ¿ ¿ )
𝒏 𝒏
Studying Relationships – Example
contd.
• Thus, relationship between Stat and Maths marks is,
𝟏
𝑺 𝒙𝒚 =
𝒏
∑ 𝒙𝒚 − 𝒙 𝒚
By substituting values to the formula,
𝟏
𝑺 𝒙𝒚 = ×𝟓𝟑𝟏𝟑− 𝟑𝟎 ×𝟕𝟎=𝟐𝟏 . 𝟑
𝟏𝟎
• But what about the strength and
the direction of relationship?
Correlation?

 A correlation is a statistical measure of association/relationship


between two variables.
 Correlation analysis quantifies the degree (strength and
direction) to which an association tends to a certain pattern
via a measure called a correlation coefficient.
 For example, Pearson’s correlation coefficient—measures the
degree to which two variables tend toward a straight-line
relationship.
Quantifying correlation

There are different methods for quantifying correlation, but


these all share a number of properties:

1. If there is no relationship between the variables, the


correlation coefficient will be zero. The closer to 0 the
value, the weaker the relationship. A perfect
correlation will be either -1 or +1, depending on the
direction.
Quantifying correlation

2. The value of a correlation coefficient indicates the


direction and strength of the association, but it says
nothing about the steepness of the relationship. A
correlation coefficient is just a number, so it can’t tell us
exactly how one variable depends on the other.
Pearson’s product-moment correlation (r)

 A measure of linear association between numeric


variables.
 Give strength and direction of the relationship between the
02 variables
 This means Pearson’s correlation (r) is appropriate when
numeric variables follow a ‘straight-line’ relationship. That
doesn’t mean they have to be perfectly related, by the
way. It simply means there shouldn’t be any ‘curviness’ to
Pearson’s correlation formula

and
Or simply, Pearson’s correlation formula is
written as

Where,

and
“Pearson’s correlation test” - Assumptions

1. Both variables are measured on an interval or ratio scale.


2. The two variables are normally distributed (in the
population).
3. The relationship between the variables is linear.
Measures strength of the association
Correlation tests between two variables

Spearman’s rank correlation


Pearson’s product-
moment correlation (r)
For qualitative variables
For quantitative variables
(Ordinal scale data)
(Ratio and interval scale data)
• r only quantifies the relationship between x and y, but doesn’t show the
relativeness  doesn’t reveal the “form of relationship”
Y=3 + 5x Y=3 + 10x Y=3 + x
y y
y

x x
x
r= r= r=
1 1 1
• Though, r=1, form of the relationship is not same (rate of increment is
different)
• Our correlation analysis only characterises the strength and direction of
the association. We need to use a different kind of analysis to say
Relationships and
regression
Applications of Regression Analysis:

Much of biology is concerned with relationships between


numeric variables. For example:
 We sample fish and measure their length and weight
because we want to understand how weight changes with
length.
 We survey grassland plots and measure soil pH and
species diversity to understand how species diversity
depends on soil pH.
Applications of Regression Analysis:

 We manipulate temperature and measure fitness in


insects because we want to describe their thermal
tolerance.
 Studying the joint effect seed quality, soil fertility,
fertilizer used, Temperature and rainfall on rice yield
Regression Analysis:
 In contrast to correlation, a regression analysis allows us
to make precise statements about how one numeric
variable depends on the values of another. Graphically,
we can evaluate such dependencies using a scatter plot.
We may be interested in knowing:
 Are the variables related or not? There’s not much point
in studying a relationship that isn’t there:
Regression Analysis :
 Is the relationship positive or negative? Sometimes we
can answer a scientific question just by knowing the
direction of a relationship:
Regression Analysis :
 Is the relationship a straight line or a curve? It is
important to know the form of a relationship if we want
to make predictions:
Uses of Regression Analysis :

Purpose: Models the relationship by fitting a line


(equation) that predicts the value of one variable
(dependent) from another (independent).
Uses of Regression Analysis :
 To know the form of relationship
 Parameter estimation
 In controlling purposes (In production processes)
 Predictions
 Given the value of X, value of Y can be estimated
Uses of Regression Analysis :

n g
a ti
o l
a p
t r
E x
ng
ti
la
t i rp
o
l a In
t e

p o ng
e r
t
In
Regression

Univariate Multivariate regression


Single response variable Multiple response variables
regression
( ‘y’) (many ‘y’s)

Simple linearMultiple linear


regression regression
Steps in Regression Analysis :

1. Statement of the problem

2. Selection of potentially relevant variables

3. Data collection

4. Model specification

5. Model validation

6. Use the fitted model


What does linear regression do?

 Simple linear regression allows researchers to predict


how one variable (the response variable) responds or
depends on to another (the predictor variable),
assuming a straight-line relationship.
How does simple linear regression work?

Finding the best fit line:

 If we draw a straight line through a set of points on a graph


then, unless they form a perfect straight line, some points will lie
close to the line and others further away.

 The vertical distances between the line and each point (i.e.
measured parallel to the 𝑦-axis)  residuals.
• The residuals represent the ‘left
over’ variation after the line has
been fitted through the data. They
indicate how well the line fits the
data
• If all the points lay close to the
line, the variability of the residuals
would be low relative to the
overall variation in the response
variable, 𝑦.
Regression works by finding the line which
minimises the size of the residuals in some sense
Simple Linear Regression
Simple Linear Regression

Y = β0 + β 1 X + ε

where

 𝑦 is the response variable,

 𝑥 is the predictor variable,

 β0 is the intercept (i.e. where the line crosses the 𝑦 axis), and

 β1 is the slope of the line.

 ε – random error

 The slope of the line  the amount by which 𝑦 changes for a change of one unit in 𝑥.
Simple Linear Regression

Y = β0 + β 1 X + ε

 The slope of the line  the amount by which 𝑦 changes for a change of

one unit in 𝑥.

 If the value of 𝑏 is positive (i.e. a plus sign in the above equation) this

means the line slopes upwards to the right.

 A negative slope (𝑦 = 𝑎 − 𝑏𝑥) means the line slopes downwards to the

right.
Calculating Simple Linear Regression


𝑦 = 𝛽0 + 𝛽1 𝑥 +𝜀
Estimating the model parameters  OLS method

𝑦=^
^ 𝛽0 + ^
𝛽1 𝑥

^𝛽 = 𝑆 𝑆 𝑥𝑦 ^𝛽 =¯ ^𝛽 𝑥
1 0 𝑦 − 1¯
𝑆 𝑆𝑥
 The Simple Linear Regression Model
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
𝑦=^
𝛽0 + ^
 The Least Squares Regression Line where
^ 𝛽1 𝑥

^𝛽 = 𝑆 𝑆𝑥𝑦 ^𝛽 =¯
𝑦 ^𝛽 𝑥
1
𝑆 𝑆𝑥 0 − 1¯

𝑆 𝑆𝑥 = ∑ ¿ ¿
(∑ 𝑥 𝑖 ) ( ∑ 𝑦 𝑖 )
𝑆 𝑆𝑥𝑦 =∑ ( 𝑥 𝑖 − ¯ ¯ )=∑ 𝑥 𝑖 𝑦 𝑖 −
𝑥 )( 𝑦 𝑖 − 𝑦 45
𝑛
Example - SLR

An educational economist wants to establish the relationship


between an individual’s income and education. He takes a
random sample of 10 individuals and asks for their annual
income ( in $1000s) and education ( in years). The results are
shown below.
Education 11 12 11 15 8 10 11 12 17 11

Income
25 33 22 41 18 28 32 24 53 26
Dependent and Independent
Variables
 The dependent variable is the one that we want to
forecast or analyze.
 The independent variable is hypothesized to affect the
dependent variable.
 In this example, we wish to analyze income and we
choose the variable individual’s education that most
affects income. Hence, y is income and x is individual’s
First Step:

∑𝑖
𝑥 =118
Sum of Squares:

𝑆 𝑆𝑥𝑦 =∑ 𝑥𝑖 𝑦 𝑖 −( ∑ 𝑥𝑖 )¿ ¿
𝑆 𝑆𝑥 = ∑ 𝑥 − ¿ ¿
2
𝑖
Therefore, ^𝛽 = 𝑆 𝑆𝑥𝑦 = 215 . 4 =3 . 74
1
𝑆 𝑆𝑥 57 . 6
^𝛽 =¯𝑦 − ^𝛽 𝑥¯ = 302 − 3 . 74 118 =− 13 . 93
0 1
10 10
50
The Least Squares Regression Line

 The least squares regression line is


^
𝑦 =−13.93 +3.74 𝑥

 Interpretation of coefficients:
 The sample slope ^
𝛽 1=3.74 tells us that on average for each additional
year of education, an individual’s income rises by $3.74 thousand.

 The y-intercept is^


𝛽 0 =−13.93 . This value is the expected (or
average) income for an individual who has 0 education level (which
is meaningless here)
51
Coefficient of Determination (R²)
 R² measures the degree of linear association between X
and Y.
 So, an R² close to 0 does not necessarily indicate that X
and Y are unrelated (relation can be nonlinear)
 Also, a high R² does not necessarily indicate that the
estimated regression line is a good fit.
52
Coefficient of Determination (R²)

𝟐 𝟐
𝑹 =𝒓
53
Degrees of freedom, mean squares and F tests

 Degrees of freedom – no.of independent components in a statistic

Source of Sum of Degre Mean F0


Variation Squares es of Squares
Freedo
m
SSR 1 MSR
Regression
SSE=SST-SSR n-2 MSE F0 =
Residual MSR/MSE
SST n-1
Total

 To test the hypothesis H0 : β1 = 0 at α% level of significance, compute the test

statistics F0 and reject F0 > F 1


n-2 or p-value
Interpreting the SLR Model

Fitted model

The residuals are the difference between the


actual values and the predicted values.

The p-value, in association with the t-


statistic, shows how significant the
coefficient is to the model.

the average amount that the actual


values of Y (the dots) differ from
the predictions (the line) in units of
Y
Significant of the
model
Coefficient of determination R2=
SSR/SST x 100%
Also, Coefficient of determination R2= (Coefficient of
Test for Model Adequacy (Testing
assumptions/Testing residuals)
1. Test for normality

H0: errors have a normal distribution

H1: errors do not have a normal distribution

This can be assessed by constructing a “normal probability plot of residuals”

A normal probability plot of the residuals is a


scatter plot with the theoretical percentiles of the
normal distribution on the x-axis and the sample
percentiles of the residuals on the y-axis
Test for Model Adequacy (Testing
assumptions/Testing residuals)
2. Test for constant variance

This can be assessed by constructing a “Residuals vs. Fits Plot”

Residuals vs. Fits Plot is a scatter plot of


residuals on the y-axis and fitted values
(estimated responses) on the x-axis. The plot
is used to detect non-linearity, unequal error
variances, and outliers.
The residuals have fairly constant variance (i.e. the
distance between the residuals and the value zero)
at each level of the fitted values.

Symmetric around 0 and


parallel to X
Non constant
variance

• If this happens no constant MSE, ANOVA is


wrong
• To avoid this situation data transformation is
• Constant variance, fitted model
is wrong.
• Maybe a non-linear regression
Test for Model Adequacy (Testing
assumptions/Testing residuals)
2. Test for constant variance

 Another method to test for constant variance is by using the Scale-Location


plot.

 Here, fitted values vs the square root of the standardized residuals are
plotted.

 Ideally, the residual points should equally spread around the red line, which
would indicate constant variance.
Linearity assumption?

In the above plot, we can see that there is a


clear pattern in the residual plot. This would
indicate that we failed to meet the
assumption that there is a linear relationship
between the predictors and the outcome
variable.
Test for Model Adequacy (Testing
assumptions/Testing residuals)
3. Test for independency

 The easiest way to check the assumption of independence is using the


Durbin-Watson test.

 The null hypothesis states that the errors are not auto-correlated with
themselves (they are independent).

 Thus, a p-value > 0.05, we would fail to reject the null hypothesis.
Test for Model Adequacy (Testing
assumptions/Testing residuals)

Interpreting Durbin-Watson Values:

 DW = 2: This signifies no autocorrelation, meaning the residuals are


independent, which is the ideal scenario for linear regression assumptions.

 DW < 2: This suggests positive autocorrelation, where successive errors are


correlated positively.

 DW > 2: This indicates negative autocorrelation, where successive errors are


correlated negatively.

You might also like