Hypothesis testing for
correlation and regression
The probabilistic model y=A + Bx +e for the relationship between two random
variables x and y, where x is independent variable and y is dependent variable.
The theoretical background for making inferences about the slope B lies in the following
properties of the least squares estimator b:
PROPERTIES OF THE LEAST SQUARES ESTIMATOR b
1. b will possess sampling distribution that is normally distributed.
2. The mean of the least squares estimator b is B, E(b) = B, that is, b is an
unbiased estimator for B.
3. The standard deviation of the sampling distribution of b is
where σ is the standard deviation of the random error e,
We will use these results to test hypotheses about and to construct a confidence
interval for the slope B of the population regression line.
Since σ is usually unknown, we use its estimator s and instead of
we use its estimate
To model the relationship between the CO (Carbon Monoxide) ranking, y, and the
nicotine content, x, of an American-made cigarette the Federal Trade commission
tested a random sample of 5 cigarettes. The CO ranking and nicotine content values
are given in Table
At significance level α = 0. 05 , test the hypothesis that the nicotine content of a
cigarette contributes useful information for the prediction of carbon monoxide
ranking y, i.e. test the prediction ability of the least squares straight line model
yˆ = − 3.0 + 20. 5 x .
Solution:
Testing the usefulness of the model requires testing the hypothesis
with n = 5 and α = 0. 05 , the critical value based on (5 -2) = 3 df is obtained from Table
Thus, we will reject H0 if t < -3.182 or t > 3.182.
In order to compute the test statistic we need the values of b, s and SSxx.
From the least square straight line: b = 20.5
Using And s=1.82
= 0.4
the test statistic is
Since the calculated t-value is greater than the critical value t0.025 = 3.182, we reject
the null hypothesis and conclude that the slope B ≠ 0 . At the significance level α =
0.05, the sample data provide sufficient evidence to conclude that nicotine content
does contribute useful information for prediction of carbon-monoxide ranking using
the linear model.
A consumer investigator obtained the following least squares straight line model
( based on a sample on n = 100 families ) relating the yearly food cost y for a
family of 4 to annual income x:
yˆ = 467 + 0.26x
In addition, the investigator computed the quantities s = 1.1, SSxx = 26. Compute
the observed p-value for a test to determine whether mean yearly food cost y
increases as annual income x increases , i.e., whether the slope of the population
regression line B is positive.
Solution:
The consumer investigator wants to test
To compute the observed significance level (p-value ) of the test we must first find
the calculated value of the test statistic, tc .
Since b = 0.26, s =1.1, and SSxx = 26
The observed significance level or p-value is given by
P(t > tc ) = P(t >1.21),
where t-distribution is based on (n - 2) = (100 - 2) = 98 df.
Since df >30 we can approximate the t-distribution with the z-distribution.
p-value = P(t >1.21) = P(z >1.21) ≈ 0.5 – 0.3869 = 0.1131
In order to conclude that the mean yearly food cost increases as annual income
increases (B > 0) we must tolerate α ≥ 0.1131. But it is a big risk and usually we take
α = 0.05. Under this significance level we can not reject the hypothesis H 0. It means
we consider the sample result to be statistically insignificant.