0% found this document useful (0 votes)
75 views20 pages

CH 2

This document discusses statistical estimation and introduces concepts like point estimation, interval estimation, confidence intervals, and properties of estimators. It provides examples of how to calculate point and interval estimates of population means and proportions, including the formulas and steps to construct confidence intervals.

Uploaded by

tihunwuro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views20 pages

CH 2

This document discusses statistical estimation and introduces concepts like point estimation, interval estimation, confidence intervals, and properties of estimators. It provides examples of how to calculate point and interval estimates of population means and proportions, including the formulas and steps to construct confidence intervals.

Uploaded by

tihunwuro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CHAPTER-TWO

Estimation and sampling distribution


Statistical inference is the act of generalizing from the data (“sample”) to a larger phenomenon (“population”)
with calculated degree of certainty. The act of generalizing and deriving statistical judgments is the process of
inference. [Note: There is a distinction between causal inference and statistical inference. Here we consider
only statistical inference.]
 In statistics there are two ways though which inference can be made.
 Statistical estimation
 Statistical hypothesis testing.
In this section we only consider Statistical Estimation
Definition: Statistical Estimation is one way of making inference about the population parameter where
the investigator does not have any prior notion about values or characteristics of the population
parameter.
There are two ways estimation.
1) Point Estimation
It is a procedure that results in a single value as an estimate for a parameter.
2) Interval estimation
It is the procedure that results in the interval of values as an estimate for a parameter, which is
interval that contains the likely values of a parameter. It deals with identifying the upper and lower
limits of a parameter. The limits by themselves are random variable.
Definitions terms used in this section.
Confidence Interval: An interval estimate with a specific level of confidence
Confidence Level: The percent of the time the true value will lie in the interval estimate given.
Degrees of Freedom: The number of data values which are allowed to vary once a statistic has been
determined.
Estimator: A sample statistic which is used to estimate a population parameter. It must be unbiased,
consistent, and relatively efficient.
Estimate: Is the different possible values which an estimator can assumes.
Interval Estimate: A range of values used to estimate a parameter.
Point Estimate: A single value used to estimate a parameter.

1
Properties of best estimator
 Unbiased Estimator: An estimator whose expected value is the value of the parameter being
estimated.
 Consistent Estimator: An estimator which gets closer to the value of the parameter as the
sample size increases.
 Relatively Efficient Estimator: The estimator for a parameter with the smallest variance.

2.1 Point and Interval estimation of the population mean: µ


 Point Estimation
Another term for statistic is point estimate, since we are estimating the parameter value. A point

estimator is the mathematical way we compute the point estimate. For instance, sum of xi over n is the

x
 .That is
i
point estimator used to compute the estimate of the population means, X is a point
n
estimator of the population mean.
Confidence interval estimation of the population mean: µ

Although X possesses nearly all the qualities of a good estimator, because of sampling error, we know
that it's not likely that our sample statistic will be equal to the population parameter, but instead will fall
into an interval of values. We will have to be satisfied knowing that the statistic is "close to" the
parameter.

There are different cases to be considered to construct confidence intervals.

Case 1:

If sample size is large or if the population is normal with known variance

Consider samples of size n drawn from a population, whose mean is  and standard deviation is  with
replacement and order important. The population can have any frequency distribution. The sampling

distribution of X will have a mean  x   and a standard deviation  x  , and approaches a normal
n

2
distribution as n gets large. This allows us to use the normal distribution curve for computing
confidence intervals.

X 
Z  has a normal distributi on with mean  0 and var iance  1
 n
   X  Z n
 X  , where  is a measure of error . (Margin of error)
   Z n

- For the interval estimator to be good the standard error (  ) should be small. How it is small?
 By making n large
 Small variability
 Taking Z small
- To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an area of size
1   such

P (  Z 2  Z  Z  2 )  1  
Where   is the probability that the parameterlies outsidethe int erval
Z 2  s tan ds for the s tan dard normal var iableto the right of which
 2 probability lies, i.e P( Z  Z 2 )   2
X 
 P( Z 2   Z 2 )  1  
 n
 P( X  Z 2  n    X  Z 2  n)  1 

 ( X  Z 2  n, X  Z 2  n ) is a 100 1   % conifidenc e int erval for 

But usually  2 is not known, in that case we estimate by its point estimator S2

 ( X  Z 2 S n , X  Z 2 S n ) is a100 1   % conifidenc e int erval for 

3
Here are the z values corresponding to the most commonly used confidence levels.

100 (1   ) %   2 Z 2

90 0.10 0.05 1.645


95 0.05 0.025 1.96
99 0.01 0.005 2.58

Case 2:

If sample size is small and the population variance,  2 is not known.

X 
t has t distribution with n  1 deg rees of freedom.
S n

 ( X  t 2 S n , X  t 2 S n ) is a100 1   % conifidenc e int erval for 

The unit of measurement of the confidence interval is the standard error. This is just the standard
deviation of the sampling distribution of the statistic.

Examples:

1. From a normal sample of size 25 trainers having the average heart rate was found to be 32 and given
that the population standard deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.

Solution:

X  32,   4.2, 1    0.95    0.05,  2  0.025


 Z 2  1.96 from table.
 The requiredint erval will be X  Z 2  n
a)
 32  1.96 * 4.2 25
 32  1.65
 (30.35, 33.65)

4
b) Therefore, we are 95% confident that the average heart rate of the population was fall in
b/n 30.35 and 33.65.
c)

X  32,   4.2, 1    0.99    0.01,  2  0.005


 Z  2  2.58 from table.
 The requiredint erval will be X  Z  2  n
 32  2.58 * 4.2 25
 32  2.17
 (29.83, 34.17 )

 We are 99% confident that the average heart rate of population was found in b/n 29.83 and
34.17.

2. A drug company is testing a new drug which is supposed to reduce blood pressure. From the six
people who are used as subjects, it is found that the average drop in blood pressure is 2.28 points,
with a standard deviation of .95 points. What is the 95% confidence interval for the mean change in
pressure?

Solution: (exercise)

2.2 Point and Interval estimation of the population proportion:

The procedure to find the confidence interval, the sample size, the error bound, and the confidence level
for a proportion is similar to that for the population mean, but the formulas are different. How do you
know you are dealing with a proportion problem? There is no mention of a mean or average!

To form a proportion, take X, the random variable for the number of successes and divide it by n, the
number of trials (or the sample size). The random variable ̂ (read "P hat") is that proportion, ̂ = .

When n is large and p is not close to zero or one, or when n ̂ <5, we can use the normal distribution to
approximate the number of successes.

5
X∼N (np,√ ), If we divide the random variable, the mean, and the standard deviation by n , we get a
normal distribution of proportions with ̂ , called the estimated proportion, as the random variable.
(Recall that a proportion as the number of successes divided by n.)

= ̂ ∼N ( ,√ )


Using algebra to simplify =√

̂ follows a normal distribution for proportions: = ̂ ∼N( ,√ )

Calculating the Margin of error (E)

̂̂
The margin of error for a proportion is E= √ , Where ̂ =1− ̂

This formula is similar to the margin of error formula for a mean, except that the "appropriate standard
deviation" is different. For a mean, when the population standard deviation is known, the appropriate

standard deviation that we use is . For a proportion, the appropriate standard deviation is √

̂̂
However, in the margin error formula, we use√ as the standard deviation, instead of√

In the margin error formula, the sample proportions ̂ and ̂ are estimates of the unknown population
proportions p and q. The estimated proportions ̂ and ̂ are used because p and q are not known. The
sample proportions ̂ and ̂ are calculated from the data: ̂ is the estimated proportion of successes, and
̂ is the estimated proportion of failures.

The (1- ) 100% confidence interval for is:

̂̂
̂ √

6
Example 1: A random sample of 122 statistics students was asked: “Have you smoked a cigarette in the
past week?” 64 students reported smoking within the past week. Find a 90% confidence interval for the
true proportion of statistics students who smoke. (Round the answers to 4 decimal places)

Solution: (exercise)

Example 2: Out of a random sample of 69 freshmen at State University, 33 students have declared a
major. Find a 98% confidence interval for the true proportion of freshmen at State University who have
declared a major. Solution: (exercise)

Example 3: Suppose a mobile phone company wants to determine the current percentage of customers
aged 50+ who use text messaging on their cell phones. How many customers aged 50+ should the
company survey in order to be 90% confident that the estimated (sample) proportion is within three
percentage points of the true population proportion of customers aged 50+ who use text messaging on
their cell phones. Solution: (exercise)

Example 4: Suppose an internet marketing company wants to determine the current percentage of
customers who click on ads on their smartphones. How many customers should the company survey in
order to be 99% confident that the estimated proportion is within five percentage points of the true
population proportion of customers who click on ads on their smartphones? (Use Population
proportion, P=0.5)

2.3 Confidence interval estimation for the difference of means: -


Consider two different populations. The first population (X) has mean μx and standard deviation σx, the
second (Y) has mean μy and standard deviation σy. From the first population take a sample of size n x and
compute its mean ̅ ; from the second population take independently a sample of size ny and compute ̅ ;
then determine ̅ - ̅ . Do this for all pairs of samples that can be chosen independently from the two
populations. The differences, ̅ - ̅ , are a new set of scores which form the sampling distribution of
differences of means.
The characteristics of the sampling distribution of differences of means are:
 The mean of the sampling distribution of differences of means equals the difference of the
population means (Mean = μ1 – μ2)

7
 The standard deviation of the sampling distribution of differences of means, also called the
standard error of differences of means is denoted by σ (̅ - ̅ ) .

σ (̅ - ̅ ) = √ where σx is the standard error of the mean of the first population and σy is the

2
standard error of the mean of the second population. ( ̅ = σ x /nx ; ̅= ; σ2y / ny )
 The sampling distribution is normal if both populations are normal, and is approximately normal if
the samples are large enough (even if the populations aren’t normal). In practice, it is assumed that
the sampling distribution of differences of means is normal if both nx and ny are ≥30.
Then the (1- 100% C.I for the difference between the two population mean - is:

̅- ̅ √

Eg. If a random sample of 50 non-smokers has a mean life of 76 years with a standard deviation of 8
years, and a random sample of 65 smokers lives 68 years with a standard deviation of 9 years,
A) What is the point estimate for the difference of the population means?
B) Find a 95% C.I. for the difference of mean lifetime of non-smokers and smokers.

Solutions:
Population x (non-smokers) nx=50 , ̅ = 76, Sx = 8, σ2x = S2x / nx, = 64 /50 =1.28 years
Population y (smokers) ny=65 , ̅ = 68, Sy = 9, σ2y = S2y / ny, = 91 /65 =1.25 years

A) A point estimate for the difference of population means (μx- μy) = ̅ - ̅ =76-68 = 8 years
B) B) At a 95% confidence level,

Z =±1.96*σ(̅-̅)= √

=±1.96*√ √

= √ = ± 1.96 *1.59 years

Hence, 95% C.I. for μx- μy = (̅ - ̅ )) ± 1.96 σ(̅ - ̅ )= 8 ± 1.96* (1.59) = 8 ± 3.12 = (4.88 to 11.12
years)

8
Exercise An anthropologist who wanted to study the heights of adult men and women took a random
sample of 128 adult men and 100 adult women and found the following summary results.

2.4 Confidence interval estimation for the difference of two proportions:

By the same analogy, the C.I. for the difference of proportions (Px - Py) is given by the following
formula.

C.I. for Px-Py =(px-py) ± Z σ(Px-Py) . Where Z is determined by the confidence coefficient and

σ (Px - Py) = √

Example: Each of two groups consists of 100 patients who have leukaemia. A new drug is given to the
first group but not to the second (the control group). It is found that in the first group 75 people have
remission for 2 years; but only 60 in the second group. Find 95% confidence limits for the difference in
the proportion of all patients with leukaemia who have remission for 2 years.

Note that: nxpx = 100 x .75 = 75 >5


nxqx = 100 x .25 = 25 >5
nypy = 100 x .60 = 60 >5
nyqy = 100 x .40 = 40 >5

px = .75, qx = .25, nx = 100, σ2Px = pxqx / nx = .75 x .25 / 100 = .001875 py = .60, qy = .40, ny = 100, σ2
Py = pyqy / ny = .60 x .40 / 100 = .0024

9
Hence, σ(Px-Py) = √ = √ = √ = 0.065

At a 95% Confidence level, Z = ± 1.96 and the difference of the two independent random samples is
(0.75 -0 .60) = 0.15. Therefore, a 95 % C. I. for the difference in the proportion with 2-year remission is
(0.15 ± 1.96 (0.065)) = (0.15 ±0 .13) = (0 .02 to 0.28).

10
Chapter Three

Hypothesis testing about population means and proportion


- This is also one way of making inference about population parameter, where the investigator has prior
notion about the value of the parameter.
Definitions:
- Statistical hypothesis: is an assertion or statement about the population whose plausibility is to be
evaluated on the basis of the sample data.
- Test statistic: is a statistics whose value serves to determine whether to reject or accept the hypothesis
to be tested. It is a random variable.
- Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value depends on
sample data.
There are two types of hypothesis:
Null hypothesis:
- It is the hypothesis to be tested.
- It is the hypothesis of equality or the hypothesis of no difference.
- Usually denoted by H0.
Alternative hypothesis:
- It is the hypothesis available when the null hypothesis has to be rejected.
- It is the hypothesis of difference.
- Usually denoted by H1 or Ha.
Types and size of errors:
- Testing hypothesis is based on sample data which may involve sampling and non sampling errors.
- The following table gives a summary of possible results of any hypothesis test:
Decision

Reject H0 Don't reject H0

H0 Type I Error(  ) Right Decision(1-  )


Truth
H1 Right Decision(1-  ) Type II Error(  )

11
- Type I error: Rejecting the null hypothesis when it is true.
- Type II error: Failing to reject the null hypothesis when it is false.
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (  ) and type II error (  ) have inverse relationship and therefore, can not be
minimized at the same time.
 In practice we set  at some value and design a test that minimize  . This is because a type I
error is often considered to be more serious, and therefore more important to avoid, than a type II
error.
General steps in hypothesis testing:

1.The first step in hypothesis testing is to specify the null hypothesis (H0) and the alternative hypothesis
(H1).
2.The next step is to select a significance level, 
3.Identify the sampling distribution of the estimator (t, Z, F,  2 )
4.The fourth step is to calculate a statistic analogous to the parameter specified by the null hypothesis.
5.Identify the critical region from table.
6.Making decision.
7.Summarization of the result (interpretation).

Hypothesis testing about the population mean  : (one population)

Suppose the assumed or hypothesized value of  is denoted by 0 , then one can formulate two sided

(1) and one sided (2 and 3) hypothesis as follows:

1. H 0 :   0 vs H1 :   0

2. H 0 :   0 vs H1 :   0

3. H 0 :   0 vs H1 :   0

12
CASES:

Case 1: When sampling is from a normal distribution with  2 known

- The relevant test statistic is

X 
Z
 n

- After specifying  we have the following regions (critical and acceptance) on the standard normal
distribution corresponding to the above three hypothesis.

Summary table for decision rule

H1 Reject H0 if Accept H0 if
  0 Z cal  Z 2 Z cal  Z 2

  0 Z cal  Z Zcal  Z

  0 Z cal  Z Z cal  Z

Where: Z cal  X  0
 n

Case 2: When sampling is from a normal distribution with  2 unknown and small sample size

- The relevant test statistic is

X  ~ t with n  1 deg rees of freedom .


t
S n

- After specifying  we have the following regions on the student t-distribution corresponding to the
above three hypothesis.

13
H1 Reject H0 if Accept H0 if
  0 tcal  t 2 tcal  t 2

  0 tcal  t tcal  t

  0 tcal  t tcal  t

Where: tcal  X  0
S n

Case3: When sampling is from a non- normally distributed population or a population whose
functional form is unknown.
- If a sample size is large one can perform a test hypothesis about the mean by using:

X  0
Z cal  , if  2 is k nown.
 n
X  0
 , if  2 is unk nown.
S n

- The decision rule is the same as case I.


Examples:
1. Test the hypotheses that the average height content of containers of certain lubricant is 10 liters if the
contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and
9.8 liters. Use the 0.01 level of significance and assume that the distribution of contents is normal.
Solution:
Let   Population mean . , 0  10
Step 1: Identify the appropriate hypothesis
H 0 :   10 vs H1 :   10
Step 2: select the level of significance,   0.01 ( given)
Step 3: Select an appropriate test statistics
t- Statistic is appropriate because population variance is not known and the sample size is also small.
Step 4: identify the critical region.
Here we have two critical regions since we have two tailed hypothesis

14
The critical region is tcal  t0.005 (9)  3.2498
 (3.2498, 3.2498 ) is acceptan ce region.
Step 5: Computations:
X  10.06, S  0.25
X  0 10.06  10
 tcal    0.76
S n 0.25 10

Step 6: Decision
Accept H0 , since tcal is in the acceptance region.
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height content of containers
of the given lubricant is different from 10 litters, based on the given sample data.
Example: The mean life time of a sample of 16 fluorescent light bulbs produced by a company is
computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the hypothesized
value for the population mean is 1600 hours. Can we conclude that the life time of light bulbs is
decreasing?
(Use   0.05 and assume the normality of the population) (exercise!)

Hypothesis testing about the population proportion : (one population)

Steps to Conduct a Hypothesis Test for a Population Proportion

1. Write down the null and alternative hypotheses in terms of the population proportion p. Include
appropriate units with the values of the proportion.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or two-
tailed.
3. Example: Suppose the hypotheses for a hypothesis test are:

H0: p=20%
Ha: p>20%
Because the alternative hypothesis is a >, this is a right-tail test. The p-value is the area in the right-tail
of the distribution.

15
4. Collect the sample information for the test and identify the significance level.
5. Find the p-value (the area in the corresponding tail) for the test using the appropriate distribution:

̂
 If n×p ≥5 and n×(1−p)≥5 , use the normal distribution with z = √

 If one of n×p<5 or n×(1−p)<5 , use a binomial distribution.

 Compare the p-value to the significance level and state the outcome of the test:
 If p-value ≤α , reject H0 in favor of Ha.
o The results of the sample data are significant. There is sufficient evidence to conclude
that the null hypothesis H0 is an incorrect belief and that the alternative hypothesis Ha is
most likely correct.

 If p-value >α, do not reject H0.


o The results of the sample data are not significant. There is not sufficient evidence to
conclude that the alternative hypothesis Ha may be correct. Write down a concluding
sentence specific to the context of the question.

Example 1: Joon believes that 50% of first-time brides in the United States are younger than their
grooms. She performs a hypothesis test to determine if the percentage is the same or different from 50%.
Joon samples 100 first-time brides and 53 reply that they are younger than their grooms. For the
hypothesis test, she uses a 1% level of significance. (Exercise)
Example 2: A teacher believes that 85% of students in the class will want to go on a field trip to the
local zoo. She performs a hypothesis test to determine if the percentage is the same or different from
85%. The teacher samples 50 students and 39 replies that they would want to go to the zoo. For the
hypothesis test, use a 1% level of significance. (Exercise)

16
Chapter Four

Chi-Square Tests

4.1 Test of Association

- Suppose we have a population consisting of observations having two attributes or qualitative


characteristics say A and B.
- If the attributes are independent then the probability of possessing both A and B is PA*PB
Where PA is the probability that a number has attribute A
PB is the probability that a number has attribute B.

- Suppose A has r mutually exclusive and exhaustive classes.


B has c mutually exclusive and exhaustive classes

- The entire set of data can be represented using r * c contingency table.

B
A B1 B2 . . Bj . Bc Total
A1 O11 O12 O1j O1c R1
A2 O21 O22 O2j O2c R2
.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or1 Or2 Orj Orc
Total C1 C2 Cj n
- The chi-square procedure test is used to test the hypothesis of independency of two attributes .For
instance we may be interested
 Whether the presence or absence of hypertension is independent of smoking habit
or not.

17
 Whether the size of the family is independent of the level of education attained by
the mothers.
 Whether there is association between father and son regarding boldness.
 Whether there is association between stability of marriage and period of
acquaintance ship prior to marriage.

- The  2 statistic is given by:


 (Oij  eij ) 2 
r c
 cal   
2
 ~  ( r 1)(c1)
2


i 1 j 1  eij 
Where Oij  the numberof units that belongto categoryi of A and j of B.
eij  Expected frequencythat belongto categoryi of A and j of B.

- The eij is given by :


Ri * C j
eij 
n
Where Ri  the i th row total .
C j  the j th column total .
n  total number of oservation s
Remark:
r c r c
n   Oij   eij
i 1 j 1 i 1 j 1

- The null and alternative hypothesis may be stated as:


H 0 : Thereis no association betweenA and B.
H1 : not H 0 ( Thereis association betweenA and B).
Decision Rule:

- Reject H0 for independency at  level of significance if the calculated value of  2 exceeds the
tabulated value with degree of freedom equal to (r  1)(c  1) .

18
 (Oij  eij ) 2 
 2 ( r 1)(c 1) at 
r c
 Reject H 0 if  2 cal    
i 1 j 1 
 eij 
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association between father
and son regarding boldness. He obtained the following results.

Son
Father Bold Not
Bold 85 59
Not 65 91
Using   5% test whether there is association between father and son regarding boldness.
Solution:
H 0 : Thereis no association between Father and Son regardingboldness.
H1 : not H 0

- First calculate the row and column totals


R1  144, R2  156, C1  150, C2  150
- Then calculate the expected frequencies( eij’s)
Ri * C j
eij 
n

 e11  R1 * C1 
144 *150
 72
n 300
R1 * C2 144 *150
e12    72
n 300

R2 * C1 156 *150
e21    78
n 300
R2 * C2 156 *150
e22    78
n 300
- Obtain the calculated value of the chi-square.

19
2  (Oij  eij ) 2 
2
 cal   
2


i 1 j 1  eij 
(85  72) 2 (59  72) 2 (65  78) 2 (91  78) 2
     9.028
72 72 78 78

- Obtain the tabulated value of chi-square


  0.05
Degrees of freedom  (r  1)(c  1)  1 * 1  1
 02.05 (1)  3.841 from table.

- The decision is to reject H0 since  2 cal   02.05 (1)


Conclusion: At 5% level of significance we have evidence to say there is association between father
and son regarding boldness, based on this sample data.
2. Random samples of 200 men, all retired were classified according to education and number of
children is as shown below
Education Number of children
level 0-1 2-3 Over 3
Elementary 14 37 32
Secondary 31 59 27
and above

Test the hypothesis that the size of the family is independent of the level of education attained by
fathers. (Use 5% level of significance) (exercise)

20

You might also like