0% found this document useful (0 votes)
8 views47 pages

IBM322 Sampling

sampling

Uploaded by

kousikkb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views47 pages

IBM322 Sampling

sampling

Uploaded by

kousikkb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Analytics for Managerial Decision Making

IBM 322

Sumit Kumar Yadav

Department of Management Studies

Tuesday 24th September, 2024


Sampling - Motivation

2 / 31
Sampling - Motivation

3 / 31
Sampling

❑ Suppose we are interested in knowing the average time an IIT


Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?

4 / 31
Sampling

❑ Suppose we are interested in knowing the average time an IIT


Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?

4 / 31
Sampling

❑ Suppose we are interested in knowing the average time an IIT


Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?

4 / 31
Basic Ideas of Sampling

1. Population (Sometimes, it is not even observable and only


abstract)
2. Sampling Frame (if you are lucky, you might get this, not
guaranteed in most practical situations)
3. Subject
4. Parameter(Constant - might be unknown)
5. Statistic (Random Variable)

5 / 31
Basic Ideas of Sampling

1. Population (Sometimes, it is not even observable and only


abstract)
2. Sampling Frame (if you are lucky, you might get this, not
guaranteed in most practical situations)
3. Subject
4. Parameter(Constant - might be unknown)
5. Statistic (Random Variable)
6. Statistic ceases to be a random variable after it is observed

5 / 31
Types of Sampling

❑ Simple Random Sample With replacement


❑ Simple Random Sample without replacement
❑ Cluster Sampling
❑ Stratified Sampling

6 / 31
Potential Causes of Bias

❑ Convenience Sampling
❑ Volunteer Sampling
❑ Systematic Sampling
❑ Non-response Bias
❑ Response Bias
Does that mean we shouldn’t use any of these types of sampling??

7 / 31
Potential Causes of Bias

❑ Convenience Sampling
❑ Volunteer Sampling
❑ Systematic Sampling
❑ Non-response Bias
❑ Response Bias
Does that mean we shouldn’t use any of these types of sampling??
NO, one can use, but with caution. Make sure it is not leading to a
systematic error

7 / 31
Sampling Using Python in Presence of Sampling
Frame

One Idea - Generate a random number for each observation and then
sort the observations

8 / 31
What after sampling?

❑ Ask, why did we do sampling? Objective is to learn about the


population
❑ Statistical Inference - Learn about parameters from sample
statistic
❑ Usually, the quantities of interest are mean and proportion in the
population (depending on the context)
❑ We deal with them separately

9 / 31
Errors in the Process of Estimation

❑ Sampling Error - Because we are only considering a subset of


population, the point estimate is rarely exactly correct.
Unavoidable error, but we can estimate the error and hence have
some control over it
❑ Non-sampling Error - If there is bias in the observations, or
sampling wasn’t done properly. Can’t be dealt with
mathematically. Should be avoided

10 / 31
Estimating Population Mean from Sample Mean

1. Let the true values in the population be A1 , A2 , A3 , · · · , AN


PN
Ai
2. Population mean is denoted by µ and equals i=1
N
3. Also, population variance is denoted by σ 2 and equals
PN 2
i=1 (Ai − µ)
N
4. Let the sample be a SRS of size n
5. Observations are X1 , X2 , . . . , Xn
6. Sample
Pnmean is denoted by X and defined as follows
Xi
X = i=1
n
7. E(X ) = µ

11 / 31
Overview of the Black Friday Dataset

❑ Source: Kaggle
❑ Purpose: Analyze consumer behavior on Black Friday
❑ Features:
❑ User demographics
❑ Product information
❑ Purchase details
❑ Our Interest - 1: From a small sample of customers, make a
reasonable guess for average purchase amount
❑ Our Interest - 2: From a small sample of customers, make a
reasonable guess for gender ratio of the customers

12 / 31
Estimating Population Variance from Sample
Observations

❑ If the sampling
Pn scheme is WITH REPLACEMENT
(Xi − X )2
sX2 = i=1
n−1
❑ If the sampling scheme is WITHOUT REPLACEMENT
  Pn 2
N −1 i=1 (Xi − X )
sX2 ,WOR =
N n−1
Why is estimating population variance important?

13 / 31
Estimating Population Variance from Sample
Observations

❑ If the sampling
Pn scheme is WITH REPLACEMENT
(Xi − X )2
sX2 = i=1
n−1
❑ If the sampling scheme is WITHOUT REPLACEMENT
  Pn 2
N −1 i=1 (Xi − X )
sX2 ,WOR =
N n−1
Why is estimating population variance important?
To get an idea about error in estimation of sample mean

13 / 31
Standard Error in Sample Mean

❑ If the sampling scheme is WITH REPLACEMENT


2
Var(X ) = σn
❑ If the sampling scheme is WITHOUT REPLACEMENT
N − n σ2
 
Var(X ) =
N −1 n
N−n
❑ N−1 is called the finite population correction
n
❑ Typically, can be ignored if sampling fraction N ≤ 0.05
❑ Standard deviation of X is called the Standard error of the sample
mean
❑ Do we know σ 2 ?? What is the remedy??

14 / 31
Central Limit Theorem

❑ Can we do better about our inference from sample mean??


❑ Maybe as the sample size increases, can we say something more
than just expectation and variance of sample mean?

15 / 31
Central Limit Theorem

❑ Can we do better about our inference from sample mean??


❑ Maybe as the sample size increases, can we say something more
than just expectation and variance of sample mean?
❑ Thank you Central Limit Theorem

15 / 31
Making Intervals

16 / 31
Confidence Interval

Back to Sample Mean X


1. X is a random variable
2. Under certain conditions, large sample size, etc. We use CLT to
get better idea about X
3. Using properties of normal distribution, what can be said about
 
4. P X is in between µ − 2 √σn and µ + 2 √σn
5. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
 
6. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
7. Or, by rearrangement of terms,

P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
8. Magic here, we have created an interval for µ
9. This is nothing but the confidence interval
17 / 31
Confidence Interval

Back to Sample Mean X


 
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
 
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,

P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval

17 / 31
Confidence Interval

Back to Sample Mean X


 
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
 
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,

P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval

17 / 31
Confidence Interval

Back to Sample Mean X


 
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
 
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,

P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval

17 / 31
Confidence Interval

Back to Sample Mean X


 
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
 
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,

P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval

17 / 31
Confidence Interval Discussions

❑ Can you also do similar calculations and make a confidence


interval for Population proportion? (Hint - Use CLT and our remark
that sample proportion can be given a similar treatment as sample
mean)
❑ Khan Academy Video
https://www.youtube.com/watch?v=bGALoCckICI
❑ Which is bigger - 99% confidence interval or 95% confidence
interval?

18 / 31
Summary of results for 100(1-α)% C.I.

n σ2 C.I. Type  Symmetric C.I. 


z α2 σ zα σ
Large known Approximate X − √ , X + √2
n n

z α2 s z α2 s
 
Large unknown Approximate X − √ ,X + √
n n
Table: C.I. for population mean µ, s is sample standard deviation

n C.I. Type  Symmetric C.I. 


s s
p̂ − z α p̂(1 − p̂) p̂(1 − p̂) 
Large Approximate , p̂ + z α2
2 n−1 n−1

Table: C.I. for population proportion p, p̂ is sample proportion

19 / 31
Sample Size Determination

❑ A survey asked 500 randomly selected students, "the average


time spent in physical exercise daily". Sample mean was 20
minutes, and standard deviation of the sample was 5 minutes.
Construct a 95% confidence interval of the population mean of
time spent in physical exercise daily.
❑ We want to repeat this study, how many students should you
survey so that the 99% confidence interval’s width is no more than
2 minutes?

20 / 31
Confidence Interval

1. Confidence Interval typically computed by adding and subtracting


standard error in the point estimate
2. Point estimate +/- some multiple of Standard error
3. What is confidence level?

21 / 31
Confidence Interval Idea

1. Confidence Interval for any parameter of the population is a


random interval that contains the parameter with some probability
2. If sampling is done repeatedly and confidence intervals are
constructed, a 95% confidence interval will contain the values
about 95% of the times
3. Typically, significance level is used, denoted by α
4. What is confidence level?

22 / 31
Confidence Interval Idea

1. 95% is the confidence level of the interval generated


2. We pick a sample, construct the interval. Can we say that the
probability that the interval contains the true value is 0.95??
3. Different schools of thought, most don’t agree on the above made
statement
4. But, everyone agrees on the fact that confidence is on the
procedure used to construct the confidence interval

23 / 31
Sample Proportion

❑ Sometimes, one is interested in estimating population proportion


❑ What is the proportion of B.Tech participants who like statistics?
❑ One can attempt the answer to this using sampling

24 / 31
Sample Proportion

❑ Can we make use of results from sample mean?

25 / 31
Sample Proportion

❑ Can we make use of results from sample mean?


❑ If the i th respondent says YES, model it as Xi = 1
❑ If the i th respondent says NO, model it as Xi = 0
❑ Denote by nYES and nNO are the responses in the sample of size n
❑ Denote by NYES and NNO are the actual values in the population
of size N

25 / 31
Sampling Proportion

❑ We denote the estimate by p̂


❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ E(p̂) = p. Do we need to prove this??
p(1 − p)
❑ Var(p̂) = . Why??
n
❑ Is p known?
❑ State CLT for sample proportion
❑ Additional conditions - np ≥ 10 and n(1 − p) ≥ 10

26 / 31
Easier way for check unbiasedness of sample
proportion

❑ We denote the estimate by p̂


❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ What kind of random variable is nYES ??
❑ nYES is Binomial random variable with parameters p and n
n  E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
 Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = =
n n2 n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)
27 / 31
Easier way for check unbiasedness of sample
proportion

❑ We denote the estimate by p̂


❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n  E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
 Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)

27 / 31
Easier way for check unbiasedness of sample
proportion

❑ We denote the estimate by p̂


❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n  E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
 Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)

27 / 31
Easier way for check unbiasedness of sample
proportion

❑ We denote the estimate by p̂


❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n  E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
 Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)

27 / 31
Application of distributions that we saw earlier , t,
chi-sq

❑ Does having the idea of population distribution itself a useful


information?
❑ If yes, how do we make use of it?
❑ Let us concern ourselves with sample mean
❑ Assume you have the information that the population distribution
is normal.
❑ How do you use this??
❑ Is CLT required??

28 / 31
Case of normal population

❑ SamplePnmean is denoted by X and defined as follows


Xi
X = i=1
n
❑ If the sampling scheme is WITH REPLACEMENT, sample
variance
Pnequals
(Xi − X )2
sX2 = i=1
n−1
❑ It can be shown that -
1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1

n 29 / 31
Case of normal population

❑ It can be shown that -


1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1

n
❑ Hence, even for small sample size, we can make confidence
intervals if we know that the population is normally distributed

29 / 31
Case of normal population

❑ It can be shown that -


1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1

n
❑ Hence, even for small sample size, we can make confidence
intervals if we know that the population is normally distributed

29 / 31
Moral of the story - results for 100(1-α)% C.I. for
Mean

Population distribution n σ2 C.I. Type


 Symmetric C.I. 
z α2 σ zα σ
Any Large known Approximate X − √ , X + √2
n n

zα s zα s
 
Any Large unknown Approximate X − √2 , X + √2
n n

zα σ zα σ
 
Normal Any known Exact X − √2 , X + √2
n n

tn−1, α2 s tn−1, α2 s
 
Normal Any unknown Exact X− √ ,X + √
n n
Table: C.I. for population mean µ, s is sample standard deviation

Typically, tn−1, α2 is used only for small n, because for large n, z α2 gives
30 / 31
Moral of the story - results for 100(1-α)% C.I. for
Proportion

n C.I. Type  Symmetric C.I. 


s s
p̂ − z α p̂(1 − p̂) p̂(1 − p̂) 
Large Approximate , p̂ + z α2
2 n−1 n−1

Table: C.I. for population proportion p, p̂ is sample proportion

For case of proportion, it is advised to use these formulae only when


apart from n being large, np̂ ≥ 10 and also n(1 − p̂) ≥ 10

31 / 31

You might also like