Analytics for Managerial Decision Making
IBM 322
Sumit Kumar Yadav
Department of Management Studies
Tuesday 24th September, 2024
Sampling - Motivation
2 / 31
Sampling - Motivation
3 / 31
Sampling
❑ Suppose we are interested in knowing the average time an IIT
Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?
4 / 31
Sampling
❑ Suppose we are interested in knowing the average time an IIT
Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?
4 / 31
Sampling
❑ Suppose we are interested in knowing the average time an IIT
Roorkee B.Tech student takes to solve a Easy Level Sudoku?
❑ I randomly pick 30 of you and get the average.
❑ Is it equal to the true average?
❑ Is it close to the true average?
❑ If we take the data of 300 students instead, will it be closer to the
average??
❑ Is there a guarantee??
❑ Can we make some probabilistic remarks?
4 / 31
Basic Ideas of Sampling
1. Population (Sometimes, it is not even observable and only
abstract)
2. Sampling Frame (if you are lucky, you might get this, not
guaranteed in most practical situations)
3. Subject
4. Parameter(Constant - might be unknown)
5. Statistic (Random Variable)
5 / 31
Basic Ideas of Sampling
1. Population (Sometimes, it is not even observable and only
abstract)
2. Sampling Frame (if you are lucky, you might get this, not
guaranteed in most practical situations)
3. Subject
4. Parameter(Constant - might be unknown)
5. Statistic (Random Variable)
6. Statistic ceases to be a random variable after it is observed
5 / 31
Types of Sampling
❑ Simple Random Sample With replacement
❑ Simple Random Sample without replacement
❑ Cluster Sampling
❑ Stratified Sampling
6 / 31
Potential Causes of Bias
❑ Convenience Sampling
❑ Volunteer Sampling
❑ Systematic Sampling
❑ Non-response Bias
❑ Response Bias
Does that mean we shouldn’t use any of these types of sampling??
7 / 31
Potential Causes of Bias
❑ Convenience Sampling
❑ Volunteer Sampling
❑ Systematic Sampling
❑ Non-response Bias
❑ Response Bias
Does that mean we shouldn’t use any of these types of sampling??
NO, one can use, but with caution. Make sure it is not leading to a
systematic error
7 / 31
Sampling Using Python in Presence of Sampling
Frame
One Idea - Generate a random number for each observation and then
sort the observations
8 / 31
What after sampling?
❑ Ask, why did we do sampling? Objective is to learn about the
population
❑ Statistical Inference - Learn about parameters from sample
statistic
❑ Usually, the quantities of interest are mean and proportion in the
population (depending on the context)
❑ We deal with them separately
9 / 31
Errors in the Process of Estimation
❑ Sampling Error - Because we are only considering a subset of
population, the point estimate is rarely exactly correct.
Unavoidable error, but we can estimate the error and hence have
some control over it
❑ Non-sampling Error - If there is bias in the observations, or
sampling wasn’t done properly. Can’t be dealt with
mathematically. Should be avoided
10 / 31
Estimating Population Mean from Sample Mean
1. Let the true values in the population be A1 , A2 , A3 , · · · , AN
PN
Ai
2. Population mean is denoted by µ and equals i=1
N
3. Also, population variance is denoted by σ 2 and equals
PN 2
i=1 (Ai − µ)
N
4. Let the sample be a SRS of size n
5. Observations are X1 , X2 , . . . , Xn
6. Sample
Pnmean is denoted by X and defined as follows
Xi
X = i=1
n
7. E(X ) = µ
11 / 31
Overview of the Black Friday Dataset
❑ Source: Kaggle
❑ Purpose: Analyze consumer behavior on Black Friday
❑ Features:
❑ User demographics
❑ Product information
❑ Purchase details
❑ Our Interest - 1: From a small sample of customers, make a
reasonable guess for average purchase amount
❑ Our Interest - 2: From a small sample of customers, make a
reasonable guess for gender ratio of the customers
12 / 31
Estimating Population Variance from Sample
Observations
❑ If the sampling
Pn scheme is WITH REPLACEMENT
(Xi − X )2
sX2 = i=1
n−1
❑ If the sampling scheme is WITHOUT REPLACEMENT
Pn 2
N −1 i=1 (Xi − X )
sX2 ,WOR =
N n−1
Why is estimating population variance important?
13 / 31
Estimating Population Variance from Sample
Observations
❑ If the sampling
Pn scheme is WITH REPLACEMENT
(Xi − X )2
sX2 = i=1
n−1
❑ If the sampling scheme is WITHOUT REPLACEMENT
Pn 2
N −1 i=1 (Xi − X )
sX2 ,WOR =
N n−1
Why is estimating population variance important?
To get an idea about error in estimation of sample mean
13 / 31
Standard Error in Sample Mean
❑ If the sampling scheme is WITH REPLACEMENT
2
Var(X ) = σn
❑ If the sampling scheme is WITHOUT REPLACEMENT
N − n σ2
Var(X ) =
N −1 n
N−n
❑ N−1 is called the finite population correction
n
❑ Typically, can be ignored if sampling fraction N ≤ 0.05
❑ Standard deviation of X is called the Standard error of the sample
mean
❑ Do we know σ 2 ?? What is the remedy??
14 / 31
Central Limit Theorem
❑ Can we do better about our inference from sample mean??
❑ Maybe as the sample size increases, can we say something more
than just expectation and variance of sample mean?
15 / 31
Central Limit Theorem
❑ Can we do better about our inference from sample mean??
❑ Maybe as the sample size increases, can we say something more
than just expectation and variance of sample mean?
❑ Thank you Central Limit Theorem
15 / 31
Making Intervals
16 / 31
Confidence Interval
Back to Sample Mean X
1. X is a random variable
2. Under certain conditions, large sample size, etc. We use CLT to
get better idea about X
3. Using properties of normal distribution, what can be said about
4. P X is in between µ − 2 √σn and µ + 2 √σn
5. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
6. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
7. Or, by rearrangement of terms,
P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
8. Magic here, we have created an interval for µ
9. This is nothing but the confidence interval
17 / 31
Confidence Interval
Back to Sample Mean X
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,
P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval
17 / 31
Confidence Interval
Back to Sample Mean X
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,
P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval
17 / 31
Confidence Interval
Back to Sample Mean X
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,
P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval
17 / 31
Confidence Interval
Back to Sample Mean X
1. P X is in between µ − 2 √σn and µ + 2 √σn
2. Is is not 0.9544 approximately?? Why approximately?? Because
CLT is approximate result.
3. Thus, P µ − 2 √σn ≤ X ≤ µ + 2 √σn = 0.9544
4. Or, by rearrangement of terms,
P X − 2 √σn ≤ µ ≤ X + 2 √σn = 0.9544
5. Magic here, we have created an interval for µ
6. This is nothing but the confidence interval
17 / 31
Confidence Interval Discussions
❑ Can you also do similar calculations and make a confidence
interval for Population proportion? (Hint - Use CLT and our remark
that sample proportion can be given a similar treatment as sample
mean)
❑ Khan Academy Video
https://www.youtube.com/watch?v=bGALoCckICI
❑ Which is bigger - 99% confidence interval or 95% confidence
interval?
18 / 31
Summary of results for 100(1-α)% C.I.
n σ2 C.I. Type Symmetric C.I.
z α2 σ zα σ
Large known Approximate X − √ , X + √2
n n
z α2 s z α2 s
Large unknown Approximate X − √ ,X + √
n n
Table: C.I. for population mean µ, s is sample standard deviation
n C.I. Type Symmetric C.I.
s s
p̂ − z α p̂(1 − p̂) p̂(1 − p̂)
Large Approximate , p̂ + z α2
2 n−1 n−1
Table: C.I. for population proportion p, p̂ is sample proportion
19 / 31
Sample Size Determination
❑ A survey asked 500 randomly selected students, "the average
time spent in physical exercise daily". Sample mean was 20
minutes, and standard deviation of the sample was 5 minutes.
Construct a 95% confidence interval of the population mean of
time spent in physical exercise daily.
❑ We want to repeat this study, how many students should you
survey so that the 99% confidence interval’s width is no more than
2 minutes?
20 / 31
Confidence Interval
1. Confidence Interval typically computed by adding and subtracting
standard error in the point estimate
2. Point estimate +/- some multiple of Standard error
3. What is confidence level?
21 / 31
Confidence Interval Idea
1. Confidence Interval for any parameter of the population is a
random interval that contains the parameter with some probability
2. If sampling is done repeatedly and confidence intervals are
constructed, a 95% confidence interval will contain the values
about 95% of the times
3. Typically, significance level is used, denoted by α
4. What is confidence level?
22 / 31
Confidence Interval Idea
1. 95% is the confidence level of the interval generated
2. We pick a sample, construct the interval. Can we say that the
probability that the interval contains the true value is 0.95??
3. Different schools of thought, most don’t agree on the above made
statement
4. But, everyone agrees on the fact that confidence is on the
procedure used to construct the confidence interval
23 / 31
Sample Proportion
❑ Sometimes, one is interested in estimating population proportion
❑ What is the proportion of B.Tech participants who like statistics?
❑ One can attempt the answer to this using sampling
24 / 31
Sample Proportion
❑ Can we make use of results from sample mean?
25 / 31
Sample Proportion
❑ Can we make use of results from sample mean?
❑ If the i th respondent says YES, model it as Xi = 1
❑ If the i th respondent says NO, model it as Xi = 0
❑ Denote by nYES and nNO are the responses in the sample of size n
❑ Denote by NYES and NNO are the actual values in the population
of size N
25 / 31
Sampling Proportion
❑ We denote the estimate by p̂
❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ E(p̂) = p. Do we need to prove this??
p(1 − p)
❑ Var(p̂) = . Why??
n
❑ Is p known?
❑ State CLT for sample proportion
❑ Additional conditions - np ≥ 10 and n(1 − p) ≥ 10
26 / 31
Easier way for check unbiasedness of sample
proportion
❑ We denote the estimate by p̂
❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ What kind of random variable is nYES ??
❑ nYES is Binomial random variable with parameters p and n
n E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = =
n n2 n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)
27 / 31
Easier way for check unbiasedness of sample
proportion
❑ We denote the estimate by p̂
❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)
27 / 31
Easier way for check unbiasedness of sample
proportion
❑ We denote the estimate by p̂
❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)
27 / 31
Easier way for check unbiasedness of sample
proportion
❑ We denote the estimate by p̂
❑ The population proportion is denoted by p
n
❑ p̂ = YES
n
❑ nYES is Binomial random variable with parameters p and n
n E(n
YES YES )
❑ Hence, E(p̂) = E = =p
n n
n
YES
Var (n
YES ) p(1 − p)
❑ Also, Var (p̂) = Var = 2
=
n n n
❑ But, we don’t know p
p̂(1 − p̂)
❑ Var(p̂) = . Why??
n−1
❑ To provide an unbiased estimator of Var(p̂)
27 / 31
Application of distributions that we saw earlier , t,
chi-sq
❑ Does having the idea of population distribution itself a useful
information?
❑ If yes, how do we make use of it?
❑ Let us concern ourselves with sample mean
❑ Assume you have the information that the population distribution
is normal.
❑ How do you use this??
❑ Is CLT required??
28 / 31
Case of normal population
❑ SamplePnmean is denoted by X and defined as follows
Xi
X = i=1
n
❑ If the sampling scheme is WITH REPLACEMENT, sample
variance
Pnequals
(Xi − X )2
sX2 = i=1
n−1
❑ It can be shown that -
1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1
√
n 29 / 31
Case of normal population
❑ It can be shown that -
1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1
√
n
❑ Hence, even for small sample size, we can make confidence
intervals if we know that the population is normally distributed
29 / 31
Case of normal population
❑ It can be shown that -
1. X and sX2 are independent
(n − 1)sX2
2. ∼ χ2n−1
σ2
X −µ
❑ Can you guess the distribution of
√σ
n
❑ If σ is unknown, we replace by sX , but then the distribution ceases
to be N(0, 1). So what is it then??
X −µ
❑ Distribution of sX ∼ tn−1
√
n
❑ Hence, even for small sample size, we can make confidence
intervals if we know that the population is normally distributed
29 / 31
Moral of the story - results for 100(1-α)% C.I. for
Mean
Population distribution n σ2 C.I. Type
Symmetric C.I.
z α2 σ zα σ
Any Large known Approximate X − √ , X + √2
n n
zα s zα s
Any Large unknown Approximate X − √2 , X + √2
n n
zα σ zα σ
Normal Any known Exact X − √2 , X + √2
n n
tn−1, α2 s tn−1, α2 s
Normal Any unknown Exact X− √ ,X + √
n n
Table: C.I. for population mean µ, s is sample standard deviation
Typically, tn−1, α2 is used only for small n, because for large n, z α2 gives
30 / 31
Moral of the story - results for 100(1-α)% C.I. for
Proportion
n C.I. Type Symmetric C.I.
s s
p̂ − z α p̂(1 − p̂) p̂(1 − p̂)
Large Approximate , p̂ + z α2
2 n−1 n−1
Table: C.I. for population proportion p, p̂ is sample proportion
For case of proportion, it is advised to use these formulae only when
apart from n being large, np̂ ≥ 10 and also n(1 − p̂) ≥ 10
31 / 31