0% found this document useful (0 votes)
8 views14 pages

Module 9 Simulation

Module 9 covers simulation, sampling, and resampling techniques, focusing on Monte Carlo simulation, bootstrapping, and jackknife methods. It includes practical examples using R software, instructional materials, and self-assessment quizzes. The module aims to equip learners with the ability to define key terms, simulate data, and apply various statistical methods for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Module 9 Simulation

Module 9 covers simulation, sampling, and resampling techniques, focusing on Monte Carlo simulation, bootstrapping, and jackknife methods. It includes practical examples using R software, instructional materials, and self-assessment quizzes. The module aims to equip learners with the ability to define key terms, simulate data, and apply various statistical methods for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Module 9 - Simulation, sampling and resampling

Instructional Hours: 4

Module Description
In this module, you will learn about sampling and the different sampling meth-
ods, simulation with emphasis on Monte Carlo simulation and resampling using
the bootstrapping and jackknife methods. Integration based on Monte Carlo
integration is presented. Practical examples using R software are provided. Fur-
ther, the module provides videos, examples and quiz for self-assessment in the
learning process.
This module introduces sampling methods, simulation using Monte Carlo
simulation and resamping using bootstrapping and Jackknife methods.

Module Learning Outcomes 1


• Define the terms sampling, simulation and resampling

• Show how to simulate data with and without replacement


• Solve an integral problem using the Monte Carlo integration
• Apply the bootstrapping and jackknife methods in resampling

Learning Activities
1. Read the lecture notes (PDF)
2. Read the assigned reference materials

3. Watch lecture videos


4. Complete the module assessment/Quiz

Introduction
0.0.1 Simulation
Simulation is a statistical technique that involves generating random data based
on specified parameters or models to understand the behavior of a system or
process. It allows us to explore the potential outcomes and variability of complex
systems when analytical solutions are difficult or impossible to obtain.

1
Key points
• Random sampling: Simulation often involves random sampling from prob-
ability distributions.
• Modeling uncertainty: It helps in modeling the uncertainty and variability
inherent in real-world systems.
• Applications: Commonly used in risk assessment, financial modeling, en-
gineering, and scientific research.
• Monte Carlo simulation: A popular simulation method that relies on re-
peated random sampling to estimate properties of a system.

0.0.2 Resampling
Resampling is a statistical technique that involves repeatedly drawing samples
from observed data. It is used to assess the variability of a statistic and make
inferences about the population from which the sample was drawn.

Key points
• Bootstrap method: A resampling technique that involves repeatedly sam-
pling with replacement from the observed data to estimate the sampling
distribution of a statistic.
• Jackknife method: A resampling technique that systematically leaves out
one observation at a time from the sample set and calculates the statistic
for each subset.
• Permutation tests: Used to test hypotheses by comparing the observed
data to data generated by randomly permuting the labels of the observa-
tions.
• Cross-validation: Used in machine learning to assess the performance of
a model by dividing the data into training and validation sets multiple
times.

0.1 Simulation, sampling and resampling


Simulation
Simulation involves generating random data based on specified parameters or
models to study the behavior of a system or process.

Sampling
Sampling is the process of selecting a subset of individuals or observations from
a larger population to infer characteristics about the population.

2
Resampling
Resampling involves repeatedly drawing samples from observed data, often to
assess the variability of a statistic.

Bootstrapping
Bootstrapping is a specific resampling technique where samples are drawn with
replacement from the observed data to estimate the sampling distribution of a
statistic.

0.1.1 Similarities
• Simulation and sampling: Both involve drawing samples from a dis-
tribution.

• Resampling and bootstrapping: Bootstrapping is a type of resam-


pling.
• Simulation and resampling: Both methods generate multiple datasets
to analyze variability and uncertainty.

0.1.2 Differences
Source of data
• Simulation: Data is generated based on specified models or distributions.

• Sampling: Data is drawn from an actual population.


• Resampling: Data is repeatedly drawn from the observed dataset.
• Bootstrapping: A specific form of resampling with replacement from the
observed dataset.

Purpose
• Simulation: To model and study complex systems.
• Sampling: To make inferences about a population.
• Resampling: To assess the variability of a statistic.
• Bootstrapping: To estimate the sampling distribution of a statistic.

3
Method
• Simulation: Involves generating random data.
• Sampling: Involves selecting a subset of the population.

• Resampling: Involves creating multiple samples from observed data.


• Bootstrapping: Involves drawing samples with replacement from the ob-
served data.

0.2 Sampling methods


Sampling is the process of selecting a subset of data points from a larger dataset
(population) to make statistical inferences about the population.

0.2.1 Types of Sampling Methods


Simple Random Sampling
Each member of the population has an equal chance of being selected.
Example 1. A data scientist wants to survey 100 employees out of 1,000 in a
company to understand their job satisfaction.
• Assign a unique number to each employee (1 to 1000).
• Use a random number generator to select 100 unique numbers.
• Survey the employees corresponding to these numbers.

Stratified Sampling
The population is divided into strata (subgroups) based on specific characteris-
tics, and random samples are taken from each stratum.
Example 2. A researcher wants to ensure representation from different age
groups in a health survey. The population is divided into three age groups: 18-
30, 31-50, and 51+.

• Divide the population into three strata based on age groups.


• Randomly sample individuals from each age group proportional to their
population size.
Example 3. In a factory with 200 workers, there are 50 in the assembly line,
100 in packaging, and 50 in quality control. A sample of 60 workers is needed.

1. Divide the workers into three strata: assembly line, packaging, and quality
control.

4
2. Calculate the sample size for each stratum proportional to their population
size:
• Assembly line: 50
200 × 60 = 15
• Packaging: 100
200 × 60 = 30
• Quality control: 50
200 × 60 = 15
3. Randomly sample 15 workers from the assembly line, 30 from packaging,
and 15 from quality control.

Cluster Sampling
The population is divided into clusters, some of which are randomly selected.
All members of selected clusters are sampled.

Example 4. A school district has 20 schools, and a researcher wants to survey


all teachers in 5 randomly selected schools.
1. Number the schools from 1 to 20.
2. Use a random number generator to select 5 unique numbers.

3. Survey all teachers in the schools corresponding to these numbers.

Systematic Sampling
Every k-th member of the population is selected after a random starting point.
Example 5. From a list of 500 customers, select a sample of 50 using system-
atic sampling.
500
1. Calculate the sampling interval: k = 50 = 10.
2. Randomly select a starting point between 1 and 10, say 4.
3. Select every 10th customer from the starting point: 4, 14, 24, 34, ..., 494.

0.3 Simulation
Simulation is a computational technique used to model the behavior of a system
by generating random samples and observing the outcomes. It is often used to
estimate the probability of complex events that are difficult to analyze using
analytical methods.

5
0.3.1 Monte Carlo Simulation
Monte Carlo Simulation is a specific type of simulation that relies on repeated
random sampling to compute results. It is named after the Monte Carlo Casino
in Monaco, due to the element of chance.

Example 6. Simulate rolling a fair six-sided die 10,000 times and estimate the
probability of rolling a 4.

1. Generate 10,000 random numbers between 1 and 6.


2. Count the number of times the number 4 appears.
3. Estimate the probability by dividing the count of 4s by the total number of
rolls.

set.seed(123)
rolls <- sample(1:6, 10000, replace = TRUE)
prob_4 <- sum(rolls == 4) / 10000
prob_4

#Expected output
0.1675

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

Example 7. Estimate the value of π using Monte Carlo simulation.


1. Generate random points within a unit square.
2. Count the number of points that fall within a quarter circle of radius 1.

3. Use the ratio of points inside the circle to the total number of points to
estimate π.

set.seed(123)
n <- 10000
x <- runif(n, min = 0, max = 1)
y <- runif(n, min = 0, max = 1)
inside_circle <- (x^2 + y^2) <= 1
pi_estimate <- 4 * sum(inside_circle) / n
pi_estimate

#Expected output
3.1424

6
Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

Example 8. Simulate tossing a fair coin 1,000 times and estimate the proba-
bility of getting heads.

1. Generate 1,000 random numbers (0 or 1) where 0 represents tails and 1


represents heads.
2. Count the number of heads.
3. Estimate the probability by dividing the count of heads by the total number
of tosses.

set.seed(123)
tosses <- sample(c(0, 1), 1000, replace = TRUE)
prob_heads <- sum(tosses) / 1000
prob_heads

# Expected output
0.507

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

Example 9. Estimate the integral of f (x) = x2 from 0 to 1 using Monte Carlo


simulation.

1. Generate random points uniformly distributed in [0,1].


2. Evaluate the function at these points.

3. Take the average of the function values to estimate the integral.

set.seed(123)
n <- 10000
x <- runif(n, min = 0, max = 1)
f_x <- x^2
integral_estimate <- mean(f_x)
integral_estimate

# Expected output
0.3337

7
Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.3.2 Watch the video


Watch the video and answer the questions that follow

Video Visit the URL below to view a video:

https://www.youtube.com/embed/v=9crPxlA795Y

Video by Stat Legend - Monte Carlo Integration using R


R1
1. Using nrep = 2, 000, find the integral of 0 x2 dx using Monte Carlo
integration (4 Marks) (Answer is 0.3303753)

2. Using nrep = 4, 000, find the integral of 1 ex dx using the Monte
Carlo integration (4 Marks) (answer is 20.9764)

0.4 Bootstrapping
Bootstrapping is a statistical technique used to estimate the sampling distribu-
tion of a statistic by repeatedly resampling with replacement from the original
sample data. This method allows for assessing the variability and confidence in-
tervals of estimates, particularly when the theoretical distribution of the statistic
is unknown or the sample size is small.

0.4.1 Steps involved in bootstrapping


1. Original sample: Start with an original sample of size n from the popula-
tion.
2. Resampling: Randomly draw n observations with replacement from the
original sample to create a bootstrap sample.

3. Statistic calculation: Calculate the desired statistic (e.g., mean, median)


for the bootstrap sample.
4. Repeat: Repeat steps 2 and 3 a large number of times (e.g., 1,000 or
10,000) to create a distribution of the statistic.

8
5. Analysis: Analyze the distribution of the bootstrap statistics to estimate
the standard error, confidence intervals, and other properties of the origi-
nal statistic.

Example 10. Estimate the mean and its 95% confidence interval for a sample
using bootstrapping.

1. Generate a bootstrap sample by resampling with replacement from the orig-


inal sample.
2. Compute the mean for each bootstrap sample.
3. Repeat this process 10,000 times to create a distribution of the mean.
4. Calculate the 95% confidence interval from the bootstrap distribution.

set.seed(123)
# Original sample
original_sample <- c(10, 12, 15, 18, 20, 25, 30, 35, 40, 45)

# Function to generate bootstrap samples and calculate the mean


bootstrap_mean <- function(data, n_bootstrap) {
bootstrap_means <- numeric(n_bootstrap)
n <- length(data)
for (i in 1:n_bootstrap) {
bootstrap_sample <- sample(data, size = n, replace = TRUE)
bootstrap_means[i] <- mean(bootstrap_sample)
}
return(bootstrap_means)
}

# Generate 10,000 bootstrap samples


n_bootstrap <- 10000
bootstrap_means <- bootstrap_mean(original_sample, n_bootstrap)

# Calculate the 95% confidence interval


ci_lower <- quantile(bootstrap_means, 0.025)
ci_upper <- quantile(bootstrap_means, 0.975)

list(mean = mean(bootstrap_means), ci_lower = ci_lower, ci_upper = ci_upper)

# Expected output
$mean
[1] 25.02295

$ci_lower
[1] 17.5

9
$ci_upper
[1] 32.5

Example 11. Estimate the median and its 95% confidence interval for a sample
using bootstrapping.
Solution:
1. Generate a bootstrap sample by resampling with replacement from the orig-
inal sample.

2. Compute the median for each bootstrap sample.


3. Repeat this process 10,000 times to create a distribution of the median.
4. Calculate the 95% confidence interval from the bootstrap distribution.

set.seed(123)
# Function to generate bootstrap samples and calculate the median
bootstrap_median <- function(data, n_bootstrap) {
bootstrap_medians <- numeric(n_bootstrap)
n <- length(data)
for (i in 1:n_bootstrap) {
bootstrap_sample <- sample(data, size = n, replace = TRUE)
bootstrap_medians[i] <- median(bootstrap_sample)
}
return(bootstrap_medians)
}

# Generate 10,000 bootstrap samples


bootstrap_medians <- bootstrap_median(original_sample, n_bootstrap)

# Calculate the 95% confidence interval


ci_lower <- quantile(bootstrap_medians, 0.025)
ci_upper <- quantile(bootstrap_medians, 0.975)

list(median = median(bootstrap_medians), ci_lower = ci_lower, ci_upper = ci_upper)

# Expected output
$median
[1] 25

$ci_lower
[1] 15

$ci_upper
[1]

10
Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.4.2 Quiz
Given the following dataset representing the weights (in kg) of five individuals:

data = {55, 62, 70, 58, 64}

Using the bootstrapping method, calculate the bootstrap estimate of the


mean weight based on 1000 resamples. (6 Marks) (Answer is 61.4). Hint:
Execute this question by setting the seed to 123.

0.5 Jackknife
Jackknife resampling is a technique used to estimate the bias and variance of a
statistical estimator. It systematically leaves out one observation at a time from
the sample set and calculates the estimate over n different samples (where n is
the sample size). It is particularly useful for small datasets and can be applied
to various statistical measures.

0.5.1 Steps involved in Jackknife resampling


1. Compute the original estimator: Calculate the estimator (e.g., mean, vari-
ance) using the full dataset.
2. Leave-one-out samples: Create n subsets of the data, each leaving out one
observation.
3. Compute leave-one-out estimates: For each subset, compute the estimator.
4. Jackknife estimate of bias: Calculate the bias of the estimator using the
leave-one-out estimates.
5. Jackknife estimate of variance: Calculate the variance of the estimator
using the leave-one-out estimates.
6. Combine results: Use the bias and variance estimates to adjust the original
estimator if needed.

Example 12. Estimate the mean of a sample and its standard error using the
jackknife method.

1. Compute the mean of the full dataset.


2. Generate leave-one-out samples.

11
3. Compute the mean for each leave-one-out sample.
4. Calculate the jackknife estimate of the mean and its standard error.

# Sample data
data <- c(2, 4, 6, 8, 10)

# Step 1: Compute the original estimator (mean)


original_mean <- mean(data)

# Step 2: Generate leave-one-out samples and compute their means


n <- length(data)
leave_one_out_means <- numeric(n)

for (i in 1:n) {
leave_one_out_sample <- data[-i]
leave_one_out_means[i] <- mean(leave_one_out_sample)
}

# Step 3: Compute the jackknife estimate of the mean


jackknife_mean <- mean(leave_one_out_means)

# Step 4: Compute the jackknife estimate of the standard error


jackknife_se <- sqrt((n - 1) * mean((leave_one_out_means - jackknife_mean)^2))

# Results
original_mean
jackknife_mean
jackknife_se

# Expected output
original_mean: 6
jackknife_mean: 6
jackknife_se: 1.414214

0.5.2 Quiz
Given the dataset
data = {5, 7, 8, 6, 9}
Calculate the Jackknife mean, bias and variance (6 Marks) (Answer: mean =7.2,
Bias=0.2 and variance=0.24 )

0.5.3 Discussion forum


Provide a discussion forum for the students to share their experiences in
this module

12
0.6 Reading Materials
1. Miltiadis, C. Mavrakakis and Jeremy Penzer (2021). Probability and Sta-
tistical Inference: From Basic Principles to Advanced Models. Chapman
and Hall/CRC (Pages 379-400) Read the selected pages

0.7 Summary
1. Sampling is the process of selecting a subset of individuals from a popu-
lation to estimate characteristics of the whole population.

Types of sampling

• Simple random sampling: Every individual has an equal chance of


being selected.
• Stratified sampling: Population is divided into strata, and samples
are taken from each stratum.
• Cluster sampling: Population is divided into clusters, and whole clus-
ters are sampled.
• Systematic sampling: Every nth individual is selected from a list of
the population.
• Convenience sampling: Samples are taken from a group that is con-
veniently accessible.

2. Bootstrapping is a resampling method used to estimate the distribution


of a statistic by sampling with replacement from the original data.

Key methods

• Generating bootstrap samples: Repeatedly draw samples with re-


placement from the original data set.
• Calculating statistics: For each bootstrap sample, calculate the de-
sired statistic (e.g., mean, variance).
• Estimating distribution: Use the distribution of the bootstrap statis-
tics to estimate the sampling distribution.
• Confidence intervals: Derive confidence intervals from the bootstrap
distribution.

3. Jackknife is a resampling method used to estimate the bias and variance


of a statistic by systematically leaving out one observation at a time from
the sample set.

13
Key methods

• Leave-one-out: Create jackknife samples by systematically omitting


each observation from the sample.
• Calculate statistic: For each jackknife sample, calculate the statistic
of interest.
• Estimate bias and variance: Use the jackknife samples to estimate
the bias and variance of the statistic.
• Jackknife-after-bootstrap: Combining jackknife and bootstrap for
more robust variance estimates.

4. Monte Carlo Simulation is a computational technique that uses repeated


random sampling to simulate and understand the behavior of a system or
process.

Key methods

• Random sampling: Generate random samples from the probability


distributions of the input variables.
• Model simulation: Use the random samples to simulate the model or
system.
• Outcome analysis: Analyze the results of the simulations to estimate
probabilities, averages, variances, etc.
• Applications: Used in various fields like finance, engineering, and
science to solve problems involving uncertainty and complex systems.

14

You might also like