Data Science: Probability
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Discrete Probability
RAFAEL IRIZARRY: We start by covering some basic principles related to categorical data.
This subset of probability is referred to as discrete probability.
It will help us understand the probability theory
we will later introduce for numeric and continuous data, which is more
common in data science applications.
Discrete probability is more useful in card games
and we use these as examples.
The word probability is used in everyday language.
For example, Google's auto complete of, what are the chances of,
gives us getting pregnant, having twins, and rain tomorrow.
Answering questions about probability is often hard, if not impossible.
Here, we discuss a mathematical definition of probability
that does permit us to give precise answers to certain questions.
For example, if I have two red beads and three blue beads inside an urn
and I pick one at random, what is the probability of picking a red one?
Our intuition tells us that the answer is 2/5, or 40%.
A precise definition can be given by noting
that there are five possible outcomes of which two satisfy
the condition necessary for the event "pick a red bead."
Because each of the five outcomes has the same chance of occurring,
we conclude that the probability is 0.4 for red and 0.6 for blue.
A more tangible way to think about the probability of an event
is as a proportion of times the event occurs
when we repeat the experiment over and over independently and under the same
conditions.
Before we continue, let's introduce some notation.
We use the notation probability of A to denote the probability of an event A
happening.
We use the very general term event to refer
to things that can happen when something happens by chance.
For example, in our previous example, the event was picking a red bead.
In a political poll, in which we call 100 likely voters
at random, an example of an event is calling
48 Democrats and 52 Republicans.
In data science applications, we will often deal with continuous variables.
In these cases, events will often be things like,
is this person taller than 6 feet?
In this case, we write events in a more mathematical form.
For example, x greater than 6.
We'll see more of these examples later.
Here, we focus on categorical data and discrete probability.
Textbook link
This video corresponds to the textbook section on discrete probability.
Key points
The probability of an event is the proportion of times the event occurs when
we repeat the experiment independently under the same conditions.
Pr(A)=probability of event A
An event is defined as an outcome that can occur when when something
happens by chance.
We can determine probabilities related to discrete variables (picking a red
bead, choosing 48 Democrats and 52 Republicans from 100 likely voters)
and continuous variables (height over 6 feet).
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Monte Carlo
Simulations
RAFAEL IRIZARRY: Computers provide a way to actually perform
the simple random experiments, such as the one we did before.
Pick a bead at random from a bag or an urn with 3 blue beads and 2 red ones.
Random number generators permit us to mimic the process of picking at random.
An example in R is the sample function.
We demonstrate its use showing you some code.
First, use the rep function to generate the urn.
We create an urn with 2 red and 3 blues.
You can see when we type beads we see this.
Now, we can use a sample function to pick one at random.
If we type sample beads comma 1, in this case, we get a blue.
This line of code produces one random outcome.
Now, we want to repeat this experiment over and over.
However, it is, of course, impossible to repeat forever.
Instead, we repeat the experiment a large enough number of times
to make the results practically equivalent to doing it
over and over forever.
This is an example of a Monte Carlo simulation.
Note that much of what mathematical and theoretical statisticians study--
something we do not cover in this course--
relates to providing rigorous definitions of practically equivalent,
as well as studying how close a large number of experiment
gets us to what happens in the limit, the limit meaning if we did it forever.
Later in this module, we provide a practical approach
to deciding what is large enough.
To perform our first Monte Carlo simulation,
we use the replicate function.
This permits us to repeat the same task any number of times we want.
Here, we repeat the random event 10,000 times.
We set B to be 10,000, then we use the replicate function
to sample from the beads 10,000 times.
We can now see if, in fact, our definition
is in agreement with this Monte Carlo simulation approximation.
We can use table, for example, to see the distribution.
And then we can use prop.table to give us the proportions.
And we see that, in fact, the Monte Carlo simulation
gives a very good approximation with 0.5962 for blue and 0.4038 for red.
We didn't get exactly 0.6 and exactly 0.4,
but statistical theory tells us that, if we make B large enough,
we can get as close as we want to those numbers.
We just covered a simple and not very useful
example of Monte Carlo simulations.
But we will use Monte Carlo simulation to estimate probabilities
in cases in which it is harder to compute the exact ones.
Before we go into more complex examples, we still
use simple ones to demonstrate the computing tools available in R.
Let's start by noting that we don't actually
have to use replicate in this particular example.
This is because the function sample has an argument that
permits us to pick more than one element from the urn.
However, by default, this selection occurs without replacement.
After a bead is selected, it is not put back in the urn.
Note what happens when we ask to randomly select 5 beads.
Let's do it over and over again.
Let's do it three times.
This results in a rearrangement that always has three blue and two
red beads.
If we asked for six beads, then we get an error.
It tells us you don't have enough beads in here to get six.
This is because it's doing it without replacement.
However, this function, the sample function,
can be used directly-- again, without the replicate--
to repeat the same experiment of picking 1 out
of 5 beads over and over under the same conditions.
To do this, we sample with replacement.
After we pick the bead we put it back in the urn.
We can tell sample to do this by changing the replace argument which
defaults to false to true.
We do it like this.
And when we do this, we see that we get very similar answers to what
we got using the replicate function.
Textbook link
This video corresponds to the textbook section on Monte Carlo simulations.
Key points
Monte Carlo simulations model the probability of different outcomes by
repeating a random process a large enough number of times that the results
are similar to what would be observed if the process were repeated forever.
The sample() function draws random outcomes from a set of options.
The replicate() function repeats lines of code a set number of times. It
is used with sample() and similar functions to run Monte Carlo simulations.
Video code
Note that your exact outcome values from the Monte Carlo simulation will differ
because the sampling is random.
beads <- rep(c("red", "blue"), times = c(2,3)) # create an urn with 2 red, 3 blue
beads # view beads object
sample(beads, 1) # sample 1 bead at random
B <- 10000 # number of times to draw 1 bead
events <- replicate(B, sample(beads, 1)) # draw 1 bead, B times
tab <- table(events) # make a table of outcome counts
tab # view count table
prop.table(tab) # view table of outcome proportions
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Setting the Random
Seed
The set.seed() function
Before we continue, we will briefly explain the following important line of code:
set.seed(1986)
Throughout this book, we use random number generators. This implies that many
of the results presented can actually change by chance, which then suggests that a
frozen version of the book may show a different result than what you obtain when
you try to code as shown in the book. This is actually fine since the results are
random and change from time to time. However, if you want to to ensure that
results are exactly the same every time you run them, you can set R’s random
number generation seed to a specific number. Above we set it to 1986. We want to
avoid using the same seed every time. A popular way to pick the seed is the year -
month - day. For example, we picked 1986 on December 20, 2018: 2018 − 12 −
20 = 1986.
You can learn more about setting the seed by looking at the documentation:
?set.seed
In the exercises, we may ask you to set the seed to assure that the results you
obtain are exactly what we expect them to be.
Important note on seeds in R 3.5 and R 3.6
R was recently updated to version 3.6 in early 2019. In this update, the default
method for setting the seed changed. This means that exercises, videos, textbook
excerpts and other code you encounter online may yield a different result based on
your version of R.
If you are running R 3.6, you can revert to the original seed setting behavior by
adding the argument sample.kind="Rounding". For example:
set.seed(1)
set.seed(1, sample.kind="Rounding") # will make R 3.6 generate a seed as in R 3.5
Using the sample.kind="Rounding" argument will generate a message:
non-uniform 'Rounding' sampler used
This is not a warning or a cause for alarm - it is a confirmation that R is using the
alternate seed generation method, and you should expect to receive this message
in your console.
If you use R 3.6, you should always use the second form of set.seed() in
this course series (outside of DataCamp assignments). Failure to do so may
result in an otherwise correct answer being rejected by the grader. In most cases
where a seed is required, you will be reminded of this fact.
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Using the mean
Function for Probability
Using the mean Function for Probability
Marcar esta página
An important application of the mean() function
In R, applying the mean() function to a logical vector returns the proportion of
elements that are TRUE. It is very common to use the mean() function in this way to
calculate probabilities and we will do so throughout the course.
Suppose you have the vector beads from a previous video:
beads <- rep(c("red", "blue"), times = c(2,3))
beads
[1] "red" "red" "blue" "blue" "blue"
To find the probability of drawing a blue bead at random, you can run:
mean(beads == "blue")
[1] 0.6
This code is broken down into steps inside R. First, R evaluates the logical
statement beads == "blue", which generates the vector:
FALSE FALSE TRUE TRUE TRUE
When the mean function is applied, R coerces the logical values to numeric values,
changing TRUE to 1 and FALSE to 0:
0 0 1 1 1
The mean of the zeros and ones thus gives the proportion of TRUE values. As we
have learned and will continue to see, probabilities are directly related to the
proportion of events that satisfy a requirement.
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Probability Distributions
RAFAEL IRIZARRY: Defining a distribution for categorical outcomes
is relatively straight forward.
We simply assign a probability to each category.
In cases that can be thought of as beads in an urn, for each bead type,
the proportion defines the distribution.
Another example comes from polling.
If you're are randomly calling likely voters from a population that
has 44% Democrat, 44% Republican, 10% undecided, and 2% green,
these proportions define the probability for each group.
For this example, the probability distribution
is simply these four proportions.
Again, categorical data makes it easy to define probability distributions.
However, later in applications that are more common in data science,
we will learn about probability distributions for continuous variables.
In this case, it'll get a little bit more complex.
But for now, we're going to stick to discrete probabilities
before we move on.
Textbook link
This video corresponds to the textbook section on probability distributions.
Key points
The probability distribution for a variable describes the probability of
observing each possible outcome.
For discrete categorical variables, the probability distribution is defined by
the proportions for each group.
Section 1: Discrete Probability 1.1 Introduction to Discrete Probability Independence
RAFAEL IRIZARRY: We say that two events are independent if the outcome of one
does not affect the other.
The classic example are coin tosses.
Every time we toss a fair coin, the probability of seeing heads
is 1/2 regardless of what previous tosses have revealed.
The same is true when we pick beads from an urn with replacement.
In the example we saw earlier, the probability of red
was .40 regardless of previous draws.
Many examples of events that are not independent come from card games.
When we deal the first card, the probability of getting, say,
a king is 1 in 13.
This is because there are 13 possibilities.
You can get an ace, a 2, a 3, a 4, et cetera, 10, jack, queen or king.
Now, if we deal a king for the first card and I don't replace it,
then the problem of getting a king in the second card
is less because there are only three kings left.
The probability is 3 out of not 52--
because we already dealt one card-- but out of 51.
These events are, therefore, not independent.
The first outcome affects the second.
To see an extreme case of non-independent events,
consider an example of drawing five beads at random
without replacement from an urn.
Three are blue, two are red.
I'm going to generate data like this using the sample function
and assign it to x.
You can't see the outcomes.
Now, if I ask you to guess the color of the first bead, what do you guess?
Since there's more blue beads, there's actually a 0.6 chance of seeing blue.
That's probably what you guess.
But now, I'm going to show you the outcomes of the other four.
The second, third, fourth and fifth outcomes you can see here.
You can see that the three blue beads have already come out.
This affects the probability of the first.
They are not independent.
So would you still guess blue?
Of course not.
Now, you know that the probability of red is 1.
These events are not independent.
The probabilities change once you see the other outcomes.
When events are not independent, conditional probabilities
are useful and necessary to make correct calculations.
We already saw an example of a conditional probability.
We computed the probability that a second dealt card is
a king given that the first was a king.
In probability, we use the following notation.
We use this dash like this as a shorthand
for given that or conditional on.
These are synonyms.
Note that, when two events, say A and B, are independent,
we have the following equation.
The probability of A given B is equal to the probability of A.
It doesn't matter what B is.
The probability A is unchanged.
This is the mathematical way of saying it.
And in fact, this can be considered the mathematical definition
of independence.
All right now.
If we want to know the probability of two events, say A and B, occurring,
we can use the multiplication rule.
So the probability of A and B is equal to the probability of A multiplied
by the probability of B given that A already happened.
Let's use blackjack as an example.
In blackjack, you get assigned two random cards without replacement.
Then, you can ask for more.
The goal is to get closer to 21 than the dealer without going over.
Face cards are worth 10 points.
So is the 10 card.
That's worth 10 points, too.
And aces are worth either 11 or 1.
So if you get an ace and a face card, you win automatically.
So, in blackjack, to calculate the chances of getting 21
in the following way, first we get an ace
and then we get a face card or a 10, we compute the problem of the first
being an ace and then multiply by the probability of a face card
or a 10 given that the first card was an ace.
This calculation is 1/13--
chance of getting an ace--
times the chance of getting a card with value 10
given that we already saw an ace, which is 16 out of 51.
We've already taken one card out.
This is approximately 2%.
The multiplicative rule also applies to more than two events.
We can use induction to expand for more than two.
So the probability of A and B and C is equal to the probability of A times
the probability of B given that A happened times the probability of C
that A and b happened.
When we have independent events, the multiplication rule becomes simpler.
We simply multiply the three probabilities.
But we have to be very careful when we use a multiplicative rule in practice.
We're assuming independence, and this can
result in very different and incorrect probability calculations
when we don't actually have independence.
This could have dire consequences, for example, in a trial
if an expert doesn't really know the multiplication rule and how to use it.
So let's use an example.
This is loosely based on something that actually happened.
Imagine a court case in which the suspect was described
to have a mustache and a beard, and the prosecution brings in an expert
to argue that, because 1 in 10 men have beards and 1 in 5 men
has mustaches, using the multiplication rule, this means that only 2% of men
have both beards and mustaches--
1/10 times 1/5.
2% is a pretty unlikely event.
However, to multiply like this, we need to assume independence.
And in this case, it's clearly not true.
The conditional probability of a man having a mustache conditional on them
having a beard is quite high.
It's about 95%.
So the correct calculation actually gives us a much higher probability.
It's 9%.
So there's definitely reasonable doubt.
Textbook link
This video corresponds to the textbook section on independence, conditional
probability and the multiplication rule.
Errors and clarifications
At 4:32 in the video, it should be 16 out of 51, not 12 out of 51. While the audio is
incorrect, the transcript and the text on screen both now have the correct number.
Note that this calculation only applies to getting blackjack in the order "Ace,
Face/10". We consider the full probability including the possible order "Face/10,
Ace" later when discussing the addition rule.
Key points
Conditional probabilities compute the probability that an event occurs given
information about dependent events. For example, the probability of drawing
a second king given that the first draw is a king is:
Pr(Card 2 is a king∣Card 1 is a king)=3/51
If two events A and B are independent, Pr(A∣B)=Pr(A).
To determine the probability of multiple events occurring, we use
the multiplication rule.
Equations
The multiplication rule for independent events is:
Pr(A and B and C)=Pr(A)×Pr(B)×Pr(C)
The multiplication rule for dependent events considers the conditional probability of
both events occurring:
Pr(A and B)=Pr(A)×Pr(B∣A)
We can expand the multiplication rule for dependent events to more than 2 events:
Pr(A and B and C)=Pr(A)×Pr(B∣A)×Pr(C∣A and B)
Section 1: Discrete Probability 1.2 Combinations and Permutations Combinations and
Permutations
RAFAEL IRIZARRY: In our very first example,
we imagine an urn with 5 beads--
3 blue, 2 red.
To review, to compute the probability distribution of 1 draw,
we simply listed out all the possibilities-- there were 5--
and then for each event, we counted how many of these possibilities
were associated with that event.
So for example, for the blue beads the probability is 0.6.
For more complicated examples, however, these computations
are not necessarily straightforward.
For example, what does the probability that if I
draw 5 cards without replacement, I get all cards
of the same suit, what is called a flush in poker?
Discrete probability teaches us how to make
these computations using mathematics.
Here we focus on how to use R code.
So we're going to use card games as examples.
So let's start by constructing a deck of cards
using R. For this, we will use the function expand.grid and the function
Paste.
We use Paste to create strings by joining smaller strings.
For example, if we have the number and the suit for a card,
in 2 different variables we can create the card name using Paste like this.
It also works on pairs of vectors.
It performs the operation element-wise.
So if we type this line of code, we get the following result.
The function expand.grid gives us all the combinations of 2 lists.
So for example, if you have blue and black pants
and white, gray, and plaid shirt, all your combinations
can be computed using the expand.grid function like this.
You can see all 6 combinations.
So here's how we generate a deck of cards.
We define the four suits, we define the 13 numbers,
and then we create the deck using expand.grid and then pasting together
the 2 columns that expand.grid creates.
Now that we have a deck constructed, we can now start answering questions
about probability.
Let's start by doing something simple.
Let's double-check that the probability of a king in the first card is 1 in 13.
We simply compute the proportion of possible outcomes
that satisfy our condition.
So we create a vector that contains the four ways we can get a king.
That's going to be the kings variable.
And then we simply check what proportion of the deck is one of these cards
and we get the answer that we expect--
0.076 dot dot dot, which is 1 in 13.
Now, how about the conditional probability of the second card being a king,
given that the first was a king?
Earlier we deduced that if 1 king is already out, then there's 51 left.
So the probability is 3 in 51.
But let's confirm by listing out all possible outcomes.
To do this, we're going to use the combinations and permutations
functions that are available from the gtools package.
The permutations function computes for any list of size n
all the different ways we can select R items.
So here's an example--
here all the ways we can choose 2 numbers from the list 1, 2, 3, 4, 5.
Notice that the order matters.
So 3, 1 is different than 1, 3, So it appears in our permutations.
Also notice that 1, 1; 2, 2; and 3, 3 don't
appear, because once we pick a number, it can't appear again.
Optionally for this function permutations, we can add a vector.
So for example, if you want to see 5 random 7-digit phone numbers out
of all possible phone numbers, you could type code like this.
Here we're defining a vector of digits that goes from 0 to 9 rather than 1
through 10.
So these four lines of code generate all phone numbers, picks 5 at random,
and then shows them to you.
To compute all possible ways that we can choose 2 cards when the order matters,
we simply type the following piece of code.
Here we use permutations.
There's 52 cards, we're going to choose 2,
and we're going to select them out of the vector that includes our card
names, which we called deck earlier.
This is going to be a matrix with 2 dimensions, 2 columns,
and in this case, it's going to have 2,652 rows.
Those are all the permutations.
Now, we're going to define the first card and the second card
by grabbing the first and second columns using this simple piece of code.
And now we can, for example, check how many cases
have a first card that is a king--
that's 204.
And now to find the conditional probability,
we ask what fraction of these 204 have also a king in the second card.
So this case we type the following piece of code.
We add all the cases that have king in the first, king in the second,
and divide by the cases that have a king in the first.
And now we get the answer 0.058 dot dot dot, which is exactly 3 out of 51,
which we had already deduced.
Note that the code we just saw is equivalent to this piece of code
where we compute the proportions instead of the totals.
And this also gives us the answer that we want, 3 out
of 51. This is an R version of the multiplication rule, which tells us
the probability of B, given A, is equal to proportion of A and B,
or the probability of A and B, divided by the proportion of A
or the probability of A.
Now, what if the order does not matter?
For example, in blackjack, if you get an ace and a face card or a 10,
it's called a natural 21, and you win automatically.
If we want to compute the probability of this happening,
we want to enumerate the combinations, not permutations,
since the order doesn't matter.
So if we get an A and a king, king and an A, it's still a 21.
We don't want to count that twice.
So notice the difference between the permutations functions, which
lists all permutations, and the combination function,
where order does not matter.
This means that 2, 1 doesn't appear because 1, 2 already appeared.
Similarly, 3, 1 and 3, 2 don't appear.
So to compute the probability of a natural 21 in blackjack,
we can do this.
We can define a vector that includes all the aces, a vector that
includes all the face cards, then we generate
all the combinations of picking 2 cards out of 52, and then we simply count.
How often do we get aces and a face card?
And we get the answer 0.048 dot, dot, dot.
Now, notice that in the previous piece of code
we assumed that the aces come first.
This is only because we know the way that combination
generates and enumerates possibilities.
But if we want to be safe, we can instead
type this code, which considers both possibilities.
We get the same answer, and again, this is
because we know how combinations works and how it lists the possibilities.
Instead of using combinations to deduce the exact probability of a natural 21,
we can also use a Monte Carlo to estimate this probability.
In this case, we draw two cards over and over
and keep track of how many 21's we get.
We can use the function sample to draw a card without a replacement like this.
Here's 1 hand.
We didn't get a 21 there.
And then check if 1 card is an ace and the other is a face card or a 10.
Now we simply repeat this over and over and we get a very good approximation--
in this case, 0.0488.
Textbook link
Here is a link to the textbook section on combinations and permutations.
Key points
paste() joins two strings and inserts a space in between.
expand.grid() gives the combinations of 2 vectors or lists.
permutations(n,r) from the gtools package lists the different ways
that r items can be selected from a set of n options when order matters.
combinations(n,r) from the gtools package lists the different ways
that r items can be selected from a set of n options when order does not
matter.
Code: Introducing paste() and expand.grid()
# joining strings with paste
number <- "Three"
suit <- "Hearts"
paste(number, suit)
# joining vectors element-wise with paste
paste(letters[1:5], as.character(1:5))
# generating combinations of 2 vectors with expand.grid
expand.grid(pants = c("blue", "black"), shirt = c("white", "grey", "plaid"))
Code: Generating a deck of cards
suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight",
"Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)
# probability of drawing a king
kings <- paste("King", suits)
mean(deck %in% kings)
Code: Permutations and combinations
Correction: The code shown does not generate all 7 digit phone numbers because
phone numbers can have repeated digits. It generates all possible 7 digit numbers
without repeats.
library(gtools)
permutations(5,2) # ways to choose 2 numbers in order from 1:5
all_phone_numbers <- permutations(10, 7, v = 0:9)
n <- nrow(all_phone_numbers)
index <- sample(n, 5)
all_phone_numbers[index,]
permutations(3,2) # order matters
combinations(3,2) # order does not matter
Code: Probability of drawing a second king given that one king
is drawn
hands <- permutations(52,2, v = deck)
first_card <- hands[,1]
second_card <- hands[,2]
sum(first_card %in% kings)
sum(first_card %in% kings & second_card %in% kings) /
sum(first_card %in% kings)
Code: Probability of a natural 21 in blackjack
aces <- paste("Ace", suits)
facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number = facecard, suit = suits)
facecard <- paste(facecard$number, facecard$suit)
hands <- combinations(52, 2, v=deck) # all possible hands
# probability of a natural 21 given that the ace is listed first in `combinations`
mean(hands[,1] %in% aces & hands[,2] %in% facecard)
# probability of a natural 21 checking for both ace first and ace second
mean((hands[,1] %in% aces & hands[,2] %in% facecard)|(hands[,2] %in%
aces & hands[,1] %in% facecard))
Code: Monte Carlo simulation of natural 21 in blackjack
Note that your exact values will differ because the process is random and the seed
is not set.
# code for one hand of blackjack
hand <- sample(deck, 2)
hand
# code for B=10,000 hands of blackjack
B <- 10000
results <- replicate(B, {
hand <- sample(deck, 2)
(hand[1] %in% aces & hand[2] %in% facecard) | (hand[2] %in%
aces & hand[1] %in% facecard)
})
mean(results)
Section 1: Discrete Probability 1.2 Combinations and Permutations The Birthday Problem
RAFAEL IRIZARRY: Suppose you're in a classroom with 50 people.
If we assume this is a randomly selected group,
what is the chance that at least two people have the same birthday?
Although it is somewhat advanced, we can actually deduce this mathematically,
and we do this later, but now, we're going to use Monte Carlo simulations.
For simplicity, we assumed that nobody was born on February 29th.
This actually doesn't change the answer much.
All right, first, note that birthdays can be represented
as numbers between 1 and 365.
So a sample of 50 random birthdays can be obtained simply
using the sample function, like this.
To check if, in this particular set of 50 people we have at least two
with the same birthday, we can use the function duplicated,
which returns true whenever an element of a vector
has already appeared in that vector.
So here's an example.
If I type duplicated 1, 2, 3, 1, 4, 3, 5, ,
we get true's for the 1 and the 3 the second time they appear.
So to check if two birthdays were the same,
we simply use any and duplicated functions, like this.
And in the case that we just generated, we actually do have this happening.
We did have two people, at least two people, with the same birthday.
We get true.
Now, to estimate the probability, we're going to repeat this experiment.
We're going to run a Monte Carlo simulation over and over again.
So what we do is we do it 10,000 times.
We use the replicate function like this.
And when we're done, we get that the probability
of having two people, at least two people, with the same birthday
in a group of 50 people is about 98%.
Were you expecting this to be so high?
People tend to underestimate these probabilities,
so it might be an opportunity for you to bet and make
some money off of your friends.
To get an intuition as to why this is so high,
think of what happens as you get close to 365 people.
At this stage, we run out of days, and the probability is 1.
In the next video, we're going to actually compute
this probability for different group sizes
and see how fast it gets close to 1.
Textbook link
Here is a link to the textbook section on the birthday problem.
Key points
duplicated() takes a vector and returns a vector of the same length
with TRUE for any elements that have appeared previously in that vector.
We can compute the probability of shared birthdays in a group of people by
modeling birthdays as random draws from the numbers 1 through 365. We
can then use this sampling model of birthdays to run a Monte Carlo
simulation to estimate the probability of shared birthdays.
Code: The birthday problem
# checking for duplicated bdays in one 50 person group
n <- 50
bdays <- sample(1:365, n, replace = TRUE) # generate n random birthdays
any(duplicated(bdays)) # check if any birthdays are duplicated
# Monte Carlo simulation with B=10000 replicates
B <- 10000
results <- replicate(B, { # returns vector of B logical values
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
mean(results) # calculates proportion of groups with duplicated bdays
Section 1: Discrete Probability 1.2 Combinations and Permutations sapply
RAFAEL IRIZARRY: Say you want to use what you've just
learned about the birthday problem to bet
with friends about two people having the same birthday in a group of people.
When are the chances larger than 50%?
Larger than 75%?
Let's create a lookup table.
We can quickly create a function to compute this for any group.
We write the function like this.
We'll call it compute prob, and we'll basically
make the calculations for the probability
of two people having the same birthday.
We will use a small Monte Carlo simulation to do it.
Now that we've done this, we want to compute this function,
we want to apply this function to several values of n,
let's say from 1 to 60.
Let's define n as a sequence starting at 1 and ending at 60.
Now, we can use a for loop to apply this function to each value in n,
but it turns out that for loops are rarely the preferred approach in R.
In general, we try to perform operations on entire vectors.
Arithmetic operations, for example, operate
on vectors in an element wise fashion.
We saw this when we learned about R. So if we type x equals 1 through 10,
now X is the vector starting at 1 and ending at 10,
and we compute the square root of x, it actually
computes the square root for each element.
Equally, if we define y to be 1 through 10, and then multiply x by y,
it multiplies each element 1 by 1--
1 times 1, 2 times 2, et cetera.
So there's really no need for for loops.
But not all functions work this way.
You can't just send a vector to any function in R.
For example, the function we just wrote does not work element-wise
since it's expecting a scalar, it's expecting an n.
This piece of code does not do what we want.
If we type compute prob and send it the vector n, we will not get what we want.
We will get just one number.
What we can do instead is use the function sapply.
sapply permits us to perform element-wise operations
on any function.
Here's how it works.
We'll define a simple example for the vector 1 through 10.
If we want to apply the square roots of each element of x,
we can simply type sapply x comma square root,
and it'll apply square root to each element of x.
Of course, we don't need to do this because square root already does that,
but we are using it as a simple example.
So for our case, we can simply type prob equals sapply n-- n is our vector--
and then the function we define compute prob.
And this will assign to each element of prob
the probability of two people having the same birthday for that n.
And now we can very quickly make a plot.
We plot the probability of two people having the same birthday
against the size of the group.
Now, let's compute the exact probabilities
rather than use Monte Carlo simulations.
The function we just defined uses a Monte Carlo simulation,
but we can use what we've learned about probability theory
to compute the exact value.
Not only do we get the exact answer using math,
but the computations are much faster since we
don't have to generate experiments.
We simply get a number.
To make the math simpler for this particular problem,
instead of computing the probability of it happening,
we'll compute the probability of it not happening,
and then we can use the multiplication rule.
Let's start with the first person.
The probability that person 1 has a unique birthday is 1, of course.
All right.
Now let's move on to the second one.
The probability that the second person has
a unique birthday given that person 1 already took one of the days
is 364 divided by 365.
Then for a person 3, given that the first two people already
have unique birthdays, that leaves 363.
So now that probability is 363 divided by 365.
If we continue this way and find the chances of all,
say, 50 people having unique birthdays, we would multiply 1 times 364 divided
by 365, times 363 divided by 365, dot dot
dot, all the way to the 50th element.
Here's the equation.
Now, we can easily write a function that does this.
This time we'll call it exact prob.
It takes n as a argument, and it computes
this probability using this simple code.
Now we can compute each probability for each n using sapply again, like this.
And if we plot it, we can see that the Monte Carlo
simulations were almost exactly right.
They were almost exact approximations of the actual value.
Now, notice had it not been possible to compute
the exact probabilities, something that sometimes happens,
we would have still been able to accurately estimate
the probabilities using Monte Carlo.
Textbook links
The textbook discussion of the basics of sapply() can be found in this textbook
section.
The textbook discussion of sapply() for the birthday problem can be found
within the birthday problem section.
Key points
Some functions automatically apply element-wise to vectors, such
as sqrt() and *.
However, other functions do not operate element-wise by default. This
includes functions we define ourselves.
The function sapply(x, f) allows any other function f to be applied
element-wise to the vector x.
The probability of an event happening is 1 minus the probability of that event
not happening:
Pr(event)=1−Pr(no event)
We can compute the probability of shared birthdays mathematically:
Pr(shared
birthdays)=1−Pr(no shared
birthdays)=1−(1×364365×363365×...×365−n+1365)
Code: Function for birthday problem Monte Carlo simulations
Note that the function body of compute_prob() is the code that we wrote in the
previous video. If we write this code as a function, we can use sapply() to apply
this function to several values of n.
# function to calculate probability of shared bdays across n people
compute_prob <- function(n, B = 10000) {
same_day <- replicate(B, {
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
mean(same_day)
n <- seq(1, 60)
Code: Element-wise operation over vectors and sapply
x <- 1:10
sqrt(x) # sqrt operates on each element of the vector
y <- 1:10
x*y # * operates element-wise on both vectors
compute_prob(n) # does not iterate over the vector n without sapply
x <- 1:10
sapply(x, sqrt) # this is equivalent to sqrt(x)
prob <- sapply(n, compute_prob) # element-wise application of compute_prob to n
plot(n, prob)
Code: Computing birthday problem probabilities with sapply
# function for computing exact probability of shared birthdays for any n
exact_prob <- function(n){
prob_unique <- seq(365, 365-n+1)/365 # vector of fractions for mult. rule
1 - prod(prob_unique) # calculate prob of no shared birthdays and subtract from 1
# applying function element-wise to vector of n values
eprob <- sapply(n, exact_prob)
# plotting Monte Carlo results and exact probabilities on same graph
plot(n, prob) # plot Monte Carlo results
lines(n, eprob, col = "red") # add line for exact prob
Section 1: Discrete Probability 1.2 Combinations and Permutations How Many Monte Carlo
Experiments are Enough?
RAFAEL IRIZARRY: In the examples we have seen,
we have used 10,000 Monte Carlo experiments.
It turns out that this provided very accurate estimates for the examples
we looked at.
In more complex calculations, 10,000 may not nearly be enough.
Also for some calculations, 10,000 experiments
might not be computationally feasible, and it might be more than we need.
In practice, we won't know what the answer
is, so we don't know if our Monte Carlo estimate is accurate.
We know that the larger the number of experiments, we've
been using the letter B to represent that, the better the approximation.
But how big do we need it to be?
This is actually a challenging question, and answering it often requires
advanced theoretical statistics training.
One practical approach we will describe here
is to check for the stability of the estimate.
Here's an example with the birthday problem.
We're going to use n equals 22.
There's 22 people.
So we're going to run a simulation where we compute or estimate
the probability of two people having a certain birthday using different sizes
of the Monte Carlo simulations.
So the value of b going to go from 10, to 20, to 40, to 100, et cetera.
We compute the simulation, and now we look at the values
that we get for each simulation.
Remember, each simulation has a different b,
a different number of experiments.
When we see this graph, we can see that it's wiggling up and down.
That's because the estimate is not stable yet.
It's not such a great estimate.
But as b gets bigger and bigger, eventually it starts to stabilize.
And that's when we start getting a feeling for the fact
that now perhaps we have a large enough number of experiments.
Textbook link
Here is a link to the matching textbook section.
Key points
The larger the number of Monte Carlo replicates B, the more accurate the
estimate.
Determining the appropriate size for B can require advanced statistics.
One practical approach is to try many sizes for B and look for sizes that
provide stable estimates.
Code: Estimating a practical value of B
This code runs Monte Carlo simulations to estimate the probability of shared
birthdays using several B values and plots the results. When B is large enough that
the estimated probability stays stable, then we have selected a useful value of B.
B <- 10^seq(1, 5, len = 100) # defines vector of many B values
compute_prob <- function(B, n = 22){ # function to run Monte Carlo simulation with each B
same_day <- replicate(B, {
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
mean(same_day)
prob <- sapply(B, compute_prob) # apply compute_prob to many values of B
plot(log10(B), prob, type = "l") # plot a line graph of estimates
Section 1: Discrete Probability 1.3 Addition Rule and Monty Hall The Addition Rule
RAFAEL IRIZARRY: Earlier, we showed you how
to compute the probability of a natural 21 in blackjack.
This is getting a face card and an ace in your first draw.
Here, we're going to show you the addition
rule, which gives you another way to compute this probability.
The addition rule tells us that the probability of A or B, right,
we're going to have to make this calculation now,
because you can get to 21 in two ways.
You can get either a face card and then an ace.
Or you can get an ace and then a face card.
So we're asking what's the probability of that or?
A or B?
The addition rule tells us the probability of A or B
is the probability of A plus the probability a B minus the probability of A and B.
To understand why this makes sense, think of a Venn diagram.
If we're computing the probability of this whole thing happening, A or B,
we can add the blue circle plus the pink circle,
and then subtract the middle because we have added it twice
by adding the blue plus the pink.
So it makes sense-- the addition rule makes a lot of sense.
So now let's apply it to the natural 21.
In the case of natural 21, the intersection is empty.
Since both hands can't happen, you can't have both an ace and then a face card,
and then at the same time have a face card and then an ace.
Those two things can't happen at the same time.
So this will be a very easy application of the addition rule.
The probability of an ace followed by a face card we know
is 1 over 13 times 16 divided by 51.
And the probability of a face card followed by an ace
is 16 over 52 times 4 over 51.
These are actually the same, which makes sense due to symmetry.
These two values are actually the same.
In any case, we get the same result that we got before
for the natural 21, which is about 0.05.
Textbook link
Here is a link to the textbook section on the addition rule.
Clarification
By "facecard", the professor means a card with a value of 10 (K, Q, J, 10).
Key points
The addition rule states that the probability of event A or event B happening
is the probability of event A plus the probability of event B minus the
probability of both events A and B happening together.
Pr(A or B)=Pr(A)+Pr(B)−Pr(A and B)
Note that (A or B) is equivalent to (A|B).
Example: The addition rule for a natural 21 in blackjack
We apply the addition rule where A = drawing an ace then a facecard and B =
drawing a facecard then an ace. Note that in this case, both events A and B cannot
happen at the same time, so Pr(A and B)=0.
Pr(ace then facecard)=452×1651
Pr(facecard then ace)=1652×451
Pr(ace then facecard | facecard then ace)=452×1651+1652×451=0.0483
Section 1: Discrete Probability 1.3 Addition Rule and Monty Hall The Monty Hall Problem
RAFAEL IRIZARRY: We're going to look at one last example
from the discrete probability.
It's the Monty Hall problem.
In the 1970s, there was a game show called Let's Make a Deal.
Monty Hall was the host-- this is where the name of the problem comes from.
At some point in the game, contestants were asked to pick one of three doors.
Behind one door, there was a prize.
The other two had a goat behind them.
And this basically meant that you lost.
If the contestant did not pick the prize door on his or her first try,
Monty Hall would open one of the two remaining doors
and show the contestant that there was no prize behind that door.
So you're left with two doors, the one you picked
and one door that you do not know what's behind it.
So then Monty Hall would ask, do you want to switch doors?
What would you do?
We can use probability to show that if you stick to the original door,
your chances of winning a car or whatever big prize is 1 in 3.
But if you switch, your chances double to 2 in 3.
This seems counterintuitive.
Many people incorrectly think both chances are 1 in 2,
since there's only two doors left and there's a prize behind one of them.
You can find detailed explanations-- we're
going to provide links of the mathematics of how you can calculate
that this is, in fact, wrong, that you have a higher chance if you switch.
But here, we're going to use a Monte Carlo simulation to show you
that this is the case.
And this will help you understand how it all works.
Note that the code we're going to show you is longer than it needs to be,
but we're using it as an illustration to help
you understand the logic behind what actually happens here.
So let's start by creating a simulation that imitates the strategy of sticking
to the same door.
Let's go through the code.
We're going to do 10,000 experiments.
So we're going to do this over and over 10,000 times.
First thing we do is we assign the prize to a door.
So the prize is going to be in one of the three doors that we have created.
Then that's going to be in prize door.
Prize door contains the number of the door with the prize behind it.
Then we're going to imitate your pick by taking a random sample of these three
doors.
Now, in the next step, we're going to decide which door you're shown.
You're going to be shown not your door and not the door with a prize.
You're going to be shown the other one.
You stick to your door.
That's what you stick to.
Nothing changed.
All this that we did right above doesn't matter,
because you're sticking to your door.
So now we do this over and over again, and at the end,
we ask is the door you chose, the one you stuck to, is that the prize door?
How often does this happen?
We know it's going to be a 1/3, because none of this procedure
changed anything.
You started picking one out of three, nothing changed.
And now, you are asked, is the one I picked the door?
If we run the simulation, we actually see it happening.
We ran it 10,000 times and the probability of winning
was 0.3357, about 1 in 3.
Now, let's repeat the exercise, but consider the switch strategy.
We start the same way.
We have three doors.
We assign three prizes at random.
Now, we know which one has the good prize, the car,
but we don't tell the contestant.
So the contestant has to pick one.
It's basically a random pick.
That's what my pick is.
And now, we're going to show the contestant one door.
It can't be the one they chose.
And it can't be the one with the car.
So that only leaves one door, the door with nothing behind them
that was not chosen.
So now, what you're going to do is you're going to switch.
You're going to switch to the door that they didn't show you,
because the one that they did show you had nothing behind it,
so basically what's happening is you are switching from the original that
had a 1 in 3 chances of being the one to whatever is the other option, which
has to have a 2 in 3 chance.
So if we run the simulation, we actually confirm that.
We get that the proportion of times we win is 0.6717, which is about 2/3.
The Monte Carlo estimate confirms the 2 out of 3 calculation.
Textbook section
Here is the textbook section on the Monty Hall Problem.
Error in Monty Hall explanation
At 0:32, the professor incorrectly says that Monty Hall opens one of the two
remaining doors only if the contestant did not pick the prize door on the first try. In
the actual problem, Monty Hall always opens one of the two remaining doors
(never revealing the prize). The Monte Carlo simulation code is correct for the
actual problem.
Key points
Monte Carlo simulations can be used to simulate random outcomes, which
makes them useful when exploring ambiguous or less intuitive problems like
the Monty Hall problem.
In the Monty Hall problem, contestants choose one of three doors that may
contain a prize. Then, one of the doors that was not chosen by the
contestant and does not contain a prize is revealed. The contestant can
then choose whether to stick with the original choice or switch to the
remaining unopened door.
Although it may seem intuitively like the contestant has a 1 in 2 chance of
winning regardless of whether they stick or switch, Monte Carlo simulations
demonstrate that the actual probability of winning is 1 in 3 with the stick
strategy and 2 in 3 with the switch strategy.
For more on the Monty Hall problem, you can watch a detailed explanation
here or read an explanation here.
Code: Monte Carlo simulation of stick strategy
B <- 10000
stick <- replicate(B, {
doors <- as.character(1:3)
prize <- sample(c("car","goat","goat")) # puts prizes in random order
prize_door <- doors[prize == "car"] # note which door has prize
my_pick <- sample(doors, 1) # note which door is chosen
show <- sample(doors[!doors %in% c(my_pick, prize_door)],1) # open
door with no prize that isn't chosen
stick <- my_pick # stick with original door
stick == prize_door # test whether the original door has the prize
})
mean(stick) # probability of choosing prize door when sticking
Code: Monte Carlo simulation of switch strategy
switch <- replicate(B, {
doors <- as.character(1:3)
prize <- sample(c("car","goat","goat")) # puts prizes in random order
prize_door <- doors[prize == "car"] # note which door has prize
my_pick <- sample(doors, 1) # note which door is chosen first
show <- sample(doors[!doors %in% c(my_pick, prize_door)], 1) # open
door with no prize that isn't chosen
switch <- doors[!doors%in%c(my_pick, show)] # switch to the door that wasn't
chosen first or opened
switch == prize_door # test whether the switched door has the prize
})
mean(switch) # probability of choosing prize door when switching
Section 2: Continuous Probability 2.1 Continuous Probability Continuous Probability
RAFAEL IRIZARRY: Earlier we explained why
when summarizing a list of numeric values such as heights,
it's not useful to construct a distribution that assigns
a proportion to each possible outcome.
Note, for example, that if we measure every single person
in a very large population with extremely high precision,
because no two people are exactly the same height,
we would need to assign a proportion to each observed value
and attain no useful summary at all.
Similarly when defining probability distributions,
it is not useful to assign a very small probability to every single height.
Just like when using distributions to summarize numeric data,
it is much more practical to define a function that operates
on intervals rather than single values.
The standard way of doing this is using the cumulative distribution function.
We previously described the empirical cumulative distribution function--
eCDF--
as a basic summary of a list of numeric values.
As an example, we define the eCDF for heights for male adult students.
Here, we define the vector x to use as an example that contains the male heights.
We use this little piece of code.
Now we can define the empirical distribution function
like this very simply.
We just count the number of cases where x is
smaller or equal to a and divide by n.
We take the mean, that's the proportion of cases.
So for every value of a, this gives a proportion of values in the list
x that are smaller or equal to a.
Note that we have not yet introduced probability.
We've been talking about list of numbers and summarizing these lists.
So let's introduce probability.
For example, let's ask if I pick one of the male students
at random, what is the chance that he is taller than 70.5 inches?
Because every student has the same chance of being picked,
the answer to this question is the proportion
of students that are taller than 70.5.
Using eCDF, we obtain the answer.
We simply type 1 minus f of 70, and we get about 37%.
Once the CDF is defined, we can use this to compute
the probability of any subset.
For example, the probability of a student being
between the height a and the height b is simply f of b minus f of a.
Because we can compute the probability for any possible event this way,
the cumulative probability function defines a probability distribution
for picking a height at random from our vector of heights x.
Textbook links
Ths video corresponds to the textbook section on continuous probability.
The previous discussion of CDF is from the Data Visualization course. Here is
the textbook section on the CDF.
Key points
The cumulative distribution function (CDF) is a distribution function for
continuous data x that reports the proportion of the data below a for all
values of a:
F(a)=Pr(x≤a)
The CDF is the probability distribution function for continuous variables. For
example, to determine the probability that a male student is taller than 70.5
inches given a vector of male heights x, we can use the CDF:
Pr(x>70.5)=1−Pr(x≤70.5)=1−F(70.5)
The probability that an observation is in between two values a,b is F(b)
−F(a).
Code: Cumulative distribution function
Define x as male heights from the dslabs heights dataset:
library(tidyverse)
library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)
Given a vector x, we can define a function for computing the CDF of x using:
F <- function(a) mean(x <= a)
1 - F(70) # probability of male taller than 70 inches
Section 2: Continuous Probability 2.1 Continuous Probability Theoretical Distribution
RAFAEL IRIZARRY: In the data visualization module,
we introduced the normal distribution as a useful approximation
to many naturally occurring distributions,
including that of height.
The cumulative distribution for the normal distribution
is defined by a mathematical formula, which in R can
be obtained with the function pnorm.
We say that a random quantity is normally distributed with average, avg,
and standard deviation, s, if its probability distribution
is defined by f of a equals pnorm a, average, s.
This is the code.
This is useful, because if we are willing to use the normal approximation
for say, height, we don't need the entire dataset
to answer questions such as, what is the probability that a randomly selected
student is taller than 70.5 inches.
We just need the average height and the standard deviation.
Then we can use this piece of code, 1 minus pnorm 70.5 mean of x, sd of x.
And that gives us the answer of 0.37.
The normal distribution is derived mathematically.
Apart from computing the average and the standard deviation,
we don't use data to define it.
Also the normal distribution is defined for continuous variables.
It is not described for discrete variables.
However, for practicing data scientists, pretty much everything we do
involves data, which is technically speaking discrete.
For example, we could consider our adult data categorical
with each specific height a unique category.
The probability distribution would then be
defined by the proportion of students reporting each of those unique heights.
Here is what a plot of that would look like.
This would be the distribution function for those categories.
So each reported height gets a probability
defined by the proportion of students reporting it.
Now while most students rounded up their height to the nearest inch,
others reported values with much more precision.
For example, student reported his height to be 69.6850393700787.
What is that about?
What's that very, very precise number?
Well, it turns out, that's 177 centimeters.
So the student converted it to inches, and copied and pasted the result
into the place where they had to report their heights.
The probability assigned to this height is about 0.001.
It's 1 in 708.
However, the probability for 70 inches is 0.12.
This is much higher than what was reported with this other value.
But does it really make sense to think that the probability of being exactly 70
inches is so much higher than the probability of being 69.68?
Clearly, it is much more useful for data analytic purposes
to treat this outcome as a continuous numeric variable.
But keeping in mind that very few people, perhaps none,
are exactly 70 inches.
But rather, that people rounded to the nearest inch.
With continuous distributions, the probability of a singular value
is not even defined.
For example, it does not make sense to ask
what is the probability that a normally distributed value is 70.
Instead, we define probabilities for intervals.
So we could ask instead, what is a probability that someone
is between 69.99 and 70.01.
In cases like height in which the data is rounded,
the normal approximation is particularly useful
if we deal with intervals that include exactly one round number.
So for example, the normal distribution is
useful for approximating the proportion of students
reporting between 69.5 and 70.5.
Here are three other examples.
Look at the numbers that are being reported.
This is using the data, the actual data, not the approximation.
Now look at what we get when we use the approximation.
We get almost the same values.
For these particular intervals, the normal approximation is quite useful.
However, the approximation is not that useful for other intervals.
For example, those that don't include an integer.
Here are two examples.
If we use these two intervals, again, this
is the data, look at the approximations now with the normal distribution.
They're not that good.
In general, we call this situation discretization.
Although the true height distribution is continuous,
the reported heights tend to be more common at discrete values,
in this case, due to rounding.
As long as we are aware of how to deal with this reality,
the normal approximation can still be a very useful tool.
Textbook link
This video corresponds to the textbook section on the theoretical distribution and
the normal approximation.
Key points
pnorm(a, avg, s) gives the value of the cumulative distribution
function F(a) for the normal distribution defined by average avg and
standard deviation s.
We say that a random quantity is normally distributed with average avg and
standard deviation s if the approximation pnorm(a, avg, s) holds for all
values of a.
If we are willing to use the normal approximation for height, we can estimate
the distribution simply from the mean and standard deviation of our values.
If we treat the height data as discrete rather than categorical, we see that
the data are not very useful because integer values are more common than
expected due to rounding. This is called discretization.
With rounded data, the normal approximation is particularly useful when
computing probabilities of intervals of length 1 that include exactly one
integer.
Code: Using pnorm() to calculate probabilities
Given male heights x:
library(tidyverse)
library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)
We can estimate the probability that a male is taller than 70.5 inches using:
1 - pnorm(70.5, mean(x), sd(x))
Code: Discretization and the normal approximation
# plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")
# probabilities in actual data over length 1 ranges containing an integer
mean(x <= 68.5) - mean(x <= 67.5)
mean(x <= 69.5) - mean(x <= 68.5)
mean(x <= 70.5) - mean(x <= 69.5)
# probabilities in normal approximation match well
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))
pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))
pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))
# probabilities in actual data over other ranges don't match normal approx as well
mean(x <= 70.9) - mean(x <= 70.1)
pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))
Section 2: Continuous Probability 2.1 Continuous Probability Probability Density
RAFAEL IRIZARRY: For categorical data, we
can define the probability of a category.
For example, a roll of a die, let's call it x, can be 1, 2, 3, 4, 5, or 6.
The probability of 4 is defined as probability of x equals 4 is 1/6.
The CDF can easily be defined by simply adding up probability.
So F of 4 is the probability of x being less than 4,
which is the probability of x being 4, or 3, or 2, or 1.
So we just add up those probabilities.
In contrast, for continuous distributions,
the probability of a single value is not defined.
However, there is a theoretical definition
that has a similar interpretation.
This has to do with the probability density.
The probability density at x is defined as the function,
we're going to call it little f of x, such that the probability distribution
big F of a, which is the probability of x being less than or equal to a,
is the integral of all values up to a of little f of x dx.
For those that know calculus, remember, that the integral is related to a sum.
It's the sum of bars with widths that are approximating 0.
If you don't know calculus, you can think of little
f of x as a curve for which the area under the curve up to the value a
gives you the probability of x being less than or equal to a.
For example, to use the normal approximation
to estimate the probability of someone being taller than 76 inches,
we can use the probability density.
So this is a mathematical formula.
Mathematically, the gray area in this figure
represents the probability of x being bigger than 76.
The curve you see is a probability density function
for the normal distribution.
In R you get the probability density function for the normal distribution
using the function dnorm.
D stands for density.
Although it may not be immediately obvious
why knowing about probability densities is useful, understanding this concept
will be essential to those wanting to fit models
to data for which predefined functions are not available.
Textbook link
This video corresponds to the textbook section on probability density.
Key points
The probability of a single value is not defined for a continuous distribution.
The quantity with the most similar interpretation to the probability of a single
value is the probability density function f(x).
The probability density f(x) is defined such that the integral of f(x) over a
range gives the CDF of that range.
F(a)=Pr(X≤a)=∫a−∞f(x)dx
In R, the probability density function for the normal distribution is given
by dnorm(). We will see uses of dnorm() in the future.
Note that dnorm() gives the density function and pnorm() gives the
distribution function, which is the integral of the density function.
Section 2: Continuous Probability 2.1 Continuous Probability Plotting the Probability Density
Plotting the Probability Density
Marcar esta página
Plotting the probability density for the normal distribution
We can use dnorm() to plot the density curve for the normal
distribution. dnorm(z) gives the probability density f(z) of a certain z-score, so we
can draw a curve by calculating the density over a range of possible values of z.
First, we generate a series of z-scores covering the typical range of the normal
distribution. Since we know 99.7% of observations will be within −3≤z≤3, we
can use a value of z slightly larger than 3 and this will cover most likely values of
the normal distribution. Then, we calculate f(z), which is dnorm() of the series of z-
scores. Last, we plot z against f(z).
library(tidyverse)
x <- seq(-4, 4, length = 100)
data.frame(x, f = dnorm(x)) %>%
ggplot(aes(x, f)) +
geom_line()
Here is the resulting plot:
Note that dnorm() gives densities for the standard normal distribution by default.
Probabilities for alternative normal distributions with mean mu and standard
deviation sigma can be evaluated with:
dnorm(z, mu, sigma)
Section 2: Continuous Probability 2.1 Continuous Probability Monte Carlo Simulations
RAFAEL IRIZARRY: In this video, we're going
to show you how to run Monte Carlo simulations using normally distributed
variables.
R provides a function to generate normally distributed outcomes.
Specifically, the rnorm function takes three arguments-- size; average,
which defaults to 0; standard deviation, which defaults to 1--
and produces these random numbers.
Here's an example of how we can generate data
that looks like our reported heights.
So if our reported heights are in the vector x, we compute their
length, their average, and their standard deviation, and then
use the rnorm function to generate the randomly distributed outcomes.
Not surprisingly, the distribution of these outcomes
looks normal because they were generated to look normal.
This is one of the most useful functions in R,
as it will permit us to generate data that mimics naturally occurring events,
and it'll let us answer questions related
to what could happen by chance by running Monte Carlo simulations.
For example, if we pick 800 males at random,
what is the distribution of the tallest person?
Specifically, we could ask, how rare is that the tallest person is a seven
footer?
We can use the following Monte Carlo simulation to answer this question.
We're going to run 10,000 simulations, and for each one,
we're going to generate 800 normally distributed values,
pick the tallest one, and return that.
The tallest variable will have these values.
So now we can ask, what proportion of these simulations
return a seven footer as the tallest person?
And we can see that it's a very small number.
Textbook link
This video corresponds to the textbook section on Monte Carlo simulations for
continuous variables.
Key points
rnorm(n, avg, s) generates n random numbers from the normal
distribution with average avg and standard deviation s.
By generating random numbers from the normal distribution, we can
simulate height data with similar properties to our dataset. Here we generate
simulated height data using the normal distribution.
Code: Generating normally distributed random numbers
# define x as male heights from dslabs data
library(tidyverse)
library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% pull(height)
# generate simulated height data using normal distribution - both datasets should have n observations
n <- length(x)
avg <- mean(x)
s <- sd(x)
simulated_heights <- rnorm(n, avg, s)
# plot distribution of simulated_heights
data.frame(simulated_heights = simulated_heights) %>%
ggplot(aes(simulated_heights)) +
geom_histogram(color="black", binwidth = 2)
Code: Monte Carlo simulation of tallest person over 7 feet
B <- 10000
tallest <- replicate(B, {
simulated_data <- rnorm(800, avg, s) # generate 800 normally distributed random
heights
max(simulated_data) # determine the tallest height
})
mean(tallest >= 7*12) # proportion of times that tallest person exceeded 7 feet (84 inches)
Section 2: Continuous Probability 2.1 Continuous Probability Other Continuous Distributions
RAFAEL IRIZARRY: The normal distribution is not the only useful
theoretical distribution.
Other continuous distributions that we may encounter
are student-t, the chi-squared, the exponential, the gamma,
and the beta distribution.
R provides functions to compute the density, the quantiles,
the cumulative distribution function, and to generate
Monte Carlo simulations for all these distributions.
R uses a convention that lets us remember the names of these functions.
Namely, using the letters d for density, q
for quantile, p for probability density function, and r for random.
By putting these letters in front of a shorthand for the distribution,
gives us the names of these useful functions.
We have already seen the functions dnorm, pnorm, and rnorm.
So these are examples of what we just described.
These are for the normal distribution.
Norm is the nickname or shorthand for the normal distribution.
The function qnorm gives us the quantiles.
For example, we can use the dnorm function to generate this plot.
This is the density function for the normal distribution.
We use the function dnorm.
For the student's t distribution, which has shorthand t,
we can use the functions dt, qt, pt, and rt
to generate the density, quantiles, probability density
function, or a Monte Carlo simulation.
Textbook link
This video corresponds to the textbook section on other continuous distributions.
Key points
You may encounter other continuous distributions (Student t, chi-squared,
exponential, gamma, beta, etc.).
R provides functions for density (d), quantile (q), probability distribution (p)
and random number generation (r) for many of these distributions.
Each distribution has a matching abbreviation (for example, norm() or t())
that is paired with the related function abbreviations (d, p, q, r) to create
appropriate functions.
For example, use rt() to generate random numbers for a Monte Carlo
simulation using the Student t distribution.
Code: Plotting the normal distribution with dnorm
Use d to plot the density function of a continuous distribution. Here is the density
function for the normal distribution (abbreviation norm()):
x <- seq(-4, 4, length.out = 100)
data.frame(x, f = dnorm(x)) %>%
ggplot(aes(x,f)) +
geom_line()
Exercise HIGHEST IQ code
# The variable `B` specifies the number of times we want the simulation to run.
B <- 1000
# Use the `set.seed` function to make sure your answer matches the expected result after random
number generation.
set.seed(1)
# Create an object called `highestIQ` that contains the highest IQ score from each random
distribution of 10,000 people.
highestIQ <- replicate(B, {
simulated_data <- rnorm(10000, 100, 15)
max(simulated_data)
})
# Make a histogram of the highest IQ scores.
hist(highestIQ)
Questions 1 and 2: ACT scores, part 1
The ACT is a standardized college admissions test used in the United States. The
four multi-part questions in this assessment all involve simulating some ACT test
scores and answering probability questions about them.
For the three year period 2016-2018, ACT standardized test scores were
approximately normally distributed with a mean of 20.9 and standard deviation of
5.7. (Real ACT scores are integers between 1 and 36, but we will ignore this detail
and use continuous values instead.)
First we'll simulate an ACT test score dataset and answer some questions about it.
Set the seed to 16, then use rnorm() to generate a normal distribution of 10000
tests with a mean of 20.9 and standard deviation of 5.7. Save these values
as act_scores. You'll be using this dataset throughout these four multi-part
questions.
(IMPORTANT NOTE! If you use R 3.6 or later, you will need to use the
command format set.seed(x, sample.kind = "Rounding") instead
of set.seed(x). Your R version will be printed at the top of the Console
window when you start RStudio.)
Question 1
What is the mean of act_scores ?
set.seed(16, sample.kind = "Rounding")
act_scores <- rnorm(10000, 20.9, 5.7)
mean(act_scores)
Question 1c
1/1 punto (calificado)
A perfect score is 36 or greater (the maximum reported score is 36).
In act_scores , how many perfect scores are there out of 10,000 simulated tests?
correcto
41
41
Explanation
The number of perfect scores can be found using the following code:
sum(act_scores >= 36)
Question 1d
1/1 punto (calificado)
In act_scores, what is the probability of an ACT score greater than 30?
Explanation
The probability can be found using the following code:
mean(act_scores > 30)
Question 2
1/1 punto (calificado)
Set x equal to the sequence of integers 1 to 36. Use dnorm to determine the value
of the probability density function over x given a mean of 20.9 and standard
deviation of 5.7; save the result as f_x . Plot x against f_x .
Which of the following plots is correct?
Explanation
The second plot, generated using the following code, is correct:
x <- 1:36
f_x <- dnorm(x, 20.9, 5.7)
data.frame(x, f_x) %>%
ggplot(aes(x, f_x)) +
geom_line()
The first plot is the distribution function rather than the density function. The third
plot fails to include the mean and standard deviation in the dnorm call. The fourth
plot is of Z-score values.
Questions 3 and 4: ACT scores, part 2
Marcar esta página
In this 3-part question, you will convert raw ACT scores to Z-scores and answer
some questions about them.
Convert act_scores to Z-scores. Recall from Data Visualization (the second course
in this series) that to standardize values (convert values into Z-scores, that is,
values distributed with a mean of 0 and standard deviation of 1), you must subtract
the mean and then divide by the standard deviation. Use the mean and standard
deviation of act_scores, not the original values used to generate random test
scores.
Question 3a
0.0/1.0 punto (calificado)
What is the probability of a Z-score greater than 2 (2 standard deviations above the
mean)?
0.0233
Code
z_scores <- (act_scores - mean(act_scores))/sd(act_scores)
mean(z_scores > 2)
Question 3b
0.0/1.0 punto (calificado)
What ACT score value corresponds to 2 standard deviations above the mean (Z = 2)?
32.2
Explanation
The score value can be calculated using the following code:
2*sd(act_scores) + mean(act_scores)
Question 3c
0.0/1.0 punto (calificado)
A Z-score of 2 corresponds roughly to the 97.5th percentile.
Use qnorm() to determine the 97.5th percentile of normally distributed data with
the mean and standard deviation observed in act_scores .
What is the 97.5th percentile of act_scores ?
32.0
Explanation
The 97.5th percentile can be calculated using the following code:
qnorm(.975, mean(act_scores), sd(act_scores))
In this 4-part question, you will write a function to create a CDF for ACT scores.
Write a function that takes a value and produces the probability of an ACT score
less than or equal to that value (the CDF). Apply this function to the range 1 to 36.
Question 4a
0.0/1.0 punto (calificado)
What is the minimum integer score such that the probability of that score or lower is at
least .95?
Your answer should be an integer 1-36.
31
Explanation
The minimum score can be calculated using the following code:
cdf <- sapply(1:36, function (x){
mean(act_scores <= x)
})
min(which(cdf >= .95))
Question 4b
0.0/1.0 punto (calificado)
Use qnorm() to determine the expected 95th percentile, the value for which the
probability of receiving that score or lower is 0.95, given a mean score of 20.9 and
standard deviation of 5.7.
What is the expected 95th percentile of ACT scores?
30.3
Explanation
The expected 95th percentile can be calculated using the following code:
qnorm(.95, 20.9, 5.7)
Question 4c
0.0/1.0 punto (calificado)
As discussed in the Data Visualization course, we can use quantile() to
determine sample quantiles from the data.
Make a vector containing the quantiles for p <- seq(0.01, 0.99, 0.01) , the 1st
through 99th percentiles of the act_scores data. Save these
as sample_quantiles .
In what percentile is a score of 26?
Your answer should be an integer (i.e. 60), not a percent or fraction. Note that a score between the 98th
and 99th percentile should be considered the 98th percentile, for example, and that quantile numbers are
used as names for the vector sample_quantiles .
82
Explanation
The percentile for a score of 26 can be calculated using the following code:
p <- seq(0.01, 0.99, 0.01)
sample_quantiles <- quantile(act_scores, p)
names(sample_quantiles[max(which(sample_quantiles < 26))])
Question 4d
0.0/1.0 punto (calificado)
Make a corresponding set of theoretical quantiles using qnorm() over the
interval p <- seq(0.01, 0.99, 0.01) with mean 20.9 and standard deviation
5.7. Save these as theoretical_quantiles . Make a QQ-plot
graphing sample_quantiles on the y-axis versus theoretical_quantiles on
the x-axis.
Which of the following graphs is correct?
Explanation
The fourth graph is correct and can be generated using the following code:
p <- seq(0.01, 0.99, 0.01)
sample_quantiles <- quantile(act_scores, p)
theoretical_quantiles <- qnorm(p, 20.9, 5.7)
qplot(theoretical_quantiles, sample_quantiles) + geom_abline()
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.1 Random
Variables and Sampling Models Random Variables
RAFAEL IRIZARRY: Random variables are numeric outcomes
resulting from a random process.
We can easily generate random variables using
some of the simple examples we have shown,
such as the red and blue bead urn.
For example, define x to be 1 if a bead is blue, and red otherwise.
Here's the R code you can write to generate that random variable.
X is a random variable.
Every time we select a new bead, the outcome changes randomly.
Sometimes it's 1, sometimes it's 0.
Here's some examples of how that random variable changes.
We're going to do it three times.
Here is 0, here is 1, and here's 1 again.
In data science, we often deal with data that is affected by chance in some way.
The data comes from a random sample, the data is affected by measurement error,
or the data measures some outcome that is random in nature.
Being able to quantify the uncertainty introduced by randomness
is one of the most important jobs of a data scientist.
Statistical inference offers a framework for doing this, as well
as several practical tools.
The first step is to learn how to mathematically describe
random variables.
We start with games of chance as an illustrative example.
Textbook link
This video corresponds to the textbook section on random variables.
Key points
Random variables are numeric outcomes resulting from random processes.
Statistical inference offers a framework for quantifying uncertainty due to
randomness.
Code: Modeling a random variable
# define random variable x to be 1 if blue, 0 otherwise
beads <- rep(c("red", "blue"), times = c(2, 3))
x <- ifelse(sample(beads, 1) == "blue", 1, 0)
# demonstrate that the random variable is different every time
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.1 Random
Variables and Sampling Models Sampling Models
RAFAEL IRIZARRY: Many data-generation procedures,
those that produce data that we study, can be modeled
quite well as draws from an urn.
For example, we can model the process of polling likely voters as drawing 0's--
Republicans-- and 1's--
Democrats-- from an urn containing the 0 and 1 code for all likely voters.
We'll see that in more detail later.
In epidemiological studies, we often assume that the subjects in our study
are a random sample from the population of interest.
The data related to a specific outcome can be modeled as a random sample
from an urn containing the values for those outcomes
for the entire population of interest.
Similarly, in experimental research, we often assume that the individual
organisms we are studying-- for example, worms, flies, or mice--
are a random sample from a larger population.
Randomized experiments can also be modeled by draws from urn,
given the way individuals are assigned to groups.
When getting assigned, you draw your group at random.
Sampling models are therefore ubiquitous in data science.
Casino games offer a plethora of examples of real-world situations
in which sampling models are used to answer specific questions.
We will therefore start with such examples.
OK, let's start with this.
Suppose a very small casino hires you to consult on
whether they should set up a roulette wheel.
They want to know if they can make money off it, or if it's too risky
and they might lose.
To keep the example simple, we will assume that 1,000 people will play,
and that the only game you can play is to bet on red or black.
The casino wants to predict how much money they will make or lose.
They want a range of values that are possible,
and in particular, they want to know, what is the chance of losing money?
If this probability is too high, they will
pass on installing roulette wheels, since they
can't take the risk, given that they need to pay their employees
and keep the lights on.
We're going to define a random variable, capital S, that will
represent the casino's total winnings.
Let's start by constructing the urn, the urn we use for our sampling model.
A roulette wheel has 18 red pockets, 18 black pockets, and 2 green ones.
So playing a color in one game of roulette
is equivalent to drawing from this urn.
Let's write some code.
There's 18 black, 18 red, and 2 green.
The 1,000 outcomes from 1,000 people playing
are independent draws from this urn.
If red comes up, the gambler wins, and the casino loses $1,
so we draw a negative 1.
Otherwise, the casino wins $1, and we draw a 1.
We can code 1,000 independent draws using the following code.
Here are the first 10 outcomes of these 1,000 draws.
Because we know the proportions of 1's and negative 1's, inside the urn,
we can generate the draws with one line of code, without defining color.
Here's that line of code.
We call this approach a sampling model, since we
are modeling the random behavior of a roulette with the sampling of draws
from an urn.
The total winnings, capital S, is simply the sum
of these 1,000 independent draws.
So here's a code that generates an example of S.
If you run that code over and over again,
you see that S changes every time.
This is, of course, because S is a random variable.
A very important and useful concept is the probability distribution
of the random variable.
The probability distribution of a random variable
tells us the probability of the observed value falling in any given interval.
So for example, if we want to know the probability that we lose money,
we're asking, what is the probability that S is in the interval S
less than 0?
Note that if we can define a cumulative distribution function--
let's call it f of a, which is equal to the probability of S being less than
or equal to a--
then we'll be able to answer any question related
to the probability of events defined by a random variable S,
including the event S less than 0.
We call f the random variable's distribution function.
We can estimate the distribution function for the random variable S
by using a Monte Carlo simulation to generate many, many realizations
of the random variable.
With the code we're about to write, we run
the experiment of having 1,000 people play roulette over and over.
Specifically, we're going to do this experiment 10,000 times.
Here's the code.
So now we're going to ask, in our simulation,
how often did we get sums smaller or equal to a?
We can get the answer by using this simple code.
This will be a very good approximation of f of a, our distribution function.
In fact, we can visualize the distribution
by creating a histogram showing the probability
f(b) minus f(a) for several intervals ab.
Here it is.
Now we can easily answer the casino's question,
how likely is it that we lose money?
We simply ask, how often was S, out of the 10,000 simulations, smaller than 0?
And the answer is, it was only 4.6% of the time.
So it's quite low.
From the histogram, we also see that that distribution
appears to be approximately normal.
If you make a Q-Q plot, you'll confirm that the normal approximation
is close to perfect.
If, in fact, the distribution is normal, then all we need to define
is the distribution's average and standard deviation.
Because we have the original values from which the distribution is created,
we can easily compute these.
The average is 52.5, and the standard deviation is 31.75.
If we add a normal density with this average and standard deviation
to the histogram we created earlier, we see that it matches very well.
This average and this standard deviation have special names.
They are referred to as the expected value
and the standard error of the random variable S.
We will say more about this in the next section.
It actually turns out that statistical theory provides a way
to derive the distribution of a random variable defined as independent draws
from an urn.
Specifically, in our example, we can show that S plus n divided by 2
follows what is known as a binomial distribution.
We therefore do not need to run Monte Carlo simulations,
nor use the normal approximation, to know the probability distribution of S.
We did this for illustrative purposes.
We ran the simulation for illustrative purposes.
For the details of the binomial distribution,
you can consult any probability book, or even Wikipedia.
However, we will discuss an incredibly useful approximation
provided by mathematical theory that applies generally
to sums of averages of draws from any urn, the central limit theorem.
Textbook link and additional information
This video corresponds to the textbook section on sampling models.
You can read more about the binomial distribution here.
Key points
A sampling model models the random behavior of a process as the sampling of
draws from an urn.
The probability distribution of a random variable is the probability of the
observed value falling in any given interval.
We can define a CDF F(a)=Pr(S≤a) to answer questions related to the
probability of S being in any interval.
The average of many draws of a random variable is called its expected value.
The standard deviation of many draws of a random variable is called its standard
error.
Monte Carlo simulation: Chance of casino losing money on
roulette
We build a sampling model for the random variable S that represents the casino's
total winnings.
# sampling model 1: define urn, then sample
color <- rep(c("Black", "Red", "Green"), c(18, 18, 2)) # define the urn for the sampling model
n <- 1000
X <- sample(ifelse(color == "Red", -1, 1), n, replace = TRUE)
X[1:10]
# sampling model 2: define urn inside sample function by noting probabilities
x <- sample(c(-1, 1), n, replace = TRUE, prob = c(9/19, 10/19)) # 1000 independent
draws
S <- sum(x) # total winnings = sum of draws
We use the sampling model to run a Monte Carlo simulation and use the results to
estimate the probability of the casino losing money.
n <- 1000 # number of roulette players
B <- 10000 # number of Monte Carlo experiments
S <- replicate(B, {
X <- sample(c(-1,1), n, replace = TRUE, prob = c(9/19, 10/19)) # simulate 1000
spins
sum(X) # determine total profit
})
mean(S < 0) # probability of the casino losing money
We can plot a histogram of the observed values of S as well as the normal density
curve based on the mean and standard deviation of S.
library(tidyverse)
s <- seq(min(S), max(S), length = 100) # sequence of 100 values across range of S
normal_density <- data.frame(s = s, f = dnorm(s, mean(S), sd(S))) # generate normal
density for S
data.frame (S = S) %>% # make data frame of S for histogram
ggplot(aes(S, ..density..)) +
geom_histogram(color = "black", binwidth = 10) +
ylab("Probability") +
geom_line(data = normal_density, mapping = aes(s, f), color = "blue")
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.1 Random
Variables and Sampling Models Distributions versus Probability Distributions
RAFAEL IRIZARRY: Before we continue, let's
make a subtle, yet important distinction and connection
between the distribution of a list of numbers, which we covered in our data
visualization module and a probability distribution, which
we're talking about here.
Previously we described how any list of numbers, let's call it x1 through xn,
has a distribution.
The definition is quite straightforward.
We define capital F of a as a function that
answers the question, what proportion of the list is less than or equal to a.
Because they are useful summaries, when the distribution
is approximately normal, we define the average and the standard deviation.
These are defined with a straightforward operation of the list.
In r we simply compute the average and standard deviation this way,
for example.
A random variable x has a distribution function.
To define this, we do not need a list of numbers.
It's a theoretical concept.
In this case, to define the distribution,
we define capital F of a as a function that answers the question,
what is the probability that x is less than or equal to a.
There is no list of numbers.
However, if x is defined by drawing from an urn with numbers in it,
then there is a list, the list of numbers inside the urn.
In this case, the distribution of that list
is the probability distribution of x and the average and standard deviation
of that list are the expected value and standard errors of the random variable.
Another way to think about it that does not involve an urn
is to run a Monte Carlo simulation and generate
a very large list of outcomes of x.
These outcomes are a list of numbers.
The distribution of this list will be a very good approximation
of the probability distribution of x.
The longer the list, the better the approximation.
The average and standard deviation of this list
will approximate the expected value and standard error of the random variable.
Textbook link
This video corresponds to the textbook section on distributions versus probability
distributions.
Key points
A random variable X has a probability distribution function F(a) that
defines Pr(X≤a) over all values of a.
Any list of numbers has a distribution. The probability distribution function of a
random variable is defined mathematically and does not depend on a list of
numbers.
The results of a Monte Carlo simulation with a large enough number of
observations will approximate the probability distribution of X.
If a random variable is defined as draws from an urn:
The probability distribution function of the random variable is defined
as the distribution of the list of values in the urn.
The expected value of the random variable is the average of values
in the urn.
The standard error of one draw of the random variable is the
standard deviation of the values of the urn.
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.1 Random
Variables and Sampling Models Notation for Random Variables
RAFAEL IRIZARRY: Note that in statistical textbooks,
capital letters are used to denote random variables
and we follow this convention here.
Lower case letters are used for observed values.
You'll see some notation that includes both.
For example, you'll see events defined as capital X less than
or equal than small cap x.
Here, x is a random variable, making it a random event.
And little x is an arbitrary value and not random.
So for example, capital X might represent the number on a die roll--
that's a random value--
and little x will represent an actual value we see.
So in this case, the probability of capital X being equal to little x
is 1 in 6 regardless of the value of little x.
Note that this notation is a bit strange,
because we ask questions about probability,
since big x is not an observed quantity.
Instead, it's a random quantity that we will see in the future.
We can talk about what we expect it to be, what values are probable,
but not what it is.
But once we have the data, we do see a realization of big x,
so data scientists talk about what could have been,
but after we see what actually happened.
Textbook link
This video corresponds to the textbook section on notation for random variables.
Key points
Capital letters denote random variables (X) and lowercase letters denote observed
values (x).
In the notation Pr(X=x), we are asking how frequently the random variable X is
equal to the value x. For example, if x=6, this statement becomes Pr(X=6).
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.1 Random
Variables and Sampling Models Central Limit Theorem
RAFAEL IRIZARRY: The Central Limit Theorem--
or the CLT for short tells us that when the number of independent draws--
also called sample size--
is large, the probability distribution of the sum of these draws
is approximately normal.
Because sampling models are used for so many data generation processes,
the CLT is considered one of the most important mathematical insights
in history.
Previously, we discussed that if we know that the distribution
of a list of numbers is approximated by the normal distribution,
all we need to describe the list are the average and the standard deviation.
We also know that the same applies to probability distributions.
If a random variable has a probability distribution that
is approximated with the normal distribution,
then all we need to describe that probability distribution are
the average and the standard deviation.
Referred to as the expected value and the standard error.
We have described sampling models for draws.
We will now go over the mathematical theory
that lets us approximate the probability distribution for the sum of draws.
Once we do this, we will be able to help the Casino predict
how much money they will make.
The same approach we use for sum of the draws
will be useful for describing the distribution of averages
and proportions, which we will need to understand, for example, how
polls work.
The first important concept to learn is the expected value.
In statistics books, it is common to use the letter capital E, like this, E of X
equals mu, to denote that the expected value of the random variable X is mu.
Mu is a Greek letter for M, which is the first letter in the word mean, which
is a synonym with average.
A random variable will vary around an expected value in a way
that if you take the average of many, many draws, the average of the draws
will approximate the expected value.
Getting closer and closer the more draws you take.
A useful formula is that the expected value
of the random variables defined by one draw
is the average of the numbers in the urn.
For example, in our urn used to model betting on red on roulette,
we have twenty 1's and 18 negative 1's.
So the expected value is E of X equals 20 plus negative 18 divided by 38.
Which is about $0.05.
It is a bit counterintuitive to say that X
varies around 0.05 when the only values it takes is 1 and minus 1.
An intuitive way to think about the expected value
is that if we play the game over and over,
the Casino wins, on average, $0.05 per game.
Our Monte Carlo Simulation confirms this.
Here we run a million games and we see that the mean of X, which is
a bunch of 0s and 1s, is about $0.05.
In general, if the urn has just two possible outcomes-- say, a and b,
with proportions p and 1 minus p respectively,
the average is a times p plus b times 1 minus p.
To see this, notice that if there are nb's in the urn,
then we have npa's and n times 1 minus is pb's.
And because the average is the sum, we have n times a times p plus n times
b times 1 minus p divided by the total, which is n.
And we get the formula that we just saw.
Now, the reason we define the expected value
is because this mathematical definition turns out
to be useful for approximating the probability distribution of sums,
which in turn, is useful to describe the distribution of averages
and proportions.
The first useful fact is that the expected value of the sum of draws
is the number of draws times the average of the numbers in the urn.
So if 1,000 people play roulette, the Casino expects to win,
on average, 1,000 times $0.05, which is $50.
But this is an expected value.
How different can one observation be from the expected value ?
The Casino really wants to know this.
What is the range of possibilities?
If negative numbers are too likely, we may not install the roulette wheels.
Statistical theory, once again, answers this question.
The standard error, or SE for short, gives us
an idea of the size of the variation around the expected value.
In statistics books, it is common to use SE of X
to know the standard error of the random variable
X If our draws are independent--
That's an important assumption-- then the standard error of the sum
is given by the equation, the square root of the number of draws,
times the standard deviation of the numbers in the urn.
Using the definition of standard deviation,
we can derive with a bit of math, that if an urn contains two values--
a and b, with proportions p and 1 minus p, respectively--
the standard deviation is the absolute value of b minus a times
the square root of p times 1 minus p.
So in our roulette example, the standard deviation of the values inside the urn
is 1 minus minus 1 times the square root of 10/19 times 9/19.
Or 0.9986, so practically 1.
The standard error tells us the typical difference between a random variable
and its expectation.
So because 1 draw is obviously the sum of 1 draw,
we can use a formula to calculate that the random variable defined by 1 draw
has an expected value of $0.05 and a standard error of about 1.
This makes sense since we either get a 1 or a minus 1
with 1 slightly favored over the minus 1.
Using the formula, the sum of 1,000 people playing
has standard error of about $32.
So when 1,000 people bet on red, the Casino
is expected to win $50 with a standard error of $32.
So it seems like a safe bet.
But we still really can't answer the question--
How likely is the Casino to lose money?
Here The Central Limit Theorem will help.
The central limit theorem tells us that the distribution of the sum of S
is approximated by a normal distribution.
Using the formula, we know that the expected value and standard errors
are $52 and $32, respectively.
Note that the theoretical values match those obtained with the Monte Carlo
simulation we ran earlier.
Using the Central Limit Theorem, we can skip the Monte Carlo simulation
and instead, compute the probability of the Casino losing
money using the approximation.
We write the simple code using the pnorm function and we get the answer.
It's about 5%.
Which, is in very good agreement-- with the Monte Carlo simulation we ran.
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.2 The
Central Limit Theorem Continued Averages and Proportions
RAFAEL IRIZARRY: There are some useful mathematical results, some of which
we used previously, that are often used when working with data.
We're going to list them here.
The first, is that the expected value of a sum of random variables
is the sum of the expected values of the individual random variables.
We can write it like this, using mathematical notation.
If the x are drawn from the same urn, then they all
have the same expected value.
We call it mu here.
And therefore, the expected value of the sum
is n times mu, which is another way of writing the result of the sum of draws.
A second property is that the expected value
of random variables times a non-random constant
is the expected value times that non-random constant.
This is easier to explain with mathematical symbols, which
we show here.
To see why this is intuitive, think of a change of units.
If we change the units of the random variable, say from dollars to cents,
the expectation should change in the same way.
A consequence of these two facts that we described
is that the expected value of the average of draws from the same urn
is the expected value of the urn, call it mu again.
Here is that result written out in mathematical formula.
So the expected value of the average is mu,
the average of the values in the urn.
A third property is that the square of the standard error
of the sum of independent random variables
is the sum of the square of the standard error of each random variable.
This one is easier to understand in its math form.
The standard error of the sum of x is the square root
of the sum of the standard error squared, as you can see here.
Note that the square of the standard error
is referred to as a variance in statistical textbooks.
A fourth property is that the standard error
of random variables times a non-random constant
is the standard error times a non-random constant.
As with the expectation, we have the following formula.
To see why this is intuitive, again, think of change of units.
A consequence of these previous two properties
is that the standard error of the average of independent draws
from the same urn is the standard deviation
of the urn-- let's call it sigma, this is the Greek letter for s.
--divided by the square root of n.
Here it is in the mathematical form.
And here we are using the properties we just described to derive that result.
The last property that we're going to talk about
is that if x is a normally distributed random variable, then
if a and b are non-random constants, then a times X plus b is also
a normally distributed random variable.
Note that we are doing is changing the units of the random variable
by multiplying by a and then shifting the center by adding b.
Textbook link
This video corresponds to the textbook section on the statistical properties of
averages.
Key points
Random variable times a constant
The expected value of a random variable multiplied by a constant is that constant
times its original expected value:
E[aX]=aμ
The standard error of a random variable multiplied by a constant is that constant
times its original standard error:
SE[aX]=aσ
Average of multiple draws of a random variable
The expected value of the average of multiple draws from an urn is the expected
value of the urn (μ).
The standard deviation of the average of multiple draws from an urn is the
standard deviation of the urn divided by the square root of the number of
draws (σ/n−−√).
The sum of multiple draws of a random variable
The expected value of the sum of n draws of a random variable is n times its
original expected value:
E[nX]=nμ
The standard error of the sum of n draws of random variable is n−
−√ times its original standard error:
SE[nX]=n−−√σ
The sum of multiple different random variables
The expected value of the sum of different random variables is the sum of the
individual expected values for each random variable:
E[X1+X2+⋯+Xn]=μ1+μ2+⋯+μn
The standard error of the sum of different random variables is the square root of
the sum of squares of the individual standard errors:
SE[X1+X2+⋯+Xn]=σ21+σ22+⋯+σ2n−−−−−−−−−−−−−−−√
Transformation of random variables
If X is a normally distributed random variable and a and b are non-random
constants, then aX+b is also a normally distributed random variable.
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.2 The
Central Limit Theorem Continued Law of Large Numbers
AFAEL IRIZARRY: An important implication of the properties
we described in the previous video is that the standard error
of the average of draws becomes smaller and smaller as the number of draws n
grows larger.
When n is very large, then the standard error is practically 0,
and the average of the draws converges to the average of the urn.
This is known in statistical textbooks as the law of large numbers,
or the law of averages.
Note that the law of averages is sometimes misinterpreted.
For example, if you toss a coin five times and you see heads each time,
you might hear someone argue that the next toss is probably
a tail because of the law of averages.
On average, we should 50% heads and 50% tails,
so we need more tails to make up.
A similar argument would be to say that red is due on roulette
after seeing black come up five times in a row.
These events are independent.
So the chance of a coin landing heads is 50%, regardless of the previous five.
Similarly for the roulette outcome.
The law of averages applies only when the number of draws
is very, very large, not in small samples.
After a million tosses, you will definitely see about 50% heads,
regardless of what the first five were.
Another somewhat funny misuse of the law of averages
is in sports, where you sometimes hear TV announcers predict
a player is about to succeed because they have failed a few times in a row,
and they need successes to make up and match their average.
Textbook link
This video corresponds to the textbook section on the law of large numbers.
Key points
The law of large numbers states that as n increases, the standard error of the
average of a random variable decreases. In other words, when n is large, the
average of the draws converges to the average of the urn.
The law of large numbers is also known as the law of averages.
The law of averages only applies when n is very large and events are independent.
It is often misused to make predictions about an event being "due" because it has
happened less frequently than expected in a small sample size.
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.2 The
Central Limit Theorem Continued Averages and Proportions
RAFEAL IRIZARRY: The central limit theorem works when the number of draws
is large.
But large is a relative term.
How big is large?
15, 100, a million?
In many circumstances, as few as 30 draws is enough to make the CLT useful.
In specific instances, as few as 10 is enough.
However, these should not be considered general rules.
Note for example, that when the probability of success is very small,
we need larger sample sizes.
Consider for example, the lottery.
In the lottery, the chance of winning are less than 1 in a million.
Thousands of people play, so the number of draws is very large.
So the central limit should work.
Yet, the number of winners, the sum of the draws,
range between 0, and in very extreme cases, four.
This sum is certainly not well approximated
by the normal distribution.
So the central limit theorem doesn't apply, even
with a very large sample size.
This is generally true when the probability of success is very low.
In these cases, the Poisson distribution is more appropriate.
We do not cover the theory here, but you can
learn about the Poisson distribution in any probability textbook and even
Wikipedia.
Textbook links and further information
This video corresponds to the textbook section on sample size for CLT.
You can read more about the Poisson distribution here.
Key points
The sample size required for the Central Limit Theorem and Law of Large Numbers
to apply differs based on the probability of success.
If the probability of success is high, then relatively few observations
are needed.
As the probability of success decreases, more observations are
needed.
If the probability of success is extremely low, such as winning a lottery, then the
Central Limit Theorem may not apply even with extremely large sample sizes. The
normal distribution is not a good approximation in these cases, and other
distributions such as the Poisson distribution (not discussed in these courses) may
be more appropriate.
sigma <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
p <- seq(0.25, 0.95, 0.05)
exp_val <- sapply(p, function(x){
mu <- n * a*x + b*(1-x)
sigma <- sqrt(n) * abs(b-a) * sqrt(x*(1-x))
1-pnorm(35, mu, sigma)
})
min(p[which(exp_val > 0.8)])
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem 3.3 Assessment:
Random Variables, Sampling Models, and the Central Limit Theorem Question 3: Betting on
Roulette
AnteriorSiguiente
1. Completado
problem Questions 1 and 2: SAT testing
2. Completado
problem Question 3: Betting on Roulette
3. other Questions on Assessment: Random Variables and Sampling Models?
El acceso de auditoría vence el Jul 6, 2020
Perderás el acceso a este curso, incluido tu progreso, el Jul 6, 2020.
Opta por el certificado verificado antes del Jun 20, 2020 para obtener acceso ilimitado al curso
mientras esté disponible en nuestro sitio. Opta ahorapara retener acceso ilimitado después del Jul
6, 2020
Question 3: Betting on Roulette
Marcar esta página
A casino offers a House Special bet on roulette, which is a bet on five pockets (00,
0, 1, 2, 3) out of 38 total pockets. The bet pays out 6 to 1. In other words, a losing
bet yields -$1 and a successful bet yields $6. A gambler wants to know the chance
of losing money if he places 500 bets on the roulette House Special.
The following 7-part question asks you to do some calculations related to this
scenario.
Question 3a
0.0/1.0 punto (calificado)
0.2
What is the expected value of the payout for one bet? incorrecto
-0.0789
0.2
Explanation
The expected value can be calculated using the following code:
p <- 5/38
a <- 6
b <- -1
mu <- a*p + b*(1-p)
mu
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3b
0.0/1.0 punto (calificado)
0.8
What is the standard error of the payout for one bet? incorrecto
2.37
0.8
Explanation
The standard error can be calculated using the following code:
sigma <- abs(b-a) * sqrt(p*(1-p))
sigma
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3c
0.0/1.0 punto (calificado)
What is the expected value of the average payout over 500 bets?
Remember there is a difference between expected value of the average and expected value of the sum.
0.7
incorrecto
-0.0789
0.7
Explanation
The expected value can be calculated using the following code:
mu
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3d
0.0/1.0 punto (calificado)
What is the standard error of the average payout over 500 bets?
Remember there is a difference between the standard error of the average and standard error of the
sum.
2.36622
incorrecto
0.106
2.366227
Explanation
The standard error can be calculated using the following code:
n <- 500
sigma/sqrt(n)
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3e
0.0/1.0 punto (calificado)
7
What is the expected value of the sum of 500 bets? incorrecto
-39.5
7
Explanation
The expected value can be calculated using the following code:
n*mu
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3f
0.0/1.0 punto (calificado)
98
What is the standard error of the sum of 500 bets? incorrecto
52.9
98
Explanation
The standard error can be calculated using the following code:
sqrt(n) * sigma
Enviar
Ha realizado 10 de 10 intentosAlgunos problemas tienen opciones como guardar, restablecer,
sugerencias o mostrar respuesta. Estas opciones aparecen después de oprimir el botón Enviar.
Mostrar Respuesta
Las respuestas son mostradas en el problema
Revisión
Question 3g
0.0/1.0 punto (calificado)
Use pnorm() with the expected value of the sum and standard error of the sum to calculate the
0.6
probability of losing money over 500 bets, Pr(X≤0). incorrecto
0.772
0.6
Explanation
The standard error can be calculated using the following code:
pnorm(0, n*mu, sqrt(n)*sigma)
Section 4: The Big Short 4.1 The Big Short The Big Short: Interest Rates Explained
RAFAEL IRIZARRY: In a way, the sampling models we've been talking about
are also used by banks to decide interest rates.
Let's see how this could be.
Suppose you run a small bank that has a history of identifying
potential homeowners that can be trusted to make payments.
In fact, historically in a given year only 2%
of your customers default, meaning that they don't pay back the money
you lent them.
However, note that if you simply loan money to everybody without interest,
you'll end up losing money due to this 2%.
Although you know 2% of your clients will probably default,
you don't know which ones.
However, by charging everybody just a bit extra
you can make up for the losses incurred due to the 2%,
and also pay the employees at work to make these loans happen.
You can also make a profit, but if you set the interest rate too high
your clients will go to another bank.
We use all these facts and some probability theory
to decide what interest rates we should charge.
Suppose your bank will give out 1,000 loans for 180,000 this year.
Also suppose that your bank loses, after adding up
all the costs, $200,000 per foreclosure.
For simplicity, we assume that that includes all operational costs.
A sampling model for this scenario is coded like this.
We either default and lose money, or not default and not lose money.
If we run the simulation we see that we lose $2.8 millions.
Note that the total loss defined by the final sum is a random variable.
Every time you run the code you get a different answer.
This is because it's a probability of defaulting.
It's not going to happen for sure.
We can easily construct a Monte Carlo simulation
to get an idea of the distribution of this random variable.
Here's the distribution.
You can see that we're going to lose money because of these people
that default. And here's the distribution of how much money we're
going to lose in millions.
We don't really need a Monte Carlo simulation though.
Using what we've learned, the CLT tells us
that because our losses are a sum of independent draws,
its distribution is approximately normal with expected
value and standard deviation given by the following formula.
We can now set an interest rate to guarantee that on average, we
break even.
Basically, we need to add a quantity x to each loan, which in this case
are represented by draws, so that the expected value is zero.
That means breaking even.
If we define l to be the loss per foreclosure,
we need to set x so that the following formula holds,
which implies that x can be calculated using this R code, which gives us
about 4,081.
This is an interest rate of about 2%.
We still have a problem though.
Although this interest rate guarantees that on average we break even,
there's a 50% chance that we will lose money.
If our bank loses money, we have to close it down.
So we need to pick an interest rate that makes it unlikely for this to happen.
At the same time, if the interest rate is too high
our clients will go to another bank.
So we must be willing to take some risks.
So, let's they say that we want our chances
of losing money to be one in 100.
What does x have to be now?
This one is a bit harder.
We want the sum-- let's call it capital S-- to have the probability
of S less than zero to be 0.01.
We know that S is approximately normal.
The expected value of S is given by this formula,
with n the number of draws, which in this case
represents the number of loans.
The standard error is given by this formula.
Because x is positive and l is negative, we
know that the absolute value of x minus l is x minus l.
Note that these are just the formulas we showed earlier,
but using more compact symbols.
Now we're going to use a mathematical trick that
is very common in statistics.
We're going to add and subtract the same quantities to both sides of the event S
less than zero so that the probability does not change,
and we end up with a standard normal random variable
on the left, which will then permit us to write down an equation with only
x as an unknown.
Here it goes.
We want the probability of S being less than zero to be 0.01.
So what we're going to do now is we're going
to subtract the expected value of S and divide by the standard error of S
on both sides of the equation.
All we did was add and divide by the same quantity on both sides.
We did this because now the term on the left
is a standard normal random variable, which we will rename as capital Z.
Now, we will fill in the blanks with the actual formula for expected value
and standard error.
The formula now looks like this.
We have a Z on the left.
That's good, that's going to make things easier later.
The right looks complicated, but remember
that l, p, and n are all known amounts, so eventually we'll
turn them into numbers.
Now, because the term on the left side is a normal random variable
with expected value of zero and standard error one,
it means that the quantity on the right must
be equal to qnorm of 0.01, which is negative 2.32.
Then the equation above will hold true.
Remember that if we set little z to be qnorm of 0.01,
this will give you a value of little z for which the following formula is
true.
The probability of big Z, less than or equal to little z, equal 0.01.
So this means that the right side of the complicated equation
must be equal to qnorm 0.01.
So we have this formula.
The trick works, because we end up with an expression containing
x that we know has to be equal to an quantity depending on things we know.
Solving for x is now simply, algebra.
If we do it, we get that the x has to be about 6,249,
which is an interest rate of about 3%, which is still pretty good.
Note also that by choosing this interest rate,
we now have an expected profit per loan of about $2,124,
which is a total expected profit of about $2 million.
We can run a Monte Carlo simulation and check our theoretical approximation.
We do that, and we indeed get that value again.
And again, the probability of profit being less than zero
according to the Monte Carlo simulation is about 1%.
Textbook link
This video corresponds to the textbook section on interest rates.
Correction
At 2:35, the displayed results of the code are incorrect. Here are the correct values:
n*(p*loss_per_foreclosure + (1-p)*0)
[1] -4e+06
sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))
[1] 885438
Key points
Interest rates for loans are set using the probability of loan defaults to calculate a
rate that minimizes the probability of losing money.
We can define the outcome of loans as a random variable. We can also define the
sum of outcomes of many loans as a random variable.
The Central Limit Theorem can be applied to fit a normal distribution to the sum of
profits over many loans. We can use properties of the normal distribution to
calculate the interest rate needed to ensure a certain probability of losing money
for a given probability of default.
Code: Interest rate sampling model
n <- 1000
loss_per_foreclosure <- -200000
p <- 0.02
defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE)
sum(defaults * loss_per_foreclosure)
Code: Interest rate Monte Carlo simulation
B <- 10000
losses <- replicate(B, {
defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE)
sum(defaults * loss_per_foreclosure)
})
Code: Plotting expected losses
library(tidyverse)
data.frame(losses_in_millions = losses/10^6) %>%
ggplot(aes(losses_in_millions)) +
geom_histogram(binwidth = 0.6, col = "black")
Code: Expected value and standard error of the sum of 1,000
loans
n*(p*loss_per_foreclosure + (1-p)*0) # expected value
sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p)) # standard error
Code: Calculating interest rates for expected value of 0
We can calculate the amount x to add to each loan so that the expected value is 0
using the equation lp+x(1−p)=0. Note that this equation is the definition of
expected value given a loss per foreclosure l with foreclosure probability p and
profit x if there is no foreclosure (probability 1−p).
We solve for x=−lp1−p and calculate x:
x = - loss_per_foreclosure*p/(1-p)
On a $180,000 loan, this equals an interest rate of:
x/180000
Equations: Calculating interest rate for 1% probability of losing
money
We want to calculate the value of x for which Pr(S<0)=0.01. The expected
value E[S] of the sum of n=1000 loans given our definitions of x, l and p is:
μS=(lp+x(1−p))∗n
And the standard error of the sum of n loans, SE[S], is:
σS=∣x−l∣np(1−p)−−−−−−−−√
Because we know the definition of a Z-score is Z=x−μσ, we know
that Pr(S<0)=Pr(Z<−μσ). Thus, Pr(S<0)=0.01 equals:
Pr(Z<−{lp+x(1−p)}n(x−l)np(1−p)−−−−−−−−√)=0.01
z<-qnorm(0.01) gives us the value of z for which Pr(Z≤z)=0.01, meaning:
z=−{lp+x(1−p)}n(x−l)np(1−p)−−−−−−−−√
Solving for x gives:
x=−lnp−znp(1−p)−−−−−−−−√n(1−p)+znp(1−p)−−−−−−−
−√
Code: Calculating interest rate for 1% probability of losing
money
l <- loss_per_foreclosure
z <- qnorm(0.01)
x <- -l*( n*p - z*sqrt(n*p*(1-p)))/ ( n*(1-p) + z*sqrt(n*p*(1-p)))\x
x/180000 # interest rate
loss_per_foreclosure*p + x*(1-p) # expected value of the profit per loan
n*(loss_per_foreclosure*p + x*(1-p)) # expected value of the profit over n loans
Code: Monte Carlo simulation for 1% probability of losing
money
Note that your results will vary from the video because the seed is not set.
B <- 100000
profit <- replicate(B, {
draws <- sample( c(x, loss_per_foreclosure), n,
prob=c(1-p, p), replace = TRUE)
sum(draws)
})
mean(profit) # expected value of the profit over n loans
mean(profit<0) # probability of losing money
Section 4: The Big Short 4.1 The Big Short The Big Short
RAFAEL IRIZARRY: One of our employees points out
that since the bank is making about $2,000 per loan,
that you should give out more loans.
Why just n?
You explain that finding those n clients was hard.
You need a group that is predictable, and that keeps the chances of defaults
low.
He then points out that even if the probability of default
is higher, as long as your expected value is positive,
you can minimize your chances of losing money
by increasing n, the number of loans, and relying
on the law of large numbers.
He claims that even if the default rate is twice as high, say 4%,
if we set the rate just a bit higher so that this happens,
you will get a positive expected value.
So if we set the interest rate at 5%, we are
guaranteed a positive expected value of $640 per loan.
And we can minimize our chances of losing money
by simply increasing the number of loans, since the probability of S
being less than 0 is equal to the probability of Z
being less than negative expected value of S divided by standard error S,
with Z a standard normal random variable, as we saw earlier.
And if we do the math, we will see that we can pick an n so
that this probability is 1 in 100.
Let's see how. If we define mu and sigma to be the expected value and standard
deviation of the urn, that is of a single loan,
we have that using the formulas above, E of S, the expected value of S,
is n times mu.
And the standard error of S is the square root of n times sigma.
So if we define little z to be the q norm of 0.01,
we have the following formula, which tells us
that if n is bigger than z squared times sigma squared divided by mu squared,
we are guaranteed to have a probability of less than 0.01 of losing money.
The implication is that as long as mu, the expected value, is positive,
we can find an n, a number of loans, that
minimizes the probability of a loss.
This is a form of the Law of Large Numbers.
When n is large, our average earning per loan
converges to the expected earning, mu.
With x fixed, now we can ask what n do we need for the probability to be 0.01?
In our example, we use the following formula.
And we get that n has to be 22,163 loans.
This is the number of loans we need so that our probability of losing
is about 1%.
And furthermore, we are expected to earn a total of about $14 million.
We can confirm this with a Monte Carlo simulation.
We run the Monte Carlo simulation, and we get the probability of losing money
is about 1%.
This seems like a no brainer.
Your colleague decides to leave your bank,
and start his own high-risk mortgage company.
A few months later, your colleague's bank has gone bankrupt.
A book is written about it and eventually, a movie
made relating to the mistake your friend and many others made.
What happened?
Your friend's scheme was based in part on the following mathematical formula--
by making n large, we minimize the standard error of our per-loan profit.
However, for this rule to hold, the X's must be independent draws.
The fact that one person defaults must be independent
of other people defaulting.
Note that, for example, that in the extreme of averaging
the same event over and over, an event is certainly not independent.
In that case, we get a much higher standard error.
It's the square root of n larger.
To construct a more realistic simulation than the original one your friend ran,
let's assume there could be a global event that
affects everybody with high-risk mortgages,
and changes their probability.
We will assume that with a 50-50 chance, all the probabilities go up or down
slightly to somewhere between 0.03 and 0.05.
But it happens to everybody at once, not just one person.
These draws are no longer independent.
Let's use a Monte Carlo simulation to see what happens under this model.
Note that our expected profit is still large.
We're expected to make about $14 million.
However, the probability of the bank having negative earning
shoots way up to almost 35%.
Even scarier is the probability of losing more than $10 million,
which is 24%.
To understand how this happens, we can look at the distribution
of our random variable.
It doesn't look normal at all.
The theory completely breaks down, and our random variable has much more
variability than expected.
The financial meltdown of 2007 was due, among other things,
to financial experts assuming independence when there was none.
Textbook link
This video corresponds to the textbook section on The Big Short.
Key points
The Central Limit Theorem states that the sum of independent draws of a random
variable follows a normal distribution. However, when the draws are not
independent, this assumption does not hold.
If an event changes the probability of default for all borrowers, then the probability
of the bank losing money changes.
Monte Carlo simulations can be used to model the effects of unknown changes in
the probability of default.
Code: Expected value with higher default rate and interest rate
p <- .04
loss_per_foreclosure <- -200000
r <- 0.05
x <- r*180000
loss_per_foreclosure*p + x*(1-p)
Equations: Probability of losing money
We can define our desired probability of losing money, z, as:
Pr(S<0)=Pr(Z<−E[S]SE[S])=Pr(Z<z)
If μ is the expected value of the urn (one loan) and σ is the standard deviation of
the urn (one loan), then E[S]=nμ and SE[S]=n−−√σ.
As in the previous video, we define the probability of losing money z=0.01. In the
first equation, we can see that:
z=−E[S]SE[S]
It follows that:
z=−nμn−−√σ=−n−−√μσ
To find the value of n for which z is less than or equal to our desired value, we
take z≤−n√μσ and solve for n:
n≥z2σ2μ2
Code: Calculating number of loans for desired probability of
losing money
The number of loans required is:
z <- qnorm(0.01)
l <- loss_per_foreclosure
n <- ceiling((z^2*(x-l)^2*p*(1-p))/(l*p + x*(1-p))^2)
n # number of loans required
n*(loss_per_foreclosure*p + x * (1-p)) # expected profit over n loans
Code: Monte Carlo simulation with known default probability
This Monte Carlo simulation estimates the expected profit given a known
probability of default p=0.04. Note that your results will differ from the video
because the seed is not set.
B <- 10000
p <- 0.04
x <- 0.05 * 180000
profit <- replicate(B, {
draws <- sample( c(x, loss_per_foreclosure), n,
prob=c(1-p, p), replace = TRUE)
sum(draws)
})
mean(profit)
Code: Monte Carlo simulation with unknown default probability
This Monte Carlo simulation estimates the expected profit given an unknown
probability of default 0.03≤p≤0.05, modeling the situation where an event
changes the probability of default for all borrowers simultaneously. Note that your
results will differ from the video because the seed is not set.
p <- 0.04
x <- 0.05*180000
profit <- replicate(B, {
new_p <- 0.04 + sample(seq(-0.01, 0.01, length = 100), 1)
draws <- sample( c(x, loss_per_foreclosure), n,
prob=c(1-new_p, new_p), replace = TRUE)
sum(draws)
})
mean(profit) # expected profit
mean(profit < 0) # probability of losing money
mean(profit < -10000000) # probability of losing over $10 million