Bayesian inference,
Naïve Bayes model
http://xkcd.com/1236/
Conditional Probability
Bayes Rule
Rev. Thomas Bayes
• The product rule gives us two ways to factor (1702-1761)
a joint probability:
P ( A, B ) P ( A | B ) P ( B ) P ( B | A) P ( A)
• Therefore, P ( B | A) P ( A)
P( A | B)
P( B)
• Why is this useful?
– Can update our beliefs about A based on evidence B
• P(A) is the prior and P(A|B) is the posterior
– Key tool for probabilistic inference: can get diagnostic probability
from causal probability
• E.g., P(Cavity = true | Toothache = true) from
P(Toothache = true | Cavity = true)
Bayes Rule example
• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each
year (5/365 = 0.014). Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When
it doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on Marie's wedding?
Bayes Rule example
• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each
year (5/365 = 0.014). Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When
it doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on Marie's wedding?
Law of total probability
Baye’s Rule example
• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each
year (5/365 = 0.014). Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When
it doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on Marie's wedding?
0.9 0.014 0.0126
0.111
0.9 0.014 0.10.986 0.0126 0.0986
Bayes rule: Example
• 1% of men at age forty who participate in routine
screening have cancer. 80% of men with cancer will
get positive tomography. 9.6% of men without
cancer will also get positive tomographies. A man in
this age group had a positive tomography in a
routine screening. What is the probability that he
actually has cancer?
0.8 0.01 0.008
0.0776
0.8 0.01 0.096 0.99 0.008 0.095
https://xkcd.com/1132/
See also: https://xkcd.com/882/
Probabilistic inference
• Suppose the agent has to make a decision about
the value of an unobserved query variable X given
some observed evidence variable(s) E = e
– Partially observable, stochastic, episodic environment
– Examples: X = {spam, not spam}, e = email message
X = {zebra, giraffe, hippo}, e = image features
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
2016 convention speeches
Clinton Trump
Source
2016 first presidential debate
Trump Clinton
Source
2016 first presidential debate
Trump unique words Clinton unique words
Source
Learning and inference pipeline
Learning Training
Labels
Training
Samples
Learned
Features Training
model
Learned
model
Inference
Features Prediction
Test Sample
Making decisions under uncertainty
• Let action At = leave for airport t minutes before flight
– Will At succeed, i.e., get me to the airport in time for the flight?
• Problems:
• Partial observability (road state, other drivers' plans, etc.)
• Noisy sensors (traffic reports)
• Uncertainty in action outcomes (flat tire, etc.)
• Complexity of modeling and predicting traffic
• Hence a non-probabilistic approach either
• Risks falsehood: “A25 will get me there on time,” or
• Leads to conclusions that are too weak for decision making:
• A25 will get me there on time if there's no accident on the bridge and it
doesn't rain and my tires remain intact, etc., etc.
• A1440 will get me there on time but I’ll have to stay overnight in the airport
Making decisions under uncertainty
• Suppose the agent believes the following:
P(A25 gets me there on time) = 0.04
P(A90 gets me there on time) = 0.70
P(A120 gets me there on time) = 0.95
P(A1440 gets me there on time) = 0.9999
• Which action should the agent choose?
– Depends on preferences for missing flight vs. time spent waiting
– Encapsulated by a utility function
• The agent should choose the action that maximizes the
expected utility:
P(At succeeds) * U(At succeeds) + P(At fails) * U(At fails)
Making decisions under uncertainty
• More generally: the expected utility of an action is defined
as:
EU(a) = Σoutcomes of a P(outcome | a) U(outcome)
• Utility theory is used to represent and infer preferences
• Decision theory = probability theory + utility theory
Monty Hall problem
• You’re a contestant on a game show. You see three closed
doors, and behind one of them is a prize. You choose one
door, and the host opens one of the other doors and
reveals that there is no prize behind it. Then he offers you a
chance to switch to the remaining door. Should you take it?
http://en.wikipedia.org/wiki/Monty_Hall_problem
Monty Hall problem
• With probability 1/3, you picked the correct door,
and with probability 2/3, picked the wrong door.
If you picked the correct door and then you
switch, you lose. If you picked the wrong door
and then you switch, you win the prize.
• Expected utility of switching:
EU(Switch) = (1/3) * 0 + (2/3) * Prize
• Expected utility of not switching:
EU(Not switch) = (1/3) * Prize + (2/3) * 0
Random variables
• We describe the (uncertain) state of the world using
random variables
Denoted by capital letters
– R: Is it raining?
– W: What’s the weather?
– D: What is the outcome of rolling two dice?
– S: What is the speed of my car (in MPH)?
• Just like variables in CSPs, random variables take on
values in a domain
Domain values must be mutually exclusive and exhaustive
– R in {True, False}
– W in {Sunny, Cloudy, Rainy, Snow}
– D in {(1,1), (1,2), … (6,6)}
– S in [0, 200]
Events
• Probabilistic statements are defined over events, or sets
of world states
“It is raining”
“The weather is either cloudy or snowy”
“The sum of the two dice rolls is 11”
“My car is going between 30 and 50 miles per hour”
• Events are described using propositions about random
variables:
R = True
W = “Cloudy” W = “Snowy”
D {(5,6), (6,5)}
30 S 50
• Notation: P(A) is the probability of the set of world states
in which proposition A holds
Kolmogorov’s axioms of
probability
• For any propositions (events) A, B
0 ≤ P(A) ≤ 1
P(True) = 1 and P(False) = 0
P(A B) = P(A) + P(B) – P(A B)
– Subtraction accounts for double-counting
• Based on these axioms, what is P(¬A)?
• These axioms are sufficient to completely specify
probability theory for discrete random variables
• For continuous variables, need density functions
Atomic events
• Atomic event: a complete specification of the state of
the world, or a complete assignment of domain values to
all random variables
– Atomic events are mutually exclusive and exhaustive
• E.g., if the world consists of only two Boolean variables
Cavity and Toothache, then there are four distinct atomic
events:
Cavity = false Toothache = false
Cavity = false Toothache = true
Cavity = true Toothache = false
Cavity = true Toothache = true
Joint probability distributions
• A joint distribution is an assignment of
probabilities to every possible atomic event
Atomic event P
Cavity = false Toothache = false 0.8
Cavity = false Toothache = true 0.1
Cavity = true Toothache = false 0.05
Cavity = true Toothache = true 0.05
– Why does it follow from the axioms of probability that
the probabilities of all possible atomic events must
sum to 1?
Joint probability distributions
• A joint distribution is an assignment of
probabilities to every possible atomic event
• Suppose we have a joint distribution of n random
variables with domain sizes d
– What is the size of the probability table?
– Impossible to write out completely for all but the
smallest distributions
Notation
• P(X1 = x1, X2 = x2, …, Xn = xn) refers to a single entry
(atomic event) in the joint probability distribution
table
– Shorthand: P(x1, x2, …, xn)
• P(X1, X2, …, Xn) refers to the entire joint probability
distribution table
• P(A) can also refer to the probability of an event
– E.g., X1 = x1 is an event
Marginal probability distributions
• From the joint distribution P(X,Y) we can find the
marginal distributions P(X) and P(Y)
P(Cavity, Toothache)
Cavity = false Toothache = false 0.8
Cavity = false Toothache = true 0.1
Cavity = true Toothache = false 0.05
Cavity = true Toothache = true 0.05
P(Cavity) P(Toothache)
Cavity = false ? Toothache = false ?
Cavity = true ? Toochache = true ?
Marginal probability distributions
• From the joint distribution P(X,Y) we can find the
marginal distributions P(X) and P(Y)
• To find P(X = x), sum the probabilities of all
atomic events where X = x:
P ( X x) P( X x Y y1 ) ( X x Y yn )
n
P( x, y1 ) ( x, yn ) P ( x, yi )
i 1
• This is called marginalization (we are
marginalizing out all the variables except X)
Conditional probability
• Probability of cavity given toothache:
P(Cavity = true | Toothache = true)
• For any two events A and B, P ( A | B ) P ( A B ) P ( A, B )
P( B) P( B)
P(A B)
P(A) P(B)
Conditional probability
P(Cavity, Toothache)
Cavity = false Toothache = false 0.8
Cavity = false Toothache = true 0.1
Cavity = true Toothache = false 0.05
Cavity = true Toothache = true 0.05
P(Cavity) P(Toothache)
Cavity = false 0.9 Toothache = false 0.85
Cavity = true 0.1 Toothache = true 0.15
• What is P(Cavity = true | Toothache = false)?
0.05 / 0.85 = 0.059
• What is P(Cavity = false | Toothache = true)?
0.1 / 0.15 = 0.667
Conditional distributions
• A conditional distribution is a distribution over the values
of one variable given fixed values of other variables
P(Cavity, Toothache)
Cavity = false Toothache = false 0.8
Cavity = false Toothache = true 0.1
Cavity = true Toothache = false 0.05
Cavity = true Toothache = true 0.05
P(Cavity | Toothache = true) P(Cavity|Toothache = false)
Cavity = false 0.667 Cavity = false 0.941
Cavity = true 0.333 Cavity = true 0.059
P(Toothache | Cavity = true) P(Toothache | Cavity = false)
Toothache= false 0.5 Toothache= false 0.889
Toothache = true 0.5 Toothache = true 0.111
Normalization trick
• To get the whole conditional distribution P(X | Y = y)
at once, select all entries in the joint distribution table
matching Y = y and renormalize them to sum to one
P(Cavity, Toothache)
Cavity = false Toothache = false 0.8
Cavity = false Toothache = true 0.1
Cavity = true Toothache = false 0.05
Cavity = true Toothache = true 0.05
Select
Toothache, Cavity = false
Toothache= false 0.8
Toothache = true 0.1
Renormalize
P(Toothache | Cavity = false)
Toothache= false 0.889
Toothache = true 0.111
Normalization trick
• To get the whole conditional distribution P(X | Y = y)
at once, select all entries in the joint distribution table
matching Y = y and renormalize them to sum to one
• Why does it work?
P ( x, y ) P ( x, y )
by marginalization
P( x, y) P( y)
x
Product rule
P ( A, B )
• Definition of conditional probability: P ( A | B )
P( B)
• Sometimes we have the conditional probability and want
to obtain the joint:
P ( A, B ) P ( A | B ) P ( B ) P ( B | A) P ( A)
Chain rule
• Product rule:
P ( A, B ) P ( A | B ) P ( B ) P ( B | A) P ( A)
• Chain rule:
P ( A1 , , An ) P ( A1 ) P ( A2 | A1 ) P ( A3 | A1 , A2 ) P ( An | A1 , , An 1 )
n
P ( Ai | A1 , , Ai 1 )
i 1
Independence
• Two events A and B are independent if and only if
P(A B) = P(A, B) = P(A) P(B)
– In other words, P(A | B) = P(A) and P(B | A) = P(B)
– This is an important simplifying assumption for
modeling, e.g., Toothache and Weather can be
assumed to be independent
• Are two mutually exclusive events independent?
– No, but for mutually exclusive events we have
P(A B) = P(A) + P(B)
Independence
• Two events A and B are independent if and only if
P(A B) = P(A, B) = P(A) P(B)
– In other words, P(A | B) = P(A) and P(B | A) = P(B)
– This is an important simplifying assumption for
modeling, e.g., Toothache and Weather can be
assumed to be independent
• Conditional independence: A and B are conditionally
independent given C iff
P(A B | C) = P(A | C) P(B | C)
– Equivalently:
P(A | B, C) = P(A | C) or P(B | A, C) = P(B | C)
Conditional independence:
Example
• Toothache: boolean variable indicating whether the patient has a
toothache
• Cavity: boolean variable indicating whether the patient has a cavity
• Catch: whether the dentist’s probe catches in the cavity
• If the patient has a cavity, the probability that the probe catches in it
doesn't depend on whether he/she has a toothache
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
• Therefore, Catch is conditionally independent of Toothache given Cavity
• Likewise, Toothache is conditionally independent of Catch given Cavity
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
• Equivalent statement:
• P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Conditional independence:
Example
• How many numbers do we need to represent the joint
probability table P(Toothache, Cavity, Catch)?
23 – 1 = 7 independent entries
• Write out the joint distribution using chain rule:
P(Toothache, Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Cavity)
• How many numbers do we need to represent these
distributions?
1 + 2 + 2 = 5 independent numbers
• In most cases, the use of conditional independence
reduces the size of the representation of the joint
distribution from exponential in n to linear in n
The Birthday problem
• We have a set of n people. What is the probability that
two of them share the same birthday?
• Easier to calculate the probability that n people do not
share the same birthday
P ( B1 , Bn distinct )
P ( Bn distinct from B1 , Bn 1 | B1 , Bn 1 distinct )
P ( B1 , Bn 1 distinct )
n
P ( Bi distinct from B1 , Bi 1 | B1 , Bi 1 distinct )
i 1
The Birthday problem
P ( B1 , Bn distinct )
n
P ( Bi distinct from B1 , Bi 1 | B1 , Bi 1 distinct )
i 1
365 i 1
P ( Bi distinct from B1 , , Bi 1 | B1 , , Bi 1 distinct)
365
365 364 365 n 1
P ( B1 , , Bn distinct)
365 365 365
365 364 365 n 1
P ( B1 , , Bn not distinct) 1
365 365 365
The Birthday problem
• For 23 people, the probability of sharing a
birthday is above 0.5!
http://en.wikipedia.org/wiki/Birthday_problem