Probabilistic Reasoning in AI
Probabilistic Reasoning in AI
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional logic
with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need uncertain
reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
2. Bayesian inference
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of
hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence. P(A) is called the prior probability, probability of hypothesis before
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written as:
Where A1, A2, A3, , An is a set of mutually exclusive and exhaustive events.
Example-1:
Question: what is the probability that a patient has diseases meningitis with a
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80%
of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) =
1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability
that the card is king is 4/52, then calculate posterior probability P(King|Face), which means
the drawn face card is a king card.
Solution:
P(king): probability that the card is King= 4/52=
card= 3/13
o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
3. Probabilistic reasoning
Probabilistic reasoning:
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty
that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed,
such as "It will rain today," "behavior of someone for some situations," "A match between two
teams or two players." These are probable sentences for which we can assume that it will happen
but not sure about it, so here we use probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding
probabilistic reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.
event A.
We can find the probability of an uncertain event by using the below formula.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real world.
Prior probability: The prior probability of an event is probability computed before observing new
information.
Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is already
occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes
English and mathematics, and then what is the percent of students those who like English also
like mathematics?
Solution:
likes English.
Hence, 57% are the students who like English also like Mathematics.
4. Bayesian networks or Belief networks
Bayesian belief network is key computer technology for dealing with probabilistic events and to
solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two
parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the
pair of nodes in the graph. These links represent that one node directly
influence the other node, and if there is no directed link that means that nodes are
independent with each other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's
first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3, , xn, then the probabilities of a different combination of x1, x2, x3.. xn,
are known as
Joint probability distribution.
P[x1, x2, x3, , xn], it can be written as the following way in terms of the joint probability distribution.
In general for each variable Xi, we can write the equation as:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly affecting
the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
o The network is representing that our assumptions do not directly perceive the burglary and
also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there
are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake
burglary.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
There are two ways to understand the semantics of the Bayesian network, which is given below:
1. Exact inference:
In exact inference, we analytically compute the conditional probability distribution over the
variables of interest.
But sometimes, that’s too hard to do, in which case we can use approximation techniques
based on statistical sampling
General question: What’s the whole probability distribution over variable X given evidence
e, P(X | e)?
In our discrete probability situation, the only way to answer a MAP query is to compute the
probability of x given e for all possible values of x and see which one is greatest
So, in general, we’d like to be able to compute a whole probability distribution over some
variable or variables X, given instantiations of a set of variables e
To answer any query involving a conjunction of variables, sum over the variables not
involved in the query
Given the joint distribution over the variables, we can easily answer any question about the
value of a single variable by summing (or marginalizing) over the other variables.
So, in a domain with four variables, A, B, C, and D, the probability that variable D has value
d is the sum over all possible combinations of values of the other three variables of the joint
probability of all four values. This is exactly the same as the procedure we went through in
the last lecture, where to compute the probability of cavity, we added up the probability of
cavity and toothache and the probability of cavity and not toothache.
In general, we’ll use the first notation, with a single summation indexed by a list of variable
names, and a joint probability expression that mentions values of those variables. But here
we can see the completely written-out definition, just so we all know what the shorthand is
supposed to mean.
In the numerator, here, you can see that we’re only summing over variables A and C, because
b and d are instantiated in the query.
We’re going to learn a general purpose algorithm for answering these joint queries fairly
efficiently. We’ll start by looking at a very simple case to build up our intuitions, then we’ll
write down the algorithm, then we’ll apply it to a more complex case.
Here’s our very simple case. It’s a bayes net with four nodes, arranged in a chain.
So, we know from before that the probability that variable D has some value little d is the
sum over A, B, and C of the joint distribution, with d fixed.
Now, using the chain rule of Bayesian networks, we can write down the joint probability as a
product over the nodes of the probability of each node’s value given the values of its parents.
So, in this case, we get P(d|c) times P(c|b) times P(b|a) times P(a).
This expression gives us a method for answering the query, given the conditional
probabilities that are stored in the net. And this method can be applied directly to any other
bayes net. But there’s a problem with it: it requires enumerating all possible combinations of
assignments to A, B, and C, and then, for each one, multiplying the factors for each node.
That’s an enormous amount of work and we’d like to avoid it if at all possible.
So, we’ll try rewriting the expression into something that might be more efficient to evaluate.
First, we can make our summation into three separate summations, one over each variable.
Then, by distributivity of addition over multiplication, we can push the summations in, so that
the sum over A includes all the terms that mention A, but no others, and so on. It’s pretty
clear that this expression is the same as the previous one in value, but it can be evaluated
more efficiently. We’re still, eventually, enumerating all assignments to the three variables,
but we’re doing somewhat fewer multiplications than before. So this is still not completely
satisfactory.
If you look, for a minute, at the terms inside the summation over A, you’ll see that we’re
doing these multiplications over for each value of C, which isn’t necessary, because they’re
independent of C. Our idea, here, is to do the multiplications once and store them for later
use. So, first, for each value of A and B, we can compute the product, generating a two
dimensional matrix.
Then, we can sum over the rows of the matrix, yielding one value of the sum for each
possible value of b.
We can continue the process in basically the same way. We can look at the summation over b
and see that the only other variable it involves is c. We can summarize those products as a set
of factors, one for each value of c. We’ll call those factors f_2 of c.
We substitute f_2 of c into the formula, remove node b from the diagram, and now we’re
down to a simple expression in which d is known and we have to sum over values of c.
Given a Bayesian network, and an elimination order for the non-query variables , compute
For i = m downto 1
That was a simple special case. Now we can look at the algorithm in the general case. Let’s
assume that we’re given a Bayesian network and an ordering on the variables that aren’t
fixed in the query. We’ll come back later to the question of the influence of the order, and
how we might find a good one.
We can express the probability of the query variables as a sum over each value of each of the
non- query variables of a product over each node in the network, of the probability that that
variable has the given value given the values of its parents.
So, we’ll eliminate the variables from the inside out. Starting with variable Xm and finishing
with variable X1.
To eliminate variable Xi, we start by gathering up all of the factors that mention Xi, and
removing them from our set of factors. Let’s say there are k such factors.
Now, we make a k+1 dimensional table, indexed by Xi as well as each of the other variables
that is mentioned in our set of factors.
table. This table is our new factor, and we put a term for it back into our
set of factors. Once we’ve eliminated all the summations, we have the
desired value.
So, we start by eliminating V. We gather the two terms that mention V and see that they also
involve variable T. So, we compute the product for each value of T, and summarize those in
the factor f1 of T.
Now we can substitute that factor in for the summation, and remove the node from the network.
The next variable to be eliminated is X. There is actually only one term involving X, and it
also involves variable A. So, for each value of A, we compute the sum over X of P(x|a). But
wait! We know what this value is! If we fix a and sum over x, these probabilities have to add
up to 1.
So, rather than adding another factor to our expression, we can just remove the whole sum. In
general, the only nodes that will have an influence on the probability of D are its ancestors.
Now, it’s time to eliminate S. We find that there are three terms involving S, and we gather
them into the sum. These three terms involve two other variables, B and L. So we have to
make a factor that specifies, for each value of B and L, the value of the sum of products.
Now we eliminate T. It involves two terms, which themselves involve variables A and L. So
we make a new factor f3 of A and L.
Next we eliminate L. It involves these two factors, which depend on variables A and B. So
we make a new factor, f4 of A and B, and substitute it in. We remove node L, but couple A
and B.
At this point, we could just do the summations over A and B and be done. But to finish out
the algorithm the way a computer would, it’s time to eliminate variable B.
It involves both of our remaining terms, and it seems to depend on variables A and D.
However, in this case, we’re interested in the probability of a particular value, little d of D,
and so the variable d is instantiated. Thus, we can treat it as a constant in this expression, and
we only need to generate a factor over a, which we’ll call f5 of a. And we can now, in some
sense, remove D from our network as well (because we’ve already factored it into our
answer).
Finally, to get the probability that variable D has value little d, we simply sum factor f5 over
all values of a. Yay! We did it.
Let’s see how the variable elimination algorithm performs, both in theory and in practice.
First of all, it’s pretty easy to see that it runs in time exponential in the number of variables
involved in the largest factor. Creating a factor with k variables involves making a k+1
dimensional table. If you have b values per variable, that’s a table of size b^(k+1). To make each
entry, you have to multiply at most n numbers, where n is the number of nodes. We have to do
this for each variable to be eliminated (which is usually close to n). So we have something like
time = O(n^2 b^k).
How big the factors are depends on the elimination order. You’ll see in one of the recitation
exercises just how dramatic the difference in factor sizes can be. A bad elimination order can
generate huge factors.
So, we’d like to use the elimination order that generates the smallest factors. Unfortunately, it
turns out to be NP hard to find the best elimination order.
At least, there are some fairly reasonable heuristics for choosing an elimination order. It’s
usually done dynamically. So, rather than fixing the elimination order in advance, as we
suggested in the algorithm description, you can pick the next variable to be eliminated depending
on the situation. In particular, one reasonable heuristic is to pick the variable to eliminate next
that will result in the smallest factor. This greedy approach won’t always be optimal, but it’s not
usually too bad.
There is one case where Bayes net inference in general, and the variable elimination algorithm in
particular is fairly efficient, and that’s when the network is a polytree. A polytree is a network
with no cycles. That is, a network in which, for any two nodes, there is only one path between
them. In a polytree,
inference is linear in the size of the network, where the size of the network is defined to be the size
of the largest conditional probability table (or exponential in the maximum number of parents of
any node). In a polytree, the optimal elimination order is to start at the root nodes, and work
downwards, always eliminating a variable that no longer has any parents. In doing so, we never
introduce additional connections into the network.
So, inference in polytrees is efficient, and even in many large non-polytree networks, it’s possible
to keep the factors small, and therefore to do inference relatively efficiently.
When the network is such that the factors are, of necessity, large, we’ll have to turn to a different
class of methods.
2. Approximate inference:
Sampling
Another strategy, which is a theme that comes up also more and more in AI actually, is to say,
well, we didn't really want the right answer anyway. Let's try to do an approximation. And you
can also show that it's computationally hard to get an approximation that's within epsilon of the
answer that you want, but again that doesn't keep us from trying.
So, the other thing that we can do is the stochastic simulation or sampling. In sampling, what we
do is we look at the root nodes of our graph, and attached to this root node is some probability
that A is going to be true, right? Maybe it's .4. So we flip a coin that comes up heads with
probability .4 and see if we get true or false.
We flip our coin, let's say, and we get true for A -- this time. And now, given the assignment of
true to A, we look in the conditional probability table for B given A = true, and that gives us a
probability for B.
Now, we flip a coin with that probability. Say we get False. We enter that into the
table. We do the same thing for C, and let’s say we get True.
Now, we look in the CPT for D given B and C, for the case where B is false and C is true, and we
flip a coin with that probability, in order to get a value for D.
So, there's one sample from the joint distribution of these four variables. And you can just keep
doing this, all day and all night, and generate a big pile of samples, using that algorithm. And now
you can ask various questions.
Estimate:
P*(D|A) = #D,A / #A
Let's say you want to know the probability of D given A. How would you answer - - given all the examples
-- what would you do to compute the probability of D given A? You would just count. You’d
count the number of cases in which A and D were true, and you’d divide that by the number of
cases in which A was true, and that would give you an unbiased estimate of the probability of D
given A. The more samples, the more confidence you’d have that the estimated probability is
close to the true one.
Estimation
Exactly because of the process we’re using to generate the samples, the majority of them will be
the typical cases. Oh, it's someone with a cold, someone with a cold, someone with a cold,
someone with a cold, someone with a cold, someone with malaria, someone with a cold, someone
with a cold. So the rare results are not going to come up very often. And so doing this sampling
naively can make it really hard to estimate the probability of a rare event. If it's something that
happens one in ten thousand times, well, you know for sure you're going to need, some number
of tens of thousands of samples to get even a reasonable estimate of that probability.
Imagine that you want to estimate the probability of some disease given -- oh, I don't know --
spots on your tongue and a sore toe. Somebody walks in and they have a really peculiar set of
symptoms, and you want to know what's the probability that they have some disease.
Well, if the symptoms are root nodes, it's easy. If the symptoms were root nodes, you could just
assign the root nodes to have their observed values and then simulate the rest of the network as
before.
But if the symptoms aren't root nodes then if you do naïve sampling, you would generate a giant
table of samples, and you'd have to go and look and say, gosh, how many cases do I have where
somebody has spots on their tongue and a sore toe; and the answer would be, well, maybe zero
or not very many.
There’s a technique called importance sampling, which allows you to draw examples from a
distribution that’s going to be more helpful and then reweight them so that you can still get an
unbiased estimate of the desired conditional probability. It’s a bit beyond the scope of this class
to get into the details, but it’s an important and effective idea.
Recitation Problem
• Do the variable elimination algorithm on the net below using the elimination order A,B,C (that
is, eliminate node C first). In computing P(D=d), what factors do you get?
Here’s the network we started with. We used elimination order C, B, A (we eliminated A first).
Now we’re going to explore what happens when we eliminate the variables in the opposite order.
First, work on the case we did, where we’re trying to calculate the probability that node D takes
on a particular value, little
d. Remember that little d is a constant in this case. Now, do the case where we’re trying to find the
whole distribution over D, so we don’t know a particular value for little d.
Find an elimination order that keeps the factors small for the net below, or show that there is no
such order.
Here’s a pretty complicated graph. But notice that no node has more than 2 parents, so none of
the CPTs are huge. The question is, is this graph hard for variable elimination? More concretely,
can you find an elimination order that results only in fairly small factors? Is there an elimination
order that generates a huge factor?
Bayesian networks (or related models) are often used in computer vision, but they almost always
require sampling. What happens when you try to do variable elimination on a model like the grid
below?
6. Casual Networks: