Module-4
Bayesian Learning
CONTENT
• Introduction
• Bayes theorem
• Bayes theorem and concept learning
• Maximum likelihood and Least Squared Error
Hypothesis
• Maximum likelihood Hypotheses for predicting
probabilities
• Minimum Description Length Principle
• Naive Bayes classifier
• Bayesian belief networks
• EM algorithm
INTRODUCTION
Bayesian learning methods are relevant to
study of machine learning for two different
reasons.-It provides probabilistic approach to
inference.
• First, Bayesian learning algorithms that calculate
explicit probabilities for hypotheses, such as the
naive Bayes classifier, are among the most
practical approaches to certain types of learning
problems
• The second reason is that they provide a useful
perspective for understanding many learning
algorithms that do not explicitly manipulate
probabilities
Features of Bayesian Learning Methods
• Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct. This
provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent
with any single example
• Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis. In Bayesian learning, prior
knowledge is provided by asserting (1) a prior probability for each
candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make
probabilistic predictions
• New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally
intractable, they can provide a standard of optimal decision making
against which other practical methods can be measured.
Practical difficulty in applying Bayesian
methods
• One practical difficulty in applying Bayesian methods is that
they typically require initial knowledge of many
probabilities. When these probabilities are not known in
advance they are often estimated based on background
knowledge, previously available data, and assumptions
about the form of the underlying distributions.
• A second practical difficulty is the significant computational
cost required to determine the Bayes optimal hypothesis in
the general case. In certain specialized situations, this
computational cost can be significantly reduced.
BAYES THEOREM
• Bayes theorem provides a way to calculate the probability
of a hypothesis based on its prior probability, the
probabilities of observing various data given the
hypothesis, and the observed data itself.
• Notations
• P(h) denote the initial probability that hypothesis h holds,
before we have observed the training data. P(h) is often
called prior probability of h and may reflect any
background knowledge about the chance that h is correct
• P(D) prior probability of D, probability that D will be
observed
• P(D|h) probability of observing D given a world in which h
holds
• P(h|D) posterior probability of h, reflects confidence that h
holds after D has been observed
• Bayes theorem is the cornerstone of Bayesian learning
methods because it provides a way to calculate the
posterior probability P(h|D), from the prior probability
P(h), together with P(D) and P(D(h).
• P(h|D) increases with P(h) and with P(D|h) according
to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more
probable it is that D will be
• observed independent of h, the less evidence D
provides in support of h.
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of
candidate hypotheses H and is interested in finding the most
probable hypothesis h ∈ H given the observed data D. Any such
maximally probable hypothesis is called a maximum a posteriori
(MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each
candidate hypothesis is hMAP is a MAP hypothesis provided
• P(D) can be dropped, because it is a constant independent of h
Maximum Likelihood (ML) Hypothesis
• In some cases, it is assumed that every hypothesis in H is
equally probable a priori (P(hi) = P(hj) for all hi and hj in H).
• In this case the below equation can be simplified and need
only consider the term P(D|h) to find the most probable
hypothesis.
• P(D|h) is often called the likelihood of the data D given h,
and any hypothesis that maximizes P(D|h) is called a
maximum likelihood (ML) hypothesis
• Examples Solved in class
Brute-Force Bayes Concept Learning
• Consider the concept learning problem first introduced in Chapter
2. In particular, assume the learner considers some finite hypothesis
space H defined over the instance space X, in which the task is to
learn some target concept c : X {0,1}.
• As usual, we assume that the learner is given some sequence of
training examples ((x1 d1). . . (xm,dm))w here xi is some instance
from X and where di is the target value of xi (i.e., di = c(xi)). To
simplify the discussion in this section, we assume the sequence of
instances (xl . . . xm) is held fixed, so that the training data D can
be written simply as the sequence of target values D = (dl . . . dm).
It can be shown that this simplification does not alter the main
conclusions of this section.
Brute-Force Bayes Concept Learning
• We can design a straightforward concept learning algorithm to output the
maximum a posteriori hypothesis, based on Bayes theorem, as follows:
• In order specify a learning problem for the BRUTE-FORCE
MAP LEARNING algorithm we must specify what values are
to be used for P(h) and for P(D|h) ? Lets choose P(h) and for
P(D|h) to be consistent with the following assumptions:
1. The training data D is noise free (i.e., di = c(xi))
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is
more probable than any other.
• What values should we specify for P(h)?
– Given no prior knowledge that one hypothesis is more likely
than another, it is reasonable to assign the same prior
probability to every hypothesis h in H.
– Assume the target concept is contained in H and require that
these prior probabilities sum to 1.
What choice shall we make for P(D|h)?
• P(D|h) is the probability of observing the target
values D = (d1 . . .dm) for the fixed set of instances
(x1 . . . xm), given a world in which hypothesis h
holds
• Since we assume noise-free training data, the
probability of observing classification di given h is
just 1 if di = h(xi) and 0 if di # h(xi). Therefore,
• Given these choices for P(h) and for P(D|h) we
now have a fully-defined problem for the
above BRUTE-FORCE MAP LEARNING
algorithm.
• In a first step, we have to determine the
probabilities for P(h|D)
• To summarize, Bayes theorem implies that the
posterior probability P(h|D) under our
assumed P(h) and P(D|h) is
• where |VSH,D| is the number of hypotheses
from H consistent with D
The Evolution of Probabilities Associated with
Hypotheses
• Figure (a) all hypotheses have the same probability.
• Figures (b) and (c), As training data accumulates, the posterior probability
for inconsistent hypotheses becomes zero while the total probability
summing to 1 is shared equally among the remaining consistent
hypotheses.
MAP Hypotheses and Consistent Learners
• A learning algorithm is a consistent learner if it outputs a hypothesis that
commits zero errors over the training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a
uniform prior probability distribution over H (P(hi) = P(hj) for all i, j), and
deterministic, noise free training data (P(D|h) =1 if D and h are consistent,
and 0 otherwise).
Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis
under the probability distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which
FIND-S outputs MAP hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version
space, its output hypothesis will be a MAP hypothesis relative to any prior
probability distribution that favors more specific hypotheses.
• Bayesian framework is a way to characterize
the behaviour of learning algorithms
• By identifying probability distributions P(h)
and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of
the algorithm can be characterized (Inductive
Bias)
• Inductive inference is modelled by an
equivalent probabilistic reasoning system
based on Bayes theorem
Normal Probability Distribution (Gaussian Distribution)
• A Normal distribution is a smooth, bell-shaped
distribution that can be completely characterized by
its mean μ and its standard deviation σ
MAXIMUM LIKELIHOOD AND LEAST-SQUARED
ERROR HYPOTHESES
• Consider the problem of learning a continuous-
valued target function such as neural network
learning, linear regression, and polynomial curve
fitting
• A straightforward Bayesian analysis will show that
under certain assumptions any learning algorithm
that minimizes the squared error between the
output hypothesis predictions and the training
data will output a maximum likelihood (ML)
hypothesis
Learning A Continuous-Valued Target Function
• Learner L considers an instance space X and a hypothesis space H
consisting of some class of real- valued functions defined over X, i.e., (∀ h
∈ H)[ h : X → R] and training examples of the form <xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each
example is corrupted by random noise drawn according to a Normal
probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
- Here f(xi) is the noise-free value of the target function and ei is a random
variable representing the noise.
- It is assumed that the values of the ei are drawn independently and that
they are distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or,
equivalently, a MAP hypothesis assuming all hypotheses are equally
probable a priori.
Learning A Linear Function
• The target function f corresponds to
the solid line.
• The training examples (xi , di ) are
assumed to have Normally
distributed noise ei with zero mean
added to the true target value f (xi ).
• The dashed line corresponds to the
hypothesis hML with least-squared
training error, hence the maximum
likelihood hypothesis.
• Notice that the maximum likelihood
hypothesis is not necessarily identical
to the correct hypothesis, f, because
it is inferred from only a limited
sample of noisy training data
Before showing why a hypothesis that minimizes the
sum of squared errors in this setting is also a maximum
likelihood hypothesis,
let us quickly review probability densities and Normal
distributions
Probability Density for continuous variables
e: a random noise variable generated by a Normal
probability distribution
<x1 . . . xm>: the sequence of instances (as before)
<d1 . . . dm>:the sequence of target values with di = f(xi) + ei
What is probability density function?
** Just for understanding
• PDF is a function of a continuous random variable, whose
integral across an interval gives the probability that the value
of the variable lies within the same interval.
• Probability density function (PDF), in statistics,
a function whose integral is calculated to
find probabilities associated with a continuous random
variable.
• In probability theory, a probability density function (PDF),
or density of a continuous random variable, is
a function whose value at any given sample (or point) in
the sample space (the set of possible values taken by the
random variable) can be interpreted as providing a relative
likelihood that the value of the ...
Why to choose PDF for continuous variables?
• The reason, roughly, is that we wish for the total probability
over all possible values of the random variable to sum to
one. In the case of continuous variables we cannot achieve
this by assigning a finite probability to each of the infinite
set of possible values for the random variable.
• Instead, we speak of a probability density for continuous
variables such as e and require that the integral of this
probability density over all possible values be one.
• In general we will use lower case p to refer to the
probability density function, to distinguish it from a finite
probability P (which we will sometimes refer to as a
probability mass). The probability density p(x0) is the limit
as ε goes to zero, of times the probability that x will take
on a value in the interval [xo, xo + ε).
• Using the previous definition of hML we have
• Assuming training examples are mutually independent given
h, we can write P(D|h) as the product of the various (di|h)
• Given the noise ei obeys a Normal distribution with zero mean
and unknown variance σ2 , each di must also obey a Normal
distribution around the true target value f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct
description of f. Hence, µ = f(xi) = h(xi)
• We now apply a transformation that is common in maximum
likelihood calculations:
• Rather than maximizing the above complicated expression we
shall choose to maximize the less complicated logarithm,
which is justified because of the monotonicity of function p.
• The log-likelihood transforms the product of individual
contributions to a sum of contributions, which is much more
manageable due to the sum rule.
• The first term in this expression is a constant
independent of h and can therefore be discarded
• Maximizing this negative term is equivalent to
minimizing the corresponding positive term.
• Finally Discard constants that are independent of h
• the hML is one that minimizes the sum of the squared errors
Why is it reasonable to choose the Normal distribution to
characterize noise?
• good approximation of many types of noise in physical
systems
• Central Limit Theorem shows that the sum of a sufficiently
large number of independent, identically distributed
random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the
attributes describing the instances themselves
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• In the previous section we determined that
the maximum likelihood hypothesis is the one
that minimizes the sum of squared errors over
the training examples.
• Now we derive an analogous criterion for a
second setting that is common in neural
network learning: learning to predict
probabilities.
Example
• Consider the setting in which we wish to learn a nondeterministic
(probabilistic) function f : X {0, 1}, which has two discrete
output values. For example, the instance space X might represent
medical patients in terms of their symptoms, and the target
function f (x) might be 1 if the patient survives the disease and 0 if
not.
• Alternatively, X might represent loan applicants in terms of their
past credit history, and f (x) might be 1 if the applicant successfully
repays their next loan and 0 if not.
• In both of these cases we might well expect f to be probabilistic. For
example, among a collection of patients exhibiting the same set of
observable symptoms, we might find that 92% survive, and 8% do
not. This unpredictability could arise from our inability to observe
all the important distinguishing features of the patients, or from
some genuinely probabilistic mechanism in the evolution of the
disease. Whatever the source of the problem, the effect is that we
have a target function f (x) whose output is a probabilistic function
of the input.
• Given this problem setting, we might wish to
learn a neural network (or other real-valued
function approximator) whose output is the
probability that f (x) = 1. In other words, we seek
to learn the target function, f' : X [O, 1}, such
that f '(x) = P( f (x) = 1).
• Let us assume the training data D is of the form D
= {(x1, d1) . . . (xm, dm)}, where di is the observed 0
or 1 value for f (xi).
• Treating both xi and di as random variables, and
assuming that each training example is drawn
independently, we can write P(D|h) as
p(D|h) = ПP(xi,di|h)
When x is independent of h we can rewrite the above expression as
MINIMUM DESCRIPTION LENGTH PRINCIPLE
• Recall from Chapter 3 the discussion of
Occam's razor, a popular inductive bias that
can be summarized as "choose the shortest
explanation for the observed data."
• In that chapter we discussed several
arguments in the long-standing debate
regarding Occam's razor. Here we consider a
Bayesian perspective on this issue and a
closely related principle called the Minimum
Description Length (MDL) principle.
• The above Equation (6.16) can be interpreted as a statement that
short hypotheses are preferred, assuming a particular
representation scheme for encoding hypotheses and data. To
explain this, let us introduce a basic result from information theory:
Consider the problem of designing a code to transmit messages
drawn at random, where the probability of encountering message i
is pi.
• We are interested here in the most compact code; that is, we are
interested in the code that minimizes the expected number of bits
we must transmit in order to encode a message drawn at random.
• Clearly, to minimize the expected code length we should assign
shorter codes to messages that are more probable. Shannon and
Weaver (1949) showed that the optimal code (i.e., the code that
minimizes the expected message length) assigns – log2 pi bits to
encode message i .
• We will refer to the number of bits required to encode message i
using code C as the description length of message i with respect
to C, which we denote by Lc(i).
Interpretation of the equation 6.16 from coding theory
• - log2 P(h) is the description length of h under the optimal
encoding for the hypothesis space H. In other words, this is
the size of the description of hypothesis h using this
optimal representation. In our notation, LCH (h) = -
log2P(h), where CH is the optimal code for hypothesis space
H.
• -log2 P(D|h) is the description length of the training data
D given hypothesis h, under its optimal encoding. In our
notation, Lc D|h(Dlh) = - log2 P(Dlh), where CD|h is the
optimal code for describing data D assuming that both the
sender and receiver know the hypothesis h.
• Therefore we can rewrite Equation (6.16) to show that
hMAPis the hypothesis h that minimizes the sum given by
the description length of the hypothesis plus the
description length of the data given the hypothesis.
NAIVE BAYES CLASSIFIER
• One highly practical Bayesian learning method is the naive
Bayes learner, often called the naive Bayes classifier. In
some domains its performance has been shown to be
comparable to that of neural network and decision tree
learning.
• The naive Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f ( x ) can take on any
value from some finite set V. A set of training examples of
the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al,
a2.. .an,). The learner is asked to predict the target value,
or classification, for this new instance.
• The Bayesian approach to classifying the new instance is to
assign the most probable target value, VMAP given the
attribute values (al,a2 . . .an ,) that describe the instance.
Now we could attempt to estimate the two terms in above Equation based on the
training data. It is easy to estimate each of the P(vj) simply by counting the frequency with
which each target value vj occurs in the training data. However, estimating the different
P(al, a2.. . an|vj) terms in this fashion is not feasible unless we have a very, very large set
of training data. The problem is that the number of these terms is equal to the number of
possible instances times the number of possible target values. Therefore, we need to see
every instance in the instance space many times in order to obtain reliable estimates.
• The naive Bayes classifier is based on the
simplifying assumption that the attribute values
are conditionally independent given the target
value. In other words, the assumption is that
given the target value of the instance, the
probability of observing the conjunction a1, a2..
.an, is just the product of the probabilities for
the individual attributes:
• P(a1, a2 . . . an|vj) = Пi P(ai lvj).
• Substituting this into Equation (6.19), we have
the approach used by the naive Bayes classifier.
where VNB denotes the target value output by the naive Bayes classifier. Notice
that in a naive Bayes classifier the number of distinct P(ai | vj) terms that must
be estimated from the training data is just the number of distinct attribute values
times the number of distinct target values-a much smaller number than if we
were to estimate the P(a1, a2 . . . an | vj) terms as first contemplated.
To summarize, the naive Bayes learning method involves a learning step in which
the various P(vj) and P(ai | vj) terms are estimated, based on their frequencies
over the training data. The set of these estimates corresponds to the learned
hypothesis. This hypothesis is then used to classify each new instance by applying
the rule in above Equation . Whenever the naive Bayes assumption of conditional
independence is satisfied, this naive Bayes classification VNB is identical to the
MAP classification.
Example
• Naïve Bayesian classifier Algorthim:
• Step 1: Convert the data set into a frequency
table
• Step 2: Create Likelihood table by finding the
probabilities like Overcast probability = 0.29
and probability of playing is 0.64.
• Step 3: Now, use Naive Bayesian equation to calculate the
posterior probability for each class. The class with the
highest posterior probability is the outcome of prediction.
• Problem: Players will play if weather is sunny. Is this
statement is correct?
• We can solve it using above discussed method of posterior
probability.
• P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14
= 0.36, P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has
higher probability.
Bayesian Belief Networks
Difference between Naïve Bayes and BBN-
• The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1 . . .an, are conditionally independent given the target value v. This assumption
dramatically reduces the complexity of learning the target function. When it is met, the naive
Bayes classifier outputs the optimal Bayes classification.
• However, in many cases this conditional independence assumption is clearly overly
restrictive.
• A Bayesian belief network describes the probability distribution governing a set of variables
by specifying a set of conditional independence assumptions along with a set of conditional
probabilities.
• In contrast to the naive Bayes classifier, which assumes that all the variables are
conditionally independent given the value of the target variable, Bayesian belief networks
allow stating conditional independence assumptions that apply to subsets of the variables.
• Thus, Bayesian belief networks provide an intermediate approach that is less constraining
than the global assumption of conditional independence made by the naive Bayes classifier,
but more tractable than avoiding conditional independence assumptions altogether.
• Bayesian belief networks are an active focus of current research, and a variety of algorithms
have been proposed for learning them and for using them for inference.
Definition -BBN
• In general, a Bayesian belief network describes the
probability distribution over a set of variables. Consider an
arbitrary set of random variables Y1 . . . Yn, where each
variable Yi can take on the set of possible values V(Yi).
• We define the joint space of the set of variables Y to be the
cross product V(Y1) x V(Y2) x . . . V(Yn).
• In other words, each item in the joint space corresponds to
one of the possible assignments of values to the tuple of
variables (Y1 . . . Yn). The probability distribution over this
joint' space is called the joint probability distribution.
• The joint probability distribution specifies the probability
for each of the possible variable bindings for the tuple (Y1 .
. . Yn).
• A Bayesian belief network describes the joint probability
distribution for a set of variables.
Conditional Independence
Naïve Bayes Conditional Independence
Representation
• A Bayesian belief network (Bayesian network for short) represents the
joint probability distribution for a set of variables. For example, the
Bayesian network in Figure represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, ForestFire,
Campjre, and BusTourGroup.
• In general, a Bayesian network represents the joint probability
distribution by specifying a set of conditional independence assumptions
(represented by a directed acyclic graph), together with sets of local
conditional probabilities.
• Each variable in the joint space is represented by a node in the Bayesian
network. For each variable two types of information are specified.
• First, the network arcs represent the assertion that the variable is
conditionally independent of its non descendants in the network given its
immediate predecessors in the network. We say X is a descendant of Y if
there is a directed path from Y to X.
• Second, a conditional probability table is given for each variable,
describing the probability distribution for that variable given the values of
its immediate predecessors.
The joint probability for any desired assignment of values (y1 . . . , yn) to the tuple of
network variables (Y1 . . . Yn) can be computed by the above formula
where Parents(Yi) denotes the set of immediate predecessors of Yi in the network.
Note the values of P(yi |Parents(Yi)) are precisely the values stored in the
conditional probability table associated with node Yi.
Inference
• Inference is the general framework for
solving inference queries when a joint distribution is
given.
• We might wish to use a Bayesian network to infer the
value of some target variable (e.g., ForestFire) given
the observed values of the other variables. Of course,
given that we are dealing with random variables it will
not generally be correct to assign the target variable a
single determined value. What we really wish to infer is
the probability distribution for the target variable,
which specifies the probability that it will take on each
of its possible values given the observed values of the
other variables.
Expectation–maximization
➢ EM algorithm is an iterative method to find maximum
likelihood estimates of parameters in statistical models,
where the model depends on unobserved latent variables.
(In statistics, latent variables (from Latin: present
participle of lateo (“lie hidden”), as opposed to
observable variables) are variables that are not directly
observed but are rather inferred (through a mathematical
model) from other variables that are observed (directly
measured).)
➢ Iteratively learn probabilistic categorization model from
unsupervised data.
➢ Initially assume random assignment of examples to
categories “Randomly label” data
➢ Learn initial probabilistic model by estimating model
parameters θ from randomly labeled data
➢ Iterate until convergence:
1. Expectation (E-step): Compute P(ci | E) for
each example given the current model, and
probabilistically re-label the examples based
on these posterior probability estimates.
2. Maximization (M-step): Re-estimate the
model parameters, θ, from the
probabilistically re-labeled data.
The EM Algorithm for Gaussian Mixtures
➢ The probability density function for multivariate_normal is
where mu is the mean, Sigma the covariance matrix, and k
is the dimension of the space where x takes values.
Algorithm:
• An arbitrary initial hypothesis h=<μ1, μ2 ,.., μk> is chosen.
• The EM Algorithm iterates over two steps:
Step 1 (Estimation, E): Calculate the expected value E[zij] of
each hidden variable zij, assuming that the current
hypothesis h=<μ1, μ2 ,.., μk> holds.
Step 2 (Maximization, M): Calculate a new maximum
likelihood hypothesis h‟=<μ1‟, μ2‟ ,.., μk‟>, assuming the
value taken on by each hidden variable zij is its expected
value E[zij] calculated in step 1. Then replace the
hypothesis h=<μ1, μ2 ,.., μk> by the new hypothesis
h‟=<μ1‟, μ2‟ ,.., μk‟> and iterate.
Conclusion
• The above algorithm for estimating the means of
a mixture of k Normal distributions illustrates the
essence of the EM approach: The current
hypothesis is used to estimate the unobserved
variables, and the expected values of these
variables are then used to calculate an improved
hypothesis.
• It can be proved that on each iteration through
this loop, the EM algorithm increases the
likelihood P(Dlh) unless it is at a local maximum.
The algorithm thus converges to a local maximum
likelihood hypothesis for (mu1 ,mu2).