0% found this document useful (0 votes)

219 views60 pages

Bayesian Learning Methods Overview

This document provides an overview of Bayesian learning methods. It discusses: - Bayes theorem and how it provides a probabilistic approach to inference by calculating the posterior probability of hypotheses based on prior probabilities and observed data. - Maximum likelihood and least squared error hypotheses, explaining how minimizing error between predictions and data outputs a maximum likelihood hypothesis under certain assumptions. - Practical challenges with Bayesian methods including needing prior probabilities and significant computational costs to determine optimal hypotheses.

Uploaded by

Kavya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

219 views60 pages

Bayesian Learning Methods Overview

Uploaded by

Kavya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Module-4

Bayesian Learning
CONTENT

• Introduction
• Bayes theorem
• Bayes theorem and concept learning
• Maximum likelihood and Least Squared Error
Hypothesis
• Maximum likelihood Hypotheses for predicting
probabilities
• Minimum Description Length Principle
• Naive Bayes classifier
• Bayesian belief networks
• EM algorithm
INTRODUCTION
Bayesian learning methods are relevant to
study of machine learning for two different
reasons.-It provides probabilistic approach to
inference.
• First, Bayesian learning algorithms that calculate
explicit probabilities for hypotheses, such as the
naive Bayes classifier, are among the most
practical approaches to certain types of learning
problems
• The second reason is that they provide a useful
perspective for understanding many learning
algorithms that do not explicitly manipulate
probabilities
Features of Bayesian Learning Methods
• Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct. This
provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent
with any single example
• Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis. In Bayesian learning, prior
knowledge is provided by asserting (1) a prior probability for each
candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make
probabilistic predictions
• New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally
intractable, they can provide a standard of optimal decision making
against which other practical methods can be measured.
Practical difficulty in applying Bayesian
methods
• One practical difficulty in applying Bayesian methods is that
they typically require initial knowledge of many
probabilities. When these probabilities are not known in
advance they are often estimated based on background
knowledge, previously available data, and assumptions
about the form of the underlying distributions.

• A second practical difficulty is the significant computational

cost required to determine the Bayes optimal hypothesis in
the general case. In certain specialized situations, this
computational cost can be significantly reduced.
BAYES THEOREM
• Bayes theorem provides a way to calculate the probability
of a hypothesis based on its prior probability, the
probabilities of observing various data given the
hypothesis, and the observed data itself.
• Notations
• P(h) denote the initial probability that hypothesis h holds,
before we have observed the training data. P(h) is often
called prior probability of h and may reflect any
background knowledge about the chance that h is correct
• P(D) prior probability of D, probability that D will be
observed
• P(D|h) probability of observing D given a world in which h
holds
• P(h|D) posterior probability of h, reflects confidence that h
holds after D has been observed
• Bayes theorem is the cornerstone of Bayesian learning
methods because it provides a way to calculate the
posterior probability P(h|D), from the prior probability
P(h), together with P(D) and P(D(h).

• P(h|D) increases with P(h) and with P(D|h) according

to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more
probable it is that D will be
• observed independent of h, the less evidence D
provides in support of h.
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of
candidate hypotheses H and is interested in finding the most
probable hypothesis h ∈ H given the observed data D. Any such
maximally probable hypothesis is called a maximum a posteriori
(MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each
candidate hypothesis is hMAP is a MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h

Maximum Likelihood (ML) Hypothesis
• In some cases, it is assumed that every hypothesis in H is
equally probable a priori (P(hi) = P(hj) for all hi and hj in H).
• In this case the below equation can be simplified and need
only consider the term P(D|h) to find the most probable
hypothesis.

• P(D|h) is often called the likelihood of the data D given h,

and any hypothesis that maximizes P(D|h) is called a
maximum likelihood (ML) hypothesis
• Examples Solved in class
Brute-Force Bayes Concept Learning
• Consider the concept learning problem first introduced in Chapter
2. In particular, assume the learner considers some finite hypothesis
space H defined over the instance space X, in which the task is to
learn some target concept c : X  {0,1}.

• As usual, we assume that the learner is given some sequence of

training examples ((x1 d1). . . (xm,dm))w here xi is some instance
from X and where di is the target value of xi (i.e., di = c(xi)). To
simplify the discussion in this section, we assume the sequence of
instances (xl . . . xm) is held fixed, so that the training data D can
be written simply as the sequence of target values D = (dl . . . dm).
It can be shown that this simplification does not alter the main
conclusions of this section.
Brute-Force Bayes Concept Learning
• We can design a straightforward concept learning algorithm to output the
maximum a posteriori hypothesis, based on Bayes theorem, as follows:
• In order specify a learning problem for the BRUTE-FORCE
MAP LEARNING algorithm we must specify what values are
to be used for P(h) and for P(D|h) ? Lets choose P(h) and for
P(D|h) to be consistent with the following assumptions:
1. The training data D is noise free (i.e., di = c(xi))
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is
more probable than any other.
• What values should we specify for P(h)?
– Given no prior knowledge that one hypothesis is more likely
than another, it is reasonable to assign the same prior
probability to every hypothesis h in H.
– Assume the target concept is contained in H and require that
these prior probabilities sum to 1.
What choice shall we make for P(D|h)?
• P(D|h) is the probability of observing the target
values D = (d1 . . .dm) for the fixed set of instances
(x1 . . . xm), given a world in which hypothesis h
holds
• Since we assume noise-free training data, the
probability of observing classification di given h is
just 1 if di = h(xi) and 0 if di # h(xi). Therefore,
• Given these choices for P(h) and for P(D|h) we
now have a fully-defined problem for the
above BRUTE-FORCE MAP LEARNING
algorithm.
• In a first step, we have to determine the
probabilities for P(h|D)
• To summarize, Bayes theorem implies that the
posterior probability P(h|D) under our
assumed P(h) and P(D|h) is

• where |VSH,D| is the number of hypotheses

from H consistent with D
The Evolution of Probabilities Associated with
Hypotheses
• Figure (a) all hypotheses have the same probability.
• Figures (b) and (c), As training data accumulates, the posterior probability
for inconsistent hypotheses becomes zero while the total probability
summing to 1 is shared equally among the remaining consistent
hypotheses.
MAP Hypotheses and Consistent Learners
• A learning algorithm is a consistent learner if it outputs a hypothesis that
commits zero errors over the training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a
uniform prior probability distribution over H (P(hi) = P(hj) for all i, j), and
deterministic, noise free training data (P(D|h) =1 if D and h are consistent,
and 0 otherwise).

Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis
under the probability distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which
FIND-S outputs MAP hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version
space, its output hypothesis will be a MAP hypothesis relative to any prior
probability distribution that favors more specific hypotheses.
• Bayesian framework is a way to characterize
the behaviour of learning algorithms
• By identifying probability distributions P(h)
and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of
the algorithm can be characterized (Inductive
Bias)
• Inductive inference is modelled by an
equivalent probabilistic reasoning system
based on Bayes theorem
Normal Probability Distribution (Gaussian Distribution)

• A Normal distribution is a smooth, bell-shaped

distribution that can be completely characterized by
its mean μ and its standard deviation σ
MAXIMUM LIKELIHOOD AND LEAST-SQUARED
ERROR HYPOTHESES
• Consider the problem of learning a continuous-
valued target function such as neural network
learning, linear regression, and polynomial curve
fitting

• A straightforward Bayesian analysis will show that

under certain assumptions any learning algorithm
that minimizes the squared error between the
output hypothesis predictions and the training
data will output a maximum likelihood (ML)
hypothesis
Learning A Continuous-Valued Target Function

• Learner L considers an instance space X and a hypothesis space H

consisting of some class of real- valued functions defined over X, i.e., (∀ h
∈ H)[ h : X → R] and training examples of the form <xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each
example is corrupted by random noise drawn according to a Normal
probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
- Here f(xi) is the noise-free value of the target function and ei is a random
variable representing the noise.
- It is assumed that the values of the ei are drawn independently and that
they are distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or,
equivalently, a MAP hypothesis assuming all hypotheses are equally
probable a priori.
Learning A Linear Function

• The target function f corresponds to

the solid line.
• The training examples (xi , di ) are
assumed to have Normally
distributed noise ei with zero mean
added to the true target value f (xi ).
• The dashed line corresponds to the
hypothesis hML with least-squared
training error, hence the maximum
likelihood hypothesis.
• Notice that the maximum likelihood
hypothesis is not necessarily identical
to the correct hypothesis, f, because
it is inferred from only a limited
sample of noisy training data
Before showing why a hypothesis that minimizes the
sum of squared errors in this setting is also a maximum
likelihood hypothesis,
let us quickly review probability densities and Normal
distributions
Probability Density for continuous variables

e: a random noise variable generated by a Normal

probability distribution
<x1 . . . xm>: the sequence of instances (as before)
<d1 . . . dm>:the sequence of target values with di = f(xi) + ei
What is probability density function?
** Just for understanding
• PDF is a function of a continuous random variable, whose
integral across an interval gives the probability that the value
of the variable lies within the same interval.
• Probability density function (PDF), in statistics,
a function whose integral is calculated to
find probabilities associated with a continuous random
variable.
• In probability theory, a probability density function (PDF),
or density of a continuous random variable, is
a function whose value at any given sample (or point) in
the sample space (the set of possible values taken by the
random variable) can be interpreted as providing a relative
likelihood that the value of the ...
Why to choose PDF for continuous variables?
• The reason, roughly, is that we wish for the total probability
over all possible values of the random variable to sum to
one. In the case of continuous variables we cannot achieve
this by assigning a finite probability to each of the infinite
set of possible values for the random variable.
• Instead, we speak of a probability density for continuous
variables such as e and require that the integral of this
probability density over all possible values be one.
• In general we will use lower case p to refer to the
probability density function, to distinguish it from a finite
probability P (which we will sometimes refer to as a
probability mass). The probability density p(x0) is the limit
as ε goes to zero, of times the probability that x will take
on a value in the interval [xo, xo + ε).
• Using the previous definition of hML we have

• Assuming training examples are mutually independent given

h, we can write P(D|h) as the product of the various (di|h)

• Given the noise ei obeys a Normal distribution with zero mean

and unknown variance σ2 , each di must also obey a Normal
distribution around the true target value f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct
description of f. Hence, µ = f(xi) = h(xi)
• We now apply a transformation that is common in maximum
likelihood calculations:
• Rather than maximizing the above complicated expression we
shall choose to maximize the less complicated logarithm,
which is justified because of the monotonicity of function p.
• The log-likelihood transforms the product of individual
contributions to a sum of contributions, which is much more
manageable due to the sum rule.

• The first term in this expression is a constant

independent of h and can therefore be discarded

• Maximizing this negative term is equivalent to

minimizing the corresponding positive term.
• Finally Discard constants that are independent of h

• the hML is one that minimizes the sum of the squared errors

Why is it reasonable to choose the Normal distribution to

characterize noise?
• good approximation of many types of noise in physical
systems
• Central Limit Theorem shows that the sum of a sufficiently
large number of independent, identically distributed
random variables itself obeys a Normal distribution

Only noise in the target value is considered, not in the

attributes describing the instances themselves
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES

• In the previous section we determined that

the maximum likelihood hypothesis is the one
that minimizes the sum of squared errors over
the training examples.
• Now we derive an analogous criterion for a
second setting that is common in neural
network learning: learning to predict
probabilities.
Example
• Consider the setting in which we wish to learn a nondeterministic
(probabilistic) function f : X  {0, 1}, which has two discrete
output values. For example, the instance space X might represent
medical patients in terms of their symptoms, and the target
function f (x) might be 1 if the patient survives the disease and 0 if
not.
• Alternatively, X might represent loan applicants in terms of their
past credit history, and f (x) might be 1 if the applicant successfully
repays their next loan and 0 if not.
• In both of these cases we might well expect f to be probabilistic. For
example, among a collection of patients exhibiting the same set of
observable symptoms, we might find that 92% survive, and 8% do
not. This unpredictability could arise from our inability to observe
all the important distinguishing features of the patients, or from
some genuinely probabilistic mechanism in the evolution of the
disease. Whatever the source of the problem, the effect is that we
have a target function f (x) whose output is a probabilistic function
of the input.
• Given this problem setting, we might wish to
learn a neural network (or other real-valued
function approximator) whose output is the
probability that f (x) = 1. In other words, we seek
to learn the target function, f' : X  [O, 1}, such
that f '(x) = P( f (x) = 1).
• Let us assume the training data D is of the form D
= {(x1, d1) . . . (xm, dm)}, where di is the observed 0
or 1 value for f (xi).
• Treating both xi and di as random variables, and
assuming that each training example is drawn
independently, we can write P(D|h) as
p(D|h) = ПP(xi,di|h)
When x is independent of h we can rewrite the above expression as
MINIMUM DESCRIPTION LENGTH PRINCIPLE

• Recall from Chapter 3 the discussion of

Occam's razor, a popular inductive bias that
can be summarized as "choose the shortest
explanation for the observed data."
• In that chapter we discussed several
arguments in the long-standing debate
regarding Occam's razor. Here we consider a
Bayesian perspective on this issue and a
closely related principle called the Minimum
Description Length (MDL) principle.
• The above Equation (6.16) can be interpreted as a statement that
short hypotheses are preferred, assuming a particular
representation scheme for encoding hypotheses and data. To
explain this, let us introduce a basic result from information theory:
Consider the problem of designing a code to transmit messages
drawn at random, where the probability of encountering message i
is pi.
• We are interested here in the most compact code; that is, we are
interested in the code that minimizes the expected number of bits
we must transmit in order to encode a message drawn at random.
• Clearly, to minimize the expected code length we should assign
shorter codes to messages that are more probable. Shannon and
Weaver (1949) showed that the optimal code (i.e., the code that
minimizes the expected message length) assigns – log2 pi bits to
encode message i .
• We will refer to the number of bits required to encode message i
using code C as the description length of message i with respect
to C, which we denote by Lc(i).
Interpretation of the equation 6.16 from coding theory

• - log2 P(h) is the description length of h under the optimal

encoding for the hypothesis space H. In other words, this is
the size of the description of hypothesis h using this
optimal representation. In our notation, LCH (h) = -
log2P(h), where CH is the optimal code for hypothesis space
H.
• -log2 P(D|h) is the description length of the training data
D given hypothesis h, under its optimal encoding. In our
notation, Lc D|h(Dlh) = - log2 P(Dlh), where CD|h is the
optimal code for describing data D assuming that both the
sender and receiver know the hypothesis h.
• Therefore we can rewrite Equation (6.16) to show that
hMAPis the hypothesis h that minimizes the sum given by
the description length of the hypothesis plus the
description length of the data given the hypothesis.
NAIVE BAYES CLASSIFIER
• One highly practical Bayesian learning method is the naive
Bayes learner, often called the naive Bayes classifier. In
some domains its performance has been shown to be
comparable to that of neural network and decision tree
learning.
• The naive Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f ( x ) can take on any
value from some finite set V. A set of training examples of
the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al,
a2.. .an,). The learner is asked to predict the target value,
or classification, for this new instance.
• The Bayesian approach to classifying the new instance is to
assign the most probable target value, VMAP given the
attribute values (al,a2 . . .an ,) that describe the instance.
Now we could attempt to estimate the two terms in above Equation based on the
training data. It is easy to estimate each of the P(vj) simply by counting the frequency with
which each target value vj occurs in the training data. However, estimating the different
P(al, a2.. . an|vj) terms in this fashion is not feasible unless we have a very, very large set
of training data. The problem is that the number of these terms is equal to the number of
possible instances times the number of possible target values. Therefore, we need to see
every instance in the instance space many times in order to obtain reliable estimates.
• The naive Bayes classifier is based on the
simplifying assumption that the attribute values
are conditionally independent given the target
value. In other words, the assumption is that
given the target value of the instance, the
probability of observing the conjunction a1, a2..
.an, is just the product of the probabilities for
the individual attributes:
• P(a1, a2 . . . an|vj) = Пi P(ai lvj).
• Substituting this into Equation (6.19), we have
the approach used by the naive Bayes classifier.
where VNB denotes the target value output by the naive Bayes classifier. Notice
that in a naive Bayes classifier the number of distinct P(ai | vj) terms that must
be estimated from the training data is just the number of distinct attribute values
times the number of distinct target values-a much smaller number than if we
were to estimate the P(a1, a2 . . . an | vj) terms as first contemplated.

To summarize, the naive Bayes learning method involves a learning step in which
the various P(vj) and P(ai | vj) terms are estimated, based on their frequencies
over the training data. The set of these estimates corresponds to the learned
hypothesis. This hypothesis is then used to classify each new instance by applying
the rule in above Equation . Whenever the naive Bayes assumption of conditional
independence is satisfied, this naive Bayes classification VNB is identical to the
MAP classification.
Example

• Naïve Bayesian classifier Algorthim:

• Step 1: Convert the data set into a frequency
table
• Step 2: Create Likelihood table by finding the
probabilities like Overcast probability = 0.29
and probability of playing is 0.64.
• Step 3: Now, use Naive Bayesian equation to calculate the
posterior probability for each class. The class with the
highest posterior probability is the outcome of prediction.
• Problem: Players will play if weather is sunny. Is this
statement is correct?
• We can solve it using above discussed method of posterior
probability.
• P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14

= 0.36, P( Yes)= 9/14 = 0.64

• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has

higher probability.
Bayesian Belief Networks
Difference between Naïve Bayes and BBN-
• The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1 . . .an, are conditionally independent given the target value v. This assumption
dramatically reduces the complexity of learning the target function. When it is met, the naive
Bayes classifier outputs the optimal Bayes classification.
• However, in many cases this conditional independence assumption is clearly overly
restrictive.
• A Bayesian belief network describes the probability distribution governing a set of variables
by specifying a set of conditional independence assumptions along with a set of conditional
probabilities.
• In contrast to the naive Bayes classifier, which assumes that all the variables are
conditionally independent given the value of the target variable, Bayesian belief networks
allow stating conditional independence assumptions that apply to subsets of the variables.
• Thus, Bayesian belief networks provide an intermediate approach that is less constraining
than the global assumption of conditional independence made by the naive Bayes classifier,
but more tractable than avoiding conditional independence assumptions altogether.
• Bayesian belief networks are an active focus of current research, and a variety of algorithms
have been proposed for learning them and for using them for inference.
Definition -BBN
• In general, a Bayesian belief network describes the
probability distribution over a set of variables. Consider an
arbitrary set of random variables Y1 . . . Yn, where each
variable Yi can take on the set of possible values V(Yi).
• We define the joint space of the set of variables Y to be the
cross product V(Y1) x V(Y2) x . . . V(Yn).
• In other words, each item in the joint space corresponds to
one of the possible assignments of values to the tuple of
variables (Y1 . . . Yn). The probability distribution over this
joint' space is called the joint probability distribution.
• The joint probability distribution specifies the probability
for each of the possible variable bindings for the tuple (Y1 .
. . Yn).
• A Bayesian belief network describes the joint probability
distribution for a set of variables.
Conditional Independence
Naïve Bayes Conditional Independence
Representation
• A Bayesian belief network (Bayesian network for short) represents the
joint probability distribution for a set of variables. For example, the
Bayesian network in Figure represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, ForestFire,
Campjre, and BusTourGroup.
• In general, a Bayesian network represents the joint probability
distribution by specifying a set of conditional independence assumptions
(represented by a directed acyclic graph), together with sets of local
conditional probabilities.
• Each variable in the joint space is represented by a node in the Bayesian
network. For each variable two types of information are specified.
• First, the network arcs represent the assertion that the variable is
conditionally independent of its non descendants in the network given its
immediate predecessors in the network. We say X is a descendant of Y if
there is a directed path from Y to X.
• Second, a conditional probability table is given for each variable,
describing the probability distribution for that variable given the values of
its immediate predecessors.
The joint probability for any desired assignment of values (y1 . . . , yn) to the tuple of
network variables (Y1 . . . Yn) can be computed by the above formula
where Parents(Yi) denotes the set of immediate predecessors of Yi in the network.
Note the values of P(yi |Parents(Yi)) are precisely the values stored in the
conditional probability table associated with node Yi.
Inference
• Inference is the general framework for
solving inference queries when a joint distribution is
given.
• We might wish to use a Bayesian network to infer the
value of some target variable (e.g., ForestFire) given
the observed values of the other variables. Of course,
given that we are dealing with random variables it will
not generally be correct to assign the target variable a
single determined value. What we really wish to infer is
the probability distribution for the target variable,
which specifies the probability that it will take on each
of its possible values given the observed values of the
other variables.
Expectation–maximization
➢ EM algorithm is an iterative method to find maximum
likelihood estimates of parameters in statistical models,
where the model depends on unobserved latent variables.
(In statistics, latent variables (from Latin: present
participle of lateo (“lie hidden”), as opposed to
observable variables) are variables that are not directly
observed but are rather inferred (through a mathematical
model) from other variables that are observed (directly
measured).)
➢ Iteratively learn probabilistic categorization model from
unsupervised data.
➢ Initially assume random assignment of examples to
categories “Randomly label” data
➢ Learn initial probabilistic model by estimating model
parameters θ from randomly labeled data
➢ Iterate until convergence:
1. Expectation (E-step): Compute P(ci | E) for
each example given the current model, and
probabilistically re-label the examples based
on these posterior probability estimates.
2. Maximization (M-step): Re-estimate the
model parameters, θ, from the
probabilistically re-labeled data.
The EM Algorithm for Gaussian Mixtures
➢ The probability density function for multivariate_normal is
where mu is the mean, Sigma the covariance matrix, and k
is the dimension of the space where x takes values.
Algorithm:
• An arbitrary initial hypothesis h=<μ1, μ2 ,.., μk> is chosen.
• The EM Algorithm iterates over two steps:
Step 1 (Estimation, E): Calculate the expected value E[zij] of
each hidden variable zij, assuming that the current
hypothesis h=<μ1, μ2 ,.., μk> holds.
Step 2 (Maximization, M): Calculate a new maximum
likelihood hypothesis h‟=<μ1‟, μ2‟ ,.., μk‟>, assuming the
value taken on by each hidden variable zij is its expected
value E[zij] calculated in step 1. Then replace the
hypothesis h=<μ1, μ2 ,.., μk> by the new hypothesis
h‟=<μ1‟, μ2‟ ,.., μk‟> and iterate.
Conclusion
• The above algorithm for estimating the means of
a mixture of k Normal distributions illustrates the
essence of the EM approach: The current
hypothesis is used to estimate the unobserved
variables, and the expected values of these
variables are then used to calculate an improved
hypothesis.
• It can be proved that on each iteration through
this loop, the EM algorithm increases the
likelihood P(Dlh) unless it is at a local maximum.
The algorithm thus converges to a local maximum
likelihood hypothesis for (mu1 ,mu2).

Module - 5 - Notes BAYESIAN Learning Notes
No ratings yet
Module - 5 - Notes BAYESIAN Learning Notes
24 pages
Bayesian and Computational Learning
No ratings yet
Bayesian and Computational Learning
178 pages
Module 5
No ratings yet
Module 5
24 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Module 5
No ratings yet
Module 5
30 pages
3.1 New
No ratings yet
3.1 New
12 pages
Unit 4
No ratings yet
Unit 4
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
44 pages
UNIT IV 4 ML Bayesian Learning
No ratings yet
UNIT IV 4 ML Bayesian Learning
73 pages
Bayes Theorem
No ratings yet
Bayes Theorem
13 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
65 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Understanding Bayes' Theorem in Learning
No ratings yet
Understanding Bayes' Theorem in Learning
31 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
60 pages
Bcs602 ML Mod-4 Notes @vtunetwork
No ratings yet
Bcs602 ML Mod-4 Notes @vtunetwork
31 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
25 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
19 pages
Bayesian Learning Video Tutorial
No ratings yet
Bayesian Learning Video Tutorial
25 pages
Understanding Bayesian Learning Concepts
No ratings yet
Understanding Bayesian Learning Concepts
33 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
39 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
14 pages
Module4 Notes
50% (2)
Module4 Notes
31 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
50 pages
ML Notes 4
No ratings yet
ML Notes 4
27 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Bayesian Learning in AI and ML
No ratings yet
Bayesian Learning in AI and ML
30 pages
Module 4
No ratings yet
Module 4
15 pages
Bayesian Learning
No ratings yet
Bayesian Learning
42 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
AI & ML Bayesian Learning Overview
No ratings yet
AI & ML Bayesian Learning Overview
18 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
18 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
14 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
10 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
45 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
26 pages
Understanding Bayes' Theorem in ML
No ratings yet
Understanding Bayes' Theorem in ML
22 pages
Chapter 8
No ratings yet
Chapter 8
26 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
31 pages
Bayesian Methods in Machine Learning
No ratings yet
Bayesian Methods in Machine Learning
36 pages
Bayes Theorem in Concept Learning
No ratings yet
Bayes Theorem in Concept Learning
33 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
91 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
Bayesian Learning and Methods Explained
No ratings yet
Bayesian Learning and Methods Explained
18 pages
Understanding Bayes Theorem and Classifiers
No ratings yet
Understanding Bayes Theorem and Classifiers
26 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
27 pages
Bayesian Learning Algorithms Overview
No ratings yet
Bayesian Learning Algorithms Overview
81 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
39 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
41 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
51 pages
Aiml Module 04
No ratings yet
Aiml Module 04
62 pages
Survey on Multimedia Sentiment Analysis
No ratings yet
Survey on Multimedia Sentiment Analysis
29 pages
Final Report Spam Mail Detection 33
No ratings yet
Final Report Spam Mail Detection 33
51 pages
Data Warehousing & Mining Exam Questions
No ratings yet
Data Warehousing & Mining Exam Questions
8 pages
Conformal Prediction - LLMs
No ratings yet
Conformal Prediction - LLMs
16 pages
Key Concepts in Machine Learning
No ratings yet
Key Concepts in Machine Learning
2 pages
Learning Algorithms For The Classification Restricted Boltzmann Machine
No ratings yet
Learning Algorithms For The Classification Restricted Boltzmann Machine
27 pages
Neural Networks in Bankruptcy Prediction
No ratings yet
Neural Networks in Bankruptcy Prediction
23 pages
Aws Certified Ai Practitioner Aif c01
No ratings yet
Aws Certified Ai Practitioner Aif c01
72 pages
Mooc ML
No ratings yet
Mooc ML
24 pages
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
5 pages
Multi-Modal Learning With Missing Modality Via Shared-Specific Feature Modelling
No ratings yet
Multi-Modal Learning With Missing Modality Via Shared-Specific Feature Modelling
10 pages
Machine Learning Assignment Insights
No ratings yet
Machine Learning Assignment Insights
3 pages
Data Analytics and AI
100% (12)
Data Analytics and AI
267 pages
ANN 3 - Perceptron
100% (1)
ANN 3 - Perceptron
56 pages
Deep Learning for Chaotic Channel Estimation
No ratings yet
Deep Learning for Chaotic Channel Estimation
83 pages
Self-Quiz Review: Data Mining Concepts
No ratings yet
Self-Quiz Review: Data Mining Concepts
6 pages
PMU-Based Fault Classification
No ratings yet
PMU-Based Fault Classification
6 pages
Data Science Honours Syllabus SPPU
No ratings yet
Data Science Honours Syllabus SPPU
5 pages
Supervised Learning Notes 1-4
No ratings yet
Supervised Learning Notes 1-4
42 pages
Machine Learning Quiz: Key Concepts
No ratings yet
Machine Learning Quiz: Key Concepts
7 pages
Machine Learning for Crop Yield Prediction
No ratings yet
Machine Learning for Crop Yield Prediction
5 pages
Road Safety Engineering Unit-2
No ratings yet
Road Safety Engineering Unit-2
26 pages
Data Mining & Warehousing Exam 2024
No ratings yet
Data Mining & Warehousing Exam 2024
2 pages
Multiclass Coding in Manufacturing Systems
No ratings yet
Multiclass Coding in Manufacturing Systems
20 pages
Saurabh Bhosale Seminar Report
No ratings yet
Saurabh Bhosale Seminar Report
27 pages
Data Analytics For Product Segmentation and Demand Forecasting of A Local Retail Store Using Python
No ratings yet
Data Analytics For Product Segmentation and Demand Forecasting of A Local Retail Store Using Python
8 pages
Mini Project Report On Ipl Win Probability Predictor"
No ratings yet
Mini Project Report On Ipl Win Probability Predictor"
28 pages
Data Mining Applications in Medicine
No ratings yet
Data Mining Applications in Medicine
63 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Attention Mechanism in Deep Learning
No ratings yet
Attention Mechanism in Deep Learning
10 pages

Bayesian Learning Methods Overview

Uploaded by

Bayesian Learning Methods Overview

Uploaded by

Module-4

• A second practical difficulty is the significant computational

• P(h|D) increases with P(h) and with P(D|h) according

• P(D) can be dropped, because it is a constant independent of h

• P(D|h) is often called the likelihood of the data D given h,

• As usual, we assume that the learner is given some sequence of

• where |VSH,D| is the number of hypotheses

• A Normal distribution is a smooth, bell-shaped

• A straightforward Bayesian analysis will show that

• Learner L considers an instance space X and a hypothesis space H

• The target function f corresponds to

e: a random noise variable generated by a Normal

• Assuming training examples are mutually independent given

• Given the noise ei obeys a Normal distribution with zero mean

• The first term in this expression is a constant

• Maximizing this negative term is equivalent to

Why is it reasonable to choose the Normal distribution to

Only noise in the target value is considered, not in the

• In the previous section we determined that

• Recall from Chapter 3 the discussion of

• - log2 P(h) is the description length of h under the optimal

• Naïve Bayesian classifier Algorthim:

• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14

• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has

You might also like