0% found this document useful (0 votes)

18 views8 pages

Dynamic Bayesian Multinets

Jeff A. Bilmes Department of Electrical Engineering University of Washington Seattle, WA 98 195-2500

Uploaded by

Tamás Mákos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

Dynamic Bayesian Multinets

Jeff A. Bilmes Department of Electrical Engineering University of Washington Seattle, WA 98 195-2500

Uploaded by

Tamás Mákos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

38 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

Dynamic Bayesian Multinets

Jeff A. Bilmes
Department of Electrical Engineering
University of Washington
Seattle, WA 98 195-2500

Abstract for example, is one such instance of pre-specifying an ar

bitrary model structure for a domain. Such an approach
In this work, dynamic Bayesian multinets are has obvious computational and infrastructural advantages:
introduced where a Markov chain state at time if the model is kept simple, inference is guaranteed to stay
t determines conditional independence patterns tractable and software tools can be developed and reused
between random variables lying within a local many times. The conditional independence properties of
time window surrounding t. It is shown how a particular model, however, could be sub-optimal for a
information-theoretic criterion functions can be given task. With a more appropriate model, substantial
used to induce sparse, discriminative, and class improvements in classification accuracy, memory require
conditional network structures that yield an op ments, and computational demands could potentially be
timal approximation to the class posterior prob achieved. While it might be sufficient to hand-specify the
ability, and therefore are useful for the classi model for a given application, a promising approach allows
fication task. Using a new structure learning the data itself to determine or at least influence the model.
heuristic, the resulting models are tested on a In the most general case, there are four distinct compo
medium-vocabulary isolated-word speech recog nents of a graphical model: the semantics, the structure,
nition task. It is demonstrated that these discrim the implementation, and the parameters. There are a va
inatively structured dynamic Bayesian multinets, riety of different semantics, including directed (Bayesian
when trained in a maximum likelihood setting us network) and undirected (Markov random field) models,
ing EM, can outperform both HMMs and other chain graphs, and other more experimental frameworks. In
dynamic Bayesian networks with a similar num general, each corresponds to a different family of proba
ber of parameters. bility distributions and, based on training data, a seman
tics could potentially be selected or perhaps even induced
anew. Fixing the semantics, obtaining a good model struc
1 Introduction
ture is crucial, and is therefore a current active research
focus [7, 16, 5, 20, 10]. Fixing the structure, there are a
While Markov chains are sometimes a useful model for se
variety of ways to implement1 the dependencies between
quences, such simple independence assumptions can lead
random variables, such as conditional probability tables,
to poor representations of real processes. An alternative
neural networks, decision trees, Gaussian polynomial re
and highly successful extension to the Markov chain al
gression, and so on. And finally, fixing all of the above, a
lows random functions to be applied to each Markov state
good assignment of all the parameters must be found. Of
to yield the hidden Markov model (HMM). As is well
course in each case, a Bayesian approach can also be taken
known, an HMM is simply one type of dynamic Bayesian
where we use a (potentially uncountably infinite) proba
network (DBN) [ 12], or more generally a graphical model
bilistically weighted mixture over multiple choices.
[21]. When HMMs are considered as one small instance in
this enormous family of models, it is reasonable to suppose The task of learning graphical models can be seen as learn
that the independence assumptions underlying an HMM ing any or all of the above four components given a collec
can further be relaxed to yield better models still. tion of data, and is akin to model selection [22] problem
known to the statistics community for years. In all cases,
The structure of a graphical model, however, is sometimes
chosen for an application without ensuring that it matches 1While this is not standard terminology, a concise way to refer
the underlying process the model is supposed to repre to the representation of the local conditional probability model is
sent. Using a hidden Markov model to represent speech, simply to use the term "implementation."
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 39

the underlying goal is to identify a system for probabilis Perhaps the earliest well-known work on structure learning
tic inference that is computationally efficient, accurate, and in directed graphical models is [7]. More recent research on
somehow informative about the given problem domain. this topic may be found in [ 17, 5, 16, 25, 10, 20, 23, 13]. 2
In general, the task of learning Bayesian networks can be
In this paper, a class of models called dynamic Bayesian
grouped into four categories [ 10] depending on 1) if the
multinets and a method to induce their structure for the
data is fully observable or if it contains missing values, and
classification task is described. In this work, an exten
2) if it is assumed that the structure of the model is known
sion of [3], the problem domain is speech recognition so
or not. The easiest case is when the data is fully observable
it is necessary to use dynamic models. Also, since clas
and model structure is known, whereas the most difficult
sification is the goal, it is beneficial to learn class-specific
case is when the data is only partially observable and when
and (as we will see) discriminative structure. And to fur
the structure is unknown or only partially known.
ther improve sparsity (and therefore reduce computational
and memory demands) and to represent class conditional Note that a general optimization procedure can be used to
information only where necessary, Bayesian multinets (de learn many aspects of a graphical model. Often, learn
scribed in the next section) are used. ing needs only a maximum likelihood procedure perhaps
with an additional complexity penalty term such as MDL or
Section 2 provides a review of structure learning in
BIC. Alternatively, a Bayesian approach to learning can be
Bayesian networks, of Bayesian multinets, and presents the
used where no single structure or set of parameters are cho
idea of structural discriminability. Section 3 introduces the
sen. For certain classes of networks, the prior and posterior
class of models considered in this work and analyzes their
are particularly simple [ 16]. Alternatively, a risk minimiza
inferential complexity. Section 4 provides three informa
tion approach [26] can be applied to the learning problem.
tion theoretic criterion functions that can be used to learn
structure, the last of which provably provides an optimal In principle, an optimization procedure could simultane
approximation to the local posterior probability. Section 5 ously cover all four components of a graphical model: se
introduces the improved pairwise algorithm, a heuristic de mantics, structure, implementation, and parameters. There
veloped because the above induction procedure is compu has, however, been little if any research on methods to Jearn
tationally infeasible. Section 6 evaluates this system on a the best implementation and semantics. The problem be
medium-vocabulary speech corpus and shows that when comes inherently difficult because the quality of each com
structure is determined using the discriminative induction ponent cannot be accurately evaluated without first obtain
method and trained using EM, these networks can outper ing good settings for the other three components. The prob
form both HMMs and other dynamic Bayesian networks lem becomes more arduous when one begins to consider
with a similar number of parameters. But when structure multi-implementation and multi-semantic models. In prac
is determined arbitrarily, or without using a discriminative tice, therefore, one or more components are typically fixed
method, the performance is dramatically worse. Finally, before any optimization begins.
Section 7 concludes and discusses future work.

2 B ackground 2.2 Bayesian Multinets

2.1 Structure Learning A advantage of Bayesian networks is that they can spec
ify dependencies only when necessary, leading to a signifi
A fully-connected graphical model can represent any prob cant reduction in the cost of inference. Bayesian multinets
ability distribution representable by a sparsely structured [ 15, 14] further generalize Bayesian networks and can fur
one, but there are many important reasons for not using ther reduce computation. A multinet can be thought of as
such a fully connected model. These include 1) sparse net a network where edges can appear or disappear depending
work structures have fewer computational and memory re on the values of certain nodes in the graph, a notion that
quirements; 2) a sparse network is less susceptible to noise has been called asymmetric independence assertions [ 14].
in training data (i.e., lower variance) and less prone to over
fitting; and 3) the resulting structure might reveal high-level Consider a network with four nodes A, B, C and Q. In
knowledge about the underlying problem domain that was a multinet, the conditional independence properties among
previously drowned out by many extra dependencies. A A, B, and C might, for example, change for differing val
graphical model should represent a dependence between ues of Q. If Q is binary, and CJLAI{B, Q 0} but=

two random variables only when necessary, where "neces C-Jl.AI{B, Q 1}, then the joint probability could be writ-
=

sary" depends on the current task. In essence, learning the

structure in data is similar to developing an efficient code
for the underlying random process, as efficient coding is
analogous to probabilistic modeling. 2 See, especially, the reviews given in [16, 5, 20].
40 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

ten as: 3 TheModel

In this work, we consider a class of models called dy

p(A,B,C) = LP(A,B,CIQ = q) p(Q = q) namic Bayesian multinets (DBM). They consist of hid
q den Markov chains that determine local class-conditional
= p(CIB,Q = O) p(BIA,Q = O) p(Q = 0) + Bayesian networks over a window of observations. Equiva
lently, they consist of extensions to hidden Markov models
p(CIB,A,Q =I) p(BIA,Q = I) p(Q =I) (HMMs) where additional cross-observation dependencies
have been added as a function of the underlying Markov
chain. This model is also called a buried Markov model
Some examples of multinets include mixtures of tree
(BMM) [3] because the hidden Markov chain in a HMM
dependent distributions [23] and class-conditional naive
is further hidden (buried) by additional cross-observation
Bayes classifiers [11, 10].
dependencies.
In general, the statistical dependencies in a multinet could
First some notation: Qt refers to a Markov state at
be represented by a regular Bayesian network via specific
values of the parameters [14] (e.g., for switching linear time t, and Q1,r g { Q1, Q 2,... , Qr} refers to the
Gaussian models, certain parameters could be zero, or for entire chain.3 Xt will refer to the observation vector
t
at time t with Xti referring to its i h element. Us
discrete probability tables, hyperplanes could indicate in
dependence between random variables only for certain val ing this notation, a hidden Markov model is a col
ues of other random variables). In other words, the family lection of hidden Q1,r and observation X1,r vari
of probability distributions representable by Bayesian net ables that possess the following conditional indepen
works and by Bayesian multinets is the same. In practice, dence properties: {Xt:T,Qt:T}ll{ Q l:t-2,Xl:t-diQt-1
however, a multinet could result in a substantial savings and XtlL{Q�t, X�t}IQt for all t.
in memory, computation, and necessary sample-size com We generalize this model such that Xti is no longer con
plexity relative to an equivalent Bayesian network. ditionally independent of all the surrounding observations
given Qt. Relative to an HMM, a DBM has been aug
mented with chain-conditional cross-observation depen
2.3 The Classification Task
dencies between individual observation elements. The
probability model becomes:

Many papers on structure learning concentrate on produc p(xl:t) = LIT p(xtlzt(qt),Qt)p(qtiQt-d

ing networks that best represent statistical dependencies ex ql,t t
tant in data. When the goal is classification, however, this
where zt(q) � X<t for all t and q. For example, it
is not necessarily optimal. Indeed, the class posterior prob
could be that Zt(q) = {Xt-1,3,Xt-1,5,Xt-2,1,Xt-3,9} and
ability will be accurately approximated if sample and class
label are considered together, and then jointly optimized in
Zt(r) = {xt-4,2,Xt-9,o} forr =fi q. A multinet occurs be
cause zt(q) is a function of q; if the Markov chain changes,
a maximum likelihood procedure, assuming sufficient data.
so will the set of dependencies. Specifically, the condi
Such a procedure might be wasteful, however, as likelihood tional independence assumption among observation ele
scores are penalized from the term containing dependen ments becomes XtilL{X�t \zti(q)}l{q, Zt;(q)}. This class
cies only between features (which has a much larger mag of model is depicted in Figure 1 for two instantiations of the
nitude) than the term containing the class posterior prob Markov chain.
ability. It is this later term that, according to Bayes deci
In general, adding conditional dependencies in a DBN can
sion theory, must be accurately modeled to achieve good
significantly increase computational and memory complex
classification performance. In [10], this issue was noticed,
ity. For the junction tree algorithm, the complexity is
and both extended versions of naive Bayes classifiers and
class conditional Bayesian multinets were considered, both
O('L'{=1 s (C;)) where Tis the number of resultant cliques
in the junction tree, and s( C;) is the size of the state space
of which outperformed the naive Bayes classifier on classi
for clique C;. For an HMM with Ttime-steps and N states,
fication tasks.
there are O(T) cliques each with at most a state space size
In a general classification task, additional reductions in of N2 resulting in O(TN2).
computation and increases in sparsity can be achieved by
To determine the complexity of a DBM, first define 4 an
learning a specific network structure for each class, where
3In general, Xu is matlab-like notation to refer to the set
each class-conditional network represents nothing other
than those dependencies, often unique to its class, that help of variables with indices between 1 and t inclusive, X<t �
tl.
approximate the class posterior probability . This property Xqt-1)• and X�t = {X1,T \ Xt}.
has been called structural discriminability [2]. 4AR-HMM stands for auto-regressive HMM. An AR-HMM
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 41

the new edges must go through a clique containing nodes

adjacent by these new edges. Second, a triangulated-by
moralization AR-HMM(K) has at most two hidden vari
ables in its cliques since no node has more than one hidden
variable as a parent, so moralization does not add edges
between hidden variables. The remaining clique variables
are observations, so the state-space size is only N2. Third,
such a triangulated AR-HMM(K) has only O(T) cliques.
Therefore, the complexity of an AR-HMM(K) and there
fore a DBM is again only O(TN2) for any fixed K. There
is, however, a constant cost associated with the number of
additional dependency edges. Incorporating this cost, the
complexity becomes O(TN2K) where K is the maximum
number of dependency edges per observation. The extra
dependency structure is sparse, however, so the computa
tional and memory requirements of a DBM will in practice
be much less than its O(TN2K) complexity suggests.
X
4 Structure Learning in DBMs

Figure 1: Two networks corresponding to two values of the Structure learning consists of optimally choosing zt(q) in
Markov chain, q1,r and q�:T· The individual elements of p(xtlzt(q), q). In this section, three methods are consid
the observation vectors are shown explicitly as dark shaded ered. In each case, a fixed upper bound is assumed on the
nodes surrounded by boxes. Nodes with diagonal lines cor possible number dependency variables. That is, it is as
respond to instantiated hidden variables. sumed that lzt(q) l ::::; c for some fixed c > 0. The phrases
"choose dependency variable" and "choose dependencies"
will be used synonymously. It is assumed that the reader is
AR-HMM(K) as an HMM but with additional edges point familiar with information-theoretic constructs [8].
ing from observations Xt-l to Xt for£ =1, . . . ,k (See
The following theorem will be needed:
Figure 2). Collapsing the vector graphical notation in Fig
ure 1 into a single node, and including a dependency from Theorem 4.1. Mutual Information and Likelihood.
observations Xr to Xt in the AR-HMM if for any value Given three random variables X, z(a) and z(b}, where
of Qt and there exists a dependency between Xri and Xti I(X; z(a)) > I(X; z(b}), the likelihood of X given z(a)
for any i,j, a DBM with a maximal dependency across K is higher than given z(b}' for n, the sample size, large
observations can be more generally represented by an AR enough, i.e.,
HMM(K).

Proof. Under the assumption, it immediately follows that:

Figure 2: Bayesian network for an AR-HMM(2) Negating and expanding as integrals gives

There are three things to note about an AR-HMM(K).

First, a moralized AR-HMM(K) is triangulated. This can
be seen by induction, the base case being obvious, and
the induction step following because 1) a cycle contain or equivalently
ing edges only contained in the previous step's graph must
have a chord by induction, and 2) a cycle containing edges
not in the previous step's graph (new edges) must also
have a chord because the portion of a cycle not containing
can of course use any, possibly non-linear, implementation be where (x ;, zik}) ,....., p(X, z(k}) fork E {a, b}. The weak
law of large numbers implies that VE > 0, 3 n and n
tween observations.
a b
42 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

such that for n > max ( n

a, b),
n q more than any increase in the r model in the context of
q for all r -:/:- q. The score increases can in fact be nega
�� t, logp(x;lz)''l + H(XIZ'''l l < <
tive, thereby decreasing the likelihood of both models, but
potentially improving the classification accuracy. Accord
ingly, the score of a model in a different context can be
again for k E {a, b}. Choosing E < IH(XIZ(a)) - evaluated using an extended form of conditional mutual in
H(XIZ(b))l/2 to get n implies Equation 1. D formation:5

p(x,zlq) dxdz
Of course, the actual probability distribution
not known and only an approximation p(xlz,
is
8) is avail
p(xlz) Ir(X; Zlq) =

6.
I p(x,zir) log
p(xlq)p(zlq)
able where the parameters e are estimated, often using Therefore, I(X; Z(q)lq) should be large and
a method such as maximum likelihood, that decreases Iq(X; Z(r)ir) should not be as large for each r. This
D(p(xiz)II.P(xiz)), the KL-distance between the actual and suggests optimizing the following:6
approximate distribution. If pis close enough top, then the
theorem above still holds. It is therefore assumed that the S(X; ZIQ) � LP(q)(8qr- p(r))Iq(X; Z(r)ir)
p
parameters of have been estimated well enough so that qr
p
any differences with are negligible.
where Z UiZ(i) and 8qr is a Kronecker delta. This
=

The above theorem is, of course, also true for condi

quantity can be further motivated by noticing that the ex
tional mutual information [8] such as I(X; ZIQ)
or for
pected class posterior probability can be expanded as fol
a particular value of q, I(X; ZIQ q). Therefore, if
=

lows:
I(X; z(al(q)IQ = q) I(X; z(bl(q)IQ q),
> for all= q
then:
E[logp(QIX,Z)] -H(QIX,Z) =

1
T 1 T -H(XIQ,Z)- H(QIZ) + H(XIZ)
Llogp(xtlz�a)(qt),qt) > Llogp(xtlz��)(qt),qt)
=

T t=l T t=l
so that
These quantities can be viewed as likelihoods of the data
given Viterbi paths Qt
of modified HMMs. In the left
E[Iogp(QIX,Z)] + H(XIQ) + H(Q)- H(X)
case, the Viterbi path likelihood is higher. Note that us = I(X; ZIQ) + I(Q; Z)- I(X; Z)
ing a similar argument as in the theorem, and because
Furthermore, the conditional entropy can be bounded by
H(X) H(XIZ),
�

-H(XIZ) = I log (� p(xlr,z)p(riz)) dP(x,z,q)

for some non-Viterbi path rt
and for n large enough. In
� I LPr
(riz) 1ogp(xlr,z)dP(x,z,q)
other words, relative to an HMM, the likelihood of the data
for paths other than the Viterbi path do not decrease when � LP(r)p(q) I 1ogp(xlz,r)dP(x,zlq)
adding conditioning variables. The following theorem has r,q
therefore been shown.
where the first inequality follows from Jensen's inequality,
Theorem 4.2. A DBM with edges added relative to an
HMM according to conditional mutual information
and the approximate equality is valid if I(Q; Z)
is small.

produces a higher likelihood score than before modifi

From this, it can be shown [2] that choosing z(a)
over z(b)
cation.
when S(X; z(a)IQ) > S(X; z(b)IQ)
will increase an up
per bound on the expected class posterior probability, and
The DBM represents statistical relationships contained in therefore could potentially reduce the Bayes error. This de
the data that are not well represented before modification, fines a second dependency selection rule.
which is the reason for the higher likelihood. Augmenting A generalization that does not require a small I(Q; Z)
can
the dependencies according to conditional mutual informa I(Q; Z)
be obtained by noticing that does not depend on
tion therefore defines the first dependency selection rule. p(xiz,q). Z
Therefore, if Z(i)
= Ui is chosen to max
When the task is classification, however, a higher likeli imize I(X; ZIQ) - I(X; Z), the average class posterior
hood does not necessarily correspond to a lower error. Con probabilityE(p(QIX,Z)] will be maximized and there
q
sider the two states and -:/:- r q.
To achieve a lower error, fore optimal for a fixed number of edges (see [2] for a
q r
a modification the and models should increase the av 5Called cross-context conditional mutual information in [2].
q
erage score of the model in the context of a sample from 6Called discriminative conditional mutual information in [2]
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 43

proof). The quantity I(X; ZIQ)- I(X; Z)

could be called INPUT: z1, q, M, Xti
the explaining away residual (or the EAR measure), and OUTPUT: Zq i
it asks for edges that are more class-conditionally depen Set Zqi =0
dent than marginally independent. Moreover, a multinet Sort Zi, so f(Zi)
2: j(Z2) 2: ...
can result when choosing class-conditional edges accord f
For j decreasing until ( z1) falls below threshold:
ing toI(X; ZIQ q)- I(X; Z)
= for each q. If z1 satisfies all the following three criteria:
Summarizing this section, there are three possible rules that
1) I(Xti; Zilq) is larger than a threshold
could be used to choose new edges between elements of Xt 2) For each Z Z i I(Zj; ZIQ) ri(Zj; XtiiQ)
E q , <

and previous observations: 3) I(Xti; Zj) is smaller than a threshold:

then add z1 to Z qi and break if I Zqi I > M.
Z*(q) = argmax I(X; Z(q)lq) (2)
Z(q)<;;X<t & IZ(q)l::;c Figure 3: The improved pairwise heuristic : this algorithm
Z* = argmax S(X; ZIQ) (3) chooses the dependency variables for the ith feature posi
Z<;;X« & IZI::;c tion of Xtand for class q.
Z* = argmax I(X; ZIQ)- I(X; Z) (4)
Z<;;X<t & IZI::;c
and discriminative criterion ensures the candidate variable
does improve the models scores in the general non-class
Equations 3 and 4 produce discriminatively structured net
conditional case.
works, in that the underlying dependencies represented by
the network are unique to each class. The resulting models
achieve a high score in the presence of a sample from the
right class, but get a low score in the presence of a differ
ent class. More importantly, this can be true even for non
optimal parameter settings since, via the structure, the net
works are inherently less capable of achieving high scores
for samples of the wrong class. Therefore, along appro
priate complexity penalties, it would be sufficient to learn
parameters using likelihood based methods rather than the
more costly risk-minimization procedures [26, 1, 9, 18, 19].
Figure 4: The redundancy check: The edge from Z4 X
to is
being considered. It will only be added if I(Z4; Zi)
for i <
5 The Improved Pairwise Algorithm 4 is small, hopefully reducing the chance that a variable is
added for which a class-conditional Markov blanket exists
The optimization suggested in the previous section is separating Z4
from X
and making Z4
superfluous.
clearly impractical. In this section, a new computation
ally efficient heuristic, entitled the improved pairwise al This algorithm is inexpensive, running in time
gorithm, is introduced. The algorithm approximates the O(IQIIXId2) where IQI is the number of classes,
desired quantities using only pairwise conditional mutual lXI IX1:tl
is the size of the observation vector, and d :::; is
information between scalars. the maximum number of variables that are considered in
the selection.
The algorithm is presented in Figure 3 using rule 4. All
candidate scalar random variables in Xl:t
are given and in
dexed by z1. A total of M edges will be added separately 6 Experimental Evaluation
q,
for each value and for each Xti.
Therefore, the algorithm
might allow for some redundancy if intra-feature depen In this section, DBMs are evaluated in the context of
dencies are already modeled. The algorithm first sorts the classifying speech utterances. All experiments are re
scalars decreasing by the function f(Zj) I(Xti; ZiiQ
= ported using the NYNEX Phonebook database[24]. Phone
q)- I(Xti; Zi
z1). The output is q , set of variables from
=

book is a large-vocabulary "phonetically-rich isolated

which a link should be added to Xti
under the class q. word telephone-speech database." It contains a rich col
lection of vocabulary words including poly-syllabic words
The algorithm uses three criteria to eliminate candidate
such as "exhaustion," "immobilizing," "sluggishness," and
edges. The first ensures that the edge to Zj
is actually
"overambitious" as well monosyllabic words such as
informative about Xti
in the context of q.
The second is
"awe," "biff," and "his."
a redundancy check - it asks for an edge from a variable
that has little information in common with the variables al The quantities (X; I ZQq) are obtained using an initial
=

ready added as depicted in Figure 4. The degree of allowed baseline HMM-based system. The following general train
redundancy is determined using 0 < r < 1. The third ing procedure is used for all of the results reported in this
44 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

section: more parameters are added to the system. In each case,

however the DBM outperform an HMM with a comparable
1) Train a bootstrap HMM-based system using EM number of parameters and states per phone, and this is true
2) Compute (X; ZIQ = q) and I(X; Z) for all pairs across the different vocabulary sizes.
within a 200ms range surrounding the current time.
3) Run the improved pairwise heuristic
Case Type WER Params/ SP
4) Train the resulting models again with EM
1 CMI 32.0% 207k/ 1
5) Test the result
2 AR2 27.6% 207k/ 1
3 AR 1 20.9% 156k/ 1
4 RAND 8.3% 207k I 1
The reported results all use a mixture of Gaussian linear
regression model to implement the dependencies. In this Table 2: Performance of DBMs determined using alter
case, nate methods and evaluated on the 75-word vocabulary test
case.
p(xiz(q) , q) = (t1 p(xlm, z ( q) , q) p(mlq) ) A second set of results are given in Table 2 and can be
where each underlying component is a Gaussian linear compared to those given in the 75-word column in Table 1.
regressive process on z using a sparse dependency matrix Case 1 shows the performance of a DBM created using
Bqm· Equation 2. This rule adds dependencies that increase the
model scores but not the classification accuracy. In fact,
l the likelihood scores in this case were dramatically larger
p(xlm, z, q) =
e-!(x-Bqmz)'E;;-;.(x-BqmZ)
l27ri:qmll/2 than before modification. As can be seen, however, the per
formance dramatically decreases, presumably because the
models are not structurally discriminative.
Therefore,p(xlm, z, q) is a Gaussian distribution with con
ditional mean B mZ and covariance matrix I: m· This Cases 2 and 3 show the performance when dependencies
q q
implementation can, to some extent, simulate conditional are added from the previous (the two previous for case 3)
variance by using mixtures, but it avoids many training observations to the current observation, and is therefore
complexities since closed-form EM update equations can not a multinet. The performance is also very poor, indi
be derived. Complete details of the experimental setup, cating again that relaxing the wrong conditional indepen
training procedure, definitions of test and training sets, dence properties can dramatically decrease classification
topology of the Markov chains, and so on are described accuracy.
in[2].
Case 4 shows the performance when a different random set
of dependencies between observation elements are added
jVocabj 75 150 300 600 Params I SP for each state. Interestingly, case 4 is much better than the
HMM 5.0% 7.0% 9.2% 13.3% 157k I 1 previous cases suggesting that the most harmful and anti
DBM 4.6% 6.5% 8.7% 12.3% 157k I 1
discriminative dependencies have not been added. The per
HMM 2.4% 3.5% 5.6% 7.7% 182k /2 formance, however, is still worse than the baseline HMM.
DBM 2.3% 3.4% 5.4% 7.5% 182k 12
HMM 1.7% 2.8% 4.6% 6.2% 163k 13 Several general points can be made from the two tables.
DBM 1.5% 2.6% 4.4% 6.0% 166k 13 Cases 1-3 indicate that dependencies that are added to a
HMM 1.5% 2.7% 4.3% 5.8% 200kl4 model structure to increase a (likelihood) score can cause
DBM 1.4% 2.6% 4.2% 5.6% 207k /4 a dramatic decrease in classification accuracy, even if the
structures are augmented in a class-conditional way, as in
case 1 above. Note that the likelihood scores for these mod
Table 1: Word-error rate results comparing HMMs with
els are dramatically higher both for the training and testing
DBMs. IVocabl is the vocabulary size (number of test
data, suggesting that overfitting is not the problem. The
classes), Params is the number of system parameters, and
goal of many model selection methods [6, 22] is to choose
SP is the number of Markov states per phoneme.
a model that provides the best description of the data, but
the above suggests that this can be inappropriate for classi
The results, given in Table 1, show the error rate for differ
fication. Admittedly, model selection procedures typically
ent topologies of Markov chain for each phone. The num
include complexity penalty terms (e.g., MDL, BIC, and so
bers are competitive with those reported using dynamic
on). But these penalties do not select for discriminative
Bayesian networks on the same speech corpus and train
structures.
ing/testing conditions [27], where best achieved result was
2.7% with a 5 15k parameter system on the 75-word vocab Second, dependencies in a network should not be added
ulary size case. The performance increases on average as just because they are missing. Cases 2 and 3 adds depen-
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 45

dencies between adjacent observation vectors, an approach Acknowledgements

sometimes justified by noting that they are not directly rep
resented by an HMM. But as the performance for these aug This work has benefited from discussions with Nir Fried
mented models shows, the results indicate that adding miss man, Steven Greenberg, Michael Jordan, Katrin Kirchhoff,
ing dependencies can decrease classification performance. Kevin Murphy, Stuart Russel, and Geoff Zweig. The au
thor would like to acknowledge the International Computer
Third, adding random dependencies does not produce as Science Institute at which many of the computational ex
poor performance as in the previous cases, but neither is periments were performed, and thank the three anonymous
there any benefit. It is unlikely that choosing random de reviewers for their useful comments.
pendencies, even if q-conditioned, will result in discrim
inative structure because the selection space is so large. References
The implications for structure learning methods that search
[1] L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information esti
over randomly chosen sets are clear: because of the large mation of HMM parameters for speech recognition. In Proc. IEEE Inti. Conf on Acoustics,
Speech, and Signal Processing, pages 49-52, Tokyo, Japan, December 1986.
search space, it is unlikely that good sets of dependencies
J. Bilmes. Nalural Statistic Models for Automatic Speech Recognition. PhD thesis, U.C.
will be found in a reasonable amount of time. It seems cru [2]
Berkeley. Dept of EECS, CS Division, 1999.
cial, therefore, to constrain the random search to those that
[3] J.A. Bilmes. Buried Markov models for speech recognition. In Proc. IEEE Inti. Conf on
found to be useful in some way, as has been argued in the Acoustics, Speech, and Signal Processing, Phoenix, AZ, March 1999.

past [ 13]. [ 4] X. Boyen, N. Friedman, and D. Koller. Discovering the hidden structure of complex dynamic
systems. 15th ConJ on Uncertainty in Artificial Intelligence, 1999.

Finally, as argued in [2], an HMM can approximate a distri [5] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE
bution arbitrarily well given enough capacity, enough train Trans. on Knowledge and Data Engineering, 8:195-210, 1994.

ing data, and enough computation. The results in the tables [6] K.P. Burnham and D.R. Anderson. Model Selection and Inference: A Practical!nfonnation
Theoretic Approach. Springer·Verlag, 1998.
support this claim as increasing parameters leads to im
proved accuracy. The performance improvement obtained [7] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Trans. on Info. Theory, 14, 1968.

by adding more hidden states is dramatic, but the additional

[8] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 1991.
discriminative DBM dependencies can provide further im
[9] Y. Ephraim, A. Dembo, and L. Rabiner. A minimum discrimination information approach
provements. for HMM. IEEE Trans. Info. Theory, 35(5):1001-1013, September 1989.

N. Friedman. The Bayesian structural EM algorithm. 14th ConJ on Uncertainty in Artificial

The results show that the DBM achieves the same or [10]
Intelligence, 1998.

better classification performance with the same parame

(111 N. Friedman, D. Geiger, and M. Goldszmidt. Bayesjan network classifiers. Machine Learn
ters, thereby supporting the claim that they have achieved ing, 29:131-163, 1997.

sparser, higher performing, but lower complexity networks. [12] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic
networks. 14th ConJ on Uncertainty in Artificial Intelligence, 1998.

[13] N. Friedman, I. Nachman, and D. Peer. Learning Bayesian network structure from massive
.
7 Discussion datasets: The . sparse candidate" algorithm. 15th Conf. on Uncertainty in Artificial Intelli
gence, 1999.

[14] D. Geiger and D. Heckennan. Knowledge representation and inference in similarity net

In this paper, a class of graphical models is considered that works and Bayesian multinets. Artificial Intelligence, 82:45-74, 1996.

generalizes HMMs. Several methods to automatically learn [ 15) D. Heckerman. Probabilistic Similarity Networks. MIT Press, 1991.

structure were presented that have optimal properties either [16] D. Heckennan. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-
95·06, Microsoft, 1995.
by maximizing the likelihood score, or (the EAR measure)
by maximizing the class posterior probability. A depen [17] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The com
bination of knowledge and statistical data. Technical Report MSR-TR-94�09, Microsoft,

dence selection heuristic, the improved pairwise algorithm, 1994.

is introduced, and an implementation was tested using a [18] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech
recognition. IEEE Trans. on Speech and Audio Signal Processing, 5(3):257-265, May 1997.
medium-vocabulary speech corpus showing that apprecia
[19] S. Katagiri, B.·H. Juang. and C.·H.·Lee. Pattern recognition using a family of design algo·
ble gains can be obtained when the dependencies are cho rithms based upon the generalized probabilistic descent method. Proceedigns of the IEEE,

sen appropriately. 1998.

[20] P. Krause. Learning probabilistic networks. Philips Resean:h Labs Tech. Repon., 1998.
While this paper does not address the problem of learning
[21) S.L. Lauritzen. Grophical Models. Oxford Science Publications, 1996.
the hidden structure in networks and uses only a simple
[22) H. Linhart and W. Zucchini. Model Selection. Wiley, 1986.
Markov chain to represent dynamics [4, 12], for speech, it
is often sufficient to consider a Markov chain as a proba [23) M. Meila. Learning with Mi�tures of Trees. PhD thesis, MIT, 1999.

bilistic sequencer over strings of phonetic units. The multi [24) J. Pitrelli, C. Fong, S.H. Wong, J.R. Spitz, and H.C. Lueng. PhoneBook: A phonetically-rich
isolated-word telephone-speech database. In Proc. IEEE Inti. ConJ on Acoustics, Speech,
nets, which are conditioned on each of these sequences, and Signal Processing, 1995.

determine local structure. Ultimately, it is planned to use [25] M. Sahami. Learning limited dependence Bayesian classifiers. In Proc. 2nd Int. ConJ on

and learn more complex models of dynamic behavior for Knowledge Discovery and Data Mining, 1996.

those classes of signals that can benefit from it. It is also [26] V. Vapnik. Statistical Learning Theory. Wiley, 1998.

planned to use the EAR measure to determine more general [27] G. Zweig. Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley,
1998.
discriminatively structured Bayesian multinets.

Heckerman95 BN KnowledgePLUSData
No ratings yet
Heckerman95 BN KnowledgePLUSData
47 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
The Max Min Hill Climbing Bayesian Network Structure Learning Algorithm
No ratings yet
The Max Min Hill Climbing Bayesian Network Structure Learning Algorithm
48 pages
Graphical Model Structure Learning
No ratings yet
Graphical Model Structure Learning
28 pages
Machine Learning Technique - Introduction To Graphical Models
No ratings yet
Machine Learning Technique - Introduction To Graphical Models
12 pages
Variyam 2
No ratings yet
Variyam 2
48 pages
Probabilistic Graphical Models Principles and Techniques - Koller, Friedman - Unknown - 2009
100% (1)
Probabilistic Graphical Models Principles and Techniques - Koller, Friedman - Unknown - 2009
1,266 pages
Learning Bayesian Classifiers Efficiently
No ratings yet
Learning Bayesian Classifiers Efficiently
10 pages
(Adaptive Computation and Machine Learning) Daphne Koller - Nir Friedman - Probabilistic Graphical Models - Principles and PDF
No ratings yet
(Adaptive Computation and Machine Learning) Daphne Koller - Nir Friedman - Probabilistic Graphical Models - Principles and PDF
1,270 pages
Daphne Koller, Nir Friedman Probabilistic Graphical Models Principles and Techniques 2009
100% (10)
Daphne Koller, Nir Friedman Probabilistic Graphical Models Principles and Techniques 2009
1,270 pages
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
No ratings yet
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
22 pages
1 s2.0 S0888613X13000698 Main
No ratings yet
1 s2.0 S0888613X13000698 Main
14 pages
08 Graphical Models
No ratings yet
08 Graphical Models
61 pages
Chapter 9 Data Mining
No ratings yet
Chapter 9 Data Mining
147 pages
Efficient Model Averaging in BNs
No ratings yet
Efficient Model Averaging in BNs
8 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
0% (1)
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Overview of Probabilistic Graphical Models
No ratings yet
Overview of Probabilistic Graphical Models
43 pages
Introduction to Graphical Models
No ratings yet
Introduction to Graphical Models
88 pages
Graphical Models: Michael I. Jordan
No ratings yet
Graphical Models: Michael I. Jordan
16 pages
Bayes Nets 2016
No ratings yet
Bayes Nets 2016
62 pages
Lecture18 s12
No ratings yet
Lecture18 s12
15 pages
Bayesian Network Construction Overview
No ratings yet
Bayesian Network Construction Overview
58 pages
Amcs 2004 14 3 11
No ratings yet
Amcs 2004 14 3 11
18 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Manu 146 1 011005
No ratings yet
Manu 146 1 011005
9 pages
The IBMAP Approach For Markov Network Structure Learning: Federico SCHL Uter Alejandro Edera
No ratings yet
The IBMAP Approach For Markov Network Structure Learning: Federico SCHL Uter Alejandro Edera
27 pages
Learning Optimal Augmented Bayes Networks: Vikas Hamine Paul Helman
No ratings yet
Learning Optimal Augmented Bayes Networks: Vikas Hamine Paul Helman
4 pages
Probabilistic Models
No ratings yet
Probabilistic Models
47 pages
Lec22 PDF
No ratings yet
Lec22 PDF
8 pages
Graphical Models, Exponential Families, and Variational Inference
No ratings yet
Graphical Models, Exponential Families, and Variational Inference
301 pages
Supp 2
No ratings yet
Supp 2
332 pages
Bayesian Network Classifiers Study
No ratings yet
Bayesian Network Classifiers Study
33 pages
Learning The Structure of Dynamic Probabilistic Networks: Nir Friedman Kevin Murphy Stuart Russell
No ratings yet
Learning The Structure of Dynamic Probabilistic Networks: Nir Friedman Kevin Murphy Stuart Russell
9 pages
Structured Prediction
No ratings yet
Structured Prediction
3 pages
1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks
No ratings yet
1999-Modelling Gene Expression Data Using Dynamic Bayesian Networks
12 pages
Unit V - Graphical Models
No ratings yet
Unit V - Graphical Models
43 pages
Directed Graphical Models in Python
No ratings yet
Directed Graphical Models in Python
24 pages
Dissertation Color
No ratings yet
Dissertation Color
171 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Probabilistic Graphical Models Homework Solutions
100% (2)
Probabilistic Graphical Models Homework Solutions
6 pages
Markov Logic Networks: Matthew Richardson
No ratings yet
Markov Logic Networks: Matthew Richardson
30 pages
2017 - Mariem Kchaou - Et Al. 2017
No ratings yet
2017 - Mariem Kchaou - Et Al. 2017
6 pages
Bayesian Networks for Researchers
No ratings yet
Bayesian Networks for Researchers
30 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
A43 - AUGMENTED BAYESIAN LEARNING - Author
No ratings yet
A43 - AUGMENTED BAYESIAN LEARNING - Author
12 pages
Probabilistic Machine Learning Supplement
No ratings yet
Probabilistic Machine Learning Supplement
307 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
20 pages
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
No ratings yet
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
10 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
Research On CDR
No ratings yet
Research On CDR
24 pages
AI&ML
No ratings yet
AI&ML
26 pages
Hidden Markov Support Vector Machines
No ratings yet
Hidden Markov Support Vector Machines
8 pages
Graph Learning Optimization Methods
No ratings yet
Graph Learning Optimization Methods
79 pages
Slides Module9
No ratings yet
Slides Module9
53 pages
Learning Graphs From Data A Signal Representation Perspective
No ratings yet
Learning Graphs From Data A Signal Representation Perspective
20 pages
DM See M4
No ratings yet
DM See M4
8 pages
Learning Causal Bayesian Network Structures From Experimental Data - Byron Ellis Wing Hung Wong
No ratings yet
Learning Causal Bayesian Network Structures From Experimental Data - Byron Ellis Wing Hung Wong
39 pages
Bayesian Networks, Introduction and Practical Applications (Final Draft)
No ratings yet
Bayesian Networks, Introduction and Practical Applications (Final Draft)
32 pages
Updated ReactLab 2024 25
No ratings yet
Updated ReactLab 2024 25
60 pages
RICEFW Tracker
100% (1)
RICEFW Tracker
109 pages
A Quick Guide To Project Cost Estimating
No ratings yet
A Quick Guide To Project Cost Estimating
16 pages
Department of Collegiate and Technical Education: Computer Science and Engineering
No ratings yet
Department of Collegiate and Technical Education: Computer Science and Engineering
18 pages
RVE Generation for Fiber Composites Analysis
No ratings yet
RVE Generation for Fiber Composites Analysis
13 pages
Caesar II Tutorial Translated
No ratings yet
Caesar II Tutorial Translated
36 pages
Class 12 Cs Project
No ratings yet
Class 12 Cs Project
14 pages
Overview of Programmable Logic Devices
No ratings yet
Overview of Programmable Logic Devices
16 pages
Patient Tracker System Study
No ratings yet
Patient Tracker System Study
59 pages
ACKNOWLEDGEMENT
No ratings yet
ACKNOWLEDGEMENT
6 pages
DR4500A Truline Circular Chart Recorder Specification: Function
No ratings yet
DR4500A Truline Circular Chart Recorder Specification: Function
12 pages
Business Modeler IDE Best Practices Guide V2.16
No ratings yet
Business Modeler IDE Best Practices Guide V2.16
228 pages
SAP Procurement Essentials
100% (1)
SAP Procurement Essentials
5 pages
Intelligent Tech: Automate vs Informate
No ratings yet
Intelligent Tech: Automate vs Informate
15 pages
Ulbricht Prosecution Sentencing Letter
100% (1)
Ulbricht Prosecution Sentencing Letter
18 pages
Oracle Exit Letter
No ratings yet
Oracle Exit Letter
2 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Dropbox System Server Crash Analysis
No ratings yet
Dropbox System Server Crash Analysis
8 pages
Cambridge-English-Proficiency-1-For-Updated-Exam 1
No ratings yet
Cambridge-English-Proficiency-1-For-Updated-Exam 1
179 pages
AI Based Customer Analysis
No ratings yet
AI Based Customer Analysis
24 pages
Crypto Arbitrage Bot (MEXC Kraken)
No ratings yet
Crypto Arbitrage Bot (MEXC Kraken)
12 pages
Name Change Request Form
No ratings yet
Name Change Request Form
1 page
K-Config 2 Guide
No ratings yet
K-Config 2 Guide
137 pages
Cameraopt
No ratings yet
Cameraopt
14 pages
Calculator - Veeam
No ratings yet
Calculator - Veeam
1 page
Z909-E1-01 Cs - CJ Eip21 Ethernet Ip Units
No ratings yet
Z909-E1-01 Cs - CJ Eip21 Ethernet Ip Units
77 pages
Restaurant Management System Project
No ratings yet
Restaurant Management System Project
11 pages
Foreword From The Director at Neptune Holidays Limited
No ratings yet
Foreword From The Director at Neptune Holidays Limited
24 pages
Business Model Canvas - Spotify
No ratings yet
Business Model Canvas - Spotify
7 pages
Oil & Gas Cybersecurity Training Guide
No ratings yet
Oil & Gas Cybersecurity Training Guide
2 pages

Dynamic Bayesian Multinets

Uploaded by

Dynamic Bayesian Multinets

Uploaded by

38 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

Dynamic Bayesian Multinets

Abstract for example, is one such instance of pre-specifying an ar­

2 B ackground 2.2 Bayesian Multinets

sary" depends on the current task. In essence, learning the

ten as: 3 TheModel

In this work, we consider a class of models called dy­

Many papers on structure learning concentrate on produc­ p(xl:t) = LIT p(xtlzt(qt),Qt)p(qtiQt-d

the new edges must go through a clique containing nodes

Proof. Under the assumption, it immediately follows that:

There are three things to note about an AR-HMM(K).

such that for n > max ( n

The above theorem is, of course, also true for condi­

-H(XIZ) = I log (� p(xlr,z)p(riz)) dP(x,z,q)

produces a higher likelihood score than before modifi­

proof). The quantity I(X; ZIQ)- I(X; Z)

and previous observations: 3) I(Xti; Zj) is smaller than a threshold:

book is a large-vocabulary "phonetically-rich isolated­

section: more parameters are added to the system. In each case,

dencies between adjacent observation vectors, an approach Acknowledgements

by adding more hidden states is dramatic, but the additional

N. Friedman. The Bayesian structural EM algorithm. 14th ConJ on Uncertainty in Artificial

better classification performance with the same parame­

dence selection heuristic, the improved pairwise algorithm, 1994.

sen appropriately. 1998.

You might also like

Abstract for example, is one such instance of pre-specifying an ar

In this work, we consider a class of models called dy

Many papers on structure learning concentrate on produc p(xl:t) = LIT p(xtlzt(qt),Qt)p(qtiQt-d

The above theorem is, of course, also true for condi

produces a higher likelihood score than before modifi

book is a large-vocabulary "phonetically-rich isolated

better classification performance with the same parame