38 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
Dynamic Bayesian Multinets
Jeff A. Bilmes
Department of Electrical Engineering
University of Washington
Seattle, WA 98 195-2500
Abstract for example, is one such instance of pre-specifying an ar
bitrary model structure for a domain. Such an approach
In this work, dynamic Bayesian multinets are has obvious computational and infrastructural advantages:
introduced where a Markov chain state at time if the model is kept simple, inference is guaranteed to stay
t determines conditional independence patterns tractable and software tools can be developed and reused
between random variables lying within a local many times. The conditional independence properties of
time window surrounding t. It is shown how a particular model, however, could be sub-optimal for a
information-theoretic criterion functions can be given task. With a more appropriate model, substantial
used to induce sparse, discriminative, and class improvements in classification accuracy, memory require
conditional network structures that yield an op ments, and computational demands could potentially be
timal approximation to the class posterior prob achieved. While it might be sufficient to hand-specify the
ability, and therefore are useful for the classi model for a given application, a promising approach allows
fication task. Using a new structure learning the data itself to determine or at least influence the model.
heuristic, the resulting models are tested on a In the most general case, there are four distinct compo
medium-vocabulary isolated-word speech recog nents of a graphical model: the semantics, the structure,
nition task. It is demonstrated that these discrim the implementation, and the parameters. There are a va
inatively structured dynamic Bayesian multinets, riety of different semantics, including directed (Bayesian
when trained in a maximum likelihood setting us network) and undirected (Markov random field) models,
ing EM, can outperform both HMMs and other chain graphs, and other more experimental frameworks. In
dynamic Bayesian networks with a similar num general, each corresponds to a different family of proba
ber of parameters. bility distributions and, based on training data, a seman
tics could potentially be selected or perhaps even induced
anew. Fixing the semantics, obtaining a good model struc
1 Introduction
ture is crucial, and is therefore a current active research
focus [7, 16, 5, 20, 10]. Fixing the structure, there are a
While Markov chains are sometimes a useful model for se
variety of ways to implement1 the dependencies between
quences, such simple independence assumptions can lead
random variables, such as conditional probability tables,
to poor representations of real processes. An alternative
neural networks, decision trees, Gaussian polynomial re
and highly successful extension to the Markov chain al
gression, and so on. And finally, fixing all of the above, a
lows random functions to be applied to each Markov state
good assignment of all the parameters must be found. Of
to yield the hidden Markov model (HMM). As is well
course in each case, a Bayesian approach can also be taken
known, an HMM is simply one type of dynamic Bayesian
where we use a (potentially uncountably infinite) proba
network (DBN) [ 12], or more generally a graphical model
bilistically weighted mixture over multiple choices.
[21]. When HMMs are considered as one small instance in
this enormous family of models, it is reasonable to suppose The task of learning graphical models can be seen as learn
that the independence assumptions underlying an HMM ing any or all of the above four components given a collec
can further be relaxed to yield better models still. tion of data, and is akin to model selection [22] problem
known to the statistics community for years. In all cases,
The structure of a graphical model, however, is sometimes
chosen for an application without ensuring that it matches 1While this is not standard terminology, a concise way to refer
the underlying process the model is supposed to repre to the representation of the local conditional probability model is
sent. Using a hidden Markov model to represent speech, simply to use the term "implementation."
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 39
the underlying goal is to identify a system for probabilis Perhaps the earliest well-known work on structure learning
tic inference that is computationally efficient, accurate, and in directed graphical models is [7]. More recent research on
somehow informative about the given problem domain. this topic may be found in [ 17, 5, 16, 25, 10, 20, 23, 13]. 2
In general, the task of learning Bayesian networks can be
In this paper, a class of models called dynamic Bayesian
grouped into four categories [ 10] depending on 1) if the
multinets and a method to induce their structure for the
data is fully observable or if it contains missing values, and
classification task is described. In this work, an exten
2) if it is assumed that the structure of the model is known
sion of [3], the problem domain is speech recognition so
or not. The easiest case is when the data is fully observable
it is necessary to use dynamic models. Also, since clas
and model structure is known, whereas the most difficult
sification is the goal, it is beneficial to learn class-specific
case is when the data is only partially observable and when
and (as we will see) discriminative structure. And to fur
the structure is unknown or only partially known.
ther improve sparsity (and therefore reduce computational
and memory demands) and to represent class conditional Note that a general optimization procedure can be used to
information only where necessary, Bayesian multinets (de learn many aspects of a graphical model. Often, learn
scribed in the next section) are used. ing needs only a maximum likelihood procedure perhaps
with an additional complexity penalty term such as MDL or
Section 2 provides a review of structure learning in
BIC. Alternatively, a Bayesian approach to learning can be
Bayesian networks, of Bayesian multinets, and presents the
used where no single structure or set of parameters are cho
idea of structural discriminability. Section 3 introduces the
sen. For certain classes of networks, the prior and posterior
class of models considered in this work and analyzes their
are particularly simple [ 16]. Alternatively, a risk minimiza
inferential complexity. Section 4 provides three informa
tion approach [26] can be applied to the learning problem.
tion theoretic criterion functions that can be used to learn
structure, the last of which provably provides an optimal In principle, an optimization procedure could simultane
approximation to the local posterior probability. Section 5 ously cover all four components of a graphical model: se
introduces the improved pairwise algorithm, a heuristic de mantics, structure, implementation, and parameters. There
veloped because the above induction procedure is compu has, however, been little if any research on methods to Jearn
tationally infeasible. Section 6 evaluates this system on a the best implementation and semantics. The problem be
medium-vocabulary speech corpus and shows that when comes inherently difficult because the quality of each com
structure is determined using the discriminative induction ponent cannot be accurately evaluated without first obtain
method and trained using EM, these networks can outper ing good settings for the other three components. The prob
form both HMMs and other dynamic Bayesian networks lem becomes more arduous when one begins to consider
with a similar number of parameters. But when structure multi-implementation and multi-semantic models. In prac
is determined arbitrarily, or without using a discriminative tice, therefore, one or more components are typically fixed
method, the performance is dramatically worse. Finally, before any optimization begins.
Section 7 concludes and discusses future work.
2 B ackground 2.2 Bayesian Multinets
2.1 Structure Learning A advantage of Bayesian networks is that they can spec
ify dependencies only when necessary, leading to a signifi
A fully-connected graphical model can represent any prob cant reduction in the cost of inference. Bayesian multinets
ability distribution representable by a sparsely structured [ 15, 14] further generalize Bayesian networks and can fur
one, but there are many important reasons for not using ther reduce computation. A multinet can be thought of as
such a fully connected model. These include 1) sparse net a network where edges can appear or disappear depending
work structures have fewer computational and memory re on the values of certain nodes in the graph, a notion that
quirements; 2) a sparse network is less susceptible to noise has been called asymmetric independence assertions [ 14].
in training data (i.e., lower variance) and less prone to over
fitting; and 3) the resulting structure might reveal high-level Consider a network with four nodes A, B, C and Q. In
knowledge about the underlying problem domain that was a multinet, the conditional independence properties among
previously drowned out by many extra dependencies. A A, B, and C might, for example, change for differing val
graphical model should represent a dependence between ues of Q. If Q is binary, and CJLAI{B, Q 0} but=
two random variables only when necessary, where "neces C-Jl.AI{B, Q 1}, then the joint probability could be writ-
=
sary" depends on the current task. In essence, learning the
structure in data is similar to developing an efficient code
for the underlying random process, as efficient coding is
analogous to probabilistic modeling. 2 See, especially, the reviews given in [16, 5, 20].
40 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
ten as: 3 TheModel
In this work, we consider a class of models called dy
p(A,B,C) = LP(A,B,CIQ = q) p(Q = q) namic Bayesian multinets (DBM). They consist of hid
q den Markov chains that determine local class-conditional
= p(CIB,Q = O) p(BIA,Q = O) p(Q = 0) + Bayesian networks over a window of observations. Equiva
lently, they consist of extensions to hidden Markov models
p(CIB,A,Q =I) p(BIA,Q = I) p(Q =I) (HMMs) where additional cross-observation dependencies
have been added as a function of the underlying Markov
chain. This model is also called a buried Markov model
Some examples of multinets include mixtures of tree
(BMM) [3] because the hidden Markov chain in a HMM
dependent distributions [23] and class-conditional naive
is further hidden (buried) by additional cross-observation
Bayes classifiers [11, 10].
dependencies.
In general, the statistical dependencies in a multinet could
First some notation: Qt refers to a Markov state at
be represented by a regular Bayesian network via specific
values of the parameters [14] (e.g., for switching linear time t, and Q1,r g { Q1, Q 2,... , Qr} refers to the
Gaussian models, certain parameters could be zero, or for entire chain.3 Xt will refer to the observation vector
t
at time t with Xti referring to its i h element. Us
discrete probability tables, hyperplanes could indicate in
dependence between random variables only for certain val ing this notation, a hidden Markov model is a col
ues of other random variables). In other words, the family lection of hidden Q1,r and observation X1,r vari
of probability distributions representable by Bayesian net ables that possess the following conditional indepen
works and by Bayesian multinets is the same. In practice, dence properties: {Xt:T,Qt:T}ll{ Q l:t-2,Xl:t-diQt-1
however, a multinet could result in a substantial savings and XtlL{Q�t, X�t}IQt for all t.
in memory, computation, and necessary sample-size com We generalize this model such that Xti is no longer con
plexity relative to an equivalent Bayesian network. ditionally independent of all the surrounding observations
given Qt. Relative to an HMM, a DBM has been aug
mented with chain-conditional cross-observation depen
2.3 The Classification Task
dencies between individual observation elements. The
probability model becomes:
Many papers on structure learning concentrate on produc p(xl:t) = LIT p(xtlzt(qt),Qt)p(qtiQt-d
ing networks that best represent statistical dependencies ex ql,t t
tant in data. When the goal is classification, however, this
where zt(q) � X<t for all t and q. For example, it
is not necessarily optimal. Indeed, the class posterior prob
could be that Zt(q) = {Xt-1,3,Xt-1,5,Xt-2,1,Xt-3,9} and
ability will be accurately approximated if sample and class
label are considered together, and then jointly optimized in
Zt(r) = {xt-4,2,Xt-9,o} forr =fi q. A multinet occurs be
cause zt(q) is a function of q; if the Markov chain changes,
a maximum likelihood procedure, assuming sufficient data.
so will the set of dependencies. Specifically, the condi
Such a procedure might be wasteful, however, as likelihood tional independence assumption among observation ele
scores are penalized from the term containing dependen ments becomes XtilL{X�t \zti(q)}l{q, Zt;(q)}. This class
cies only between features (which has a much larger mag of model is depicted in Figure 1 for two instantiations of the
nitude) than the term containing the class posterior prob Markov chain.
ability. It is this later term that, according to Bayes deci
In general, adding conditional dependencies in a DBN can
sion theory, must be accurately modeled to achieve good
significantly increase computational and memory complex
classification performance. In [10], this issue was noticed,
ity. For the junction tree algorithm, the complexity is
and both extended versions of naive Bayes classifiers and
class conditional Bayesian multinets were considered, both
O('L'{=1 s (C;)) where Tis the number of resultant cliques
in the junction tree, and s( C;) is the size of the state space
of which outperformed the naive Bayes classifier on classi
for clique C;. For an HMM with Ttime-steps and N states,
fication tasks.
there are O(T) cliques each with at most a state space size
In a general classification task, additional reductions in of N2 resulting in O(TN2).
computation and increases in sparsity can be achieved by
To determine the complexity of a DBM, first define 4 an
learning a specific network structure for each class, where
3In general, Xu is matlab-like notation to refer to the set
each class-conditional network represents nothing other
than those dependencies, often unique to its class, that help of variables with indices between 1 and t inclusive, X<t �
tl.
approximate the class posterior probability . This property Xqt-1)• and X�t = {X1,T \ Xt}.
has been called structural discriminability [2]. 4AR-HMM stands for auto-regressive HMM. An AR-HMM
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 41
the new edges must go through a clique containing nodes
adjacent by these new edges. Second, a triangulated-by
moralization AR-HMM(K) has at most two hidden vari
ables in its cliques since no node has more than one hidden
variable as a parent, so moralization does not add edges
between hidden variables. The remaining clique variables
are observations, so the state-space size is only N2. Third,
such a triangulated AR-HMM(K) has only O(T) cliques.
Therefore, the complexity of an AR-HMM(K) and there
fore a DBM is again only O(TN2) for any fixed K. There
is, however, a constant cost associated with the number of
additional dependency edges. Incorporating this cost, the
complexity becomes O(TN2K) where K is the maximum
number of dependency edges per observation. The extra
dependency structure is sparse, however, so the computa
tional and memory requirements of a DBM will in practice
be much less than its O(TN2K) complexity suggests.
X
4 Structure Learning in DBMs
Figure 1: Two networks corresponding to two values of the Structure learning consists of optimally choosing zt(q) in
Markov chain, q1,r and q�:T· The individual elements of p(xtlzt(q), q). In this section, three methods are consid
the observation vectors are shown explicitly as dark shaded ered. In each case, a fixed upper bound is assumed on the
nodes surrounded by boxes. Nodes with diagonal lines cor possible number dependency variables. That is, it is as
respond to instantiated hidden variables. sumed that lzt(q) l ::::; c for some fixed c > 0. The phrases
"choose dependency variable" and "choose dependencies"
will be used synonymously. It is assumed that the reader is
AR-HMM(K) as an HMM but with additional edges point familiar with information-theoretic constructs [8].
ing from observations Xt-l to Xt for£ =1, . . . ,k (See
The following theorem will be needed:
Figure 2). Collapsing the vector graphical notation in Fig
ure 1 into a single node, and including a dependency from Theorem 4.1. Mutual Information and Likelihood.
observations Xr to Xt in the AR-HMM if for any value Given three random variables X, z(a) and z(b}, where
of Qt and there exists a dependency between Xri and Xti I(X; z(a)) > I(X; z(b}), the likelihood of X given z(a)
for any i,j, a DBM with a maximal dependency across K is higher than given z(b}' for n, the sample size, large
observations can be more generally represented by an AR enough, i.e.,
HMM(K).
Proof. Under the assumption, it immediately follows that:
Figure 2: Bayesian network for an AR-HMM(2) Negating and expanding as integrals gives
There are three things to note about an AR-HMM(K).
First, a moralized AR-HMM(K) is triangulated. This can
be seen by induction, the base case being obvious, and
the induction step following because 1) a cycle contain or equivalently
ing edges only contained in the previous step's graph must
have a chord by induction, and 2) a cycle containing edges
not in the previous step's graph (new edges) must also
have a chord because the portion of a cycle not containing
can of course use any, possibly non-linear, implementation be where (x ;, zik}) ,....., p(X, z(k}) fork E {a, b}. The weak
law of large numbers implies that VE > 0, 3 n and n
tween observations.
a b
42 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
such that for n > max ( n
a, b),
n q more than any increase in the r model in the context of
q for all r -:/:- q. The score increases can in fact be nega
�� t, logp(x;lz)''l + H(XIZ'''l l < <
tive, thereby decreasing the likelihood of both models, but
potentially improving the classification accuracy. Accord
ingly, the score of a model in a different context can be
again for k E {a, b}. Choosing E < IH(XIZ(a)) - evaluated using an extended form of conditional mutual in
H(XIZ(b))l/2 to get n implies Equation 1. D formation:5
p(x,zlq) dxdz
Of course, the actual probability distribution
not known and only an approximation p(xlz,
is
8) is avail
p(xlz) Ir(X; Zlq) =
6.
I p(x,zir) log
p(xlq)p(zlq)
able where the parameters e are estimated, often using Therefore, I(X; Z(q)lq) should be large and
a method such as maximum likelihood, that decreases Iq(X; Z(r)ir) should not be as large for each r. This
D(p(xiz)II.P(xiz)), the KL-distance between the actual and suggests optimizing the following:6
approximate distribution. If pis close enough top, then the
theorem above still holds. It is therefore assumed that the S(X; ZIQ) � LP(q)(8qr- p(r))Iq(X; Z(r)ir)
p
parameters of have been estimated well enough so that qr
p
any differences with are negligible.
where Z UiZ(i) and 8qr is a Kronecker delta. This
=
The above theorem is, of course, also true for condi
quantity can be further motivated by noticing that the ex
tional mutual information [8] such as I(X; ZIQ)
or for
pected class posterior probability can be expanded as fol
a particular value of q, I(X; ZIQ q). Therefore, if
=
lows:
I(X; z(al(q)IQ = q) I(X; z(bl(q)IQ q),
> for all= q
then:
E[logp(QIX,Z)] -H(QIX,Z) =
1
T 1 T -H(XIQ,Z)- H(QIZ) + H(XIZ)
Llogp(xtlz�a)(qt),qt) > Llogp(xtlz��)(qt),qt)
=
T t=l T t=l
so that
These quantities can be viewed as likelihoods of the data
given Viterbi paths Qt
of modified HMMs. In the left
E[Iogp(QIX,Z)] + H(XIQ) + H(Q)- H(X)
case, the Viterbi path likelihood is higher. Note that us = I(X; ZIQ) + I(Q; Z)- I(X; Z)
ing a similar argument as in the theorem, and because
Furthermore, the conditional entropy can be bounded by
H(X) H(XIZ),
�
-H(XIZ) = I log (� p(xlr,z)p(riz)) dP(x,z,q)
for some non-Viterbi path rt
and for n large enough. In
� I LPr
(riz) 1ogp(xlr,z)dP(x,z,q)
other words, relative to an HMM, the likelihood of the data
for paths other than the Viterbi path do not decrease when � LP(r)p(q) I 1ogp(xlz,r)dP(x,zlq)
adding conditioning variables. The following theorem has r,q
therefore been shown.
where the first inequality follows from Jensen's inequality,
Theorem 4.2. A DBM with edges added relative to an
HMM according to conditional mutual information
and the approximate equality is valid if I(Q; Z)
is small.
produces a higher likelihood score than before modifi
From this, it can be shown [2] that choosing z(a)
over z(b)
cation.
when S(X; z(a)IQ) > S(X; z(b)IQ)
will increase an up
per bound on the expected class posterior probability, and
The DBM represents statistical relationships contained in therefore could potentially reduce the Bayes error. This de
the data that are not well represented before modification, fines a second dependency selection rule.
which is the reason for the higher likelihood. Augmenting A generalization that does not require a small I(Q; Z)
can
the dependencies according to conditional mutual informa I(Q; Z)
be obtained by noticing that does not depend on
tion therefore defines the first dependency selection rule. p(xiz,q). Z
Therefore, if Z(i)
= Ui is chosen to max
When the task is classification, however, a higher likeli imize I(X; ZIQ) - I(X; Z), the average class posterior
hood does not necessarily correspond to a lower error. Con probabilityE(p(QIX,Z)] will be maximized and there
q
sider the two states and -:/:- r q.
To achieve a lower error, fore optimal for a fixed number of edges (see [2] for a
q r
a modification the and models should increase the av 5Called cross-context conditional mutual information in [2].
q
erage score of the model in the context of a sample from 6Called discriminative conditional mutual information in [2]
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 43
proof). The quantity I(X; ZIQ)- I(X; Z)
could be called INPUT: z1, q, M, Xti
the explaining away residual (or the EAR measure), and OUTPUT: Zq i
it asks for edges that are more class-conditionally depen Set Zqi =0
dent than marginally independent. Moreover, a multinet Sort Zi, so f(Zi)
2: j(Z2) 2: ...
can result when choosing class-conditional edges accord f
For j decreasing until ( z1) falls below threshold:
ing toI(X; ZIQ q)- I(X; Z)
= for each q. If z1 satisfies all the following three criteria:
Summarizing this section, there are three possible rules that
1) I(Xti; Zilq) is larger than a threshold
could be used to choose new edges between elements of Xt 2) For each Z Z i I(Zj; ZIQ) ri(Zj; XtiiQ)
E q , <
and previous observations: 3) I(Xti; Zj) is smaller than a threshold:
then add z1 to Z qi and break if I Zqi I > M.
Z*(q) = argmax I(X; Z(q)lq) (2)
Z(q)<;;X<t & IZ(q)l::;c Figure 3: The improved pairwise heuristic : this algorithm
Z* = argmax S(X; ZIQ) (3) chooses the dependency variables for the ith feature posi
Z<;;X« & IZI::;c tion of Xtand for class q.
Z* = argmax I(X; ZIQ)- I(X; Z) (4)
Z<;;X<t & IZI::;c
and discriminative criterion ensures the candidate variable
does improve the models scores in the general non-class
Equations 3 and 4 produce discriminatively structured net
conditional case.
works, in that the underlying dependencies represented by
the network are unique to each class. The resulting models
achieve a high score in the presence of a sample from the
right class, but get a low score in the presence of a differ
ent class. More importantly, this can be true even for non
optimal parameter settings since, via the structure, the net
works are inherently less capable of achieving high scores
for samples of the wrong class. Therefore, along appro
priate complexity penalties, it would be sufficient to learn
parameters using likelihood based methods rather than the
more costly risk-minimization procedures [26, 1, 9, 18, 19].
Figure 4: The redundancy check: The edge from Z4 X
to is
being considered. It will only be added if I(Z4; Zi)
for i <
5 The Improved Pairwise Algorithm 4 is small, hopefully reducing the chance that a variable is
added for which a class-conditional Markov blanket exists
The optimization suggested in the previous section is separating Z4
from X
and making Z4
superfluous.
clearly impractical. In this section, a new computation
ally efficient heuristic, entitled the improved pairwise al This algorithm is inexpensive, running in time
gorithm, is introduced. The algorithm approximates the O(IQIIXId2) where IQI is the number of classes,
desired quantities using only pairwise conditional mutual lXI IX1:tl
is the size of the observation vector, and d :::; is
information between scalars. the maximum number of variables that are considered in
the selection.
The algorithm is presented in Figure 3 using rule 4. All
candidate scalar random variables in Xl:t
are given and in
dexed by z1. A total of M edges will be added separately 6 Experimental Evaluation
q,
for each value and for each Xti.
Therefore, the algorithm
might allow for some redundancy if intra-feature depen In this section, DBMs are evaluated in the context of
dencies are already modeled. The algorithm first sorts the classifying speech utterances. All experiments are re
scalars decreasing by the function f(Zj) I(Xti; ZiiQ
= ported using the NYNEX Phonebook database[24]. Phone
q)- I(Xti; Zi
z1). The output is q , set of variables from
=
book is a large-vocabulary "phonetically-rich isolated
which a link should be added to Xti
under the class q. word telephone-speech database." It contains a rich col
lection of vocabulary words including poly-syllabic words
The algorithm uses three criteria to eliminate candidate
such as "exhaustion," "immobilizing," "sluggishness," and
edges. The first ensures that the edge to Zj
is actually
"overambitious" as well monosyllabic words such as
informative about Xti
in the context of q.
The second is
"awe," "biff," and "his."
a redundancy check - it asks for an edge from a variable
that has little information in common with the variables al The quantities (X; I ZQq) are obtained using an initial
=
ready added as depicted in Figure 4. The degree of allowed baseline HMM-based system. The following general train
redundancy is determined using 0 < r < 1. The third ing procedure is used for all of the results reported in this
44 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
section: more parameters are added to the system. In each case,
however the DBM outperform an HMM with a comparable
1) Train a bootstrap HMM-based system using EM number of parameters and states per phone, and this is true
2) Compute (X; ZIQ = q) and I(X; Z) for all pairs across the different vocabulary sizes.
within a 200ms range surrounding the current time.
3) Run the improved pairwise heuristic
Case Type WER Params/ SP
4) Train the resulting models again with EM
1 CMI 32.0% 207k/ 1
5) Test the result
2 AR2 27.6% 207k/ 1
3 AR 1 20.9% 156k/ 1
4 RAND 8.3% 207k I 1
The reported results all use a mixture of Gaussian linear
regression model to implement the dependencies. In this Table 2: Performance of DBMs determined using alter
case, nate methods and evaluated on the 75-word vocabulary test
case.
p(xiz(q) , q) = (t1 p(xlm, z ( q) , q) p(mlq) ) A second set of results are given in Table 2 and can be
where each underlying component is a Gaussian linear compared to those given in the 75-word column in Table 1.
regressive process on z using a sparse dependency matrix Case 1 shows the performance of a DBM created using
Bqm· Equation 2. This rule adds dependencies that increase the
model scores but not the classification accuracy. In fact,
l the likelihood scores in this case were dramatically larger
p(xlm, z, q) =
e-!(x-Bqmz)'E;;-;.(x-BqmZ)
l27ri:qmll/2 than before modification. As can be seen, however, the per
formance dramatically decreases, presumably because the
models are not structurally discriminative.
Therefore,p(xlm, z, q) is a Gaussian distribution with con
ditional mean B mZ and covariance matrix I: m· This Cases 2 and 3 show the performance when dependencies
q q
implementation can, to some extent, simulate conditional are added from the previous (the two previous for case 3)
variance by using mixtures, but it avoids many training observations to the current observation, and is therefore
complexities since closed-form EM update equations can not a multinet. The performance is also very poor, indi
be derived. Complete details of the experimental setup, cating again that relaxing the wrong conditional indepen
training procedure, definitions of test and training sets, dence properties can dramatically decrease classification
topology of the Markov chains, and so on are described accuracy.
in[2].
Case 4 shows the performance when a different random set
of dependencies between observation elements are added
jVocabj 75 150 300 600 Params I SP for each state. Interestingly, case 4 is much better than the
HMM 5.0% 7.0% 9.2% 13.3% 157k I 1 previous cases suggesting that the most harmful and anti
DBM 4.6% 6.5% 8.7% 12.3% 157k I 1
discriminative dependencies have not been added. The per
HMM 2.4% 3.5% 5.6% 7.7% 182k /2 formance, however, is still worse than the baseline HMM.
DBM 2.3% 3.4% 5.4% 7.5% 182k 12
HMM 1.7% 2.8% 4.6% 6.2% 163k 13 Several general points can be made from the two tables.
DBM 1.5% 2.6% 4.4% 6.0% 166k 13 Cases 1-3 indicate that dependencies that are added to a
HMM 1.5% 2.7% 4.3% 5.8% 200kl4 model structure to increase a (likelihood) score can cause
DBM 1.4% 2.6% 4.2% 5.6% 207k /4 a dramatic decrease in classification accuracy, even if the
structures are augmented in a class-conditional way, as in
case 1 above. Note that the likelihood scores for these mod
Table 1: Word-error rate results comparing HMMs with
els are dramatically higher both for the training and testing
DBMs. IVocabl is the vocabulary size (number of test
data, suggesting that overfitting is not the problem. The
classes), Params is the number of system parameters, and
goal of many model selection methods [6, 22] is to choose
SP is the number of Markov states per phoneme.
a model that provides the best description of the data, but
the above suggests that this can be inappropriate for classi
The results, given in Table 1, show the error rate for differ
fication. Admittedly, model selection procedures typically
ent topologies of Markov chain for each phone. The num
include complexity penalty terms (e.g., MDL, BIC, and so
bers are competitive with those reported using dynamic
on). But these penalties do not select for discriminative
Bayesian networks on the same speech corpus and train
structures.
ing/testing conditions [27], where best achieved result was
2.7% with a 5 15k parameter system on the 75-word vocab Second, dependencies in a network should not be added
ulary size case. The performance increases on average as just because they are missing. Cases 2 and 3 adds depen-
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 45
dencies between adjacent observation vectors, an approach Acknowledgements
sometimes justified by noting that they are not directly rep
resented by an HMM. But as the performance for these aug This work has benefited from discussions with Nir Fried
mented models shows, the results indicate that adding miss man, Steven Greenberg, Michael Jordan, Katrin Kirchhoff,
ing dependencies can decrease classification performance. Kevin Murphy, Stuart Russel, and Geoff Zweig. The au
thor would like to acknowledge the International Computer
Third, adding random dependencies does not produce as Science Institute at which many of the computational ex
poor performance as in the previous cases, but neither is periments were performed, and thank the three anonymous
there any benefit. It is unlikely that choosing random de reviewers for their useful comments.
pendencies, even if q-conditioned, will result in discrim
inative structure because the selection space is so large. References
The implications for structure learning methods that search
[1] L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information esti
over randomly chosen sets are clear: because of the large mation of HMM parameters for speech recognition. In Proc. IEEE Inti. Conf on Acoustics,
Speech, and Signal Processing, pages 49-52, Tokyo, Japan, December 1986.
search space, it is unlikely that good sets of dependencies
J. Bilmes. Nalural Statistic Models for Automatic Speech Recognition. PhD thesis, U.C.
will be found in a reasonable amount of time. It seems cru [2]
Berkeley. Dept of EECS, CS Division, 1999.
cial, therefore, to constrain the random search to those that
[3] J.A. Bilmes. Buried Markov models for speech recognition. In Proc. IEEE Inti. Conf on
found to be useful in some way, as has been argued in the Acoustics, Speech, and Signal Processing, Phoenix, AZ, March 1999.
past [ 13]. [ 4] X. Boyen, N. Friedman, and D. Koller. Discovering the hidden structure of complex dynamic
systems. 15th ConJ on Uncertainty in Artificial Intelligence, 1999.
Finally, as argued in [2], an HMM can approximate a distri [5] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE
bution arbitrarily well given enough capacity, enough train Trans. on Knowledge and Data Engineering, 8:195-210, 1994.
ing data, and enough computation. The results in the tables [6] K.P. Burnham and D.R. Anderson. Model Selection and Inference: A Practical!nfonnation
Theoretic Approach. Springer·Verlag, 1998.
support this claim as increasing parameters leads to im
proved accuracy. The performance improvement obtained [7] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Trans. on Info. Theory, 14, 1968.
by adding more hidden states is dramatic, but the additional
[8] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 1991.
discriminative DBM dependencies can provide further im
[9] Y. Ephraim, A. Dembo, and L. Rabiner. A minimum discrimination information approach
provements. for HMM. IEEE Trans. Info. Theory, 35(5):1001-1013, September 1989.
N. Friedman. The Bayesian structural EM algorithm. 14th ConJ on Uncertainty in Artificial
The results show that the DBM achieves the same or [10]
Intelligence, 1998.
better classification performance with the same parame
(111 N. Friedman, D. Geiger, and M. Goldszmidt. Bayesjan network classifiers. Machine Learn
ters, thereby supporting the claim that they have achieved ing, 29:131-163, 1997.
sparser, higher performing, but lower complexity networks. [12] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic
networks. 14th ConJ on Uncertainty in Artificial Intelligence, 1998.
[13] N. Friedman, I. Nachman, and D. Peer. Learning Bayesian network structure from massive
.
7 Discussion datasets: The . sparse candidate" algorithm. 15th Conf. on Uncertainty in Artificial Intelli
gence, 1999.
[14] D. Geiger and D. Heckennan. Knowledge representation and inference in similarity net
In this paper, a class of graphical models is considered that works and Bayesian multinets. Artificial Intelligence, 82:45-74, 1996.
generalizes HMMs. Several methods to automatically learn [ 15) D. Heckerman. Probabilistic Similarity Networks. MIT Press, 1991.
structure were presented that have optimal properties either [16] D. Heckennan. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-
95·06, Microsoft, 1995.
by maximizing the likelihood score, or (the EAR measure)
by maximizing the class posterior probability. A depen [17] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The com
bination of knowledge and statistical data. Technical Report MSR-TR-94�09, Microsoft,
dence selection heuristic, the improved pairwise algorithm, 1994.
is introduced, and an implementation was tested using a [18] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech
recognition. IEEE Trans. on Speech and Audio Signal Processing, 5(3):257-265, May 1997.
medium-vocabulary speech corpus showing that apprecia
[19] S. Katagiri, B.·H. Juang. and C.·H.·Lee. Pattern recognition using a family of design algo·
ble gains can be obtained when the dependencies are cho rithms based upon the generalized probabilistic descent method. Proceedigns of the IEEE,
sen appropriately. 1998.
[20] P. Krause. Learning probabilistic networks. Philips Resean:h Labs Tech. Repon., 1998.
While this paper does not address the problem of learning
[21) S.L. Lauritzen. Grophical Models. Oxford Science Publications, 1996.
the hidden structure in networks and uses only a simple
[22) H. Linhart and W. Zucchini. Model Selection. Wiley, 1986.
Markov chain to represent dynamics [4, 12], for speech, it
is often sufficient to consider a Markov chain as a proba [23) M. Meila. Learning with Mi�tures of Trees. PhD thesis, MIT, 1999.
bilistic sequencer over strings of phonetic units. The multi [24) J. Pitrelli, C. Fong, S.H. Wong, J.R. Spitz, and H.C. Lueng. PhoneBook: A phonetically-rich
isolated-word telephone-speech database. In Proc. IEEE Inti. ConJ on Acoustics, Speech,
nets, which are conditioned on each of these sequences, and Signal Processing, 1995.
determine local structure. Ultimately, it is planned to use [25] M. Sahami. Learning limited dependence Bayesian classifiers. In Proc. 2nd Int. ConJ on
and learn more complex models of dynamic behavior for Knowledge Discovery and Data Mining, 1996.
those classes of signals that can benefit from it. It is also [26] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
planned to use the EAR measure to determine more general [27] G. Zweig. Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley,
1998.
discriminatively structured Bayesian multinets.