Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
19 pages
1 file
This presentation discusses the enhancements in Gaussian Mixture Models (GMMs) for speech recognition with a focus on handling data complexities and improving performance on unseen data. Key methodologies include robust model training, adjustments to covariance matrices, and the application of Monte Carlo simulations to optimize estimates. Findings suggest effective strategies for model complexity and unseen data performance through various testing on real speech data.
International Conference on Acoustics, Speech, and Signal Processing, 1996
Two discriminative techniques are described (and evaluated) for estimating the parameters of the Gaussians in a large vocabulary speech-recognition system. The first technique is based on using a modification of the maximum mutual information (MMI) objective function, and appears to provide no improvement over standard ML estimation. The second technique is based on a heuristic correction of the Gaussian parameters,
IEICE Transactions on Information and Systems, 2006
In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at ATR has been developing a hybrid HMM/BN model, which is an HMM where the state probability distribution is modeled by a BN, instead of commonly used mixtures of Gaussian functions. In this paper, we describe how to use the hybrid HMM/BN acoustic models, especially emphasizing some design and implementation issues. The most essential part of HMM/BN model building is the choice of the state BN topology. As it is manually chosen, there are some factors that should be considered in this process. They include, but are not limited to, the type of data, the task and the available additional information. When context-dependent models are used, the state-level structure can be obtained by traditional methods. The HMM/BN parameter learning is based on the Viterbi training paradigm and consists of two alternating steps-BN training and HMM transition updates. For recognition, in some cases, BN inference is computationally equivalent to a mixture of Gaussians, which allows HMM/BN model to be used in existing decoders without any modification. We present two examples of HMM/BN model applications in speech recognition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance than the conventional HMM.
Proceedings of “Verificatori Biometrici” Workshop, organized by Technical University of Cluj-Napoca, Universitas Napocensis Babes-Bolyai, Universitas Medicinae et Farmaciae Napocensis and CNCSIS, Cluj-Napoca, Romania, May
In this paper the GMM speaker model was analyzed from the viewpoint of its phonetic content. Phoneme distribution among clusters represented by Gaussians was studied. Special speaker models were also created using only a part of the training data, in order to identify the most valuable part of speech, for the purpose of speaker identification. Key words: Speaker Identification, Gaussian Mixture Models, Phonetic Analysis
2010
This technical report contains the details of an acoustic modeling approach based on subspace adaptation of a shared Gaussian Mixture Model. This refers to adaptation to a particular speech state; it is not a speaker adaptation technique, although we do later introduce a speaker adaptation technique that it tied to this particular framework. Our model is a large shared GMM whose parameters vary in a subspace of relatively low dimension (e.g. 50), thus each state is described by a vector of low dimension which controls the GMM's means and mixture weights in a manner determined by globally shared parameters. In addition we generalize to having each speech state be a mixture of substates, each with a different vector. Only the mathematical details are provided here; experimental results are being published separately.
One of the most successful models for speech recognition has been the HMM with mixture of Gaussians in the states to generate/capture observations. In this work we show how the addition of a parameter to model higher order moment statistics, such us the kurtosis, can provide improvements to the system. The distributions in which this degree of freedom is integrated are the generalized Gaussians. It is shown a method to estimate the parameters of these distributions even if they are embedded in a HMM or mixture of distributions. Some experimental results are obtained with this method compared to baseline systems of full and diagonal covariance matrices.
Lecture Notes in Computer Science, 2013
An estimation of parameters of a multivariate Gaussian Mixture Model is usually based on a criterion (e.g. Maximum Likelihood) that is focused mostly on training data. Therefore, testing data, which were not seen during the training procedure, may cause problems. Moreover, numerical instabilities can occur (e.g. for low-occupied Gaussians especially when working with full-covariance matrices in high-dimensional spaces). Another question concerns the number of Gaussians to be trained for a specific data set. The approach proposed in this paper can handle all these issues. It is based on an assumption that the training and testing data were generated from the same source distribution. The key part of the approach is to use a criterion based on the source distribution rather than using the training data itself. It is shown how to modify an estimation procedure in order to fit the source distribution better (despite the fact that it is unknown), and subsequently new estimation algorithm for diagonal-as well as full-covariance matrices is derived and tested.
IEEE Transactions on Audio, Speech and Language Processing, 2000
In this paper we study discriminative training of acoustic models for speech recognition under two criteria: maximum mutual information (MMI) and a novel "error weighted" training technique. We present a proof that the standard MMI training technique is valid for a very general class of acoustic models with any kind of parameter tying. We report experimental results for subspace constrained Gaussian mixture models (SCG-MMs), where the exponential model weights of all Gaussians are required to belong to a common "tied" subspace, as well as for Subspace Precision and Mean (SPAM) models which impose separate subspace constraints on the precision matrices (i.e. inverse covariance matrices) and means. It has been shown previously that SCGMMs and SPAM models generalize and yield significant error rate improvements over previously considered model classes such as diagonal models, models with semi-tied covariances, and EMLLT (extended maximum likelihood linear transformation) models. We show here that MMI and error weighted training each individually result in over 20% relative reduction in word error rate on a digit task over maximum likelihood (ML) training. We also show that a gain of as much as 28% relative can be achieved by combining these two discriminative estimation techniques.
2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014
This paper investigates recently proposed Stranded Gaussian Mixture acoustic Model (SGMM) for Automatic Speech Recognition (ASR). This model extends conventional hidden Markov model (HMM-GMM) by explicitly introducing dependencies between components of the observation Gaussian mixture densities. The main objective of the paper is to experimentally study, how useful SGMM can be for dealing with data, which contains different sources of acoustic variability. First studied sources of variability are age and gender in quiet environment (TIdigits task including child speech). Second, the SGMM modeling is applied on data produced by different speakers and corrupted by non-stationary noise (CHiME 2013 challenge data). Finally, SGMM is applied on the same noisy data, but after performing speech enhancement (i.e., the remaining variability mostly comes from residual noise and different speakers). Although SGMM was originally proposed for robust speech recognition of noisy data, in this work it was found, that the model is more efficient for handling speaker variability in quiet environment.
Interspeech 2007, 2007
A Gaussian mixture optimization method is explored using cross-validation likelihood as an objective function instead of the conventional training set likelihood. The optimization is based on reducing the number of mixture components by selecting and merging a pair of Gaussians step by step base on the objective function so as to remove redundant components and improve the generality of the model. Cross-validation likelihood is more appropriate for avoiding over-fitting than the conventional likelihood and can be efficiently computed using sufficient statistics. It results in a better Gaussian pair selection and provides a termination criterion that does not rely on empirical thresholds. Large-vocabulary speech recognition experiments on oral presentations show that the cross-validation method gives a smaller word error rate with an automatically determined model size than a baseline training procedure that does not perform the optimization.
2006
Abstract We develop a framework for large margin classification by Gaussian mixture models (GMMs). Large margin GMMs have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces. Model parameters are trained discriminatively to maximize the margin of correct classification, as measured in terms of Mahalanobis distances. The required optimization is convex over the model's parameter space of positive semidefinite matrices and can be performed efficiently.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Acoustics, Speech and …, 2007
… , Speech, and Signal …, 2004
Computer Speech & Language, 2011
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015
Acoustics, Speech and …, 2012
International Journal of Electronics and Electical Engineering, 2015
International Journal of Advanced Computer Science and Applications, 2010