Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Speech recognition has always been an accepted and expected feature of futuristic computer systems. From HAL in 2001: A Space Odyssey ("Open the pod bay doors, HAL." "I'm sorry Dave, but I'm afraid I can't do that.") to Lieutenant Commander Data in Star Trek: The Next Generation, authors who write of the future anticipate the day when computers can understand, interpret and follow the spoken words of humans.
1999
The Department of Information Science is one of six departments that make up the Division of Commerce at the University of Otago. The department offers courses of study leading to a major in Information Science within the BCom, BA and BSc degrees. In addition to undergraduate teaching, the department is also strongly involved in postgraduate research programmes leading to MCom, MA, MSc and PhD degrees. Research projects in spatial information processing, connectionist-based information systems, software engineering and software development, information engineering and database, software metrics, distributed information systems, multimedia information systems and information systems security are particularly well supported.
Foundations and Trends® in Signal Processing, 2007
Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.
1998
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than rst order dependencies in the observed data sequences. This is due to the rst order state process and the assumption of state conditional independence between observations. Arti cial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classi cation and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classi cation abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and to evaluate this model on a number of standard speech recognition tasks. This has resulted in a hybrid called a Hidden Neural Network (HNN), in which the HMM emission and transition probabilities are replaced by the outputs of state-speci c neural networks. The HNN framework is characterized by: Discriminative training: HMMs are commonly trained by the Maximum Likelihood (ML) criterion to model within-class data distributions. As opposed to this, the HNN is trained by the Conditional Maximum Likelihood (CML) criterion to discriminate between di erent classes. CML training is in this work implemented by a gradient descent algorithm in which the neural networks are updated by backpropagation of errors calculated by a modi ed version of the forward-backward algorithm for HMMs. Global normalization: A valid probabilistic interpretation of the HNN is ensured by normalizing the model globally at the sequence level during CML training. This is di erent from the local normalization of probabilities enforced at the state level in standard HMMs. Flexibility: The global normalization makes the HNN architecture very exible. Any combination of neural network estimated parameters and standard HMM parameters can be used. Furthermore, the global normalization of the HNN gives a large freedom in selecting the architecture and output functions of the neural networks. v vi Postscript les of this thesis and all the above listed papers can be downloaded from the WWW-server at the Section for Digital Signal Processing. 1 The papers relevant to the work described in this thesis are furthermore included in appendix C-F of this thesis. Acknowledgments At this point I would like to thank Ste en Duus Hansen and Anders Krogh for their supervision of my Ph.D. project. Especially I wish to express my gratitude to Anders Krogh for the guidance, encouragement and friendship that he managed to extend to me during our almost ve years of collaboration. Even during his stay at the Sanger Centre in Cambridge he managed to guide me through the project by always responding to my emails and telephone calls and by inviting me to visit him. Anders' scienti c integrity, great intuition, ambition and pleasant company has earned him my respect. Without his encouragement and optimistic faith in this work it might never have come to an end. The sta and Ph.D. students at the Section for Digital Signal Processing are thanked for creating a very pleasant research environment and for the many joyful moments at the o ce and during conference trips. Thanks also to Mogens Dyrdahl and everybody else involved in maintaining the excellent computing facilities which were crucial for carrying out my research. Center for Biological Sequence Analysis is also acknowledged for providing CPU-time which made some of the computationally intensive evaluations possible. Similarly, Peter Toft is thanked for learning me to master the force of Linux. I sincerely wish to express my gratitude to Steve Renals for inviting me to work at the Department of Computer Science, University of She eld from February to July 1997. It was a very pleasant and rewarding stay. The Ph.D. students and sta at the Department of Computer Science are acknowledged for their great hospitality and for creating a pleasant research atmosphere. I'm especially grateful to Gethin Williams for the many discussions on hybrid speech recognizers and for proofreading large parts of this thesis. I'm indebted to Gethin for his many valuable comments and suggestions to improve this manuscript. Morten With Pedersen and Kirsten Pedersen are also acknowledged for their comments and suggestions to this manuscript. Morten is furthermore thanked for the many fruitful discussions we've had and for his pleasant company at the o ce during the years. The speech group and in particular Christophe Ris at the Circuit Theory and Signal Processing Lab (TCTS), Facult e Polytechnique de Mons is acknowledged for providing data necessary to carry out the experiments presented in chapter 9 of this thesis. The Technical University of Denmark is acknowledged for allowing me the opportunity of doing this work. Otto M nsteds foundation and Valdemar Selmer Trane og Hustru Elisa Trane's foundation is acknowledged for nancial support to travel activities. Last but not least I thank my family and friends for their support, love and care during the Ph.D. study. A special heartfelt thanks goes to my wife and little daughter who helped me maintain my sanity during the study, as I felt myself drowning in ambitions. Without their support this work would not have been possible.
IEICE Transactions on Information and Systems, 2006
In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at ATR has been developing a hybrid HMM/BN model, which is an HMM where the state probability distribution is modeled by a BN, instead of commonly used mixtures of Gaussian functions. In this paper, we describe how to use the hybrid HMM/BN acoustic models, especially emphasizing some design and implementation issues. The most essential part of HMM/BN model building is the choice of the state BN topology. As it is manually chosen, there are some factors that should be considered in this process. They include, but are not limited to, the type of data, the task and the available additional information. When context-dependent models are used, the state-level structure can be obtained by traditional methods. The HMM/BN parameter learning is based on the Viterbi training paradigm and consists of two alternating steps-BN training and HMM transition updates. For recognition, in some cases, BN inference is computationally equivalent to a mixture of Gaussians, which allows HMM/BN model to be used in existing decoders without any modification. We present two examples of HMM/BN model applications in speech recognition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance than the conventional HMM.
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
The idea of giving computers the ability to process human language is as old as the idea of computers themselves. This book is about the implementation and implications of that exciting idea. We introduce a vibrant interdisciplinary field with many names corresponding to its many facets, names like speech and language processing, human language technology, natural language processing, computational linguistics, and speech recognition and synthesis. The goal of this new field is to get computers to perform useful tasks involving human language, tasks like enabling human-machine communication, improving human-human communication, or simply doing useful processing of text or speech.
2017
Natural language processing enables computer and machines to understand and speak human languages. Speech recognition is a process in which computer understand the human language and processes further instructions as per recognition of the human language. The human language varies so the machine or computer needs entirely different algorithms as the human languages differ in various aspects, such as sounds, phonemes, words, meanings and much more. Understanding human language is a challenging job and for this purpose Hidden Markov Models are used commonly as they possess promising results in understanding human language. A survey of various researches employing Hidden Markov models is presented to highlight the importance of HMM in the process of speech recognition.
Inter-speaker variability, one of the problems faced in speech recognition system, has caused the performance degradation in recognizing varied speech spoken by different speakers. Vocal Tract Length Normalization (VTLN) method is known to improve the recognition performances by compensating the speech signal using specific warping factor. Experiments are conducted using TIMIT speech corpus and Hidden Markov Model Toolkit (HTK) together with the implementation of VTLN method in order to show improvement in speaker independent phoneme recognition. The results show better recognition performance using Bigram Language Model compared to Unigram Language Model, with Phoneme Error Rate (PER) 28.8% as the best recognition performance for Bigram and PER 38.09% for Unigram. The best warp factor used for normalization in this experiment is 1.40.
ost of us frequently use speech to communicate with other people. Most of us will also communicate regularly with a computer, but rarely by means of speech. The computer input usually comes from a keyboard or a mouse, and the output goes to the monitor or a printer. Still, in many cases the communication with a computer would be facilitated if speech could be used, if only because most people speak faster than they type. A necessary requirement for this is that the computer is able to recognise our speech: automatic speech recognition (ASR).
Cognitive Science, 2005
Although researchers studying human speech recognition and automatic speech recognition share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In the present paper, we provide a computational-level analysis of the task of speech recognition which reveals the close parallels between research concerned with human and automatic speech recognition.
International Journal of Man-Machine Studies, 1982
This work describes the lcxical, syntactic and semantic processing of a recognition system of meaningful sentences spoken in the Italian language. A Transition Network grammar models, in an integrated way, the syntactic and semantic knowledge sources of the robot command protocol used in the experiments.
The demand of intelligent machines that may recognize the spoken speech and respond in a natural voice has been driving speech research. The challenging in speech recognition systems due to the language nature where there are no clear boundaries between words, the phonetic beginning and ending are influenced by neighbouring words, in addition to the variability in different speakers speech: male or female, young or senior, loud or low speech, read or spontaneous, emotional or formal, fast or slow speaking rate and the speech signal can be affected with environment noise. To avoid these difficulties the data driven statistical approach based on large quantities of spoken data is used. The performance of speech recognition systems is still far worse than that of humans. This is partly caused by the use of poor statistical models. In this paper, a comprehensive study of statistical methods for speech and language processing are presented. The role of signal processing in creating a reliable feature set for the recognizer and the role of statistical methods in enabling the recognizer to recognize the words of the spoken input sentence as well as the meaning associated with the recognized word sequence were presented.
International Journal of Machine Learning and Computing, 2018
[Co-authored with Rene J. Perez, Chloe A. Kimble, and Jin Wang (Valdosta State)] We use speech recognition algorithms daily with our phones, computers, home assistants, and more. Each of these systems use algorithms to convert the sound waves into useful data for processing which is then interpreted by the machine. Some of these machines use older algorithms while the newer systems use neural networks to interpret this data. These systems then produce an output generated in the form of text to be used. A large amount of training data is needed to make these algorithms and neural networks function effectively.
INTERNATIONAL JOURNAL ON INTEGRATED EDUCATION, 2019
The specified subfield of computational linguistics and computer science can said to be linked with speech recognition. Speech recognition can develop new variation technologies as well as methodologies generated as interdisciplinary concept. It can be considered to translate and recognize and satisfy the capability towards understanding and translating the words that are already spoken. It is more preciously said that in the most recent times this field has secured positive feedback by intense learning of voice recognition. Such evidences shows the proof that it has more market demand for implementing the application of specific data as voice recognition. Deployment of speech recognition systems can be utilized as the evidence shown to its analyzing methods that is helpful for designing each and every individual's future. It is said that the computer plays an important role for this process as by this all the translated words can be acknowledged by the texts also.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.