Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Jurnal Teknologi
…
10 pages
1 file
In this paper, we propose a post decoding system combination approach for automatic transcribing Malay broadcast news. This approach combines the hypotheses produced by parallel automatic speech recognition (ASR) systems. Each ASR system uses different language models, one which is generic domain model and another is domain specific model. The main idea is to take advantage of different ASR knowledge to improve ASR decoding result. It uses the language score and time information to produce a 1-best lattice, and then rescore the 1-best lattice to get the most likely word sequence as the final output. The proposed approach was compared with conventional combination approach, the recognizer output voting error reduction (ROVER). Our proposed approach improved the word error rate (WER) from 33.9% to 30.6% with an average relative WER improvement of 9.74%, and it is better than the conventional ROVER approach.
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speec h and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of speech data while the text corpus was transcribed from around 35 hours of television broadcast news. The characteristics of the corpus were analyzed and shown in the paper. The speech corpus was split according to the evaluation focus condition used in the DARPA Hub-4 evaluation. An 18k-word Thai speech recognition system was setup to test with this speech corpus as a preliminary experiment. Acoustic model adaptations were performed to improve the system performance. The best system yielded a word error rate of about 20% for clean and planned speech, and below 30% for the overall condition.
We describe several new research directions we investigated toward the development of our broadcast news transcription system for the 1998 DARPA H4 evaluations. Our goal was to develop significantly faster and smaller speech recognition systems without degrading the word error rate of our 1997 system. We did this through significant algorithmic research creating various new techniques. A sample of these techniques was used to put together our 1998 broadcast news system, which is conceptually much simpler, faster, and smaller, but gives the same word error rate as our 1997 system. In particular, our 1998 system is based on a simple phonetically tied mixture (PTM) model with a total of only 13,000 Gaussians, as compared to a 67,000-Gaussian state-clustered system we used in 1997.
2012
We present a description of the development and evaluation of a first South African broadcast news transcription system. We describe a number of speech resources which have been collected in the resource-scarce South African environment for system development purposes: a 20 hour corpus of South African English (SAE) broadcast news; a 109M word corpus of South African newspaper text collected for language modelling purposes; and a 60k word SAE pronunciation dictionary. The development of our system is based on similar state-of-the-art broadcast news transcription systems. Our system uses cross-word triphone HMMs, MF-PLP features and persegment cepstral mean and per-bulletin cepstral variance normalisation. Our final system obtains a word error rate of 24.6%. We find that, for newsreader data, Indian and Black South African English accents are recognised more accurately than the speech by White English mother tongue speakers. However, for the spontaneous speech found in interviews and crossings to other locations, the latter accent is associated with the best results, although for this speech the error rates are high overall. Finally, we consider the recognition of MP3-compressed audio and show that performance only deteriorates at high compression levels.
We describe several new research directions we investigated toward the development of our broadcast news transcription system for the 1998 DARPA H4 evaluations. Our goal was to develop significantly faster and smaller speech recognition systems without degrading the word error rate of our 1997 system. We did this through significant algorithmic research creating various new techniques. A sample of these techniques was used to put together our 1998 broadcast news system, which is conceptually much simpler, faster, and smaller, but gives the same word error rate as our 1997 system. In particular, our 1998 system is based on a simple phonetically tied mixture (PTM) model with a total of only 13,000 Gaussians, as compared to a 67,000-Gaussian state-clustered system we used in 1997. 1. Introduction One of our main goals in 1998 was to significantly increase speed and decrease model size, while maintaining or improving accuracy. These goals are difficult to achieve simultaneously because ...
Speech Communication, 2002
Over the last few years, the DARPA-sponsored Hub-4 continuous speech recognition evaluations have advanced speech recognition technology for automatic transcription of broadcast news. In this paper, we report on our research and progress in this domain, with an emphasis on efficient modeling with significantly fewer parameters for faster and more accurate recognition. In the acoustic modeling area, this was achieved through new parameter tying, Gaussian clustering, and mixture weight thresholding schemes. The effectiveness of acoustic adaptation is greatly increased through unsupervised clustering of test data. In language modeling, we explored the use of non-broadcast-news training data, as well as adaptation to topic and speaking styles. We developed an effective and efficient parameter pruning technique for backoff language models that allowed us to cope with ever increasing amounts of training data and expanded N-gram scopes. Finally, we improved our progressive search architecture with more efficient algorithms for lattice generation, compaction, and incorporation of higher-order language models.
1999
This contribution is a report on an ongoing research aiming at the development of a speech recognition system for French, combining a standard HMM recognition tool with a syntactic parser. Because of the very high number of homophones in French and because several agreement rules spread over an unbounded number of words, we designed our GB-based morphological and syntactic parser to output a correct orthographic form from a lattice of phonemes produced by the front-head HMM recognition system. This resulting lattice of phonemes is processed by the syntactic parser, which selects the best word sequence according to its linguistic knowledge. The originality of this approach is the use of a syntactic parser, tuned to phonetic inputs, for a speech recognition task.
2000
In this paper, our work in developing a Mandarin broadcast news transcription system is described. The main focus of this work is a port of the LIMSI American English broadcast news transcription system to the Chinese Mandarin language. The system consists of an audio partitioner and an HMM-based continuous speech recognizer. The acoustic models were trained on about 24 hours of data from the 1997 Hub4 Mandarin corpus available via LDC. In addition to the transcripts, the language models were trained on Mandarin Chinese News Corpus containing about 186 million characters. We investigate recognition performance as a function of lexical size, with and without tone in the lexicon, and with a topic dependent language model. The transcription character error rate on the DARPA 1997 test set is 18.1% using a lexicon with 3 tone levels and a topic-based language model.
2009 Oriental COCOSDA International Conference on Speech Database and Assessments, 2009
This paper presents the development of the speech, text and pronunciation dictionary resources required to build a large vocabulary speech recognizer for the Malay language. This project is a collaboration project among three universities: USM, MMU from Malaysia and NTUfrom Singapore. The Malay speech corpus consists of read speech (speaker independent/ dependent and accent independent/ dependent) and broadcast news. To date, 90 speakers have been recorded which is equal to a total ofnearly 70 hours of read speech, and 10 hours of broadcast news from local TV stations in Malaysia was transcribed. The text corpus consists of 700Mbytes of data extracted from Malaysia's local news web pages from 1998-2008 and a rule based G2P tool is develop to generate the pronunciation dictionary.
1998
In this paper the Philips Broadcast News transcription system is described. The Broadcast News task aims at the recognition of found" speech in radio and television broadcasts without any additional side information e.g. speaking style, background conditions. The system was derived from the Philips continuous mixture density crossword HMM system, using MFCC features and Laplacian densities. A segmentation was performed to obtain sentence-like partitions of the broadcasts. Using data-driven clustering, the obtained segments were grouped into clusters with similar acoustic conditions for adaptation purposes. Gender independent w ordinternal and crossword triphone models were trained on 70 hours of the HUB4 training data. No focus condition speci c training was applied. Channel and speaker normalization was done by mean and variance normalization as well as VTN and MLLR. The transcription was produced by an adaptive multiple pass decoder starting with phrase-bigram decoding using word-internal triphones and nishing with a phrasetrigram decoding using MLLR-adapted crossword models.
2007
The majority of state-of-the-art speech recognition systems make use of system combination. The combination approaches adopted have traditionally been tuned to minimising Word Error Rates (WERs). In recent years there has been growing interest in taking the output from speech recognition systems in one language and translating it into another. This paper investigates the use of cross-site combination approaches in terms of both WER and impact on translation performance. In addition the stages involved in modifying the output from a Speech-to-Text (STT) system to be suitable for translation are described. Two source languages, Mandarin and Arabic, are recognised and then translated using a phrase-based statistical machine translation system into English. Performance of individual systems and cross-site combination using cross-adaptation and ROVER are given. Results show that the best STT combination scheme in terms of WER is not necessarily the most appropriate when translating speech.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2022 14th International Conference on Knowledge and Smart Technology (KST), 2022
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
Acoustics, Speech and …, 2008
Interspeech 2005
… Languages Technologies for …, 2010
International Journal of Speech Technology, 2007