As the Web transforms from a text only medium into a more multimedia rich medium the need arises ... more As the Web transforms from a text only medium into a more multimedia rich medium the need arises to perform searches based on the multimedia content. In this paper we present an audio and video search engine to tackle this problem. The engine uses speech recognition technology to index spoken audio and video files from the World Wide Web when no transcriptions are available. If transcriptions (even imperfect ones) are available we can also take advantage of them to improve the indexing process.
... I must also recognize the other members of my thesis committee, Vijaya Kumar, Raj Reddy, Alej... more ... I must also recognize the other members of my thesis committee, Vijaya Kumar, Raj Reddy, Alejandro Acero and Bishnu Atal. I am indebted to Raj for creating the CMU SPHINX group and providing the research infrastructure used in this thesis. ...
This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II ... more This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II system more robust with respect to acoustical environment. The first algorithm, phonedependent cepstral compensation, is similar in concept to the previously-described MFCDCN method, except that cepstral compensation vectors are selected according to the current phonetic hypothesis, rather than on the basis of SNR or VQ codeword identity. We also describe two procedures to accomplish adaptation of the VQ codebook for new environments, as well as the use of reduced-bandwidth f~equency analysis to process telephone-bandwidth speech. Use of the various compensation algorithms in consort produces a reduction of error rates for SPHINX-II by as much as 40 percent relative to the rate achieved with eepstral mean norrealization alone, in both development test sets and in the context of the 1993 ARPA CSR evaluations.
Support Vector Machines (SVMs) represent a new approach to pattern classification which has recen... more Support Vector Machines (SVMs) represent a new approach to pattern classification which has recently attracted a great deal of interest in the machine learning community. Their appeal lies in their strong connection to the underlying statistical learning theory, in particular the theory of Structural Risk Minimization. SVMs have been shown to be particularly successful in fields such as image identification and face recognition; in many problems SVM classifiers have been shown to perform much better than other nonlinear classifiers such as artificial neural networks and ¡ -nearest neighbors.
We introduce a new family of environmental compensation algorithms called multivariate gaussian b... more We introduce a new family of environmental compensation algorithms called multivariate gaussian based cepstral normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components of Gaussian mixtures, and an efficient procedure for estimating the correction factors is provided. The RATZ algorithm can be implemented to work with or without the use of “stereo” development data that had been simultaneously recorded in the training and testing environments. “Blind” RATZ partially overcomes the loss of information that would have been provided by stereo training through the use of a more accurate description of how noisy environments affect clean speech. We evaluate the performance of the two RATZ algorithms using the CMU SPHINX-II system on the alphanumeric census database and compare their performance with that of previous environmental-robustness developed at CMU
In this paper we address the problem of aligning very long (often more than one hour) audio files... more In this paper we address the problem of aligning very long (often more than one hour) audio files to their corresponding textual transcripts in an effective manner. We present an efficient recursive technique to solve this problem that works well even on noisy speech signals. The key idea of this algorithm is to turn the forced alignment problem into a recursive speech recognition problem with a gradually restricting dictionary and language model. The algorithm is tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks.
In this paper we introduce a new analytical approach to environment compensation for speech recog... more In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneouslyrecorded or "stereo" recordings of clean and degraded speech.
We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing ... more We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs has been artificially increased to 50%. We additionally investigate whether the combination of the three indexing techniques can yield improvements in retrieval. We explore several simple combination strategies such as weighted combinations. We find that combining word and sub-word based systems results in improved retrieval performance.
We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing ... more We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs has been artificially increased to 50%. We additionally investigate whether the combination of the three indexing techniques can yield improvements in retrieval. We explore several simple combination strategies such as weighted combinations. We find that combining word and sub-word based systems results in improved retrieval performance.
Citrus tristeza closterovirus (CTV) isolates of several geographical origins were compared for va... more Citrus tristeza closterovirus (CTV) isolates of several geographical origins were compared for variations in their coat protein (CP) gene by analysis of single-strand conformation polymorphism (SSCP). The CP gene of 17 isolates was reverse transcribed, amplified by polymerase chain reaction (PCR), and 22 clones were inserted into a plasmid vector. These clones were sequenced and found to have between 91.7% and 99.8% sequence homology. Clones were amplified and the PCR products denatured and compared by SSCP analysis in 8% polyacrylamide gels. Using two different electrophoretic conditions, the patterns were different for 16 or 17 clones. Four pairs of clones (T36/T66, P1/Q2, 03/8Q, and E1/E2) differing by 10, 2, 1 and 1 nucleotides, respectively, could not be distinguished using either condition. When these clones were compared by SSCP after digestion with Eco91I (BstEII) three of the pairs (T36/T66, P1/Q2, and 03/8Q) could be differentiated, whereas the clones E1 and E2 (differing by 1 nucleotide) remained indistinguishable. Thus, SSCP analysis combining two electrophoretic conditions and restriction of eight clones with Eco91I allowed discrimination between 21 of the 22 CP gene clones selected.SSCP analysis may provide a procedure to identify and differentiate CTV isolates based on comparisons of several genes or gene regions. It is rapid and cheap and may drastically reduce the amount of sequencing necessary for accurate comparisons.
Genetic variability of citrus tristeza virus (CTV) was studied using the haplotypes detected by s... more Genetic variability of citrus tristeza virus (CTV) was studied using the haplotypes detected by single-strand conformation polymorphism (SSCP) analysis of genes p18 and p20 in six virus populations of two origins. The Spanish group included a CTV isolate and subisolates obtained by graft-transmission to different host species. The other included two subisolates aphid-transmitted from a single Japanese isolate. The homozygosity observed for gene p20 was always significantly higher than that expected under neutral evolution, whereas only three populations showed high homozygosity for p18, suggesting stronger host constraints for p20 than for p18. Sequential transmissions of a Spanish isolate to new host species increased the difference between its population and that of the successive subisolates for gene p18, as estimated by the F statistic. Analysis of molecular variance indicated that variation between both groups of populations was not statistically significant, whereas variations between populations of the same group or within populations were significant for both genes studied. Our data indicate that selection affects the haplotype distribution and that adaptation to a new host can be as important or more as the geographical origin. Variation of the CTV populations after host change or aphid transmission may explain in part the wide biological variability observed among CTV isolates.
We examined the population structure and genetic variation of four genomic regions within and bet... more We examined the population structure and genetic variation of four genomic regions within and between 30 Citrus tristeza virus (CTV) isolates from Spain and California. Our analyses showed that most isolates contained a population of sequence variants, with one being predominant. Four isolates showed two major sequence variants in some genomic regions. The two major variants of three of these isolates showed very low nucleotide identity to each other but were very similar to those of other isolates, suggesting the possibility of mixed infections with two divergent isolates. Incongruencies of phylogenetic relationships in the different genomic regions and statistical analyses suggested that the genomes of some CTV sequence variants originated by recombination events between diverged sequence variants. No correlation was observed between geographic origin and nucleotide distance, and thus from a genetic view, the Spanish and Californian isolates analyzed here could be considered members of the same population.
Separation and interference of strains from a citrus tristeza virus isolate evidenced by biological activity and double-stranded RNA (dsRNA) analysis
Plant Pathology, 1993
Separation of strains of citrus tristeza virus (CTV), differentiated by their double-stranded RNA... more Separation of strains of citrus tristeza virus (CTV), differentiated by their double-stranded RNA (dsRNA) profiles, was obtained by graft-inoculating citron plants from a Mexican lime that had been recently aphid- or graft-inoculated with a mild CTV isolate (T-385). Up to 24 sub-isolates with differing dsRNA profiles were obtained from the aphid-inoculated lime. Some of these sub-isolates induced stronger symptoms in several citrus species than the original T-385 isolate. One sub-isolate, T-385-33, was mild in Mexican lime, but induced stem pitting on sweet orange. Inoculation of this isolate on Mexican lime, sour orange and Eureka lemon induced mild or no symptoms when inoculum was taken from citron, but very severe symptoms when the inoculum was from sweet orange. Mexican lime and sweet orange plants co-inoculated with T-385-33 from sweet orange in combination with the other 23 sub-isolates showed mild symptoms. The results obtained suggest that there is natural cross-protection among sub-isolates in the original T-385 isolate.
Proceedings of The National Academy of Sciences, 1999
Citrus tristeza virus (CTV) populations in citrus trees are unusually complex mixtures of viral g... more Citrus tristeza virus (CTV) populations in citrus trees are unusually complex mixtures of viral genotypes and defective RNAs developed during the long-term vegetative propagation of the virus and by additional mixing by aphid transmission. The viral replication process allows the maintenance of minor amounts of disparate genotypes and defective RNAs in these populations. CTV is a member of the Closteroviridae possessing a positive-stranded RNA genome of Ϸ20 kilobases that expresses the replicase-associated genes as an Ϸ400-kDa polyprotein and the remaining 10 3 genes through subgenomic mRNAs. A full-length cDNA clone of CTV was generated from which RNA transcripts capable of replication in protoplasts were derived. The large size of cDNA hampered its use as a genetic system. Deletion of 10 3 genes resulted in an efficient RNA replicon that was easy to manipulate. To investigate the origin and maintenance of the genotypes in CTV populations, we tested the CTV replicase for its acceptance of divergent sequences by creating chimeric replicons with heterologous termini and examining their ability to replicate. Exchange of the similar 3 termini resulted in efficient replication whereas substitution of the divergent (up to 58% difference in sequence) 5 termini resulted in reduced but significant replication, generally in proportion to the extent of sequence divergence.
As the Web transforms from a text only medium into a more multimedia rich medium the need arises ... more As the Web transforms from a text only medium into a more multimedia rich medium the need arises to perform searches based on the multimedia content. In this paper we present an audio and video search engine to tackle this problem. The engine uses speech recognition technology to index spoken audio and video files from the World Wide Web when no transcriptions are available. If transcriptions (even imperfect ones) are available we can also take advantage of them to improve the indexing process.
... I must also recognize the other members of my thesis committee, Vijaya Kumar, Raj Reddy, Alej... more ... I must also recognize the other members of my thesis committee, Vijaya Kumar, Raj Reddy, Alejandro Acero and Bishnu Atal. I am indebted to Raj for creating the CMU SPHINX group and providing the research infrastructure used in this thesis. ...
This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II ... more This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II system more robust with respect to acoustical environment. The first algorithm, phonedependent cepstral compensation, is similar in concept to the previously-described MFCDCN method, except that cepstral compensation vectors are selected according to the current phonetic hypothesis, rather than on the basis of SNR or VQ codeword identity. We also describe two procedures to accomplish adaptation of the VQ codebook for new environments, as well as the use of reduced-bandwidth f~equency analysis to process telephone-bandwidth speech. Use of the various compensation algorithms in consort produces a reduction of error rates for SPHINX-II by as much as 40 percent relative to the rate achieved with eepstral mean norrealization alone, in both development test sets and in the context of the 1993 ARPA CSR evaluations.
Support Vector Machines (SVMs) represent a new approach to pattern classification which has recen... more Support Vector Machines (SVMs) represent a new approach to pattern classification which has recently attracted a great deal of interest in the machine learning community. Their appeal lies in their strong connection to the underlying statistical learning theory, in particular the theory of Structural Risk Minimization. SVMs have been shown to be particularly successful in fields such as image identification and face recognition; in many problems SVM classifiers have been shown to perform much better than other nonlinear classifiers such as artificial neural networks and ¡ -nearest neighbors.
We introduce a new family of environmental compensation algorithms called multivariate gaussian b... more We introduce a new family of environmental compensation algorithms called multivariate gaussian based cepstral normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components of Gaussian mixtures, and an efficient procedure for estimating the correction factors is provided. The RATZ algorithm can be implemented to work with or without the use of “stereo” development data that had been simultaneously recorded in the training and testing environments. “Blind” RATZ partially overcomes the loss of information that would have been provided by stereo training through the use of a more accurate description of how noisy environments affect clean speech. We evaluate the performance of the two RATZ algorithms using the CMU SPHINX-II system on the alphanumeric census database and compare their performance with that of previous environmental-robustness developed at CMU
In this paper we address the problem of aligning very long (often more than one hour) audio files... more In this paper we address the problem of aligning very long (often more than one hour) audio files to their corresponding textual transcripts in an effective manner. We present an efficient recursive technique to solve this problem that works well even on noisy speech signals. The key idea of this algorithm is to turn the forced alignment problem into a recursive speech recognition problem with a gradually restricting dictionary and language model. The algorithm is tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks.
In this paper we introduce a new analytical approach to environment compensation for speech recog... more In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneouslyrecorded or "stereo" recordings of clean and degraded speech.
We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing ... more We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs has been artificially increased to 50%. We additionally investigate whether the combination of the three indexing techniques can yield improvements in retrieval. We explore several simple combination strategies such as weighted combinations. We find that combining word and sub-word based systems results in improved retrieval performance.
We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing ... more We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs has been artificially increased to 50%. We additionally investigate whether the combination of the three indexing techniques can yield improvements in retrieval. We explore several simple combination strategies such as weighted combinations. We find that combining word and sub-word based systems results in improved retrieval performance.
Citrus tristeza closterovirus (CTV) isolates of several geographical origins were compared for va... more Citrus tristeza closterovirus (CTV) isolates of several geographical origins were compared for variations in their coat protein (CP) gene by analysis of single-strand conformation polymorphism (SSCP). The CP gene of 17 isolates was reverse transcribed, amplified by polymerase chain reaction (PCR), and 22 clones were inserted into a plasmid vector. These clones were sequenced and found to have between 91.7% and 99.8% sequence homology. Clones were amplified and the PCR products denatured and compared by SSCP analysis in 8% polyacrylamide gels. Using two different electrophoretic conditions, the patterns were different for 16 or 17 clones. Four pairs of clones (T36/T66, P1/Q2, 03/8Q, and E1/E2) differing by 10, 2, 1 and 1 nucleotides, respectively, could not be distinguished using either condition. When these clones were compared by SSCP after digestion with Eco91I (BstEII) three of the pairs (T36/T66, P1/Q2, and 03/8Q) could be differentiated, whereas the clones E1 and E2 (differing by 1 nucleotide) remained indistinguishable. Thus, SSCP analysis combining two electrophoretic conditions and restriction of eight clones with Eco91I allowed discrimination between 21 of the 22 CP gene clones selected.SSCP analysis may provide a procedure to identify and differentiate CTV isolates based on comparisons of several genes or gene regions. It is rapid and cheap and may drastically reduce the amount of sequencing necessary for accurate comparisons.
Genetic variability of citrus tristeza virus (CTV) was studied using the haplotypes detected by s... more Genetic variability of citrus tristeza virus (CTV) was studied using the haplotypes detected by single-strand conformation polymorphism (SSCP) analysis of genes p18 and p20 in six virus populations of two origins. The Spanish group included a CTV isolate and subisolates obtained by graft-transmission to different host species. The other included two subisolates aphid-transmitted from a single Japanese isolate. The homozygosity observed for gene p20 was always significantly higher than that expected under neutral evolution, whereas only three populations showed high homozygosity for p18, suggesting stronger host constraints for p20 than for p18. Sequential transmissions of a Spanish isolate to new host species increased the difference between its population and that of the successive subisolates for gene p18, as estimated by the F statistic. Analysis of molecular variance indicated that variation between both groups of populations was not statistically significant, whereas variations between populations of the same group or within populations were significant for both genes studied. Our data indicate that selection affects the haplotype distribution and that adaptation to a new host can be as important or more as the geographical origin. Variation of the CTV populations after host change or aphid transmission may explain in part the wide biological variability observed among CTV isolates.
We examined the population structure and genetic variation of four genomic regions within and bet... more We examined the population structure and genetic variation of four genomic regions within and between 30 Citrus tristeza virus (CTV) isolates from Spain and California. Our analyses showed that most isolates contained a population of sequence variants, with one being predominant. Four isolates showed two major sequence variants in some genomic regions. The two major variants of three of these isolates showed very low nucleotide identity to each other but were very similar to those of other isolates, suggesting the possibility of mixed infections with two divergent isolates. Incongruencies of phylogenetic relationships in the different genomic regions and statistical analyses suggested that the genomes of some CTV sequence variants originated by recombination events between diverged sequence variants. No correlation was observed between geographic origin and nucleotide distance, and thus from a genetic view, the Spanish and Californian isolates analyzed here could be considered members of the same population.
Separation and interference of strains from a citrus tristeza virus isolate evidenced by biological activity and double-stranded RNA (dsRNA) analysis
Plant Pathology, 1993
Separation of strains of citrus tristeza virus (CTV), differentiated by their double-stranded RNA... more Separation of strains of citrus tristeza virus (CTV), differentiated by their double-stranded RNA (dsRNA) profiles, was obtained by graft-inoculating citron plants from a Mexican lime that had been recently aphid- or graft-inoculated with a mild CTV isolate (T-385). Up to 24 sub-isolates with differing dsRNA profiles were obtained from the aphid-inoculated lime. Some of these sub-isolates induced stronger symptoms in several citrus species than the original T-385 isolate. One sub-isolate, T-385-33, was mild in Mexican lime, but induced stem pitting on sweet orange. Inoculation of this isolate on Mexican lime, sour orange and Eureka lemon induced mild or no symptoms when inoculum was taken from citron, but very severe symptoms when the inoculum was from sweet orange. Mexican lime and sweet orange plants co-inoculated with T-385-33 from sweet orange in combination with the other 23 sub-isolates showed mild symptoms. The results obtained suggest that there is natural cross-protection among sub-isolates in the original T-385 isolate.
Proceedings of The National Academy of Sciences, 1999
Citrus tristeza virus (CTV) populations in citrus trees are unusually complex mixtures of viral g... more Citrus tristeza virus (CTV) populations in citrus trees are unusually complex mixtures of viral genotypes and defective RNAs developed during the long-term vegetative propagation of the virus and by additional mixing by aphid transmission. The viral replication process allows the maintenance of minor amounts of disparate genotypes and defective RNAs in these populations. CTV is a member of the Closteroviridae possessing a positive-stranded RNA genome of Ϸ20 kilobases that expresses the replicase-associated genes as an Ϸ400-kDa polyprotein and the remaining 10 3 genes through subgenomic mRNAs. A full-length cDNA clone of CTV was generated from which RNA transcripts capable of replication in protoplasts were derived. The large size of cDNA hampered its use as a genetic system. Deletion of 10 3 genes resulted in an efficient RNA replicon that was easy to manipulate. To investigate the origin and maintenance of the genotypes in CTV populations, we tested the CTV replicase for its acceptance of divergent sequences by creating chimeric replicons with heterologous termini and examining their ability to replicate. Exchange of the similar 3 termini resulted in efficient replication whereas substitution of the divergent (up to 58% difference in sequence) 5 termini resulted in reduced but significant replication, generally in proportion to the extent of sequence divergence.
Uploads
Papers by Pedro Moreno