GMM-UBM-based speaker verification heavily relies on well-trained UBMs. In practice, it is not of... more GMM-UBM-based speaker verification heavily relies on well-trained UBMs. In practice, it is not often easy to obtain a UBM that fully matches the acoustic channel in operation. In a previous study, we proposed to address this problem by a novel sequential UBM adaptation approach based on MAP. This work extends the study by applying the sequential approach to speaker model adaptation. In addition, we investigate a new feature-space sequential adaptation approach based on feature MAP linear regression (fMAPLR) and compare it with the previously proposed model-space MAP approach. We find that these two approaches are complementary and can be combined to deliver additional performance gains. The experiments conducted on a time-varying speech database demonstrate that the proposed MAP-fMAPLR approach leads to significant EER reduction with two mismatched UBMs (25% and 39% respectively).
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
In this paper, a semantic parser with support to ellipsis resolution in a Chinese spoken language... more In this paper, a semantic parser with support to ellipsis resolution in a Chinese spoken language dialogue system is proposed. The grammar and parsing strategy of this parser is designed to address the characteristics of spoken language and to support the ellipsis resolution. Namely, it parses the user utterance with a domain-specific semantic grammar based on a template-filling approach. Syntactic constraints extracted by a Generalized LR parser are also used in the parsing process. With a paradigm of two-state bottom-up parsing and a scoring scheme, the ellipsis resolution module is integrated into the parser seamlessly. The parsing result is represented by a linked structure of semantic frames, which is convenient to both the parser and its successive components of the dialogue system.
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
The purpose of this paper is to solve the contextual ellipsis problem that is popular in our Chin... more The purpose of this paper is to solve the contextual ellipsis problem that is popular in our Chinese spoken dialogue system named EasyNav. A Theme Structure is proposed to describe the attentional state. Its dynamic generation feature makes it suitable to model the topic transition in user-initiative dialogues. By studying the differences and the similarities between the ellipsis and the anaphora phenomena, we extend the resolution procedure and the theory from anaphora to ellipsis. The ellipsis resolution is now based on the semantic knowledge and the discourse factor other than the syntactic information. A Theme Structure Method proposed in this paper for the ellipsis resolution is uniform to not only all kinds of elliptical elements but also some particular ellipsis types such as the fragmental ellipsis and the default ellipsis.
The Speaker and Language Recognition Workshop (Odyssey 2022)
The choice of an optimal time-frequency resolution is usually a difficult but important step in t... more The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by a learnable neural network inspired by the weighting block in squeeze-and-excitation networks (SENet). Furthermore, the refinement of the chosen timefrequency resolutions is investigated by pruning the ones with relatively low importance, which reduces the complexity and size of the model. The proposed method is evaluated on the tasks of speech anti-spoofing in ASVSpoof 2019 and its superiority has been justified by comparing with similar baselines.
2005 International Conference on Natural Language Processing and Knowledge Engineering
In this paper, an adaptation method of the language style of a language model is proposed based o... more In this paper, an adaptation method of the language style of a language model is proposed based on the differences between spoken and written language. Several interpolation methods based on trigram counts are used for adaptation. An interpolation method considering Katz smoothing computes weights according to the confidence score of a trigram. An adaptation method based on the classification of a trigram's style feature computes weights dynamically according to the trigram's language style tendency, and several weight generation functions are proposed. Experiments for spoken language on the Chinese corpora show that these methods, especially the method considering both a trigram's confidence and style tendency, can achieve a reduction in the Chinese character error rate for pinyin-to-character conversion.
Research on speaker recognition is extending to address the vulnerability in the wild conditions,... more Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3,000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition, and on how to utilize the valuable multi-genre data more efficiently.
2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2016
Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-v... more Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition. A potential problem of the PLDA model, however, is that it essentially assumes Gaussian distributions over speaker vectors, which is not always true in practice. Additionally, the objective function is not directly related to the goal of the task, e.g., discriminating true speakers and imposters. In this paper, we propose a max-margin metric learning approach to solve the problems. It learns a linear transform with a criterion that the margin between target and imposter trials are maximized. Experiments conducted on the SRE08 core test show that compared to PLDA, the new approach can obtain comparable or even better performance, though the scoring is simply a cosine computation.
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017
Tibetan is an important low-resource language in China. A key factor that hinders the speech and ... more Tibetan is an important low-resource language in China. A key factor that hinders the speech and language research for Tibetan is the lack of resources, particularly free ones. This paper describes our recent progression on Tibetan resource construction supported by the NSFC M2ASR project, including the phone set, lexicon, as well as the transcription of a large scale speech corpus. Following the M2ASR free data program, all the resources are publicly available and free for researchers. We also release a small Tibetan speech database that can be used to build a proto type Tibetan speech recognition system.
APSIPA Transactions on Signal and Information Processing, 2020
Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spo... more Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features...
Fear recognition, which aims at predicting whether a movie segment can induce fear or not, is a p... more Fear recognition, which aims at predicting whether a movie segment can induce fear or not, is a promising area in movie emotion recognition. Research in this area, however, has reached a bottleneck. Difficulties may partly result from the imbalanced database. In this paper, we propose an imbalance learning-based framework for movie fear recognition. A data rebalance module is adopted before classification. Several sampling methods, including the proposed softsampling and hardsampling which combine the merits of both undersampling and oversampling, are explored in this module. Experiments are conducted on the MediaEval 2017 Emotional Impact of Movies Task. Compared with the current state-of-the-art, we achieve an improvement of 8.94% on F1, proving the effectiveness of proposed framework.
2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013
Speaker verification suffers from significant performance degradation with emotion variation. In ... more Speaker verification suffers from significant performance degradation with emotion variation. In a previous study, we have demonstrated that an adaptation approach based on MLLR/CMLLR can provide a significant performance improvement for verification on emotional speech. This paper follows this direction and presents an emotional adaptive training (EAT) approach. This approach iteratively estimates the emotiondependent CMLLR transformations and retrains the speaker models with the transformed speech, which therefore can make use of emotional enrollment speech to train a stronger speaker model. This is similar to the speaker adaptive training (SAT) in speech recognition. The experiments are conducted on an emotional speech database which involves speech recordings of 30 speakers in 5 emotions. The results demonstrate that the EAT approach provides significant performance improvements over the baseline system where the neutral enrollment data are used to train the speaker models and the emotional test utterances are verified directly. The EAT also outperforms another two emotionadaptation approaches in a significant way: (1) the CMLLRbased approach where the speaker models are trained with the neutral enrollment speech and the emotional test utterances are transformed by CMLLR in verification; (2) the MAP-based approach where the emotional enrollment data are used to train emotion-dependent speaker models and the emotional utterances are verified based on the emotion-matched models.
2009 IEEE International Conference on Multimedia and Expo, 2009
To achieve a good balance between matching accuracy and computation efficiency is a key challenge... more To achieve a good balance between matching accuracy and computation efficiency is a key challenge for Query-by-Humming (QBH) system. In this paper, we propose an approach of n-gram based fast match. Our n-gram method uses a robust statistical note transcription as well as error compensation method based on the analysis of frequent transcription errors. The effectiveness of our approach has been evaluated on a relatively large melody database with 5223 melodies. The experimental results show that when the searching space was reduced to only 10% of the whole size, 90% of the target melodies were preserved in the candidates, and 88% of the match accuracy of system was kept. Meanwhile, no obvious additional computation was applied.
Information of speech units like vowels, consonants and syllables can be a kind of knowledge used... more Information of speech units like vowels, consonants and syllables can be a kind of knowledge used in text-independent Short Utterance Speaker Recognition (SUSR) in a similar way as in text-dependent speaker recognition. In such tasks, data for each speech unit, especially at the time of recognition, is often not enough. Hence, it is not practical to use the full set of speech units because some of the units might not be well trained. To solve this problem, a method of using speech unit categories rather than individual phones is proposed for SUSR, wherein similar speech units are put together, hence solving the problem of sparse data. We define Vowel, Consonant, and Syllable Categories (VC, CC and SC) with Standard Chinese (Putonghua) as a reference. A speech utterance is recognized into VC, CC ad SC sequences which are used to train Universal Background Models (UBM) for each speech unit category in the training procedure, and to perform speech unit category dependent speaker recogn...
In this paper, a Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservatio... more In this paper, a Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservation domain (CSTSC-Flight) of 6 GB raw data containing about 50 hours' valid speech is introduced, including the collection and transcription principles and outline. Analysis on the spoken language phenomena contained in this corpus is then performed. Based on this, four types of grammatical are proposed so as to cover as many Chinese spoken language phenomena as possible for robust natural language parsing and understanding in spoken dialogue systems.
Many knowledge repositories nowadays contain billions of triplets, i.e. (head-entity, relationshi... more Many knowledge repositories nowadays contain billions of triplets, i.e. (head-entity, relationship, tail-entity), as relation instances. These triplets form a directed graph with entities as nodes and relationships as edges. However, this kind of symbolic and discrete storage structure makes it difficult for us to exploit the knowledge to enhance other intelligenceacquired applications (e.g. the Question-Answering System), as many AI-related algorithms prefer conducting computation on continuous data. Therefore, a series of emerging approaches have been proposed to facilitate knowledge computing via encoding the knowledge graph into a low-dimensional embedding space. TransE is the latest and most promising approach among them, and can achieve a higher performance with fewer parameters by modeling the relationship as a transitional vector from the head entity to the tail entity. Unfortunately, it is not flexible enough to tackle well with the various mapping properties of triplets, even though its authors spot the harm on performance. In this paper, we thus propose a superior model called TransM to leverage the structure of the knowledge graph via pre-calculating the distinct weight for each training triplet according to its relational mapping property. In this way, the optimal function deals with each triplet depending on its own weight. We carry out extensive experiments to compare TransM with the state-of-the-art method TransE and other prior arts. The performance of each approach is evaluated within two different application scenarios on several benchmark datasets. Results show that the model we proposed significantly outperforms the former ones with lower parameter complexity as TransE.
2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), 2014
We propose a fast block-wise and parallel training approach to train i-vector systems. This appro... more We propose a fast block-wise and parallel training approach to train i-vector systems. This approach divides the loading matrix into groups according to components or acoustic feature dimensions and trains the loading matrices of these groups independently and in parallel. These individually trained block matrices can be combined to approximate the original loading matrix, or used to derive independent i-vectors. We tested the block-wise training on speaker verification tasks based on the NIST SRE data and found that it can substantially speed up the training while retaining the quality of the resulting ivectors.
2009 IEEE International Conference on Multimedia and Expo, 2009
The Query-by-Humming (QBH) system allows users to retrieve songs by singing/humming. In this pape... more The Query-by-Humming (QBH) system allows users to retrieve songs by singing/humming. In this paper we propose a phraselevel piecewise linear scaling algorithm for melody match. Musical phrase boundaries are predicted for the query to split it to phrases. The boundaries of melody fragment corresponding to each phrase are allowed for adjusting in a limited scope. The algorithm employs Dynamic Programming and Recursive Alignment to search for the minimal piecewise matching cost upon Linear Scaling at phrase-level. Our experimental results on 5223 melody database show that the proposed algorithm outperforms traditional algorithms. The proposed algorithm gives significant improvements of 17.0%, 14.7% and 4.8% with respect to Linear Scaling, Dynamic Time Wrapping and Recursive Alignment in top-1 rate, respectively. The results show that the proposed algorithm is more efficient than the previous algorithms.
Observation on query log of search engine indicates that queries are usually ambiguous. Similar t... more Observation on query log of search engine indicates that queries are usually ambiguous. Similar to document ranking, search intents should be ranked to facilitate information search. Previous work attempts to rank intents with merely relevance score. We argue that diversity is also important. In this work, unified models are proposed to rank intents underlying a query by combining relevance score and diversity degree, in which the latter is reflected by non-overlapping ratio of every intent and aggregated non-overlapping ratio of a set of intents. Three conclusions are drawn according to the experiment results. Firstly, diversity plays an important role in intent ranking. Secondly, URL is more effective than similarity in detecting unique subtopics. Thirdly, the aggregated non-overlapping ratio makes some contribution in similarity based intent ranking but little in URL based intent ranking.
International Journal of Pattern Recognition and Artificial Intelligence, 2015
Cross-lingual document clustering is the task of automatically organizing a large collection of m... more Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.
GMM-UBM-based speaker verification heavily relies on well-trained UBMs. In practice, it is not of... more GMM-UBM-based speaker verification heavily relies on well-trained UBMs. In practice, it is not often easy to obtain a UBM that fully matches the acoustic channel in operation. In a previous study, we proposed to address this problem by a novel sequential UBM adaptation approach based on MAP. This work extends the study by applying the sequential approach to speaker model adaptation. In addition, we investigate a new feature-space sequential adaptation approach based on feature MAP linear regression (fMAPLR) and compare it with the previously proposed model-space MAP approach. We find that these two approaches are complementary and can be combined to deliver additional performance gains. The experiments conducted on a time-varying speech database demonstrate that the proposed MAP-fMAPLR approach leads to significant EER reduction with two mismatched UBMs (25% and 39% respectively).
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
In this paper, a semantic parser with support to ellipsis resolution in a Chinese spoken language... more In this paper, a semantic parser with support to ellipsis resolution in a Chinese spoken language dialogue system is proposed. The grammar and parsing strategy of this parser is designed to address the characteristics of spoken language and to support the ellipsis resolution. Namely, it parses the user utterance with a domain-specific semantic grammar based on a template-filling approach. Syntactic constraints extracted by a Generalized LR parser are also used in the parsing process. With a paradigm of two-state bottom-up parsing and a scoring scheme, the ellipsis resolution module is integrated into the parser seamlessly. The parsing result is represented by a linked structure of semantic frames, which is convenient to both the parser and its successive components of the dialogue system.
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
The purpose of this paper is to solve the contextual ellipsis problem that is popular in our Chin... more The purpose of this paper is to solve the contextual ellipsis problem that is popular in our Chinese spoken dialogue system named EasyNav. A Theme Structure is proposed to describe the attentional state. Its dynamic generation feature makes it suitable to model the topic transition in user-initiative dialogues. By studying the differences and the similarities between the ellipsis and the anaphora phenomena, we extend the resolution procedure and the theory from anaphora to ellipsis. The ellipsis resolution is now based on the semantic knowledge and the discourse factor other than the syntactic information. A Theme Structure Method proposed in this paper for the ellipsis resolution is uniform to not only all kinds of elliptical elements but also some particular ellipsis types such as the fragmental ellipsis and the default ellipsis.
The Speaker and Language Recognition Workshop (Odyssey 2022)
The choice of an optimal time-frequency resolution is usually a difficult but important step in t... more The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by a learnable neural network inspired by the weighting block in squeeze-and-excitation networks (SENet). Furthermore, the refinement of the chosen timefrequency resolutions is investigated by pruning the ones with relatively low importance, which reduces the complexity and size of the model. The proposed method is evaluated on the tasks of speech anti-spoofing in ASVSpoof 2019 and its superiority has been justified by comparing with similar baselines.
2005 International Conference on Natural Language Processing and Knowledge Engineering
In this paper, an adaptation method of the language style of a language model is proposed based o... more In this paper, an adaptation method of the language style of a language model is proposed based on the differences between spoken and written language. Several interpolation methods based on trigram counts are used for adaptation. An interpolation method considering Katz smoothing computes weights according to the confidence score of a trigram. An adaptation method based on the classification of a trigram's style feature computes weights dynamically according to the trigram's language style tendency, and several weight generation functions are proposed. Experiments for spoken language on the Chinese corpora show that these methods, especially the method considering both a trigram's confidence and style tendency, can achieve a reduction in the Chinese character error rate for pinyin-to-character conversion.
Research on speaker recognition is extending to address the vulnerability in the wild conditions,... more Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3,000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition, and on how to utilize the valuable multi-genre data more efficiently.
2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2016
Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-v... more Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition. A potential problem of the PLDA model, however, is that it essentially assumes Gaussian distributions over speaker vectors, which is not always true in practice. Additionally, the objective function is not directly related to the goal of the task, e.g., discriminating true speakers and imposters. In this paper, we propose a max-margin metric learning approach to solve the problems. It learns a linear transform with a criterion that the margin between target and imposter trials are maximized. Experiments conducted on the SRE08 core test show that compared to PLDA, the new approach can obtain comparable or even better performance, though the scoring is simply a cosine computation.
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017
Tibetan is an important low-resource language in China. A key factor that hinders the speech and ... more Tibetan is an important low-resource language in China. A key factor that hinders the speech and language research for Tibetan is the lack of resources, particularly free ones. This paper describes our recent progression on Tibetan resource construction supported by the NSFC M2ASR project, including the phone set, lexicon, as well as the transcription of a large scale speech corpus. Following the M2ASR free data program, all the resources are publicly available and free for researchers. We also release a small Tibetan speech database that can be used to build a proto type Tibetan speech recognition system.
APSIPA Transactions on Signal and Information Processing, 2020
Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spo... more Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features...
Fear recognition, which aims at predicting whether a movie segment can induce fear or not, is a p... more Fear recognition, which aims at predicting whether a movie segment can induce fear or not, is a promising area in movie emotion recognition. Research in this area, however, has reached a bottleneck. Difficulties may partly result from the imbalanced database. In this paper, we propose an imbalance learning-based framework for movie fear recognition. A data rebalance module is adopted before classification. Several sampling methods, including the proposed softsampling and hardsampling which combine the merits of both undersampling and oversampling, are explored in this module. Experiments are conducted on the MediaEval 2017 Emotional Impact of Movies Task. Compared with the current state-of-the-art, we achieve an improvement of 8.94% on F1, proving the effectiveness of proposed framework.
2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013
Speaker verification suffers from significant performance degradation with emotion variation. In ... more Speaker verification suffers from significant performance degradation with emotion variation. In a previous study, we have demonstrated that an adaptation approach based on MLLR/CMLLR can provide a significant performance improvement for verification on emotional speech. This paper follows this direction and presents an emotional adaptive training (EAT) approach. This approach iteratively estimates the emotiondependent CMLLR transformations and retrains the speaker models with the transformed speech, which therefore can make use of emotional enrollment speech to train a stronger speaker model. This is similar to the speaker adaptive training (SAT) in speech recognition. The experiments are conducted on an emotional speech database which involves speech recordings of 30 speakers in 5 emotions. The results demonstrate that the EAT approach provides significant performance improvements over the baseline system where the neutral enrollment data are used to train the speaker models and the emotional test utterances are verified directly. The EAT also outperforms another two emotionadaptation approaches in a significant way: (1) the CMLLRbased approach where the speaker models are trained with the neutral enrollment speech and the emotional test utterances are transformed by CMLLR in verification; (2) the MAP-based approach where the emotional enrollment data are used to train emotion-dependent speaker models and the emotional utterances are verified based on the emotion-matched models.
2009 IEEE International Conference on Multimedia and Expo, 2009
To achieve a good balance between matching accuracy and computation efficiency is a key challenge... more To achieve a good balance between matching accuracy and computation efficiency is a key challenge for Query-by-Humming (QBH) system. In this paper, we propose an approach of n-gram based fast match. Our n-gram method uses a robust statistical note transcription as well as error compensation method based on the analysis of frequent transcription errors. The effectiveness of our approach has been evaluated on a relatively large melody database with 5223 melodies. The experimental results show that when the searching space was reduced to only 10% of the whole size, 90% of the target melodies were preserved in the candidates, and 88% of the match accuracy of system was kept. Meanwhile, no obvious additional computation was applied.
Information of speech units like vowels, consonants and syllables can be a kind of knowledge used... more Information of speech units like vowels, consonants and syllables can be a kind of knowledge used in text-independent Short Utterance Speaker Recognition (SUSR) in a similar way as in text-dependent speaker recognition. In such tasks, data for each speech unit, especially at the time of recognition, is often not enough. Hence, it is not practical to use the full set of speech units because some of the units might not be well trained. To solve this problem, a method of using speech unit categories rather than individual phones is proposed for SUSR, wherein similar speech units are put together, hence solving the problem of sparse data. We define Vowel, Consonant, and Syllable Categories (VC, CC and SC) with Standard Chinese (Putonghua) as a reference. A speech utterance is recognized into VC, CC ad SC sequences which are used to train Universal Background Models (UBM) for each speech unit category in the training procedure, and to perform speech unit category dependent speaker recogn...
In this paper, a Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservatio... more In this paper, a Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservation domain (CSTSC-Flight) of 6 GB raw data containing about 50 hours' valid speech is introduced, including the collection and transcription principles and outline. Analysis on the spoken language phenomena contained in this corpus is then performed. Based on this, four types of grammatical are proposed so as to cover as many Chinese spoken language phenomena as possible for robust natural language parsing and understanding in spoken dialogue systems.
Many knowledge repositories nowadays contain billions of triplets, i.e. (head-entity, relationshi... more Many knowledge repositories nowadays contain billions of triplets, i.e. (head-entity, relationship, tail-entity), as relation instances. These triplets form a directed graph with entities as nodes and relationships as edges. However, this kind of symbolic and discrete storage structure makes it difficult for us to exploit the knowledge to enhance other intelligenceacquired applications (e.g. the Question-Answering System), as many AI-related algorithms prefer conducting computation on continuous data. Therefore, a series of emerging approaches have been proposed to facilitate knowledge computing via encoding the knowledge graph into a low-dimensional embedding space. TransE is the latest and most promising approach among them, and can achieve a higher performance with fewer parameters by modeling the relationship as a transitional vector from the head entity to the tail entity. Unfortunately, it is not flexible enough to tackle well with the various mapping properties of triplets, even though its authors spot the harm on performance. In this paper, we thus propose a superior model called TransM to leverage the structure of the knowledge graph via pre-calculating the distinct weight for each training triplet according to its relational mapping property. In this way, the optimal function deals with each triplet depending on its own weight. We carry out extensive experiments to compare TransM with the state-of-the-art method TransE and other prior arts. The performance of each approach is evaluated within two different application scenarios on several benchmark datasets. Results show that the model we proposed significantly outperforms the former ones with lower parameter complexity as TransE.
2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), 2014
We propose a fast block-wise and parallel training approach to train i-vector systems. This appro... more We propose a fast block-wise and parallel training approach to train i-vector systems. This approach divides the loading matrix into groups according to components or acoustic feature dimensions and trains the loading matrices of these groups independently and in parallel. These individually trained block matrices can be combined to approximate the original loading matrix, or used to derive independent i-vectors. We tested the block-wise training on speaker verification tasks based on the NIST SRE data and found that it can substantially speed up the training while retaining the quality of the resulting ivectors.
2009 IEEE International Conference on Multimedia and Expo, 2009
The Query-by-Humming (QBH) system allows users to retrieve songs by singing/humming. In this pape... more The Query-by-Humming (QBH) system allows users to retrieve songs by singing/humming. In this paper we propose a phraselevel piecewise linear scaling algorithm for melody match. Musical phrase boundaries are predicted for the query to split it to phrases. The boundaries of melody fragment corresponding to each phrase are allowed for adjusting in a limited scope. The algorithm employs Dynamic Programming and Recursive Alignment to search for the minimal piecewise matching cost upon Linear Scaling at phrase-level. Our experimental results on 5223 melody database show that the proposed algorithm outperforms traditional algorithms. The proposed algorithm gives significant improvements of 17.0%, 14.7% and 4.8% with respect to Linear Scaling, Dynamic Time Wrapping and Recursive Alignment in top-1 rate, respectively. The results show that the proposed algorithm is more efficient than the previous algorithms.
Observation on query log of search engine indicates that queries are usually ambiguous. Similar t... more Observation on query log of search engine indicates that queries are usually ambiguous. Similar to document ranking, search intents should be ranked to facilitate information search. Previous work attempts to rank intents with merely relevance score. We argue that diversity is also important. In this work, unified models are proposed to rank intents underlying a query by combining relevance score and diversity degree, in which the latter is reflected by non-overlapping ratio of every intent and aggregated non-overlapping ratio of a set of intents. Three conclusions are drawn according to the experiment results. Firstly, diversity plays an important role in intent ranking. Secondly, URL is more effective than similarity in detecting unique subtopics. Thirdly, the aggregated non-overlapping ratio makes some contribution in similarity based intent ranking but little in URL based intent ranking.
International Journal of Pattern Recognition and Artificial Intelligence, 2015
Cross-lingual document clustering is the task of automatically organizing a large collection of m... more Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.
Uploads
Papers by Thomas Zheng