Gaussian mixture models (GMMs) are commonly used in textindependent speaker verification for mode... more Gaussian mixture models (GMMs) are commonly used in textindependent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using just the mean vectors of the GMM in conjunction with support vector machine (SVM). This paper advocates the use of spectral correlation captured by covariance matrices, and investigates its effectiveness compared to and in complement with the mean vectors. We examine two approaches, namely, homoscedastic and heteroscedastic modeling, in estimating the spectral correlation. We introduce two kernel metrics, namely, Frobenius angle and log-Euclidean inner product, for measuring the similarity between speech utterances in terms of spectral correlation. Experiment conducted on the NIST 2006 speaker verification task shows that approximately 10% of relative improvement is achieved by using the spectral correlation in conjunction with the mean vectors.
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017
This paper describes a new database for the assessment of automatic speaker verification (ASV) vu... more This paper describes a new database for the assessment of automatic speaker verification (ASV) vulnerabilities to spoofing attacks. In contrast to other recent data collection efforts, the new database has been designed to support the development of replay spoofing countermeasures tailored towards the protection of text-dependent ASV systems from replay attacks in the face of variable recording and playback conditions. Derived from the re-recording of the original RedDots database, the effort is aligned with that in text-dependent ASV and thus well positioned for future assessments of replay spoofing countermeasures, not just in isolation, but in integration with ASV. The paper describes the database design and re-recording, a protocol and some early spoofing detection results. The new "Red-Dots Replayed" database is publicly available through a creative commons license.
In this paper we study automatic regularization techniques for the fusion of automatic speaker re... more In this paper we study automatic regularization techniques for the fusion of automatic speaker recognition systems. Parameter regularization could dramatically reduce the fusion training time. In addition, there will not be any need for splitting the development set into different folds for cross-validation. We utilize majorization-minimization approach to automatic ridge regression learning and design a similar way to learn LASSO regularization parameter automatically. By experiments we show improvement in using automatic regularization.
Support vector machine (SVM) equipped with sequence kernel has been proven to be a powerful techn... more Support vector machine (SVM) equipped with sequence kernel has been proven to be a powerful technique for speaker verification. A number of sequence kernels have been recently proposed, each being motivated from different perspectives with diverse mathematical derivations. Analytical comparison of kernels becomes difficult. To facilitate such comparisons, we propose a generic structure showing how different levels of cues conveyed by speech utterances, ranging from low-level acoustic features to highlevel speaker cues, are being characterized within a sequence kernel. We then identify the similarities and differences between the popular generalized linear discriminant sequence (GLDS) and GMM supervector kernels, as well as our own probabilistic sequence kernel (PSK). Furthermore, we enhance the PSK in terms of accuracy and computational complexity. The enhanced PSK gives competitive accuracy with the other two kernels. Fusing all the three kernels yields an EER of 4.83% on the 2006 NIST SRE core test.
In this paper, we apply Constrained Maximum a Posteriori Linear Regression (CMAPLR) transformatio... more In this paper, we apply Constrained Maximum a Posteriori Linear Regression (CMAPLR) transformation on Universal Background Model (UBM) when characterizing each speaker with a supervector. We incorporate the covariance transformation parameters into the supervector in addition to the mean transformation parameters. Maximum Likelihood Linear Regression (MLLR) covariance transformation is adopted. The auxiliary function maximization involved in Maximum Likelihood (ML) and Maximum a Posteriori (MAP) estimation is also presented. Our experiment on the 2006 NIST Speaker Recognition Evaluation (SRE) corpus shows that the two proposed techniques provide substantial performance improvement.
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evalua... more The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end l... more This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end language identification (LID) task. The approach has been proven to significantly improve the state-of-art in many domains include speech recognition, computer vision and genomics. As an end-to-end system, deep learning removes the burden of hand crafting the feature extraction is conventional approach in LID. This versatility is achieved by training a very deep network to learn distributed representations of speech features with multiple levels of abstraction. In this paper, we show that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths. Our results show that a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs. Additionally, we compare our network performance to state-of-the-art BNF-based i-vector system on NIST 2015 Language Recognition Evaluation corpus. Key to our approach is that we effectively address computational and regularization issues into the network structure to build deeper architecture compare to any previous DNN approaches to language recognition task.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
ABSTRACT An i-vector is a low-dimensional fixed-length representation of a variable-length speech... more ABSTRACT An i-vector is a low-dimensional fixed-length representation of a variable-length speech utterance, and is defined as the posterior mean of a latent variable conditioned on the observed feature sequence of an utterance. The assumption is that the prior for the latent variable is non-informative, since for homogeneous datasets there is no gain in generality in using an informative prior. This work shows that extracting i-vectors for a heterogeneous dataset, containing speech samples recorded from multiple sources, using informative priors instead is applicable, and leads to favorable results. Tests carried out on the NIST 2008 and 2010 Speaker Recognition Evaluation (SRE) dataset show that our proposed method beats three baselines: For the short2-short3 core-task in SRE'08, for the female and male cases, five and six respectively, out of eight common conditions were beaten, and for the core-core task in SRE'10, for both genders, five out of nine common conditions were beaten.
Total variability model has shown to be effective for text-independent speaker verification. It p... more Total variability model has shown to be effective for text-independent speaker verification. It provisions a tractable way to estimate the so-called i-vector, which describes the speaker and session variability rendered in a whole utterance. In order to extract the local session variability that is neglected by an i-vector, local variability models were proposed, including the Gaussian-and the dimension-oriented local variability models. This paper presents a consolidated study of the total and local variability models and gives a full comparison between them under the same framework. Besides, new extensions are proposed for the existing local variability models. The comparison between the total variability model and the local variability models is fulfilled with the experiments on NIST
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Recently, joint factor analysis (JFA) and identity-vector (i-vector) represent the domin... more ABSTRACT Recently, joint factor analysis (JFA) and identity-vector (i-vector) represent the dominant techniques used for speaker recognition due to their superior performance. Developed relatively earlier, the Gaussian mixture model - support vector machine (GMM-SVM) with nuisance attribute projection (NAP) has gradually become less popular. However, when developing the relevance factor in maximum a posteriori (MAP) estimation of GMM to be adapted by application data in place of the conventional fixed value, it is noted that GMM-SVM demonstrates some advantages. In this paper, we conduct a comparative study between GMM-SVM with adaptive relevance factor and JFA/i-vector under the framework of Speaker Recognition Evaluation (SRE) formulated by the National Institute of Standards and Technology (NIST).
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper describes text-dependent speaker verification as a task involving four classes of tria... more This paper describes text-dependent speaker verification as a task involving four classes of trials depending on whether the target speaker or an impostor pronounces the expected pass-phrase or not. These four classes are used to reformulate the log-likelihood ratio traditionally used in text-independent speaker verification. Three formulations of the alternative hypothesis are considered, leading to three new expressions of the verification score. Experiments performed on the publicly available RSR2015 database show a significant improvement compared to existing baseline scores. A relative gain up to 61% in term of minimum cost is achieved when considering that the alternative hypothesis is the union of three sub-hypotheses corresponding to the three existing classes of impostures.
The 9th International Symposium on Chinese Spoken Language Processing, 2014
Total variability modeling has shown to be effective for text-independent speaker verification ta... more Total variability modeling has shown to be effective for text-independent speaker verification task. It provisions a tractable way to estimate the so-called i-vector, which describes the speaker and session variability rendered in an utterance. Due to the low dimensionality of the i-vector, channel compensation techniques such as linear discriminant analysis (LDA) and probabilistic LDA can be applied for the purpose of channel compensation. This paper proposes the local variability modeling technique, the central idea of which is to capture the local variability associated with individual dimension of the acoustic space. We analyze the latent structure associated with both the i-vector and local variability vector and show that the two representations complement each other based on the experiment conducted on NIST SRE'08 and SRE'10 datasets.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
The importance of phonetic variability for short duration speaker verification is widely acknowle... more The importance of phonetic variability for short duration speaker verification is widely acknowledged. This paper assesses the performance of Probabilistic Linear Discriminant Analysis (PLDA) and i-vector normalization for a text-dependent verification task. We show that using a class definition based on both speaker and phonetic content significantly improves the performance of a state-ofthe-art system. We also compare four models for computing the verification scores using multiple enrollment utterances and show that using PLDA intrinsic scoring obtains the best performance in this context. This study suggests that such scoring regime remains to be optimized.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Gaussian mixture models (GMMs) are commonly used to model the spectral distribution of speech sig... more Gaussian mixture models (GMMs) are commonly used to model the spectral distribution of speech signals for text-independent speaker verification. Mean vectors of the GMM, used in conjunction with support vector machine (SVM), have shown to be effective in characterizing speaker information. In addition to the mean vectors, covariance matrices capture the correlation between spectral features, which also represent some salient
2008 6th International Symposium on Chinese Spoken Language Processing, 2008
In this paper, we propose a self-organized clustering method for feature mapping to compensate th... more In this paper, we propose a self-organized clustering method for feature mapping to compensate the channel variation in spoken language recognition. The self-organized clustering is realized by transforming the utterances into the Gaussian mixture model (GMM) supervectors and categorizing the supervectors through k-mean algorithm. Based on the language-dependent cluster-ofutterance information of the training databases, the feature mapping parameters are trained for each of the target languages. During recognition, the test utterance is identified to be one of the clusters according to the feature mapping parameters and then transformed into the cluster-independent features through feature mapping for a given target language. We show the effectiveness of the proposed self-organized feature mapping scheme through the 2003 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) by using GMM recognizer.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
Voice conversion-the methodology of automatically converting one's utterances to sound as if spok... more Voice conversion-the methodology of automatically converting one's utterances to sound as if spoken by another speaker-presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.
2010 7th International Symposium on Chinese Spoken Language Processing, 2010
Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for mod... more Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using the mean super-vector obtained by concatenating the mean vectors of the GMM. This paper proposes to use the spatial correlation captured by the covariance matrix of the mean super-vector for speaker
Gaussian mixture models (GMMs) are commonly used in textindependent speaker verification for mode... more Gaussian mixture models (GMMs) are commonly used in textindependent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using just the mean vectors of the GMM in conjunction with support vector machine (SVM). This paper advocates the use of spectral correlation captured by covariance matrices, and investigates its effectiveness compared to and in complement with the mean vectors. We examine two approaches, namely, homoscedastic and heteroscedastic modeling, in estimating the spectral correlation. We introduce two kernel metrics, namely, Frobenius angle and log-Euclidean inner product, for measuring the similarity between speech utterances in terms of spectral correlation. Experiment conducted on the NIST 2006 speaker verification task shows that approximately 10% of relative improvement is achieved by using the spectral correlation in conjunction with the mean vectors.
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017
This paper describes a new database for the assessment of automatic speaker verification (ASV) vu... more This paper describes a new database for the assessment of automatic speaker verification (ASV) vulnerabilities to spoofing attacks. In contrast to other recent data collection efforts, the new database has been designed to support the development of replay spoofing countermeasures tailored towards the protection of text-dependent ASV systems from replay attacks in the face of variable recording and playback conditions. Derived from the re-recording of the original RedDots database, the effort is aligned with that in text-dependent ASV and thus well positioned for future assessments of replay spoofing countermeasures, not just in isolation, but in integration with ASV. The paper describes the database design and re-recording, a protocol and some early spoofing detection results. The new "Red-Dots Replayed" database is publicly available through a creative commons license.
In this paper we study automatic regularization techniques for the fusion of automatic speaker re... more In this paper we study automatic regularization techniques for the fusion of automatic speaker recognition systems. Parameter regularization could dramatically reduce the fusion training time. In addition, there will not be any need for splitting the development set into different folds for cross-validation. We utilize majorization-minimization approach to automatic ridge regression learning and design a similar way to learn LASSO regularization parameter automatically. By experiments we show improvement in using automatic regularization.
Support vector machine (SVM) equipped with sequence kernel has been proven to be a powerful techn... more Support vector machine (SVM) equipped with sequence kernel has been proven to be a powerful technique for speaker verification. A number of sequence kernels have been recently proposed, each being motivated from different perspectives with diverse mathematical derivations. Analytical comparison of kernels becomes difficult. To facilitate such comparisons, we propose a generic structure showing how different levels of cues conveyed by speech utterances, ranging from low-level acoustic features to highlevel speaker cues, are being characterized within a sequence kernel. We then identify the similarities and differences between the popular generalized linear discriminant sequence (GLDS) and GMM supervector kernels, as well as our own probabilistic sequence kernel (PSK). Furthermore, we enhance the PSK in terms of accuracy and computational complexity. The enhanced PSK gives competitive accuracy with the other two kernels. Fusing all the three kernels yields an EER of 4.83% on the 2006 NIST SRE core test.
In this paper, we apply Constrained Maximum a Posteriori Linear Regression (CMAPLR) transformatio... more In this paper, we apply Constrained Maximum a Posteriori Linear Regression (CMAPLR) transformation on Universal Background Model (UBM) when characterizing each speaker with a supervector. We incorporate the covariance transformation parameters into the supervector in addition to the mean transformation parameters. Maximum Likelihood Linear Regression (MLLR) covariance transformation is adopted. The auxiliary function maximization involved in Maximum Likelihood (ML) and Maximum a Posteriori (MAP) estimation is also presented. Our experiment on the 2006 NIST Speaker Recognition Evaluation (SRE) corpus shows that the two proposed techniques provide substantial performance improvement.
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evalua... more The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end l... more This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end language identification (LID) task. The approach has been proven to significantly improve the state-of-art in many domains include speech recognition, computer vision and genomics. As an end-to-end system, deep learning removes the burden of hand crafting the feature extraction is conventional approach in LID. This versatility is achieved by training a very deep network to learn distributed representations of speech features with multiple levels of abstraction. In this paper, we show that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths. Our results show that a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs. Additionally, we compare our network performance to state-of-the-art BNF-based i-vector system on NIST 2015 Language Recognition Evaluation corpus. Key to our approach is that we effectively address computational and regularization issues into the network structure to build deeper architecture compare to any previous DNN approaches to language recognition task.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
ABSTRACT An i-vector is a low-dimensional fixed-length representation of a variable-length speech... more ABSTRACT An i-vector is a low-dimensional fixed-length representation of a variable-length speech utterance, and is defined as the posterior mean of a latent variable conditioned on the observed feature sequence of an utterance. The assumption is that the prior for the latent variable is non-informative, since for homogeneous datasets there is no gain in generality in using an informative prior. This work shows that extracting i-vectors for a heterogeneous dataset, containing speech samples recorded from multiple sources, using informative priors instead is applicable, and leads to favorable results. Tests carried out on the NIST 2008 and 2010 Speaker Recognition Evaluation (SRE) dataset show that our proposed method beats three baselines: For the short2-short3 core-task in SRE'08, for the female and male cases, five and six respectively, out of eight common conditions were beaten, and for the core-core task in SRE'10, for both genders, five out of nine common conditions were beaten.
Total variability model has shown to be effective for text-independent speaker verification. It p... more Total variability model has shown to be effective for text-independent speaker verification. It provisions a tractable way to estimate the so-called i-vector, which describes the speaker and session variability rendered in a whole utterance. In order to extract the local session variability that is neglected by an i-vector, local variability models were proposed, including the Gaussian-and the dimension-oriented local variability models. This paper presents a consolidated study of the total and local variability models and gives a full comparison between them under the same framework. Besides, new extensions are proposed for the existing local variability models. The comparison between the total variability model and the local variability models is fulfilled with the experiments on NIST
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Recently, joint factor analysis (JFA) and identity-vector (i-vector) represent the domin... more ABSTRACT Recently, joint factor analysis (JFA) and identity-vector (i-vector) represent the dominant techniques used for speaker recognition due to their superior performance. Developed relatively earlier, the Gaussian mixture model - support vector machine (GMM-SVM) with nuisance attribute projection (NAP) has gradually become less popular. However, when developing the relevance factor in maximum a posteriori (MAP) estimation of GMM to be adapted by application data in place of the conventional fixed value, it is noted that GMM-SVM demonstrates some advantages. In this paper, we conduct a comparative study between GMM-SVM with adaptive relevance factor and JFA/i-vector under the framework of Speaker Recognition Evaluation (SRE) formulated by the National Institute of Standards and Technology (NIST).
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper describes text-dependent speaker verification as a task involving four classes of tria... more This paper describes text-dependent speaker verification as a task involving four classes of trials depending on whether the target speaker or an impostor pronounces the expected pass-phrase or not. These four classes are used to reformulate the log-likelihood ratio traditionally used in text-independent speaker verification. Three formulations of the alternative hypothesis are considered, leading to three new expressions of the verification score. Experiments performed on the publicly available RSR2015 database show a significant improvement compared to existing baseline scores. A relative gain up to 61% in term of minimum cost is achieved when considering that the alternative hypothesis is the union of three sub-hypotheses corresponding to the three existing classes of impostures.
The 9th International Symposium on Chinese Spoken Language Processing, 2014
Total variability modeling has shown to be effective for text-independent speaker verification ta... more Total variability modeling has shown to be effective for text-independent speaker verification task. It provisions a tractable way to estimate the so-called i-vector, which describes the speaker and session variability rendered in an utterance. Due to the low dimensionality of the i-vector, channel compensation techniques such as linear discriminant analysis (LDA) and probabilistic LDA can be applied for the purpose of channel compensation. This paper proposes the local variability modeling technique, the central idea of which is to capture the local variability associated with individual dimension of the acoustic space. We analyze the latent structure associated with both the i-vector and local variability vector and show that the two representations complement each other based on the experiment conducted on NIST SRE'08 and SRE'10 datasets.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
The importance of phonetic variability for short duration speaker verification is widely acknowle... more The importance of phonetic variability for short duration speaker verification is widely acknowledged. This paper assesses the performance of Probabilistic Linear Discriminant Analysis (PLDA) and i-vector normalization for a text-dependent verification task. We show that using a class definition based on both speaker and phonetic content significantly improves the performance of a state-ofthe-art system. We also compare four models for computing the verification scores using multiple enrollment utterances and show that using PLDA intrinsic scoring obtains the best performance in this context. This study suggests that such scoring regime remains to be optimized.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Gaussian mixture models (GMMs) are commonly used to model the spectral distribution of speech sig... more Gaussian mixture models (GMMs) are commonly used to model the spectral distribution of speech signals for text-independent speaker verification. Mean vectors of the GMM, used in conjunction with support vector machine (SVM), have shown to be effective in characterizing speaker information. In addition to the mean vectors, covariance matrices capture the correlation between spectral features, which also represent some salient
2008 6th International Symposium on Chinese Spoken Language Processing, 2008
In this paper, we propose a self-organized clustering method for feature mapping to compensate th... more In this paper, we propose a self-organized clustering method for feature mapping to compensate the channel variation in spoken language recognition. The self-organized clustering is realized by transforming the utterances into the Gaussian mixture model (GMM) supervectors and categorizing the supervectors through k-mean algorithm. Based on the language-dependent cluster-ofutterance information of the training databases, the feature mapping parameters are trained for each of the target languages. During recognition, the test utterance is identified to be one of the clusters according to the feature mapping parameters and then transformed into the cluster-independent features through feature mapping for a given target language. We show the effectiveness of the proposed self-organized feature mapping scheme through the 2003 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) by using GMM recognizer.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
Voice conversion-the methodology of automatically converting one's utterances to sound as if spok... more Voice conversion-the methodology of automatically converting one's utterances to sound as if spoken by another speaker-presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.
2010 7th International Symposium on Chinese Spoken Language Processing, 2010
Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for mod... more Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using the mean super-vector obtained by concatenating the mean vectors of the GMM. This paper proposes to use the spatial correlation captured by the covariance matrix of the mean super-vector for speaker
Uploads
Papers by Aik Lee