0% found this document useful (0 votes)
10 views10 pages

1 s2.0 S0950705116303926 Main

Uploaded by

pune.shahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

1 s2.0 S0950705116303926 Main

Uploaded by

pune.shahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Knowledge-Based Systems 115 (2017) 5–14

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: [Link]/locate/knosys

Deep neural network framework and transformed MFCCs for speaker’s


age and gender classification
Zakariya Qawaqneh a, Arafat Abu Mallouh a, Buket D. Barkana b,∗
a
Computer Science and Engineering Department, School of Engineering, University of Bridgeport, Bridgeport, CT 06604 United States
b
Electrical Engineering Department, School of Engineering, University of Bridgeport, Bridgeport, CT 06604 United States

a r t i c l e i n f o a b s t r a c t

Article history: Speaker age and gender classification is one of the most challenging problems in speech processing. Al-
Received 27 April 2016 though many studies have been carried out focusing on feature extraction and classifier design for im-
Revised 28 September 2016
provement, classification accuracies are still not satisfactory. The key issue in identifying speaker’s age
Accepted 7 October 2016
and gender is to generate robust features and to design an in-depth classifier. Age and gender informa-
Available online 20 October 2016
tion is concealed in speaker’s speech, which is liable for many factors such as, background noise, speech
Keywords: contents, and phonetic divergences. The success of DNN architecture in many applications motivated this
Deep neural network work to propose a new speaker’s age and gender classification system that uses BNF extractor together
DNN with DNN. This work has two major contributions: Introduction of shared class labels among misclas-
I-Vector sified classes to regularize the weights in DNN and generation of transformed MFCCs feature set. The
MFCCs proposed system uses HTK to find tied-state triphones for all utterances, which are used as labels for
Speaker age and gender classification
the output layer in the DNNs for the first time in age and gender classification. BNF extractor is used to
generate transformed MFCCs features. The performance evaluation of the new features is done by two
classifiers, DNN and I-Vector. It is observed that the transformed MFCCs are more effective than the tra-
ditional MFCCs in speaker’s age and gender classification. By using the transformed MFCCs, the overall
classification accuracies are improved by about 13%.
© 2016 Elsevier B.V. All rights reserved.

1. Introduction stage is classifier design. A classifier uses the extracted features to


predict the speakers’ age and gender.
Currently, computerized systems such as language learning, Numerous feature sets have been developed and evaluated in
phone ads, criminal cases, computerized health and educational the literature for this problem. Those features can be classified into
systems are rapidly spreading and imposing an urgent need for three categories, spectral, prosodic, and glottal features. One of the
better performance. Such applications can be improved by speak- most recognized feature sets is MFCCs which represent the spec-
ers’ age, gender, accent, and emotional state information [1–3]. Age tral characteristics of speech utterance. MFFCs are widely used in
and gender recognition is defined as the extraction of age and gen- the literature for different speech processing applications such as
der information from speaker’s speech. A key stage in identifying speech recognition, speaker identification, and noise classification.
speakers’ age and gender is to extract and select effective features MFCCs represent the spectrum that is related to vocal tract shape
that represent the speaker’s characteristics uniquely. Another key and do not capture the prosodic information [4]. The effectiveness
of MFCCs comes from the ability to model the vocal tract in short-
time power spectrum. Although previous studies have presented
Abbreviations: DNN, Deep neural network; aGender, Age-annotated database of some improvements in this field, the classification of speaker’s
German telephone speech; HTK, Hidden Markov model toolkit; MFCCs, Mel fre-
age and gender has a big room for improvement. More effective
quency cepstral coefficients; RBM, Restricted Boltzmann machine; DBN, Deep belief
networks; GMM, Gaussian mixtures models; SVM, Support vector machines; MLLR, feature sets, especially for short-time duration speech utterances,
Maximum likelihood linear regression; TPP, Tandem posterior probability; UBM, and classifier designs are required to improve current classifica-
Universal background model; PPR, Parallel phoneme recognizer; MAP, Maximum- tion accuracies. There are studies reporting high overall classifi-
a-posteriori; BNF, Bottle-neck feature; BB-RBM, Bernoulli-Bernoulli RBM; GB-RBM,
cation accuracies [5] (around 90%), however these studies either
Gaussian-Bernoulli RBM.
∗ used a small private corpus or predicted a small number of age
Corresponding author.
E-mail addresses: zqawaqne@[Link] (Z. Qawaqneh), and gender classes. AGender database is one of the most challeng-
aabumall@[Link] (A.A. Mallouh), bbarkana@[Link] (B.D. ing databases in speaker’s age and gender classification since it
Barkana).

[Link]
0950-7051/© 2016 Elsevier B.V. All rights reserved.
6 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14

is text-independent; background noise is present; the number of helps getting information about the age and gender of a speaker. In
utterances varies for each class; and there are seven classes. The TPP, UBM models are trained independently. Therefore, each UBM
highest reported classification accuracy for this database is around component can model some underlying phonetic sounds [26]. The
60% by using a combination of several feature sets [6]. TPP system’s overall accuracy is calculated as 37.8%. The SVM base-
Last few years, DNNs have been used effectively for feature ex- line system using 450 dimensional acoustic features [22] and sev-
traction and classification in computer vision [7–9], image pro- eral prosodic features, such as F0, F0 envelop, jitter, and shimmer
cessing and classification [8,10], and natural language recognition is designed to capture the age and gender information at prosodic
[11,12]. In 2006, Hinton et al. [13] introduced the RBM for the first level. This system achieved an overall accuracy of 44.6% for the
time as a keystone for training DBN. Later, Benjio [14] successfully aGender database.
proposed a new way to train DNN by using auto encoders. DNN In [22] it is shown that combining these methods will result in
has a deep architecture that transforms rich input features into low computational cost. Moreover, a score level fusion of different
strong internal representation [15]. One of the most recent popular number of systems is used. Each system has its complementary in-
techniques is the eigenvoice (I-Vector) which is based on the pro- formation from other systems. The highest accuracy (52.7%) is at-
cess of joint factor analysis [16]. Currently, it is considered as one tained when the five systems are combined.
of the state-of-art in the field of speaker recognition and language Metze et al. [27] studied different techniques for age and gen-
detection [17,18]. Eigenvoice adaptation is the main procedure to der classification based on telephone applications. They also com-
estimate I-Vector which represents a low-dimensional latent factor pared the performance of their system to human listeners. Their
for each class in a corpus. A test data is scored by a linear strategy first technique, PPR is one of the early systems which were built to
that computes the log-likelihood ratio between different classes. deal with automatic sound recognition and language identification
This paper is organized as follows. A brief literature review is problems. The main core of this system is to create a PPR for each
provided in Section 2. In Section 3, the methodology of the pro- class in the age and gender database. They reported that the PPR
posed work is explained. The classifier design is introduced in system performs almost like human listeners with the disadvan-
Section 4. Experimental results and their analysis are presented in tage of losing quality and accuracy on short utterances. Their sec-
Section 5. The conclusion, challenges, and future work follow in ond technique is based on prosodic features. This technique uses
Section 6. several prosodic features jitter, shimmer, statistical information of
the harmonics to noise ratio, and many several statistical informa-
2. Literature review tion of the fundamental frequency. All these features are utilized
and analyzed using a system with two layers. The first layer an-
The problem of age and gender classification was studied early alyzes the features by using three different neural networks. The
in 1950s [19], but the computer-aided systems for deriving the age second layer processes the output information which has already
and gender information from speech have been developed recently been produced by the first layer by using dynamic Bayesian net-
[20,21]. Li et al. [22] utilized various acoustic and prosodic meth- work. The system based on prosodic features has shown better per-
ods to improve accuracies by using two or more fusion systems formance on variation of the utterance duration. Their third tech-
such as GMM base, GMM-SVM mean super vector, GMM-SVM- nique is the linear prediction analysis which computes a distance
MLLR super vector, GMM-SVM TPP super vector, and SVM baseline between the formants and the signal spectrum based on the linear
system. Their GMM system used 13-dimensional MFCCs features prediction cover. The Gaussian distributions of the distance were
and their first and second derivatives per frame as input. Cepstral considered to contain useful information about the age and gender
mean subtraction and variance normalization are performed to get of a speaker. This system has failed due to the fact that young and
zero mean and unit variance on their database. A UBM and MAP adult speakers have almost the same Gaussian distribution.
techniques [23] are used to model different age and gender classes This work shares the same goal with previous works in the lit-
in a supervised manner for GMM training purpose. Their system erature. The previous systems in [22], which are GMM base, GMM-
achieved an overall accuracy of 43.1%. The other proposed system SVM-Mean supervector, GMM-SVM-MLLR supervector, GMM-SVM-
by Li et al. [22] is the GMM-SVM mean super vector system that TPP supervector, and SVM baseline system, as well as, the previous
is considered as an acoustic-level approach for speaker’s age and systems in [27], which are PPR, prosodic feature, and linear predic-
gender classification. The GMM baseline system is used for extract- tion analysis systems used a combination of different popular fea-
ing features and for training the UBM model. The mean vectors of ture sets to classify speakers‘ age and gender information. Different
all the Gaussian components are concatenated to form the GMM than the previous works, our proposed work offers a new feature
super vectors, and then it is modeled by SVMs. One of the ad- set that is constructed from the MFCCs and a DNN-based classi-
vantages of their work is the usage of two-stage frameworks as fier that is designed for speaker‘s age and gender classification.
in [24], which solve the limitation of computer memory required DNN with a bottleneck layer is used to generate bottleneck fea-
by large database training instead of directly training a multi-class tures from the MFCCs. These features can be considered as a low-
SVM classifier by using all the high-dimensional super vectors. This dimensional feature set since the bottleneck layer compresses the
system achieved a 42.6% overall accuracy for the aGender database. MFCCs. In addition, a DNN classifier is designed and used instead
In the GMM-SVM MLLR system, the MLLR adapts the means of of combining several classifiers together for a better classification.
the UBM for each utterance to extract the features of the super In [22], it is reported that the highest accuracies are achieved by
vector [25]. SVM is used to model the resulted MLLR matrix su- combining five systems together. On the other hand, our proposed
per vector. Dimension reduction on the MLLR super vector space system achieves higher accuracies by using only one classifier.
is done by linear discriminant analysis. It is important to mention
that the MLLR matrix contains speaker’s specific characteristics and
the contents of this transformed matrix are used as feature super 3. Methodology
vectors for speaker modeling and age and gender recognition. The
MLLR achieved an overall accuracy of 36.2% In this section, the generation of transformed features and the
The GMM-SVM TPP super vector is calculated as probability dis- suggested regularized DNN weights using shared class labels are
tribution over all Gaussian components. In this method, the KL- explained. We propose an approach to transform existing features
divergence is used to measure the similarity between vectors. The into more effective features. MFCCs, their first and second deriva-
usage of KL-divergence provides discriminative information, which tives are used as input features for comparison reasons since most
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 7

BNF extraction generates the transformed MFCCs using a bottle-


neck layer in a trained DNN.

3.1.1. Phoneme label extraction (tied-state triphones)


Usually each database has a transcript file for each utterance
that contains spoken words. Using the transcript along with speech
audio files, the phonemes are extracted and this process is called
grapheme-to-phoneme phase. The primary function of the HTK
toolkit is to build Hidden Markov Models (HMMs) for speech-based
tasks such as recognizers [29]. In the field of speech recognition,
the recognition of speech is performed by mapping the sequence
of speech vectors to the desired symbols sequence. Several compli-
cations may occur while performing the recognition of speech. For
example, the mapping between symbols and speech is not one-to-
one. In most cases, the speech vector could be mapped to many
symbols. Another complication is unclear boundary locations be-
tween words in a speech. This will cause incorrect mapping be-
tween the speech and the symbols. HTK tool is designed to address
Fig. 1. The main steps for extracting the BNF features from the input features.
such issues using HMMs. HMMs are used to align phonemes with
correct labels. It provides word isolation to deal with the unclear
of the previous studies have used MFCCs features in age and gen- boundary location problem. In this work, we utilized the HTK tool
der classification [6,22,27]. in [29] to find the tied-state triphones which will be used later as
labels for the output layer in the DNN.
3.1. Generation of transformed features The steps of finding the tied-state triphones is depicted in
Fig. 2 and described below.
New bottleneck features are generated from input features by
using DNN as shown in Fig. 1. For example glottal and spectral fea- Step 1: Generate the monophones by considering all of the pro-
tures can be used to generate a new form of bottleneck features in nunciations of each utterance in the database. The pronun-
speech field. ciation that matches the best to the speech audio will be
The DNN that is used to generate these features consists of sev- selected as an output.
eral hidden layers in which one of them has a very small num- Step 2: Produce triphones. Monophones are used to produce tri-
ber of units compared to other layers. The resulted features can be phones. The current monophone, X, the previous mono-
considered as a low-dimensional representation since the bottle- phone, L, and the next monophone, R, are processed to-
neck layer compresses the input features and the output labels to gether.
form a new bottleneck features. It is as a way of nonlinear dimen- Step 3: Generate triphones that do not exist in the training data.
sionality reduction since it produces a low-dimensional feature set These are called tied-state triphones.
from the input features based on the nonlinear activation functions Step 4: Find the best match between each frame of the speech ut-
used to produce the outputs of the units in the neural network. Re- terance and tied-state triphones. The best match will be
cently, the usage of bottleneck DNN has shown improved results in the phoneme label of the corresponding target frame.
auto-encoder to reconstruct the input features [28]. In this paper,
the bottleneck features are investigated further and used to classify The phoneme labels are used for speech recognition. In this
speaker’s age and gender. work, the phoneme labels are used to create transformed features.
In this section, the phoneme label extraction and the BNF ex- It keeps the phoneme specific characteristics of each speaker. The
traction are introduced. Firstly, the labels are extracted for each phoneme labels also help the DNN to embrace distinctive informa-
frame for all utterances. Then based on the extracted labels, the tion in the BNF.

Fig. 2. HTK process for extracting phoneme frame labels.


8 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14

3.1.2. BNF extraction and the weights between the layers in the DBN stack are opti-
In this section, we discuss the BNF extraction process. First we mized, the supervised training process is started by adding a final
will describe the DNN training procedure in its two phases: the layer of labels on top of the DBN layers. These labels represent the
generative (unsupervised) and the supervised. Then, the process of final classes of the whole network. In our work, these labels rep-
extracting the BNFs features based on the trained DNN will be ex- resent the tied-state triphones for the utterance speech data.
plained in the BNF extractor section.
B) BNF extractor
A) DNN training BNF architecture is generated from a trained DNN where each
layer represents a different internal structure of the input features.
The first phase is generative. The DNN is pre-trained by using
In the DNN, the output of each hidden layer produces transformed
an unsupervised learning technique that employs the RBM. The
features. All the layers above the bottleneck layer are removed
second phase is discriminative. The DNN is trained by using the
to produce the BNF extractor as shown in Fig. 3. Fig. 3 explains
back-propagation algorithm in a supervised way. An RBM has in-
the proposed bottleneck feature extraction architecture using the
put layer, V (visible layer) where V = {v1 , v2 ,…, vV }, and the output
phoneme labels. The left side (in Fig. 3) explains the pre-training
layer, H (hidden layer) where h = {h1 , h2 ,…, hH } [30]. The visible
phase in the DBN consisting of five RBM layers. The first layer is a
and the hidden layers consist of units. Each unit in the visible layer
GB-RBM and the rest are BB-RBM with the bottleneck layer located
is connected to all units in the hidden layer. The restriction of this
in the middle. The right side (in Fig. 3) portrays the DNN architec-
architecture is that there is no connection between the units in the
ture which is formed by adding a softmax output layer on top of
same layer. Two types of RBMs, BB-RBM and GB-RBM [31] are used
the DBN architecture. The weights for the DNN are tuned during
in this work. In the BB-RBM, the visible and hidden layer unit val-
supervised phase.
ues are binary, V ∈ {0, 1} and H ∈ {0, 1}. The energy function of
Introducing bottleneck layer has many benefits as reducing the
the BB-RBM is defined in Eq. (1)
number of units inside the bottleneck layer, getting rid of redun-

V 
H 
V 
H dant values from the input feature set, and reflecting the class la-
E ( v, h ) = − vi hj wij − vi bvi − hj bhj (1) bels during the classification process [34,35]. It also helps to cap-
i=1 j=1 i=1 j=1 ture the descriptive and expressive features of short-time speech
where Vi is the visible unit in layer i and Hj is the hidden unit in utterances [36]. Given a BNF extractor with M layers, the features
layer j. Wij denotes the weight between the visible unit and the at the output layer can be extracted using Eq. (5).

hidden unit. bv i and bh j are the bias of the visible unit in layer i ⎪ N

⎪ l1 ( x ) = σ w ( xn + b1 )
and the hidden unit in layer j, respectively. For the GB-RBM, the ⎪

visible unit values are real, where V R, and the hidden units val- ⎪

n=1

⎪ F2
ues are binary, where H {0,1}. The energy function of this model ⎪
⎪ l2 ( x ) = σ w ( xn + b2 )

⎨ n=1
is defined as in Eq. (2)
 2 . (5)

V 
H 
V
vi − bvi 
H ⎪

E ( v, h ) = −
vi
hj wji + − hj bhj


σi 2σi2
(2) ⎪
⎪ .


i=1 j=1 i=1 j=1



⎪ FM
where σ i is the standard deviation of the Gaussian noise for the ⎩lM (x ) = σ w(lm−1 (x ) + bM )
visible unit i. The joint probability distribution which is associated i=1

with configuration of (v,h) is defined in Eq. (3) where σ is computed by the logistic function
exp(−E(v, h; θ ) ) σ (x) = 1/(1 + exp ( − x)). X = {X1 ,…, XN } is the feature set vec-
p ( v, h; θ ) = (3) tor, and N is the number of input features. LM is the output of
Z
the Mth layer. F is a varying number that represents the input for
θ represent the weights and the biases, while Z is the partition each layer in the BNF extractor. w represents the weights between
function defined as in Eq. (4). the input and output nodes in each layer. B represents the bias for
 each layer.
Z= exp (−E(v, h; θ ) ) (4)
v h
3.2. Regularizing DNN weights using shared class labels
The RBM is the basic building block in DBN. It is used as a fea-
ture detector and trained in an unsupervised way. The output of Traditionally, one label is assigned to each class during the reg-
a trained RBM is used as an input to train another RBM. Training ularization of weights. However in this work, one label is allowed
RBM is very useful for complex problems where the structure of to represent two classes. Those two classes sharing the same la-
the data is complicated and the implicit features could not be de- bel are chosen among the most misclassified classes. By sharing
tected directly [32]. A number of RBMs could be stacked together the same label, the weights between the DNN layers are being en-
to represent complex structures and to detect implicit features forced to converge to an unbiased form with a wider-range repre-
from the previous RBM representation in the stack. The stacked sentation. Misclassifications between classes are determined by a
RBMs represent a generative model called DBN. The learning al- DNN classifier (Fig. 4A). Two classes having the highest misclassi-
gorithm in the DBN is layer-wise and unsupervised. The layer-wise fication ratio are chosen to share a label. Let us have a database
learning helps to find descriptive features that represent correla- with seven classes, and the highest misclassifications occurred be-
tion between the input data in each layer [33]. The DBN learning tween classes (3 and 5), and between classes (4 and 6). Therefore,
algorithm works to optimize the weights between layers. More- five shared labels are generated, the first label is for the class 1,
over, it is proved that initializing the weights between layers in the the second label is for the class 2, the third label is a shared label
DBN network enhances the results more than if random weights between the classes 3 and 5, the fourth label is shared between
are used. Another advantage of DBN training lies in its ability to the classes 4 and 6, and finally the fifth label is for the class 7. As
reduce the effect of over-fitting and under-fitting problems where shown in Fig. 4B a second DNN structure calculates the regularized
both are common problems in models with big number of param- weights. These regularized weights are used as initial weights for
eters and deep architectures. After the DBN learning is completed the third DNN classifier as shown in Fig. 4C.
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 9

Fig. 3. BNF extractor using trained DNN.

Fig. 4. DNN structures. (A) Finding misclassified classes. (B) Training a second DNN with shared class labels to calculate regularized weights. (C) Initializing a third DNN
with regularized weights.

4. Classifier design the extraction of the I-Vectors, noise in each I-Vector is removed
by Gaussian probabilistic linear discriminant analysis [39]. Finally,
I-vector and DNNs are used as classifiers to assess the perfor- given a test utterance, the score between a target class and the test
mance of the transformed MFCCs features. Both classifiers are one utterance is calculated using the log-likelihood ratio.
of the state-of-art classifiers that have been used in speaker recog-
nition/verification and language identification [10,12,18].
4.2. DNN classifier
4.1. I-Vector classifier
Recently, the DNN is considered one of the most popular clas-
I-Vector is employed as a back-end classifier in our system. The sifiers and feature extractors. The DNN classifier consists of more
transformed MFCCs feature set is used as the input vector for the than three layers including the input and the output layers. In
classifier which consists of I-Vectors (eigenvoices) extraction, noise DNN, each layer is trained based on the features coming from pre-
removal, and scoring. I-Vector classifier estimates different classes vious layer’s output. Therefore, the further the classifier advance
by using eigenvoice adaptation [37]. The total variability subspace in the training and in the layers, the more complex and generative
for each utterance is learned from the training data set. Then, the features are generated.
total variability subspace is used to estimate a low-dimensional A supervised DNN is built to classify the age and gender for
set from the adapted mean super vectors which are called iden- each group on the database based on the frame level. The input
tity vector (I-Vector). The linear discriminant analysis is applied to feature set for the network is the frames of each class utterances,
reduce the dimension of the extracted I-Vectors by Fisher criterion while the output labels represent the number of the classes in the
[38]. For each utterance, GMM mean vectors are calculated. The database. After DNN is trained, all the output activations for each
UBM super vector, M is adapted by stacking the mean vectors of frame for a given class utterance are accumulated and normalized
the GMM. It is defined in Eq. (6). by performing a feedforward process on the trained network to
build a model for each class [40]. At the testing process, a new
M = m + Tw (6)
model is created for the utterance test based on the trained net-
T represents a low-rank matrix, and w represents the required low- work. The cosine similarity between the utterance test model and
dimensional I-Vector. Note that the matrix T is initialized based on each class model is computed. Then the final classification decision
the variance of the entire utterances in the training database. After is made by taking the highest cosine similarity.
10 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14

Table 1 input vector which has 39 × n features. n is set to 11 after rigorous


Age-annotated database of German telephone speech [41].
trial and error process. The 11 sequence frames are target frame
Class Age group Age Gender and the previous and next (n-1)/2 frames. The number of nodes in
1-C Children 7–14 Male + Female the bottleneck layer is set to the number of input features, which is
2-YF Young 15–24 Female 39. The number of nodes in the output layer is set to the number
3-YM Young 15–24 Male of tied-state triphones, which is 4400, in the database. The train-
4-MF Middle 25–54 Female ing data is divided into mini batches. Each mini batch consists of
5-MM Middle 25–54 Male
1024 utterances. 10 epochs are used for training the GB-RBM over
6-SF Senior 55–80 Female
7-SM Senior 55–80 Male all the training data while 12 epochs are used for the rest of the
BB-RBMs. The learning rate for GB-RBM and BB-RBM is 0.0025. In
the fine tuning phase, 12 epochs are used. The learning rate is ini-
5. Experimental results tially set to 0.1 for the first 6 epochs, and then it is decreased to
one-half its initial value for the remainder epochs.
In this section, the results of the proposed work are presented. In this work, the DNN is used as classifier, as well. DNN is prob-
These results are obtained after conducting several experiments on lem dependent, therefore, many experiments should be carried out
a public database. The database will be discussed in Section 5.1. to find the optimal settings for a successful classification. After ex-
Section 5.2 explains the feature set in this work. In Section 5.3, tensive experiments, DNN is built with 5 hidden layers of 1024
the settings for the conducted experiments are discussed in details. nodes each. The input data is the BNF features, while the num-
Finally, the results of the conducted experiments are presented and ber of epochs is 16. The learning rate was initially set to 0.1 for
discussed in Section 5.4. the first 3 epochs, then it is decreased to 0.8 times the old learn-
ing rate every two epochs. The momentum value was started at
5.1. Database 0.5 for the first 3 epochs and then is increased to 0.9 for the re-
mainder epochs. The same settings are used to find the shared la-
Database of Age and Gender Annotated Telephone Speech, bels between the misclassified classes in order to obtain the initial
aGender corpus, consists of 47 hours of prompted and free text. weights for the classifier as described in Section 3.2.
The number of speakers in the database is 945 and it includes
7 mixed classes ranging from 7–80 years (Table 1). The number 5.4. Results
of utterances in the database is 65,364 and the average length of
utterances is 2.58 s. The database is divided into two parts; the Several experiments have been conducted to evaluate the per-
training/development set contains 53,076 utterances (770 speak- formance of the proposed work. As shown in Table 2, the overall
ers) while the test set contains 17,332 utterances (175 speak- classification accuracy by using the transformed MFCCs is 56.13%
ers) [41]. The nature of speech content is short commands, single and 58.89% by the I-vector and DNN classifiers, respectively. On the
words, and numbers. other hand, the classification accuracies by using the traditional
MFCCs are calculated as 43.60% and 45.89% by the same classifiers.
5.2. Feature set The classification accuracies of MF, MM, SF, and SM classes are in-
creased drastically. The statistical analysis of the MFCCs features
MFCCs are widely used in speech signal processing. In litera- is studied by Barkana and Zhou [6] in age and gender problem.
ture, most of the speaker’s age and gender classification works They reported that MFCCs features have a near identical distribu-
used MFCCs as input features. For that reason, we chose MFCCs tion and flatness for all age groups of female and male speakers.
features to evaluate the performance and effectiveness of the pro- As a result, the recognition of different age groups becomes diffi-
posed approach in generating transformed features. Thus the clas- cult by using MFCCs. The transformed MFFCs that are generated for
sification accuracies can be compared with previous findings. Over- the first time in this work increased the overall classification accu-
all classification accuracies of the I-Vector and DNN classifiers are racy by about 13%. One of the reasons for this improvement is that
presented in Table 2 by using the traditional MFCCs and the trans- the transformed MFCCs features represent the prosodic features in
formed MFCCs feature sets. addition to spectral features. The involvement of the phoneme la-
bels in the generation of the transformed MFCCs made it possible
5.3. DNN training settings to grasp the prosodic features, such as intonation, stress, tone, and
rhythm, of a speaker. Another reason is that the transformed fea-
The utterance is divided into frames of 25 ms. In total, 39 fea- tures are the result of using phoneme labels in the training data,
tures, one energy and 12- MFCCs with its first and second deriva- and this helped to remove any noise or silent frames so that the
tives, are extracted for each frame. The DNN settings used in this transformed features are calculated without acoustic background
work are based on the work in automatic speech recognition by noise.
[34,42]. There are 5 hidden layers with 1024 nodes in each layer Fig. 5 shows the receiver operating characteristics (ROC) of the
except the bottleneck layer where the number of nodes is 39. The transformed and traditional MFCCs (with random and regularized
number of nodes in the input layer is equal to the length of the weights) by using DNN and I-vector classifiers. The ROC curves

Table 2
The overall classification accuracies of the DNN and I-Vector classifiers using the traditional and the transformed MFCCs (%). Bold
values represents the overall performances.

Classifier C YF YM MF MM SF SM Overall Acc.

I-vector Traditional MFCCs 64.86 57.12 49.01 24.50 27.03 49.91 32.80 43.60
Transformed MFCCs 60.33 66.49 48.00 45.46 48.56 56.89 67.15 56.13
DNN with Traditional MFCCs 54.33 52.60 44.80 25.13 42.33 46.13 55.87 45.89
regularized weights Transformed MFCCs 62.23 61.54 53.38 47.69 52.00 64.23 70.77 58.98
DNN with random Traditional MFCCs 56.53 47.27 49.07 27.53 35.33 36.13 53.80 43.67
weights Transformed MFCCs 59.69 60.15 48.85 40.08 52.23 60.92 63.38 55.04
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 11

Fig. 5. ROC curves of different classifier scenarios. A) The DNN classifier with regularized weights and the traditional MFCCs. B) The DNN classifier with regularized weights
and the transformed MFFCs. C) The DNN classifier with random weights and the traditional MFCCs. D) The DNN classifier with random weights and the transformed MFFCs
by. E) The I-vector classifier by using traditional MFCCs. F) The I-vector by using the transformed MFCCs.

Table 3
Corresponding AUC measurements for classification of Speaker’s age and gender. Bold values represents the overall performances.

Class DNN regularized weights DNN random weights I-vector

Traditional MFCCS Transformed MFCCs Traditional MFCCS Transformed MFCCs Traditional MFCCS Transformed MFCCs

C 0.86 0.87 0.86 0.87 0.88 0.90


YF 0.81 0.89 0.82 0.88 0.88 0.89
YM 0.81 0.89 0.81 0.87 0.78 0.80
MF 0.76 0.87 0.75 0.86 0.79 0.83
MM 0.85 0.87 0.85 0.87 0.76 0.87
SF 0.74 0.89 0.71 0.88 0.63 0.68
SM 0.86 0.92 0.85 0.91 0.89 0.92
Overall 0.81 0.89 0.80 0.88 0.80 0.84
AUC

are calculated by using one-agianst-all rule. The area under curve weights, therefore, the regularized weights converge faster than
(AUC) for the transformed MFCCs is found to be bigger than the the random weights for most of the DNN layers.
traditional MFCCs (Table 3 compares the AUC for both sets). The Table 4 presents the confusion matrix by using the I-vector
AUC values are calculated as in [43]. The DNN classifier performes classifier. It can be seen that children (C), young female (YF),
better than the I-vector classifier in terms of AUC. and senior (SM, SF) classes are classified with higher accuracies
As comparing the classifiers, the DNN classifier performed compared to the other classes. The major classifications occurred
slightly better than the I-vector classifier. Fig. 6, shows the vari- among the same-gender classes. Young female (YF) and senior
ance in weights at each layer in the DNN classifier by using ran- male (SM) classes have the highest accuracy rates and are correctly
dom weights and regularized weights. Higher variance between classified as 66.49% and 67.15%, respectively. Middle and senior fe-
the weights in each layer is needed to distinguish different classes. male groups (MF, SF) are classified with the accuracy of 45.46% and
As it can be seen in Fig. 6, the variance between the weights us- 56.89%. Children (C) and young male (YM) classes achieved the ac-
ing shared labels is higher than that of the randomly initialized curacy of 60.33% and 48%.
12 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14

Table 6
Confusion matrix of the DNN classifier using the traditional MFCCs set (%). Bold
values represents the classification accuricies.

Predicted
Actual
C YF YM MF MM SF SM

C 54.33 22.88 2.67 6.13 0.73 11.13 2.13


YF 13.00 52.60 0.40 16.47 0.20 16.93 0.40
YM 0.87 1.00 44.80 2.13 26.20 4.60 20.40
MF 4.40 26.47 1.73 26.13 1.53 37.67 2.07
MM 1.07 0.80 30.93 1.40 42.33 2.60 20.87
SF 4.27 16.20 3.00 23.27 1.20 46.13 5.93
SM 1.07 0.47 10.98 0.67 26.87 4.07 55.87

Table 7
Overall performance comparison in speaker’s
age and gender classification. Bold values repre-
sents the performances of the proposed systems
by this work.

System Overall Acc. (%)

GMM base 43.1


Mean Super Vector 42.6
MLLR Super Vector 36.2
TPP Super Vector 37.8
SVM Base 44.6
MFuse 1 + 2 45.2
MFuse 3 + 4 40.3
MFuse 1 + 2 + 3 + 4 50.4
MFuse 1 + 2 + 3 + 4 + 5 52.7
Fig. 6. Variance versus epoch number graphs of regularized and random weights BNF-I-vector (This work) 56.13
between layers. The x-axis represents the epoch number (1–8), and y-axis repre- BNF-DNN (This work) 58.98
sents the variance (y is scaled by 10 0 0).

Table 4 same gender and close age, or between the children and young fe-
Confusion matrix of the I-vector classifier using the transform MFCCs set (%). Bold
male class.
values represents the classification accuricies.
By comparing the classification accuracies of each class in Table
Predicted 5 and Table 6, the transformed MFCCs help to improve the DNN
Actual
C YF YM MF MM SF SM performance about 10% higher for the classes C, YF, YM, and MM
C 60.33 27.90 1.5 4.80 0 2.88 2.59
and between 15–20% higher for the classes MF, SF, and SM. This
YF 21.08 66.49 2.70 6.85 0 2.88 0 observation can also be seen in the AUC measurements in the
YM 8.89 1.62 48 0.18 18.97 10.99 11.35 Table 3. In their work, Barkana and Zhou [6] reported that tra-
MF 3.60 16.85 2.52 45.46 2.16 29.23 0.18 ditional MFCCs of the middle-aged female (MF) speakers and se-
MM 3.42 1.26 24.43 4.14 48.56 2.34 15.85
nior female speaker have very similar characteristics leading to
SF 7.41 11.17 5.23 13.18 1.44 56.89 4.68
SM 4.50 0.72 11.53 0 15.56 0.54 67.15 misclassifications between these two classes. The proposed trans-
formed MFCCs decreased the misclassifications between these two
classes significantly since phoneme labels are used in generat-
Table 5 ing the transformed features. The transformed features contain
Confusion matrix of the DNN classifier using the transform MFCCs set (%). Bold phoneme specific characteristics of each speaker in addition to the
values represents the classification accuricies.
spectral characteristics.
Predicted Richardson et al. stated in their work [44] that features aligned
Actual with phonetical labels or posteriors are still contain speaker-
C YF YM MF MM SF SM
dependent and phonetically discriminative information, which is
C 63.23 15.38 4.08 3.31 5.08 4.54 4.38
YF 15.92 61.54 0 11.08 0.54 10.23 0.69
that are useful for speaker verification. Sarkar et al in [45] reported
YM 1.62 0.62 53.38 2.38 24.46 2.15 15.38 that “… The results show that the phonetically discriminative MLP
MF 3.38 16.08 2.15 47.69 0.77 28.85 1.08 features retain speaker-specific information which is complementary
MM 0.69 0.92 21.77 0.85 52 2.23 21.54 to the short-term cepstral features…” Braun et al. [46] studied the
SF 4.69 8.92 1.77 16.23 0.923 64.23 3.23
effects of the language on estimating speaker’s age. They found
SM 0.46 0.31 11.85 0.38 13.69 2.54 70.77
that the estimation of the speaker’s age was language indepen-
dent, and the listeners did not gain from their knowledge of the
corresponding language. We can conclude that the BNFs which are
Table 5 and Table 6 present the confusion matrices of the DNN based on phonetical labels retain speaker-dependent information.
classifier using the transformed and traditional MFCCs with regu- As a result, BNFs are language independent.
larized weights. In Table 5, the class SM is classified with the high- The overall accuracies of the previous studies using the aGen-
est accuracy (70.77%), while the classes YF, C, and SF are correctly der database and the MFFCs feature set are listed in Table 7. The
classified with the accuracy ranges between 61% and 64%. The clas- classification accuracies for these systems are reported in Li et al.
sification accuracies of the MM and YM classes are calculated as [22]. The highest reported classification accuracy in the literature is
52% and 53.3%, respectively. The lowest accuracy was achieved by 52.7%, which is achieved by the MFuse 1 + 2 + 3 + 4 + 5 classifier, a
the class of MF, as 47.69%. It is observed that the highest misclas- combination of GMM base, Mean Supervector, MLLR, TPP, and SVM
sification rates have always occurred between the classes with the base systems. GMM baseline, SVM baseline, and GMM-SVM mean
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 13

supervector systems have achieved better accuracies than that of step to apply DNN techniques and I-vector classifier on speaker’s
the more complex GMM-SVM MLLR supervector system and GMM- age and gender classification problem.
SVM TPP supervector system. Combining more systems together Many challenges were encountered during the development of
did not provide higher classification accuracies. In GMM base sys- the proposed work. Tied-state triphones were needed to be used
tem, 39-dimensional MFCCs feature set per frame is extracted. An as labels on the DNN output layer. As a solution, a trained GMM-
UBM along with MAP is used to build the class model in a su- HMM model was used to generate the required labels. The tran-
pervised manner. The overall accuracy for that model is reported script file for each utterance was also required in label generating
as 43.1%. The next system, the Mean Super Vector used the GMM process. It is noticed that some of the utterance transcripts in the
baseline system to extract the feature set and training the UBM aGender database were incomplete or inaccurate. To address this
model. The mean vectors of all the Gaussian components are con- issue, the database transcript files have been fixed and refined. The
catenated to form the GMM supervectors. It is modeled by an SVM. optimization of the DNN parameters such as the number of layers,
The overall accuracy of this system is stated as 42.6%. In MLLR su- number of units in each layer, weight initialization, learning rate
pervector system, MLLR supervectors and SVM are used to train is problem-dependent. The settings of the optimal parameters dif-
multi-class models and to score the test set. UBM technique is con- fer from one problem to another. Therefore, different experiments
ducted using the MLLR adaptation for all samples in the training were conducted to find the optimal settings for the each DNN. Op-
set in order to extract the corresponding MLLR supervectors. The timizing one parameter alone does not optimize the rest of the pa-
overall accuracy of MLLR system is reported as 36.2% as shown rameters. Thus, the parameters should be tuned together to reach
in Table 7. Another variation of the GMM-SVM mean supervec- the optimal settings. The computation time for training DNN de-
tor method is the TPP supervector that is calculated as probability pends on different factors: the size of the database (for speech ut-
distribution over all Gaussian components. In this method the KL- terances, there were millions of concatenated frames), the num-
divergence [47] is used to measure the similarity between vectors. ber of features for each sample, number of layers, and number of
In TPP, a UBM model is trained independently as an age and gen- epochs. To overcome the limited computation resources, we have
der models. The overall accuracy for this system is given as 37.8%. utilized two of INVIDIA TITAN X GPGPU devices, connected to one
In SVM-base system 450 dimensional acoustic features such as F0, single host to make the computation faster by the parallel power
jitter, shimmer, along with MFCCs feature set per utterance are ex- of the GPGPUs.
tracted. The corresponding features are used as inputs for an SVM One possible challenge in this work is the extraction of the tied-
classifier by achieving an overall classification accuracy of 44.6%. state triphones. Most of the age and gender speech databases do
The proposed work achieved higher overall accuracies by both not come with tied-state triphones. As a result, the tied-state tri-
BNF-I-vector and BNF-DNN classifiers (56.13%, 58.98%) compared phones should be carefully extracted to implement the proposed
to previous works for the aGender database in the literature. The work on a database. The extraction process has to satisfy several
transformed MFCCs set is proved to be more effective than the tra- requirements such as the manuscripts of the speech utterances and
ditional MFCCs features in speaker’s age and gender classification. special speech software (such as HTK). The extraction process in-
There are two main reasons behind this improvement. First, intro- volves many steps that take time. Another possible challenge of
ducing phoneme labels to create BNFs for age and gender prob- this work is to find the most misclassified classes with each other
lem has a significant impact on the BNFs, which become more dis- in order to use the shared labels technique.
criminative and descriptive. By phoneme labels, phonetic compo- The proposed work achieved higher classification accuracies
nents in a speaker’s speech signal have been captured and used than the previous systems in the literature. This work proves
in detecting the speaker’s age and gender information. Second, the that the bottleneck features can offer better speaker dependent
regularized weights converged faster and provided higher variance information. In this work, we only utilized MFCCs to generate
between classes. These improvements boosted the performance of the bottleneck features. As future work, other time-domain and
the classifiers. frequency-domain based features such as fundamental frequency,
pitch-range-based, and linear predictive coefficients can be used
as input features. It is expected that the classification accuracies
6. Conclusions, challenges, and future work would be improved further. In addition, different type of deep neu-
ral networks such as convolutional neural networks (CNNs) can
The goal of this paper is to improve the classification accuracies be used alone or along with the DNN classifier. In this work, the
in speaker’s age and gender classification. For this purpose, major I-vector was used as a classifier since it is one of the state-of-
contributions are made to the area of feature extraction and classi- art techniques in several fields such as language identification and
fier design. First, a novel approach is introduced to generate trans- speaker verification. We plan to investigate the usage of the ex-
formed MFCCs feature set. Second, classifier weights are regular- tracted BNF features as input for the I-vector in order to model the
ized by using shared labels. As one of the most popular feature sets corresponding I-vector for each utterance. The resulted I-vectors
in the speech signal processing, MFCCs are proved to be ineffective will be used as a new feature set that could be fed to any clas-
in speaker’s age and gender classification in literature. To improve sifier. Since the I-vector utilizes techniques such as a within class
the performance of the traditional MFCCs, the transformed MFCCs covariance, it might help to represent the speech utterances in a
feature set is generated by using BNF extractor. In the BNF extrac- more distinguishable way. As we plan to fine-tune several DNN
tor, phoneme labels are used to capture phonetic components in architectures jointly by using a new cost function, each DNN will
the speech. We showed that the DNN can be designed and trained have a different feature set. The DNNs will be trained jointly and
to adapt smoothly with the BNF extractor, so that a new trans- simultaneously. It is expected to improve the classification accu-
formed features can be obtained. The shared labels are used to racies by representing speech utterances distinctively, since it will
regularize weights between DNN layers. The regularized weights utilize more than one feature set for each utterance and include
provided faster convergence and higher variance between classes. more information about the speaker.
The performance of the transformed MFCCs is evaluated by two
classifiers: DNN and I-Vector. The results showed a significant im- References
provement in the classification accuracies. The overall accuracy of
[1] M. Black, A. Katsamanis, C.-C. Lee, A.C. Lammert, B.R. Baucom, A. Christensen,
the proposed work is 58.98% and 56.13% for the DNN and I-vector, et al., Automatic classification of married couples’ behavior using audio fea-
respectively. To the best of our knowledge, our work is the first tures, in: INTERSPEECH, 2010, pp. 2030–2033.
14 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14

[2] P. Nguyen, D. Tran, X. Huang, D. Sharma, Automatic speech-based classification [26] X. Zhang, S. Hongbin, Z. Qingwei, Y. Yonghong, Using a kind of novel phono-
of gender, age and accent, in: B.-H. Kang, D. Richards (Eds.), Knowledge Man- tactic information for SVM based speaker recognition, IEICE Trans. Inf. Syst. 92
agement and Acquisition for Smart Systems and Services, vol. 6232, Springer (2009) 746–749.
Berlin Heidelberg, 2010, pp. 288–299. [27] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, et al., "Com-
[3] T. Schultz, Speaker characteristics, in: C. Müller (Ed.), Speaker Classification I, parison of four approaches to age and gender recognition for telephone ap-
vol. 4343, Springer Berlin Heidelberg, 2007, pp. 47–74. plications," in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
[4] S.B. Davis, P. Mermelstein, Comparison of parametric representations for International Conference on, 2007, pp. IV-1089-IV-1092.
monosyllabic word recognition in continuously spoken sentences, Acoust. [28] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
Speech Signal Process. IEEE Trans. vol. 28 (1980) 357–366. ral networks, Science 313 (2006) 504–507.
[5] H.-J. Kim, K. Bae, H.-S. Yoon, Age and gender classification for a home-robot [29] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, et al., in: The HTK
service, in: Robot and Human interactive Communication, 2007. RO-MAN 2007. Book, vol. 3, Cambridge University Engineering Department, 2002, p. 175.
The 16th IEEE International Symposium on, 2007, pp. 122–126. [30] G. Hinton, Training products of experts by minimizing contrastive divergence,
[6] B.D. Barkana, J. Zhou, A new pitch-range based feature set for a speaker’s age Neural Comput. 14 (2002) 1771–1800.
and gender classification, Appl. Acoust. 98 (2015) 52–61. [31] A.-r. Mohamed, D. Yu, L. Deng, Investigation of full-sequence training of deep
[7] M.D. Zeiler, Hierarchical Convolutional Deep Learning in Computer Vision, New belief networks for speech recognition, in: INTERSPEECH, 2010, pp. 2846–2849.
York University, 2013. [32] L. Deng, A tutorial survey of architectures, algorithms, and applications for
[8] M. Ranzato, G.E. Hinton, Modeling pixel means and covariances using factor- deep learning, APSIPA Transactions on Signal and Information Processing 3
ized third-order Boltzmann machines, in: Computer Vision and Pattern Recog- (2015) null-null.
nition (CVPR), 2010 IEEE Conference on, 2010, pp. 2551–2558. [33] A. Mohamed, T.N. Sainath, G. Dahl, B. Ramabhadran, G.E. Hinton, M.A. Picheny,
[9] C. Ekanadham, S. Reader, H. Lee, Sparse deep belief net models for visual area Deep belief networks using discriminative features for phone recognition, in:
V2, Adv. Neural Inf. Process. Syst. 20 (2008). Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Con-
[10] G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural ference on, 2011, pp. 5060–5063.
networks for large-vocabulary speech recognition, Audio Speech Lang. Process. [34] Y. Bao, H. Jiang, C. Liu, Y. Hu, L. Dai, Investigation on dimensionality reduc-
IEEE Trans. 20 (2012) 30–42. tion of concatenated features with deep neural network for LVCSR systems,
[11] T. Deselaers, S. Hasan, O. Bender, H. Ney, A deep learning approach to machine in: Signal Processing (ICSP), 2012 IEEE 11th International Conference on, 2012,
transliteration, in: Proceedings of the Fourth Workshop on Statistical Machine pp. 562–566.
Translation, 2009, pp. 233–241. [35] F. Grézl, M. Karafiát, S. Kontár, J. Cernocky, Probabilistic and bottle-neck fea-
[12] D. Yu, S. Wang, Z. Karam, L. Deng, Language recognition using deep-structured tures for LVCSR of meetings, Acoustics, Speech and Signal Processing, 2007.
conditional random fields, in: Acoustics Speech and Signal Processing (ICASSP), ICASSP 2007. IEEE International Conference on, 2007 IV-757-IV-760.
2010 IEEE International Conference on, 2010, pp. 5030–5033. [36] F. Grézl, P. Fousek, Optimizing bottle-neck features for LVCSR, in: Acoustics,
[13] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief Speech and Signal Processing, 20 08. ICASSP 20 08. IEEE International Confer-
nets, Neural Comput. 18 (2006) 1527–1554. ence on, 2008, pp. 4729–4732.
[14] Y. Bengio, Learning deep architectures for AI, Found. Trends® Mach. Learn. 2 [37] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus
(2009) 1–127. eigenchannels in speaker recognition, Audio Speech Lang. Process. IEEE Trans.
[15] J.M. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, et al., Devel- 15 (2007) 1435–1447.
opments and directions in speech recognition and understanding, Part 1 [DSP [38] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor anal-
Education], Signal Process. Mag. IEEE 26 (2009) 75–80. ysis for speaker verification, Audio Speech Lang. Process. IEEE Trans. 19 (2011)
[16] D. Matrouf, N. Scheffer, B.G. Fauve, J.-F. Bonastre, A straightforward and effi- 788–798.
cient implementation of the factor analysis model for speaker verification, in: [39] P. Kenny, Bayesian Speaker Verification with Heavy-Tailed Priors, in: Odyssey,
INTERSPEECH, 2007, pp. 1242–1245. 2010, p. 14.
[17] M. Senoussaoui, P. Kenny, N. Dehak, P. Dumouchel, An I-Vector extractor suit- [40] E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, J. Gonzalez-Dominguez, Deep
able for speaker recognition with both microphone and telephone speech, in: neural networks for small footprint text-dependent speaker verification, in:
Odyssey, 2010, p. 6. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Con-
[18] N. Dehak, P.A. Torres-Carrasquillo, D.A. Reynolds, R. Dehak, Language recog- ference on, 2014, pp. 4052–4056.
nition via i-vectors and dimensionality reduction, in: INTERSPEECH, 2011, [41] F. Burkhardt, M. Eckert, W. Johannsen, J. Stegmann, A database of age and gen-
pp. 857–860. der annotated telephone speech, LREC, 2010.
[19] E.D. Mysak, Pitch and duration characteristics of older males, J. Speech Hearing [42] F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent
Res. (1959). deep neural networks for conversational speech transcription, in: Automatic
[20] N. Minematsu, M. Sekiguchi, K. Hirose, Automatic estimation of one’s age with Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, 2011,
his/her speech based upon acoustic modeling techniques of speakers, Acous- pp. 24–29.
tics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Confer- [43] D.J. Hand, R.J. Till, A simple generalisation of the area under the ROC curve for
ence on, 2002 I-137-I-140. multiple class classification problems, Mach. Learn. 45 (2001) 171–186.
[21] C. Muller, F. Wittig, J. Baus, Exploiting speech for recognizing elderly users to [44] F. Richardson, D. Reynolds, and N. Dehak, “A unified deep neural network for
respond to their special needs, Eighth European Conference on Speech Com- speaker and language recognition,” 2015, arXiv:1504.00923.
munication and Technology, 2003. [45] A.K. Sarkar, C.-T. Do, V.-B. Le, C. Barras, Combination of cepstral and phoneti-
[22] M. Li, C.-S. Jung, K.J. Han, Combining five acoustic level modeling methods cally discriminative features for speaker verification, IEEE Signal Process. Lett.
for automatic Speaker’s age and gender recognition, in: INTERSPEECH, 2010, 21 (2014) 1040–1044.
pp. 2826–2829. [46] A. Braun, L. Cerrato, Estimating speaker age across languages, in: Proceedings
[23] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted of ICPhS, 1999, pp. 1369–1372.
Gaussian mixture models, Digital Signal Process. 10 (20 0 0) 19–41. [47] J.R. Hershey, P. Olsen, Approximating the Kullback Leibler divergence be-
[24] M. Li, H. Suo, X. Wu, P. Lu, Y. Yan, Spoken language identification using tween Gaussian mixture models, Acoustics, Speech and Signal Processing,
score vector modeling and support vector machine, in: INTERSPEECH, 2007, 2007. ICASSP 2007. IEEE International Conference on, 2007 IV-317-IV-320.
pp. 350–353.
[25] A. Stolcke, S.S. Kajarekar, L. Ferrer, E. Shrinberg, Speaker recognition with ses-
sion variability normalization based on MLLR adaptation transforms, Audio
Speech Lang. Process. IEEE Trans. 15 (2007) 1987–1998.

You might also like