1 s2.0 S0950705116303926 Main
1 s2.0 S0950705116303926 Main
Knowledge-Based Systems
journal homepage: [Link]/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: Speaker age and gender classification is one of the most challenging problems in speech processing. Al-
Received 27 April 2016 though many studies have been carried out focusing on feature extraction and classifier design for im-
Revised 28 September 2016
provement, classification accuracies are still not satisfactory. The key issue in identifying speaker’s age
Accepted 7 October 2016
and gender is to generate robust features and to design an in-depth classifier. Age and gender informa-
Available online 20 October 2016
tion is concealed in speaker’s speech, which is liable for many factors such as, background noise, speech
Keywords: contents, and phonetic divergences. The success of DNN architecture in many applications motivated this
Deep neural network work to propose a new speaker’s age and gender classification system that uses BNF extractor together
DNN with DNN. This work has two major contributions: Introduction of shared class labels among misclas-
I-Vector sified classes to regularize the weights in DNN and generation of transformed MFCCs feature set. The
MFCCs proposed system uses HTK to find tied-state triphones for all utterances, which are used as labels for
Speaker age and gender classification
the output layer in the DNNs for the first time in age and gender classification. BNF extractor is used to
generate transformed MFCCs features. The performance evaluation of the new features is done by two
classifiers, DNN and I-Vector. It is observed that the transformed MFCCs are more effective than the tra-
ditional MFCCs in speaker’s age and gender classification. By using the transformed MFCCs, the overall
classification accuracies are improved by about 13%.
© 2016 Elsevier B.V. All rights reserved.
[Link]
0950-7051/© 2016 Elsevier B.V. All rights reserved.
6 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14
is text-independent; background noise is present; the number of helps getting information about the age and gender of a speaker. In
utterances varies for each class; and there are seven classes. The TPP, UBM models are trained independently. Therefore, each UBM
highest reported classification accuracy for this database is around component can model some underlying phonetic sounds [26]. The
60% by using a combination of several feature sets [6]. TPP system’s overall accuracy is calculated as 37.8%. The SVM base-
Last few years, DNNs have been used effectively for feature ex- line system using 450 dimensional acoustic features [22] and sev-
traction and classification in computer vision [7–9], image pro- eral prosodic features, such as F0, F0 envelop, jitter, and shimmer
cessing and classification [8,10], and natural language recognition is designed to capture the age and gender information at prosodic
[11,12]. In 2006, Hinton et al. [13] introduced the RBM for the first level. This system achieved an overall accuracy of 44.6% for the
time as a keystone for training DBN. Later, Benjio [14] successfully aGender database.
proposed a new way to train DNN by using auto encoders. DNN In [22] it is shown that combining these methods will result in
has a deep architecture that transforms rich input features into low computational cost. Moreover, a score level fusion of different
strong internal representation [15]. One of the most recent popular number of systems is used. Each system has its complementary in-
techniques is the eigenvoice (I-Vector) which is based on the pro- formation from other systems. The highest accuracy (52.7%) is at-
cess of joint factor analysis [16]. Currently, it is considered as one tained when the five systems are combined.
of the state-of-art in the field of speaker recognition and language Metze et al. [27] studied different techniques for age and gen-
detection [17,18]. Eigenvoice adaptation is the main procedure to der classification based on telephone applications. They also com-
estimate I-Vector which represents a low-dimensional latent factor pared the performance of their system to human listeners. Their
for each class in a corpus. A test data is scored by a linear strategy first technique, PPR is one of the early systems which were built to
that computes the log-likelihood ratio between different classes. deal with automatic sound recognition and language identification
This paper is organized as follows. A brief literature review is problems. The main core of this system is to create a PPR for each
provided in Section 2. In Section 3, the methodology of the pro- class in the age and gender database. They reported that the PPR
posed work is explained. The classifier design is introduced in system performs almost like human listeners with the disadvan-
Section 4. Experimental results and their analysis are presented in tage of losing quality and accuracy on short utterances. Their sec-
Section 5. The conclusion, challenges, and future work follow in ond technique is based on prosodic features. This technique uses
Section 6. several prosodic features jitter, shimmer, statistical information of
the harmonics to noise ratio, and many several statistical informa-
2. Literature review tion of the fundamental frequency. All these features are utilized
and analyzed using a system with two layers. The first layer an-
The problem of age and gender classification was studied early alyzes the features by using three different neural networks. The
in 1950s [19], but the computer-aided systems for deriving the age second layer processes the output information which has already
and gender information from speech have been developed recently been produced by the first layer by using dynamic Bayesian net-
[20,21]. Li et al. [22] utilized various acoustic and prosodic meth- work. The system based on prosodic features has shown better per-
ods to improve accuracies by using two or more fusion systems formance on variation of the utterance duration. Their third tech-
such as GMM base, GMM-SVM mean super vector, GMM-SVM- nique is the linear prediction analysis which computes a distance
MLLR super vector, GMM-SVM TPP super vector, and SVM baseline between the formants and the signal spectrum based on the linear
system. Their GMM system used 13-dimensional MFCCs features prediction cover. The Gaussian distributions of the distance were
and their first and second derivatives per frame as input. Cepstral considered to contain useful information about the age and gender
mean subtraction and variance normalization are performed to get of a speaker. This system has failed due to the fact that young and
zero mean and unit variance on their database. A UBM and MAP adult speakers have almost the same Gaussian distribution.
techniques [23] are used to model different age and gender classes This work shares the same goal with previous works in the lit-
in a supervised manner for GMM training purpose. Their system erature. The previous systems in [22], which are GMM base, GMM-
achieved an overall accuracy of 43.1%. The other proposed system SVM-Mean supervector, GMM-SVM-MLLR supervector, GMM-SVM-
by Li et al. [22] is the GMM-SVM mean super vector system that TPP supervector, and SVM baseline system, as well as, the previous
is considered as an acoustic-level approach for speaker’s age and systems in [27], which are PPR, prosodic feature, and linear predic-
gender classification. The GMM baseline system is used for extract- tion analysis systems used a combination of different popular fea-
ing features and for training the UBM model. The mean vectors of ture sets to classify speakers‘ age and gender information. Different
all the Gaussian components are concatenated to form the GMM than the previous works, our proposed work offers a new feature
super vectors, and then it is modeled by SVMs. One of the ad- set that is constructed from the MFCCs and a DNN-based classi-
vantages of their work is the usage of two-stage frameworks as fier that is designed for speaker‘s age and gender classification.
in [24], which solve the limitation of computer memory required DNN with a bottleneck layer is used to generate bottleneck fea-
by large database training instead of directly training a multi-class tures from the MFCCs. These features can be considered as a low-
SVM classifier by using all the high-dimensional super vectors. This dimensional feature set since the bottleneck layer compresses the
system achieved a 42.6% overall accuracy for the aGender database. MFCCs. In addition, a DNN classifier is designed and used instead
In the GMM-SVM MLLR system, the MLLR adapts the means of of combining several classifiers together for a better classification.
the UBM for each utterance to extract the features of the super In [22], it is reported that the highest accuracies are achieved by
vector [25]. SVM is used to model the resulted MLLR matrix su- combining five systems together. On the other hand, our proposed
per vector. Dimension reduction on the MLLR super vector space system achieves higher accuracies by using only one classifier.
is done by linear discriminant analysis. It is important to mention
that the MLLR matrix contains speaker’s specific characteristics and
the contents of this transformed matrix are used as feature super 3. Methodology
vectors for speaker modeling and age and gender recognition. The
MLLR achieved an overall accuracy of 36.2% In this section, the generation of transformed features and the
The GMM-SVM TPP super vector is calculated as probability dis- suggested regularized DNN weights using shared class labels are
tribution over all Gaussian components. In this method, the KL- explained. We propose an approach to transform existing features
divergence is used to measure the similarity between vectors. The into more effective features. MFCCs, their first and second deriva-
usage of KL-divergence provides discriminative information, which tives are used as input features for comparison reasons since most
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 7
3.1.2. BNF extraction and the weights between the layers in the DBN stack are opti-
In this section, we discuss the BNF extraction process. First we mized, the supervised training process is started by adding a final
will describe the DNN training procedure in its two phases: the layer of labels on top of the DBN layers. These labels represent the
generative (unsupervised) and the supervised. Then, the process of final classes of the whole network. In our work, these labels rep-
extracting the BNFs features based on the trained DNN will be ex- resent the tied-state triphones for the utterance speech data.
plained in the BNF extractor section.
B) BNF extractor
A) DNN training BNF architecture is generated from a trained DNN where each
layer represents a different internal structure of the input features.
The first phase is generative. The DNN is pre-trained by using
In the DNN, the output of each hidden layer produces transformed
an unsupervised learning technique that employs the RBM. The
features. All the layers above the bottleneck layer are removed
second phase is discriminative. The DNN is trained by using the
to produce the BNF extractor as shown in Fig. 3. Fig. 3 explains
back-propagation algorithm in a supervised way. An RBM has in-
the proposed bottleneck feature extraction architecture using the
put layer, V (visible layer) where V = {v1 , v2 ,…, vV }, and the output
phoneme labels. The left side (in Fig. 3) explains the pre-training
layer, H (hidden layer) where h = {h1 , h2 ,…, hH } [30]. The visible
phase in the DBN consisting of five RBM layers. The first layer is a
and the hidden layers consist of units. Each unit in the visible layer
GB-RBM and the rest are BB-RBM with the bottleneck layer located
is connected to all units in the hidden layer. The restriction of this
in the middle. The right side (in Fig. 3) portrays the DNN architec-
architecture is that there is no connection between the units in the
ture which is formed by adding a softmax output layer on top of
same layer. Two types of RBMs, BB-RBM and GB-RBM [31] are used
the DBN architecture. The weights for the DNN are tuned during
in this work. In the BB-RBM, the visible and hidden layer unit val-
supervised phase.
ues are binary, V ∈ {0, 1} and H ∈ {0, 1}. The energy function of
Introducing bottleneck layer has many benefits as reducing the
the BB-RBM is defined in Eq. (1)
number of units inside the bottleneck layer, getting rid of redun-
V
H
V
H dant values from the input feature set, and reflecting the class la-
E ( v, h ) = − vi hj wij − vi bvi − hj bhj (1) bels during the classification process [34,35]. It also helps to cap-
i=1 j=1 i=1 j=1 ture the descriptive and expressive features of short-time speech
where Vi is the visible unit in layer i and Hj is the hidden unit in utterances [36]. Given a BNF extractor with M layers, the features
layer j. Wij denotes the weight between the visible unit and the at the output layer can be extracted using Eq. (5).
⎧
hidden unit. bv i and bh j are the bias of the visible unit in layer i ⎪ N
⎪
⎪ l1 ( x ) = σ w ( xn + b1 )
and the hidden unit in layer j, respectively. For the GB-RBM, the ⎪
⎪
visible unit values are real, where V R, and the hidden units val- ⎪
⎪
n=1
⎪
⎪ F2
ues are binary, where H {0,1}. The energy function of this model ⎪
⎪ l2 ( x ) = σ w ( xn + b2 )
⎪
⎨ n=1
is defined as in Eq. (2)
2 . (5)
V
H
V
vi − bvi
H ⎪
⎪
E ( v, h ) = −
vi
hj wji + − hj bhj
⎪
⎪
σi 2σi2
(2) ⎪
⎪ .
⎪
⎪
i=1 j=1 i=1 j=1
⎪
⎪
⎪
⎪ FM
where σ i is the standard deviation of the Gaussian noise for the ⎩lM (x ) = σ w(lm−1 (x ) + bM )
visible unit i. The joint probability distribution which is associated i=1
with configuration of (v,h) is defined in Eq. (3) where σ is computed by the logistic function
exp(−E(v, h; θ ) ) σ (x) = 1/(1 + exp ( − x)). X = {X1 ,…, XN } is the feature set vec-
p ( v, h; θ ) = (3) tor, and N is the number of input features. LM is the output of
Z
the Mth layer. F is a varying number that represents the input for
θ represent the weights and the biases, while Z is the partition each layer in the BNF extractor. w represents the weights between
function defined as in Eq. (4). the input and output nodes in each layer. B represents the bias for
each layer.
Z= exp (−E(v, h; θ ) ) (4)
v h
3.2. Regularizing DNN weights using shared class labels
The RBM is the basic building block in DBN. It is used as a fea-
ture detector and trained in an unsupervised way. The output of Traditionally, one label is assigned to each class during the reg-
a trained RBM is used as an input to train another RBM. Training ularization of weights. However in this work, one label is allowed
RBM is very useful for complex problems where the structure of to represent two classes. Those two classes sharing the same la-
the data is complicated and the implicit features could not be de- bel are chosen among the most misclassified classes. By sharing
tected directly [32]. A number of RBMs could be stacked together the same label, the weights between the DNN layers are being en-
to represent complex structures and to detect implicit features forced to converge to an unbiased form with a wider-range repre-
from the previous RBM representation in the stack. The stacked sentation. Misclassifications between classes are determined by a
RBMs represent a generative model called DBN. The learning al- DNN classifier (Fig. 4A). Two classes having the highest misclassi-
gorithm in the DBN is layer-wise and unsupervised. The layer-wise fication ratio are chosen to share a label. Let us have a database
learning helps to find descriptive features that represent correla- with seven classes, and the highest misclassifications occurred be-
tion between the input data in each layer [33]. The DBN learning tween classes (3 and 5), and between classes (4 and 6). Therefore,
algorithm works to optimize the weights between layers. More- five shared labels are generated, the first label is for the class 1,
over, it is proved that initializing the weights between layers in the the second label is for the class 2, the third label is a shared label
DBN network enhances the results more than if random weights between the classes 3 and 5, the fourth label is shared between
are used. Another advantage of DBN training lies in its ability to the classes 4 and 6, and finally the fifth label is for the class 7. As
reduce the effect of over-fitting and under-fitting problems where shown in Fig. 4B a second DNN structure calculates the regularized
both are common problems in models with big number of param- weights. These regularized weights are used as initial weights for
eters and deep architectures. After the DBN learning is completed the third DNN classifier as shown in Fig. 4C.
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 9
Fig. 4. DNN structures. (A) Finding misclassified classes. (B) Training a second DNN with shared class labels to calculate regularized weights. (C) Initializing a third DNN
with regularized weights.
4. Classifier design the extraction of the I-Vectors, noise in each I-Vector is removed
by Gaussian probabilistic linear discriminant analysis [39]. Finally,
I-vector and DNNs are used as classifiers to assess the perfor- given a test utterance, the score between a target class and the test
mance of the transformed MFCCs features. Both classifiers are one utterance is calculated using the log-likelihood ratio.
of the state-of-art classifiers that have been used in speaker recog-
nition/verification and language identification [10,12,18].
4.2. DNN classifier
4.1. I-Vector classifier
Recently, the DNN is considered one of the most popular clas-
I-Vector is employed as a back-end classifier in our system. The sifiers and feature extractors. The DNN classifier consists of more
transformed MFCCs feature set is used as the input vector for the than three layers including the input and the output layers. In
classifier which consists of I-Vectors (eigenvoices) extraction, noise DNN, each layer is trained based on the features coming from pre-
removal, and scoring. I-Vector classifier estimates different classes vious layer’s output. Therefore, the further the classifier advance
by using eigenvoice adaptation [37]. The total variability subspace in the training and in the layers, the more complex and generative
for each utterance is learned from the training data set. Then, the features are generated.
total variability subspace is used to estimate a low-dimensional A supervised DNN is built to classify the age and gender for
set from the adapted mean super vectors which are called iden- each group on the database based on the frame level. The input
tity vector (I-Vector). The linear discriminant analysis is applied to feature set for the network is the frames of each class utterances,
reduce the dimension of the extracted I-Vectors by Fisher criterion while the output labels represent the number of the classes in the
[38]. For each utterance, GMM mean vectors are calculated. The database. After DNN is trained, all the output activations for each
UBM super vector, M is adapted by stacking the mean vectors of frame for a given class utterance are accumulated and normalized
the GMM. It is defined in Eq. (6). by performing a feedforward process on the trained network to
build a model for each class [40]. At the testing process, a new
M = m + Tw (6)
model is created for the utterance test based on the trained net-
T represents a low-rank matrix, and w represents the required low- work. The cosine similarity between the utterance test model and
dimensional I-Vector. Note that the matrix T is initialized based on each class model is computed. Then the final classification decision
the variance of the entire utterances in the training database. After is made by taking the highest cosine similarity.
10 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14
Table 2
The overall classification accuracies of the DNN and I-Vector classifiers using the traditional and the transformed MFCCs (%). Bold
values represents the overall performances.
I-vector Traditional MFCCs 64.86 57.12 49.01 24.50 27.03 49.91 32.80 43.60
Transformed MFCCs 60.33 66.49 48.00 45.46 48.56 56.89 67.15 56.13
DNN with Traditional MFCCs 54.33 52.60 44.80 25.13 42.33 46.13 55.87 45.89
regularized weights Transformed MFCCs 62.23 61.54 53.38 47.69 52.00 64.23 70.77 58.98
DNN with random Traditional MFCCs 56.53 47.27 49.07 27.53 35.33 36.13 53.80 43.67
weights Transformed MFCCs 59.69 60.15 48.85 40.08 52.23 60.92 63.38 55.04
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 11
Fig. 5. ROC curves of different classifier scenarios. A) The DNN classifier with regularized weights and the traditional MFCCs. B) The DNN classifier with regularized weights
and the transformed MFFCs. C) The DNN classifier with random weights and the traditional MFCCs. D) The DNN classifier with random weights and the transformed MFFCs
by. E) The I-vector classifier by using traditional MFCCs. F) The I-vector by using the transformed MFCCs.
Table 3
Corresponding AUC measurements for classification of Speaker’s age and gender. Bold values represents the overall performances.
Traditional MFCCS Transformed MFCCs Traditional MFCCS Transformed MFCCs Traditional MFCCS Transformed MFCCs
are calculated by using one-agianst-all rule. The area under curve weights, therefore, the regularized weights converge faster than
(AUC) for the transformed MFCCs is found to be bigger than the the random weights for most of the DNN layers.
traditional MFCCs (Table 3 compares the AUC for both sets). The Table 4 presents the confusion matrix by using the I-vector
AUC values are calculated as in [43]. The DNN classifier performes classifier. It can be seen that children (C), young female (YF),
better than the I-vector classifier in terms of AUC. and senior (SM, SF) classes are classified with higher accuracies
As comparing the classifiers, the DNN classifier performed compared to the other classes. The major classifications occurred
slightly better than the I-vector classifier. Fig. 6, shows the vari- among the same-gender classes. Young female (YF) and senior
ance in weights at each layer in the DNN classifier by using ran- male (SM) classes have the highest accuracy rates and are correctly
dom weights and regularized weights. Higher variance between classified as 66.49% and 67.15%, respectively. Middle and senior fe-
the weights in each layer is needed to distinguish different classes. male groups (MF, SF) are classified with the accuracy of 45.46% and
As it can be seen in Fig. 6, the variance between the weights us- 56.89%. Children (C) and young male (YM) classes achieved the ac-
ing shared labels is higher than that of the randomly initialized curacy of 60.33% and 48%.
12 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14
Table 6
Confusion matrix of the DNN classifier using the traditional MFCCs set (%). Bold
values represents the classification accuricies.
Predicted
Actual
C YF YM MF MM SF SM
Table 7
Overall performance comparison in speaker’s
age and gender classification. Bold values repre-
sents the performances of the proposed systems
by this work.
Table 4 same gender and close age, or between the children and young fe-
Confusion matrix of the I-vector classifier using the transform MFCCs set (%). Bold
male class.
values represents the classification accuricies.
By comparing the classification accuracies of each class in Table
Predicted 5 and Table 6, the transformed MFCCs help to improve the DNN
Actual
C YF YM MF MM SF SM performance about 10% higher for the classes C, YF, YM, and MM
C 60.33 27.90 1.5 4.80 0 2.88 2.59
and between 15–20% higher for the classes MF, SF, and SM. This
YF 21.08 66.49 2.70 6.85 0 2.88 0 observation can also be seen in the AUC measurements in the
YM 8.89 1.62 48 0.18 18.97 10.99 11.35 Table 3. In their work, Barkana and Zhou [6] reported that tra-
MF 3.60 16.85 2.52 45.46 2.16 29.23 0.18 ditional MFCCs of the middle-aged female (MF) speakers and se-
MM 3.42 1.26 24.43 4.14 48.56 2.34 15.85
nior female speaker have very similar characteristics leading to
SF 7.41 11.17 5.23 13.18 1.44 56.89 4.68
SM 4.50 0.72 11.53 0 15.56 0.54 67.15 misclassifications between these two classes. The proposed trans-
formed MFCCs decreased the misclassifications between these two
classes significantly since phoneme labels are used in generat-
Table 5 ing the transformed features. The transformed features contain
Confusion matrix of the DNN classifier using the transform MFCCs set (%). Bold phoneme specific characteristics of each speaker in addition to the
values represents the classification accuricies.
spectral characteristics.
Predicted Richardson et al. stated in their work [44] that features aligned
Actual with phonetical labels or posteriors are still contain speaker-
C YF YM MF MM SF SM
dependent and phonetically discriminative information, which is
C 63.23 15.38 4.08 3.31 5.08 4.54 4.38
YF 15.92 61.54 0 11.08 0.54 10.23 0.69
that are useful for speaker verification. Sarkar et al in [45] reported
YM 1.62 0.62 53.38 2.38 24.46 2.15 15.38 that “… The results show that the phonetically discriminative MLP
MF 3.38 16.08 2.15 47.69 0.77 28.85 1.08 features retain speaker-specific information which is complementary
MM 0.69 0.92 21.77 0.85 52 2.23 21.54 to the short-term cepstral features…” Braun et al. [46] studied the
SF 4.69 8.92 1.77 16.23 0.923 64.23 3.23
effects of the language on estimating speaker’s age. They found
SM 0.46 0.31 11.85 0.38 13.69 2.54 70.77
that the estimation of the speaker’s age was language indepen-
dent, and the listeners did not gain from their knowledge of the
corresponding language. We can conclude that the BNFs which are
Table 5 and Table 6 present the confusion matrices of the DNN based on phonetical labels retain speaker-dependent information.
classifier using the transformed and traditional MFCCs with regu- As a result, BNFs are language independent.
larized weights. In Table 5, the class SM is classified with the high- The overall accuracies of the previous studies using the aGen-
est accuracy (70.77%), while the classes YF, C, and SF are correctly der database and the MFFCs feature set are listed in Table 7. The
classified with the accuracy ranges between 61% and 64%. The clas- classification accuracies for these systems are reported in Li et al.
sification accuracies of the MM and YM classes are calculated as [22]. The highest reported classification accuracy in the literature is
52% and 53.3%, respectively. The lowest accuracy was achieved by 52.7%, which is achieved by the MFuse 1 + 2 + 3 + 4 + 5 classifier, a
the class of MF, as 47.69%. It is observed that the highest misclas- combination of GMM base, Mean Supervector, MLLR, TPP, and SVM
sification rates have always occurred between the classes with the base systems. GMM baseline, SVM baseline, and GMM-SVM mean
Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14 13
supervector systems have achieved better accuracies than that of step to apply DNN techniques and I-vector classifier on speaker’s
the more complex GMM-SVM MLLR supervector system and GMM- age and gender classification problem.
SVM TPP supervector system. Combining more systems together Many challenges were encountered during the development of
did not provide higher classification accuracies. In GMM base sys- the proposed work. Tied-state triphones were needed to be used
tem, 39-dimensional MFCCs feature set per frame is extracted. An as labels on the DNN output layer. As a solution, a trained GMM-
UBM along with MAP is used to build the class model in a su- HMM model was used to generate the required labels. The tran-
pervised manner. The overall accuracy for that model is reported script file for each utterance was also required in label generating
as 43.1%. The next system, the Mean Super Vector used the GMM process. It is noticed that some of the utterance transcripts in the
baseline system to extract the feature set and training the UBM aGender database were incomplete or inaccurate. To address this
model. The mean vectors of all the Gaussian components are con- issue, the database transcript files have been fixed and refined. The
catenated to form the GMM supervectors. It is modeled by an SVM. optimization of the DNN parameters such as the number of layers,
The overall accuracy of this system is stated as 42.6%. In MLLR su- number of units in each layer, weight initialization, learning rate
pervector system, MLLR supervectors and SVM are used to train is problem-dependent. The settings of the optimal parameters dif-
multi-class models and to score the test set. UBM technique is con- fer from one problem to another. Therefore, different experiments
ducted using the MLLR adaptation for all samples in the training were conducted to find the optimal settings for the each DNN. Op-
set in order to extract the corresponding MLLR supervectors. The timizing one parameter alone does not optimize the rest of the pa-
overall accuracy of MLLR system is reported as 36.2% as shown rameters. Thus, the parameters should be tuned together to reach
in Table 7. Another variation of the GMM-SVM mean supervec- the optimal settings. The computation time for training DNN de-
tor method is the TPP supervector that is calculated as probability pends on different factors: the size of the database (for speech ut-
distribution over all Gaussian components. In this method the KL- terances, there were millions of concatenated frames), the num-
divergence [47] is used to measure the similarity between vectors. ber of features for each sample, number of layers, and number of
In TPP, a UBM model is trained independently as an age and gen- epochs. To overcome the limited computation resources, we have
der models. The overall accuracy for this system is given as 37.8%. utilized two of INVIDIA TITAN X GPGPU devices, connected to one
In SVM-base system 450 dimensional acoustic features such as F0, single host to make the computation faster by the parallel power
jitter, shimmer, along with MFCCs feature set per utterance are ex- of the GPGPUs.
tracted. The corresponding features are used as inputs for an SVM One possible challenge in this work is the extraction of the tied-
classifier by achieving an overall classification accuracy of 44.6%. state triphones. Most of the age and gender speech databases do
The proposed work achieved higher overall accuracies by both not come with tied-state triphones. As a result, the tied-state tri-
BNF-I-vector and BNF-DNN classifiers (56.13%, 58.98%) compared phones should be carefully extracted to implement the proposed
to previous works for the aGender database in the literature. The work on a database. The extraction process has to satisfy several
transformed MFCCs set is proved to be more effective than the tra- requirements such as the manuscripts of the speech utterances and
ditional MFCCs features in speaker’s age and gender classification. special speech software (such as HTK). The extraction process in-
There are two main reasons behind this improvement. First, intro- volves many steps that take time. Another possible challenge of
ducing phoneme labels to create BNFs for age and gender prob- this work is to find the most misclassified classes with each other
lem has a significant impact on the BNFs, which become more dis- in order to use the shared labels technique.
criminative and descriptive. By phoneme labels, phonetic compo- The proposed work achieved higher classification accuracies
nents in a speaker’s speech signal have been captured and used than the previous systems in the literature. This work proves
in detecting the speaker’s age and gender information. Second, the that the bottleneck features can offer better speaker dependent
regularized weights converged faster and provided higher variance information. In this work, we only utilized MFCCs to generate
between classes. These improvements boosted the performance of the bottleneck features. As future work, other time-domain and
the classifiers. frequency-domain based features such as fundamental frequency,
pitch-range-based, and linear predictive coefficients can be used
as input features. It is expected that the classification accuracies
6. Conclusions, challenges, and future work would be improved further. In addition, different type of deep neu-
ral networks such as convolutional neural networks (CNNs) can
The goal of this paper is to improve the classification accuracies be used alone or along with the DNN classifier. In this work, the
in speaker’s age and gender classification. For this purpose, major I-vector was used as a classifier since it is one of the state-of-
contributions are made to the area of feature extraction and classi- art techniques in several fields such as language identification and
fier design. First, a novel approach is introduced to generate trans- speaker verification. We plan to investigate the usage of the ex-
formed MFCCs feature set. Second, classifier weights are regular- tracted BNF features as input for the I-vector in order to model the
ized by using shared labels. As one of the most popular feature sets corresponding I-vector for each utterance. The resulted I-vectors
in the speech signal processing, MFCCs are proved to be ineffective will be used as a new feature set that could be fed to any clas-
in speaker’s age and gender classification in literature. To improve sifier. Since the I-vector utilizes techniques such as a within class
the performance of the traditional MFCCs, the transformed MFCCs covariance, it might help to represent the speech utterances in a
feature set is generated by using BNF extractor. In the BNF extrac- more distinguishable way. As we plan to fine-tune several DNN
tor, phoneme labels are used to capture phonetic components in architectures jointly by using a new cost function, each DNN will
the speech. We showed that the DNN can be designed and trained have a different feature set. The DNNs will be trained jointly and
to adapt smoothly with the BNF extractor, so that a new trans- simultaneously. It is expected to improve the classification accu-
formed features can be obtained. The shared labels are used to racies by representing speech utterances distinctively, since it will
regularize weights between DNN layers. The regularized weights utilize more than one feature set for each utterance and include
provided faster convergence and higher variance between classes. more information about the speaker.
The performance of the transformed MFCCs is evaluated by two
classifiers: DNN and I-Vector. The results showed a significant im- References
provement in the classification accuracies. The overall accuracy of
[1] M. Black, A. Katsamanis, C.-C. Lee, A.C. Lammert, B.R. Baucom, A. Christensen,
the proposed work is 58.98% and 56.13% for the DNN and I-vector, et al., Automatic classification of married couples’ behavior using audio fea-
respectively. To the best of our knowledge, our work is the first tures, in: INTERSPEECH, 2010, pp. 2030–2033.
14 Z. Qawaqneh et al. / Knowledge-Based Systems 115 (2017) 5–14
[2] P. Nguyen, D. Tran, X. Huang, D. Sharma, Automatic speech-based classification [26] X. Zhang, S. Hongbin, Z. Qingwei, Y. Yonghong, Using a kind of novel phono-
of gender, age and accent, in: B.-H. Kang, D. Richards (Eds.), Knowledge Man- tactic information for SVM based speaker recognition, IEICE Trans. Inf. Syst. 92
agement and Acquisition for Smart Systems and Services, vol. 6232, Springer (2009) 746–749.
Berlin Heidelberg, 2010, pp. 288–299. [27] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, et al., "Com-
[3] T. Schultz, Speaker characteristics, in: C. Müller (Ed.), Speaker Classification I, parison of four approaches to age and gender recognition for telephone ap-
vol. 4343, Springer Berlin Heidelberg, 2007, pp. 47–74. plications," in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
[4] S.B. Davis, P. Mermelstein, Comparison of parametric representations for International Conference on, 2007, pp. IV-1089-IV-1092.
monosyllabic word recognition in continuously spoken sentences, Acoust. [28] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
Speech Signal Process. IEEE Trans. vol. 28 (1980) 357–366. ral networks, Science 313 (2006) 504–507.
[5] H.-J. Kim, K. Bae, H.-S. Yoon, Age and gender classification for a home-robot [29] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, et al., in: The HTK
service, in: Robot and Human interactive Communication, 2007. RO-MAN 2007. Book, vol. 3, Cambridge University Engineering Department, 2002, p. 175.
The 16th IEEE International Symposium on, 2007, pp. 122–126. [30] G. Hinton, Training products of experts by minimizing contrastive divergence,
[6] B.D. Barkana, J. Zhou, A new pitch-range based feature set for a speaker’s age Neural Comput. 14 (2002) 1771–1800.
and gender classification, Appl. Acoust. 98 (2015) 52–61. [31] A.-r. Mohamed, D. Yu, L. Deng, Investigation of full-sequence training of deep
[7] M.D. Zeiler, Hierarchical Convolutional Deep Learning in Computer Vision, New belief networks for speech recognition, in: INTERSPEECH, 2010, pp. 2846–2849.
York University, 2013. [32] L. Deng, A tutorial survey of architectures, algorithms, and applications for
[8] M. Ranzato, G.E. Hinton, Modeling pixel means and covariances using factor- deep learning, APSIPA Transactions on Signal and Information Processing 3
ized third-order Boltzmann machines, in: Computer Vision and Pattern Recog- (2015) null-null.
nition (CVPR), 2010 IEEE Conference on, 2010, pp. 2551–2558. [33] A. Mohamed, T.N. Sainath, G. Dahl, B. Ramabhadran, G.E. Hinton, M.A. Picheny,
[9] C. Ekanadham, S. Reader, H. Lee, Sparse deep belief net models for visual area Deep belief networks using discriminative features for phone recognition, in:
V2, Adv. Neural Inf. Process. Syst. 20 (2008). Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Con-
[10] G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural ference on, 2011, pp. 5060–5063.
networks for large-vocabulary speech recognition, Audio Speech Lang. Process. [34] Y. Bao, H. Jiang, C. Liu, Y. Hu, L. Dai, Investigation on dimensionality reduc-
IEEE Trans. 20 (2012) 30–42. tion of concatenated features with deep neural network for LVCSR systems,
[11] T. Deselaers, S. Hasan, O. Bender, H. Ney, A deep learning approach to machine in: Signal Processing (ICSP), 2012 IEEE 11th International Conference on, 2012,
transliteration, in: Proceedings of the Fourth Workshop on Statistical Machine pp. 562–566.
Translation, 2009, pp. 233–241. [35] F. Grézl, M. Karafiát, S. Kontár, J. Cernocky, Probabilistic and bottle-neck fea-
[12] D. Yu, S. Wang, Z. Karam, L. Deng, Language recognition using deep-structured tures for LVCSR of meetings, Acoustics, Speech and Signal Processing, 2007.
conditional random fields, in: Acoustics Speech and Signal Processing (ICASSP), ICASSP 2007. IEEE International Conference on, 2007 IV-757-IV-760.
2010 IEEE International Conference on, 2010, pp. 5030–5033. [36] F. Grézl, P. Fousek, Optimizing bottle-neck features for LVCSR, in: Acoustics,
[13] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief Speech and Signal Processing, 20 08. ICASSP 20 08. IEEE International Confer-
nets, Neural Comput. 18 (2006) 1527–1554. ence on, 2008, pp. 4729–4732.
[14] Y. Bengio, Learning deep architectures for AI, Found. Trends® Mach. Learn. 2 [37] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus
(2009) 1–127. eigenchannels in speaker recognition, Audio Speech Lang. Process. IEEE Trans.
[15] J.M. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, et al., Devel- 15 (2007) 1435–1447.
opments and directions in speech recognition and understanding, Part 1 [DSP [38] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor anal-
Education], Signal Process. Mag. IEEE 26 (2009) 75–80. ysis for speaker verification, Audio Speech Lang. Process. IEEE Trans. 19 (2011)
[16] D. Matrouf, N. Scheffer, B.G. Fauve, J.-F. Bonastre, A straightforward and effi- 788–798.
cient implementation of the factor analysis model for speaker verification, in: [39] P. Kenny, Bayesian Speaker Verification with Heavy-Tailed Priors, in: Odyssey,
INTERSPEECH, 2007, pp. 1242–1245. 2010, p. 14.
[17] M. Senoussaoui, P. Kenny, N. Dehak, P. Dumouchel, An I-Vector extractor suit- [40] E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, J. Gonzalez-Dominguez, Deep
able for speaker recognition with both microphone and telephone speech, in: neural networks for small footprint text-dependent speaker verification, in:
Odyssey, 2010, p. 6. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Con-
[18] N. Dehak, P.A. Torres-Carrasquillo, D.A. Reynolds, R. Dehak, Language recog- ference on, 2014, pp. 4052–4056.
nition via i-vectors and dimensionality reduction, in: INTERSPEECH, 2011, [41] F. Burkhardt, M. Eckert, W. Johannsen, J. Stegmann, A database of age and gen-
pp. 857–860. der annotated telephone speech, LREC, 2010.
[19] E.D. Mysak, Pitch and duration characteristics of older males, J. Speech Hearing [42] F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent
Res. (1959). deep neural networks for conversational speech transcription, in: Automatic
[20] N. Minematsu, M. Sekiguchi, K. Hirose, Automatic estimation of one’s age with Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, 2011,
his/her speech based upon acoustic modeling techniques of speakers, Acous- pp. 24–29.
tics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Confer- [43] D.J. Hand, R.J. Till, A simple generalisation of the area under the ROC curve for
ence on, 2002 I-137-I-140. multiple class classification problems, Mach. Learn. 45 (2001) 171–186.
[21] C. Muller, F. Wittig, J. Baus, Exploiting speech for recognizing elderly users to [44] F. Richardson, D. Reynolds, and N. Dehak, “A unified deep neural network for
respond to their special needs, Eighth European Conference on Speech Com- speaker and language recognition,” 2015, arXiv:1504.00923.
munication and Technology, 2003. [45] A.K. Sarkar, C.-T. Do, V.-B. Le, C. Barras, Combination of cepstral and phoneti-
[22] M. Li, C.-S. Jung, K.J. Han, Combining five acoustic level modeling methods cally discriminative features for speaker verification, IEEE Signal Process. Lett.
for automatic Speaker’s age and gender recognition, in: INTERSPEECH, 2010, 21 (2014) 1040–1044.
pp. 2826–2829. [46] A. Braun, L. Cerrato, Estimating speaker age across languages, in: Proceedings
[23] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted of ICPhS, 1999, pp. 1369–1372.
Gaussian mixture models, Digital Signal Process. 10 (20 0 0) 19–41. [47] J.R. Hershey, P. Olsen, Approximating the Kullback Leibler divergence be-
[24] M. Li, H. Suo, X. Wu, P. Lu, Y. Yan, Spoken language identification using tween Gaussian mixture models, Acoustics, Speech and Signal Processing,
score vector modeling and support vector machine, in: INTERSPEECH, 2007, 2007. ICASSP 2007. IEEE International Conference on, 2007 IV-317-IV-320.
pp. 350–353.
[25] A. Stolcke, S.S. Kajarekar, L. Ferrer, E. Shrinberg, Speaker recognition with ses-
sion variability normalization based on MLLR adaptation transforms, Audio
Speech Lang. Process. IEEE Trans. 15 (2007) 1987–1998.