International Conference on Communication and Signal Processing, April 6-8, 2016, India
Speech to Text Conversion for Multilingual
Languages
Yogita H. Ghadage, Sushama D. Shelke
Abstract—The current work presents a multilingual speech-to- human computer interaction. Generally today’s speech
text conversion system. Conversion is based on information in recognition technologies are designed for English language.
speech signal. Speech is the natural and most important form of So that illiterate rural communities or educationally under-
communication for human being. Speech-To-Text (STT) system privileged people are being kept away of computer
takes a human speech utterance as an input and requires a string technology. If the processing of computer technology in native
of words as output. The objective of this system is to extract, language is made possible i.e. if computer technologies can
characterize and recognize the information about speech. The
understand the native language then it will be easy to use
proposed system is implemented using Mel-Frequency Cepstral
Coefficient (MFCC) feature extraction technique and Minimum computer technologies for illiterate people, people from rural
Distance Classifier, Support Vector Machine (SVM) methods for communities or educationally under-privileged. Marathi is a
speech classification. Speech utterances are pre-recorded and native language of Maharashtra. In a day to day life while
stored in a database. Database mainly divided into two parts speaking we use English words, i.e. most of the time we mix
testing and training. Samples from training database are passed English with native language. So author has designed
through training phase and features are extracted. Combining Multilingual Speech-To-Text conversion system. In which
features for each sample forms feature vector which is stored as Marathi, English, Marathi-English mix speech has given focus.
reference. Sample to be tested from testing part is given to system The objective of the proposed system is to design and
and its features are extracted. Similarity between these features implement Speech-To-Text conversion system for Marathi,
and reference feature vector is computed and words having English, Marathi-English mix languages. The system has
maximum similarity are given as output. The system is developed
developed for small database which contains 10 Marathi
in MATLAB (R2010a) environment.
sentences, 3 English, 2 mix sentences. This work is based on
Index Terms—Mel Frequency Cepstral Coefficients (MFCC),
MFCC, SVM & Minimum Distance Classifier[2].
Minimum Distance Classifier, Speech Recognition, Speech-To- The outline of the paper is as follows. Section II gives a
Text (STT), Support Vector Machine (SVM). brief overview of the system. Section III describes about the
speech database. Section IV explains the MFCC feature
I. INTRODUCTION extraction. Section V says about the pattern classification.
Section VI describes about the experimental setup. Section
Speech Recognition is the procedure of extracting essential VII discuss about the result. Section VIII,] concludes the
information from input speech signal to make accurate paper. Section IX says about the future work.
decision about the corresponding text. Speech signal conveys
very rich information, such as speaker information, linguistic
II. SYSTEM OVERVIEW
information which has inspired many researchers to develop
the system that automatically process the speech e.g. speech
enhancement, speech synthesis, speech compression, speaker
recognition, speech recognition and verification. Speech
recognition can be further classified as speaker dependent and
speaker independent[1]. Computer follows human voice
commands with the help of speech recognition mechanism and
understand human languages i.e. it acts as good interface for
Yogita H. Ghadage is with the Electronics and Telecommunication
Engineering Department, NBN Sinhgad School of Engineering, Pune.
(e-mail: ghadagehyogita@[Link]).
Sushama D. Shelke is with the Electronics and Telecommunication
Engineering Department, NBN Sinhgad School of Engineering, Pune.
(e-mail: [Link]@[Link]). Fig. 1. Block diagram of system
978-1-5090-0396-9/16/$31.00 ©2016 IEEE
0236
Fig. 1 shows a block diagram of Speech-To-Text conversion TABLE I
system. The system operation is divided into two phases i.e. MARATHI DATABASE
training and testing. First in training phase speech utterances of
each sentences is recorded. Speech signal is preprocessed and
segmented into words. For each word acoustic features are
extracted using MFCC method. Such features for each word
forming feature vector is stored for reference. In testing phase
the speech utterance to be tested is preprocessed, segmented
into words and features are extracted for each words. These
features are compared with the reference feature vector stored
during training phase. This is done by using combination of
SVM and Minimum Distance Classifier. The word having
minimum difference is given as recognized word.
III. SPEECH DATABASE
Database is the crucial point in the automatic Speech-To-
Text conversion system. For any automatic speech recognition
system first step is to configure the database. The proposed
system is implemented on self-generated database[3]. The
whole database is divided into two parts. One is training
database and another is testing database. TABLE II
Marathi language sentences: 10 ENGLISH DATABASE
English language sentences: 3
Marathi-English mix sentences: 2
Total sentences: 15
Speakers: 4 (2-Male, 2-Female)
A. Training Database
The training database contains recorded speech utterances
by 4 different users for 10 Marathi sentences, 3 English TABLE III
MARATHI-ENGLISH MIX DATABASE
sentences and 2 mix sentences. Here each sentence is uttered
10 times by each user, i.e. 40 utterances of each sentence are
used to train the system i.e. total 600 samples are used to train
the system. Sentences used in the formation of database are
mentioned below in table1, 2 and 3.
B. Testing Database
IV. MFCC FEATURE EXTRACTION
The testing database also contains recorded speech
In any automatic speech recognition system first and the
utterances by 4 different users for 10 Marathi, 3 English and 2
most important step is to extract features. i.e. To identify
English-Marathi mix sentences. Here each sentence is uttered
useful components of speech signal that are used to identify
20 times by each user i.e. total 1200 samples are used to test
the linguistic content and discarding all the other stuff which
the system. Each sample of training and testing databases is
carries information like background noise, emotion etc[4].
recorded with sampling frequency of 8 kHz.
Two main purposes of feature extraction are: first is to
compress the input speech signal into features, and second is to
use these features which are insensitive to speech variations,
changes of environmental conditions and independent of
speaker.
0237
§ f · (1)
Mel ( f ) 2595 * log 10¨1 ¸
© 700 ¹
As non-linear characteristics of the human auditory system
in frequency are approximated by Mel filtering in the same
way natural logarithm is used to approximate the loudness
non-linearity i.e. the relationship between the human’s
perception of loudness and the sound intensity is approximated
by natural logarithm. Multiplication in frequency domain,
become simple addition after the logarithm.
The log Mel filter bank coefficients are computed from the
filter output as:
Fig. 2. Block diagram of MFCC
N 1
Steps of MFCC feature extraction are as follows:
s ( m) ¦ X (k ) H (k ),0 P M
20 log 10
k 0
(2)
A. Pre-emphasis
Where, M is the no. of Mel filters (20 to 40),
Pre-emphasis is applied to spectrally flatten the input speech
X (k) is N-point FFT of specific window frame of the input
signal. First order high pass FIR filter is used to pre-emphasize
speech signal, H (k) is the Mel filter transfer function.
the higher frequency components.
F. Discrete Cosine Transform (DCT)
B. Framing
For transforming the Mel coefficients back to time domain
An audio signal is constantly changing, so to simplify things
discrete cosine transform is performed. The result of this step
we assume that on short time scales the audio signal doesn’t
is called the Mel Frequency Cepstral Coefficients (MFCC).
change much. So we frame the signal into 20-40 ms frames.
The inverse Fourier transform of the log magnitude of
Hamming window is applied on each frame and it rid of some
Fourier transform of the signal is called as Cepstrum. As
information at start and at end of frame so to reincorporate this
coefficients of the log Mel filter bank are real and symmetric,
information back into our extracted features overlapping is
we can replace the inverse Fourier transform operation by
applied on frames[5].
DCT to generate the Cepstral coefficients.
C. Windowing The smooth spectral shape or vocal tract shape is represented
To avoid or reduce the unwanted discontinuities in speech by lower order Cepstral coefficients. While the excitation
segment and distortion in spectrum introduced by framing information is represented by higher coefficients.
process windowing is performed. Mostly Hamming window is The Cepstral coefficients are the DCT of the M filter outputs
used in speech recognition. obtained as:
D. Discrete Fourier Transform (DFT) M 1
ª Sn m 1 / 2 º
Spectral estimation can be done by DFT. FFT is very ¦ s ( m)
m 0
cos«
¬ M »
¼
(3)
efficient algorithm to implement DFT. The magnitude
frequency response of each frame is obtained after FFT
execution. i.e. Spectral coefficients of the speech frames are Typically first 13 Cepstral coefficients are used. Generally
complex numbers containing both magnitude and phase MFCC coefficients are less correlated than the log Mel filter
information. The phase information is usually discarded for bank coefficients this is the biggest advantage.
speech recognition and only the magnitude of the spectral
coefficients is retained. V. PATTERN CLASSIFICATION
E. Mel Frequency Filtering A. Minimum Distance Classifier (MDC)
Normally each tone with an actual frequency ‘f’ is measured In speech recognition or STT conversion there are mainly
in Hz. For speech signal, the ability of human ear to two phases first is training phase and second one is testing
understand frequency contents of sounds does not follow a phase. For classification, during training phase zero crossing
linear scale. So that for every tone a subjective pitch is points (ZCP) corresponding to the different words are pre
measured on a scale called the ‘Mel’ scale. Below 1000 Hz, computed and stored as reference ZCPs[6].
Mel frequency scale is a linear frequency spacing and above Minimum distance classifier computes Euclidean distance
1000 Hz it is a logarithmic spacing. between the zero crossing points of the uttered word and zero
The following formula gives the transformation of a given crossing points of words from database. The word having least
linear frequency ‘f’ Hz into corresponding ‘Mel’ frequency. Euclidean distance is declared as uttered word.
0238
Euclidean distance is given as: small class problems i.e. multiclass problems are converted
into binary problems. So solving these problems becomes
k easy. MDC is mainly used for coarse tuning and SVM
d 2 ( x, p) ¦ (x
1
i pi ) 2 (4)
performs fine tuning.
VI. EXPERIMENTAL SETUP
Where, x and p ZCP database.
i.e. x is a ZCP vector of uttered word. The system is trained with the training database and the
p is a ZCP vector of different words.
recorded speech utterances stored in test database are used to
test the system. All utterances are recorded at 8 kHz of
i varies from 1 to k (i.e. no. of ZCPs of a particular word).
sampling frequency. Duration for sentences is from 3sec-5sec.
The sum of squares of the difference between the individual
The input speech signal is given to the MFCC which
zero crossing points is computed to calculate the Euclidean
converts it into feature vectors. Minimum distance classifier
distance i.e. distance between the uttered word and all words
and support vector machine techniques are used for
in the database is found out. The word in the database with
classification purpose[7-9].
least distance is declared as the uttered word.
The trained speech samples are saved as reference models
B. Support Vector Machine (SVM) into database. After that each segmented speech sample of test
SVM is one of the effective method of pattern classification. speech signal is passed over reference models and minimum
SVM use linear and nonlinear separating hyper-planes for data distance is computed. Each word recognition is done by using
classification. First input is mapped into a high dimensional minimum distance and SVM model. The whole system is
space and then with the help of hyper-plane it distinguishes the implemented and tested in MATLAB software.
classes.
The inner product, kernel which is caused by the high VII. RESULTS
dimensional mapping is a crucial aspect of opting SVMs
successfully i.e. a high dimensional feature space is implicitly
introduced by a computationally efficient kernel mapping and
in a high dimensional feature space SVM finds a separating
surface with a large margin between training samples of two
classes . And large margin implies a better generalization
ability. SVM uses discriminative approach. The classification
of any fixed length data vectors is possible by SVM. It cannot
be readily applied to task involving variable length data
classification.
The support vector classifier uses the function:
f ( x) (>D K s ( x)@) b (5)
Where, K s ( x) [k ( x, s1 ),......k ( x, sd )]T is a vector of
evaluation of kernel functions centered at the support vectors.
f ( x) (>D K s ( x)@) b Which are usually subset of the
training data. Fig. 3. Speech Waveform of ‘Tu kuthe aahes’
The classification rule is defined as:
q( x) {1 forf ( x) t 0
(6)
{2 forf ( x) 0
And multiclass classification function and rule is defined as:
f y (x) (D y ks (x)) by, y Y (7)
Q(x) arg max f y (x), y Y (8)
C. SVM-MDC Combination
The proposed system uses the combination of SVM and
MDC for classification. It translates large class problems into Fig. 4. Power Spectral Density of signal ‘Tu kuthe aahes’
0239
IX. FUTURE WORK
This connected word speech recognition system is developed
for speaker independent Marathi, English, English-Marathi
mix languages. This work may extend for other regional
languages and may extend towards the real time connected
word speech recognition for multilingual speech.
REFERENCES
[1] Priyanka P. Patil, Sanjay A. Pardeshi, “Marathi Connected Word
Speech Recognition System,” IEEE First International Conference on
Networks & Soft Computing, pp 314-318, Aug. 2014.
[2] [Link], [Link], “Speech Recognition by Machine: A Review,”
Fig. 5. MFCC Filter weights International Journal of Computer Science and Information Security,
vol.6, no.3, 2009.
[3] Mathias De Wachter, Mike Matton, Kris Demuynck, Patrick Wambacq,
“Template Based Continuous Speech Recognition,” IEEE Transs. On
Audio, Speech & Language Processing, vol.15, issue 4,pp 1377-1390,
May 2007.
[4] Vikram.C.M., [Link], “Phoneme Independent Pathalogical Voice
Detection Using Wavelet Based MFCCs, GMM-SVM Hybrid
Classifier,” IEEE International Conference on Advances in Computing,
Communications and Informatics, pp 929-934, Aug. 2013.
[5] [Link], [Link], Abhishek Karan and [Link],
“PSOC based isolated speech recognition system,” IEEE International
Conference on Communication and Signal Processing, pp 693- 697,
April 3-5, 2013, India.
[6] Taabish Gulzar, Anand Singh, Dinesh Kumar Rajoriya and Najma
Farooq, “A Systematic Analysis of Automatic Speech Recognition: An
Overview,” International Journal of Current Engineering and
Technology, vol.4, no.3, June 2014.
[7] Santosh V. Chapaneri, “Spoken Digits Recognition using Weighted
MFCC and Improved Features for Dynamic Time Warping,”
International Journal of Computer Applications, vol.40, no.3, Feb.
2012.
[8] Rashmi C. R., “Review of Algorithms and Applications in Speech
Recognition System,” International Journal of Computer Science and
Information Technologies, vol. 5(4), pp 5258-5262, 2014.
Fig. 6. MFCC Discrete Cosine Transform Matrix [9] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit, “Isolated
Speech Recognition Using MFCC and DTW,” International Journal of
Advanced Research in Electrical, Electronics and Instrumentation
TABLE IV Engineering, vol.2, Issue 8, Aug. 2013.
MARATHI-ENGLISH MIX DATABASE
VIII. CONCLUSION
The % accuracy of the proposed system for Marathi
language of 93.625% is achieved using MFCC for feature
extraction, Minimum Distance classifier and SVM
combination for classification. The proposed system achieved
the higher accuracy compared to the using MFCC-feature
extraction technique & CDHMM-classifier, which gives
accuracy of 88.80% for Marathi language. The proposed
system achieved 91.6667% accuracy for English and 90.625%
accuracy for English-Marathi mix languages.
0240