Arabrecognizer: Modern Standard Arabic Speech Recognition Inspired by Deepspeech2 Utilizing Franco Arabic
Arabrecognizer: Modern Standard Arabic Speech Recognition Inspired by Deepspeech2 Utilizing Franco Arabic
https://doi.org/10.1007/s10772-024-10130-8
Abstract
Speech recognition is a critical task in spoken language applications. Globally known models such as DeepSpeech2 are effec-
tive for English speech recognition, however, they are not well-suited for languages like Arabic. This paper is interested in
recognizing the Arabic language, especially Modern Standard Arabic (MSA). This paper proposed two models that utilize
"Franco-Arabic" as an encoding mechanism and additional enhancements to recognize MSA. The first model uses Mel-
Frequency Cepstral Coefficients (MFCCs) as input features, while the second employs six sequential Gated Recurrent Unit
(GRU) layers. Each model is then followed by a fully connected layer with a dropout layer which helped reduce overfitting.
The Connectionist Temporal Classification (CTC) loss is used to calculate the prediction error and to maximize the correct
transcription likelihood. Two experiments were conducted for each model. The first experiment involved 41 h of continuous
speech over 15 epochs. Whereas, the second one utilized 69 h over 30 epochs. The experiments showed that the first model
excels in speed while the second excels in accuracy, and both outperformed the well-known DeepSpeech2.
Keywords Franco-Arabic · Modern Standard Arabic · MSA · Speech recognition · Arabic speech recognition ·
DeepSpeech2 · GRU· CTC loss
1 Introduction English ASR studies have received much more interest and
attention, than Arabic ASR (Abdelhamid et al., 2020). The
Artificial Intelligence has significantly impacted the media Arabic language is regarded as one of the oldest, richest,
around us, notably images, written words, and especially miscellaneous languages and is ranked as the fourth most
processing audio signals. Many useful ideas, research, and spoken language in the world, the current research pro-
projects have emerged in audio processing, such as speaker gress is not yet satisfactory compared to other languages
identification, speech recognition, gender recognition, and (Abdelhamid et al., 2020). There are three classes of Ara-
other great research (Chollet, 2021). Automatic Speech Rec- bic language (Elmahdy et al., 2009): (1) Classical Arabic is
ognition (ASR) is the process of transforming human voice the Quran, Hadith, and classic Arabic poetry language. (2)
signals into words or commands (O’Shaughnessy, 2008). Arabic Dialects have multiple regional forms used for daily
spoken communication informally in different regions. With
the rise of social media, Arabic dialects are also needed to
* Mohammed M. Nasef be written. Each Arabic dialect has its own unique set of
[email protected]; features and properties, which makes it hard to process data
[email protected] from these different varieties. (3) Modern Standard Arabic
Amr A. Elshall (MSA) is based on classical Arabic with simplified pho-
[email protected]; nology, emphasizing clarity and comprehensibility. MSA is
[email protected]
used for formal communication, newspapers, modern books,
Amr M. Sauber and news. It serves as an official, well-recognized, and uni-
[email protected];
[email protected] fied language understood by all Arabic speakers. Addition-
ally, it facilitates streamlined processing by boasting a lower
1
Mathematics and Computer Science Department, Faculty complexity level compared to classical Arabic.
of Science, Menoufia University, Shebin el Koom 32511,
Egypt
Vol.:(0123456789)
International Journal of Speech Technology
There are many challenges facing the Arabic ASR adds complexity and processing time. The DNN/HMM
(Haraty & Ariss, 2007; Rahman et al., 2024). Such as many hybrid approach used the Minimum Phone Error (MPE)
phonemes (Al-Anzi & AbuZeina, 2022; Ali et al., 2009; criterion, trained sequentially with DNN, and achieved the
Alotaibi, 2008; Essa et al., 2008), grammatical and mor- highest accuracy, the results yielded WER of 17.86% for
phological complexity (Forsberg, 2003; Hussein et al., 2022; broadcast news reports, 29.85% for broadcast conversations,
Mohamed et al., 2013), intensive need of computational and an overall WER of 25.6%. It is worth mentioning that
power (Georgescu et al., 2021). Hence, few studies in Arabic the study utilized a corpus comprising 50 h of transcribed
ASR have been published due to these challenges. audio rather than using standard Common Voice.
An Arabic speech recognizer based on the SPHINX-IV A study on ASR explored end-to-end approaches using
framework and proposed an automatic toolkit Able to gener- three distinct datasets: MGB2 (MGB2 dataset: https://arabi
ate a Pronunciation Dictionary for both the standard Arabic cspeech.org/mgb2), MGB3 (MGB3 dataset: https://arabic-
language and the Holy Qur'an (Hyassat & Abu Zitar, 2006). speech.org, mgb3-asr- ), and MGB5 (MGB5 dataset: https://
Three corpora were utilized as datasets: the Holy Qur'an arabicspeech.org/mgb5) yielding WER of 12.5%, 27.5%,
Corpus (HQC-1), the command and control corpus (CAC-1), and 33.8% respectively (Hussein et al., 2022). The research
and the Arabic digits corpus (ADC). The training process involved partitioning each dataset into specific audio clip
employed a Hidden Markov Model (HMM), resulting in a lengths for experimental purposes. The model was applied
46.18% Word Error Rate (WER). iteratively to each segmented dataset, and the resulting aver-
ASR system is presented using a 1200-h speech corpus. age accuracies were reported. Several critical points require
The baseline recognizer utilized Gaussian Mixture Models discussion. Firstly, the paper uses 1200 h for training from
(GMMs) and HMMs, trained on 39-dimensional MFCCs MGB2, but the number of hours for development and evalu-
(AlHanai et al., 2016). The study employed a one-to-one ation is low and limited to only 10 h. It is worth noting that
mapping between characters and acoustic units, encompass- the entire MGB2 dataset comprises only 1,200 h. Secondly,
ing 960,000-word entries and 38 acoustic units. Feature the choice of the datasets, MGB2 comprises recordings from
extraction utilized the KALDI speech recognition toolkit "Al Jazeera" broadcasts, encompassing diverse dialects such
(Povey et al., 2011). Acoustic model training utilized the as those of correspondents, callers, and broadcasters, poten-
CNTK toolkit, while language models were built using the tially impacting model accuracy and misleading it. MGB3
SRILM toolkit (Yu et al.). The developed system incorpo- focuses on the Egyptian dialect, limiting the applicability
rated a Deep Neural Network (DNN) structure, integrating of the ASR system to Egyptian speakers exclusively. Simi-
various techniques such as feed-forward, convolutional larly, MGB5 targets the Moroccan dialect. Hence, utilizing
layers, time-delay networks, Long Short-Term Memory an MSA dataset would ensure broader applicability across
(LSTM), highway LSTM (H-LSTM), and grid LSTM Arabic speakers. Finally, the complexity and extensive use
(GLSTM). Evaluation of the corpus demonstrated signifi- of special characters in the lookup table poses challenges in
cant performance, achieving an 18.3% WER using trained model training.
GLSTM models. A comprehensive benchmark for Arabic speech recog-
Three distinct system architectures inspired by biological nition using ASR technologies was introduced (Obaidah,
methods were developed for Arabic ASR (Hmad & Allen, et al., 2024). The authors have collected a dataset of 132 h
2012). These systems were trained using an Arabic pho- of speech data from calls between agents and clients across
neme database (APD) manually extracted from the King the Arab region. Five different ASR systems were utilized to
Abdulaziz Arabic Phonetics Database (KAPD). The map- compare their performance on the collected dataset. The pro-
ping was converting each Arabic character into one or two posed models are designed to support multiple languages.
English characters using special symbols. The dataset was The evaluation results were as follows: Whisper developed
employed to train and evaluate three different Multilayer by OpenAI achieved the worst performance with a WER of
Perceptron (MLP) neural network architectures for pho- 83.8%, Azure API achieved a WER of 71.88%, Google API
neme recognition. Each system utilized MFCCs for feature achieved a WER of 67.1%, while Meta M4T V1 developed
extraction and adapted dataset normalization techniques for by Meta achieved a WER of 67.8%, and Chirp developed
training and testing. The systems achieved WER of 47.52%, by Google achieved the best performance with the lowest
44.58%, and 46.63% respectively. WER of 48.9%.
A comprehensive comparison of advanced speech rec- Several baseline sequence-to-sequence deep neural mod-
ognition techniques is presented (Cardinal et al., 2014). els for Arabic dialects and MSA were designed (Nasr et al.,
The study utilized a corpus comprising 50 h of transcribed 2023). The models include DeepSpeech2, Bidirectional
audio from the news channel "Al Jazeera" to train various Long Short-Term Memory (Bi-LSTM) with attention, Bi-
approaches. The mapping employed was word-to-word, tak- LSTM without attention, LSTM with attention, and LSTM
ing into account diacritics in the mapping process, which without attention. The Bi-LSTM with attention achieved
International Journal of Speech Technology
encouraging results with a 59% WER on the Yemeni speech encompasses signal preprocessing, converting Arabic char-
corpus, 83% WER on the Jordanian speech corpus, and acters to Franco-Arabic characters through the utilization
53% WER on the multi-dialectal Yemeni-Jordanian-Arabic of a specifically designed lookup table, and introducing
speech corpus. The authors fine-tuned DeepSpeech2 and two proposed models. These models are inspired by Deep-
achieved a WER of 31% for the Yemeni corpus, 68% for the Speech2 in an attempt to achieve appropriate performance
Jordanian corpus, and 30% for the multi-dialectal Arabic and solve problems with Arabic ASR.
corpus. Additionally, this paper demonstrates the fine-tuning
of DeepSpeech2 on the Common Voice MSA dataset, and
2.1 Pre‑processing
the evaluation was on 84 h resulting in 86% WER.
To solve the problems mentioned, such as: using special
The preprocessing stage is divided into two distinctive
characters, mapping and converting the Arabic characters,
phases. Initially, the first phase undertakes the conversion of
and using an appropriate dataset. This paper will present
letters from Arabic to Franco-Arabic. whereas, in the second
two proposed models to suggest some ideas for solving these
phase, the MFCCs and spectrogram features are extracted
problems and the Arabic language problems to enhance the
from the WAV files.
models' performance.
The following points summarize the contributions of this
paper: 2.1.1 Letter pre‑processing
• We have developed two models to address the Arabic The Arabic language is complicated and presents a formida-
speech recognition problem. The first model excels in ble challenge for any model to process. This difficulty arises
speed while the second excels in accuracy. from the presence of numerous special characters and unique
• Both proposed models consume minimal data for training script shapes, such as " "ئ," "و," "ض," "ث," "أand various
to achieve preliminary good results, unlike other models compound letters formed by combining multiple shapes.
that require more than 1000 h. Additionally, each letter takes several forms, and its form
• Both models have been trained on MSA to enhance their depends on its position in the word, as it can be connected
generalizability and usability for all Arabic speakers. to another letter or not. For instance, the letter " "عis written
• The use of Franco-Arabic in a scientific way to develop a like this if it is alone while written like this " "المعif it is at
promising field in AI like speech recognition. Although it the end of the word and written like this " "ميعادif it is con-
is unscientific, it has many disadvantages and negatively nected to another letter finally written like this " "عمروif it is
affects the Arabic and English languages. at the beginning of the word.
• Use the most common features in speech recognition, Recently, Franco-Arabic has become more and more used
namely Mel-Frequency Cepstral Coefficients (MFCCs) since its inception due to the emergence of video games and
and Spectrogram, and perform comparative analysis of cultural fusion. Franco-Arabic is a method of typing Arabic
results. words using English letters and numbers. Writing in this way
causes many problems, wastes the power of the Arabic lan-
The rest of the paper is organized as follows: Sect. 2 pre- guage among children and youth, and reduces the usage of
sents the proposed methodologies, which include preproc- the true English language. Therefore, this paper employs
essing steps, the lookup table, and the proposed models. Franco-Arabic in a scientific manner to harness its full
Section 3 introduces the experimental results containing the potential. Numerous challenges emerged while constructing
dataset, computational power, and performance measures. the lookup table, foremost among them being the existence
Section 4 discusses the advances and limitations of the pro- of Arabic letters that lack counterparts in the Franco-Arabic
posed models. Finally, Sect. 5 introduces the conclusion of like " "ؤand ""ئ. Furthermore, many letters in the Arabic
this paper. language are represented by one letter in Franco-Arabic such
as " "ز," "ذand " "ظrepresented as "z" in Franco-Arabic,
because they have the same phonetics. To solve these prob-
2 Proposed methodologies lems, additional letters were added that belong to the English
language but do not belong to Franco-Arabic, to create a
DeepSpeech2 (Amodei et al., 2016) stands out as a formida- complete lookup table for a good mapping between letters.
ble model for addressing challenges facing English speech Table 1 shows the lookup used to map between letters.
recognition. Despite its efficacy in English, DeepSpeech2 By using this lookup table, all the Arabic sentences in
does not extend its proficiency to the realm of Arabic speech the dataset can be converted to Franco-Arabic sentences.
recognition. Consequently, this section delineates the meth- Figure 1 shows the conversion process from Arabic to
odology employed for Arabic ASR. The methodology Franco-Arabic.
International Journal of Speech Technology
Each audio clip in the dataset has a different sample rate. The input feature of DeepSpeech2 is a spectrogram. In
During the signal pre-processing phase, all audio files were the first proposed model, the input feature was changed
resampled to a unified rate of 16 kHz. Figure 2 shows the from spectrogram to MFCCs. Additionally, we changed
steps of signal preprocessing. the RNN utilized in the acoustic model to be bidirectional
In this paper MFCCs and spectrograms were used as GRU. GRU was introduced to fix the vanishing gradient
audio features because MFCCs are specifically designed to problem in 2014 faced by standard Recurrent Neural Net-
model human auditory perception, emphasizing frequencies works (RNN), so GRU outperformed vanilla RNN. GRU
that are most important for understanding speech (Moondra exhibits many characteristics similar to LSTM but with
& Chahal, 2023). Additionally, spectrograms are the main lower complexity and faster to compute. (Rana, 2016).
feature of Deepspeech2 so we applied it. Figure 3 shows how Figure 5 shows the architecture of the model and the pro-
to get MFCCs from a waveform signal. Figure 4 shows how posed model with its modifications.
to get a spectrogram from a waveform signal.
In the first proposed model (ArabRecognizer 1), MFCCs with 512 units for each layer, in addition to a fully connected
is the input to multiple convolutional layers. This model uses layer with 1024 units, and a ReLU activation function followed
two layers of Conv2D with 32 × 32 filters. Each of them is fol- by a dropout layer to reduce overfitting. Finally, the output is
lowed by a Rectified Linear Unit (ReLU) activation function obtained from the Softmax layer. Adam optimizer is used to
and a Batch Normalization layer. The first Conv2D layer has optimize the CTC loss which is used to calculate the prediction
a stride of O’Shaughnessy (2008) and a kernel size of [11, error and maximize the likelihood of the correct transcrip-
41] without padding. The second Conv2D layer has a stride tion. Algorithm 1 shows the pseudo-code for the first proposed
of Chollet (2021); O’Shaughnessy, 2008) and a kernel size model (ArabRecognizer 1).
of Hussein et al. (2022); MGB2 dataset:https://arabicspeech.
org/mgb2) without padding. The acoustic model consists of a
bidirectional RNN (bi-RNN) using five sequential GRU layers,
International Journal of Speech Technology
2.3 Second proposed model (ArabRecognizer 2) acoustic model comprises six sequential GRU layers, each
with 1024 units, and a feed-forward layer with 2048 units.
The second model proposed takes the spectrogram rather The ReLU activation function is added and followed by a
than the MFCCs as the input feature. The spectrogram passes dropout layer to minimize overfitting. The output is derived
through two Conv2D layers, each of which is 32 × 32 filters. from a Softmax layer and "Adam" optimizer is employed
Both layers are followed by a ReLU activation function and for training. The CTC loss function is utilized to calculate
a Batch Normalization layer. The first Conv2D layer has a the prediction error and maximize the probability of correct
stride of O’Shaughnessy (2008)O’Shaughnessy, 2008 and a transcription. Figure 6 shows the second proposed model
kernel size of [11, 41] without padding. The second Conv2D (ArabRecognizer 2) architecture. Algorithm 2 shows the
layer has a stride of Chollet (2021); O’Shaughnessy, 2008) pseudo-code for the second proposed model (ArabRecog-
and a kernel size of Hussein et al. (2022); MGB2 data- nizer 2).
set: https://arabicspeec/org, mgb2) without padding. The
International Journal of Speech Technology
Table 3 Comparison between Model WER Accuracy Average Time per epoch
proposed models and Similarity
DeepSpeech2 experiment 2
DeepSpeech2 (Amodei et al., 2016) 46% 54% 83% 46–50 min
The first proposed model (ArabRecognizer 1) 44% 56% 84% 39–44 min
The second proposed model (ArabRecognizer 2) 41% 59% 86.5% 53–55 min
International Journal of Speech Technology
ArabRecognizer 2 4 Discussion
Expt. 1 VS Expt. 2
but it did not perform well enough in Arabic. Two proposed Ali, M., Elshafei, M., Al-Ghamdi, M., & Al-Muhtaseb, H. (2009).
models were presented for recognizing MSA speech using Arabic phonetic dictionaries for speech recognition. Journal of
Information Technology Research, 2(4), 67–80. https://doi.org/
Franco-Arabic. The use of Franco-Arabic has given prom- 10.4018/jitr.2009062905
ising results that encourage us to continue developing this Alotaibi, Y. (2008). Comparative study of ANN and HMM to Arabic
approach. The models used the most common features in digits recognition systems. Journal of King Abdulaziz University-
speech recognition, MFCCs, and Spectrograms. The best Engineering Science, 19(1), 43–60. https://doi.org/10.4197/Eng.
19-1.3
performance was conducted for only 69 h over 30 epochs. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
The first proposed model achieved 44% WER, while the E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al.
second proposed model achieved 41% WER. The experi- (2016). Deep Speech 2 : End-to-end speech recognition in English
ments showed that the first model demonstrated superior and Mandarin, ICML (2016) (pp. 173–182). 1/2022
Cardinal, P., et al. (2014). Recent advances in ASR applied to an Ara-
speed, while the second model excelled in accuracy. The bic transcription system for Al-Jazeera. In Proceedings of annual
results of the two models achieved good WER rather than conference in International Speech Communication Association
DeepSpeech2 and some well-known methods for speech rec- (Interspeech), (pp. 2088–2092), January 2014.
ognition with fewer training hours. This paper proves the Chollet, F. (2021). Deep learning with Python. Manning: Second
Edition.
possibility of relying on Franco-Arabic as an intermediary Common voice dataset. https://commonvoice.mozilla.org/en/datasets
in Arabic language recognition systems. In future work, we 2/2022
plan to increase the dataset used and the number of epochs Elmahdy, M., Gruhn, R., Minker, W., & Abdennadher, S. (2009). Mod-
to enhance our models' accuracy. Additionally, we aim to ern standard Arabic based multilingual approach for dialectal Ara-
bic speech recognition. In 2009 eighth international symposium
address the challenge of distinguishing similar phonetic let- on natural language processing (pp. 169–174), Bangkok, Thai-
ters in Arabic. Finally, we intend to translate the two pro- land, October 2009. IEEE. https://doi.org/10.1109/SNLP.2009.
posed models into practical real-world applications. 5340923
Essa, E. M., Tolba, A. S., & Elmougy, S. (2008) A comparison of
combined classifier architectures for Arabic speech recognition.
Funding This study was not funded by any organization. In 2008 international conference on computer engineering & sys-
tems, (pp. 149–153), Cairo, Egypt, November 2008. IEEE. https://
Data availability Available. doi.org/10.1109/ICCES.2008.4772985
Forsberg, M. (2003). Why is speech recognition difficult? Chalmers
University of Technology ResearchGate. March 2003 (pp. 1–9).
Declarations Georgescu, A.-L., Pappalardo, A., Cucu, H., & Blott, M. (2021).
Performance vs hardware requirements in state-of-the-art
Conflict of interest The authors declare have no conflict of interest. automatic speech recognition. EURASIP Journal of Audio
Speech Music Processing, 2021(1), 28. https://doi.org/10.1186/
Ethical approval No experiments involving humans or animals in this s13636-021-00217-4
article. Haraty, R. A., & El Ariss, O. (2007). CASRA+: A colloquial Ara-
bic speech recognition application. American Journal of Applied
Consent for publication All authors gave their consent. Sciences, 4(1), 23–32. https://doi.org/10.3844/ajassp.2007.23.32
Hmad, N., & Allen, T. (2012). Biologically inspired continuous Arabic
speech recognition. In M. Bramer & M. Petridis (Eds.), Research
and development in intelligent systems XXIX (pp. 245–258).
Springer.
References Hussein, A., Watanabe, S., & Ali, A. (2022). Arabic speech recognition
by end-to-end, modular systems and human. Computer Speech &
Abdelhamid, A., Alsayadi, H. A., Hegazy, I., & Fayed, Z. T. (2020). Language, 71, 101272. https://doi.org/10.1016/j.csl.2021.101272
End-to-end Arabic speech recognition: A review. Bibliotheca Hyassat, H., & AbuZitar, R. (2006). Arabic speech recognition using
Alexandrina, Sep 2020. Retrieved Dec 12, 2023 from http://resea SPHINX engine. International Journal of Speech Technology,
rch.asu.edu.eg/handle/123456789/178165 9(3–4), 133–150. https://doi.org/10.1007/s10772-008-9009-1
Akasheh, W. M., Haider, A. S., Al-Saideen, B., & Sahari, Y. (2024). MGB2 dataset: https://arabicspeech.org/mgb2/
Artificial intelligence-generated Arabic subtitles: Insights from MGB3 dataset: https://arabicspeech.org/mgb3-asr-2/
Veed.io’s automatic speech recognition system of Jordanian Ara- MGB5 dataset: https://arabicspeech.org/mgb5/
bic. Texto Livre, 17, e46952. https://doi.org/10.1590/1983-3652. Mohamed, O., Shedeed, H., Tolba, M., & Gadalla, M. (2013). Mor-
2024.46952 phame-based Arabic language modeling for automatic speech
Al-Anzi, F. S., & AbuZeina, D. (2022). Synopsis on Arabic speech rec- recognition, Jun 2013.
ognition. Ain Shams Engineering Journal, 13(2), 101534. https:// Moondra, A., & Chahal, P. (2023). Improved speaker recognition for
doi.org/10.1016/j.asej.2021.06.020 degraded human voice using modified-MFCC and LPC with
AlHanai, T., Hsu, W.-N. & Glass, J. (2016). Development of the MIT CNN. IJACSA. https://doi.org/10.14569/IJACSA.2023.0140416
ASR system for the 2016 Arabic multi-genre broadcast challenge. Nasr, S., Duwairi, R., & Quwaider, M. (2023). End-to-end speech
In 2016 IEEE spoken language technology workshop (SLT), (pp. recognition for Arabic dialects. Arabian Journal for Science
299–304), San Diego, CA, December 2016. IEEE. https://d oi.o rg/ and Engineering, 48(8), 10617–10633. https://doi.org/10.1007/
10.1109/SLT.2016.7846280 s13369-023-07670-7
International Journal of Speech Technology
O’Shaughnessy, D. (2008). Automatic speech recognition: History, introduction to computational networks and the computational
methods and challenges. Pattern Recognition, 41(10), 2965–2979. network toolkit, Technical report.
https://doi.org/10.1016/j.patcog.2008.05.008 Zhang, S., Hu, Y., & Bian, G. (2017). Research on string similarity
Obaidah, Q. A., et al. (2024). A new benchmark for evaluating auto- algorithm based on Levenshtein Distance. In 2017 IEEE 2nd
matic speech recognition in the Arabic call domain. arXiv, 2024. advanced information technology, electronic and automation
https://doi.org/10.48550/ARXIV.2403.04280 control conference (IAEAC), (pp. 2247–2251), Chongqing, China,
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, March 2017. IEEE. https://d oi.o rg/1 0.1 109/I AEAC.2 017.8 05441 9
N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.
(2011). The Kaldi speech recognition toolkit. In IEEE 2011 work- Publisher's Note Springer Nature remains neutral with regard to
shop on automatic speech recognition and understanding. IEEE jurisdictional claims in published maps and institutional affiliations.
Signal Processing Society, 2011, number EPFL-CONF-192584.
Rahman, A., Kabir, Md. M., Mridha, M. F., Alatiyyah, M., Alhas- Springer Nature or its licensor (e.g. a society or other partner) holds
son, H. F., & Alharbi, S. S. (2024). Arabic speech recognition: exclusive rights to this article under a publishing agreement with the
Advancement and challenges. IEEE Access, 12, 39689–39716. author(s) or other rightsholder(s); author self-archiving of the accepted
https://doi.org/10.1109/ACCESS.2024.3376237 manuscript version of this article is solely governed by the terms of
Rana, R. (2016). Gated Recurrent Unit (GRU) for emotion classifi- such publishing agreement and applicable law.
cation from noisy speech. arXiv, 2016. https://doi.org/10.48550/
ARXIV.1612.07778
Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B.,
Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., et al. (2014). An