0% found this document useful (0 votes)
31 views14 pages

Arabrecognizer: Modern Standard Arabic Speech Recognition Inspired by Deepspeech2 Utilizing Franco Arabic

Uploaded by

amrmausad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

Arabrecognizer: Modern Standard Arabic Speech Recognition Inspired by Deepspeech2 Utilizing Franco Arabic

Uploaded by

amrmausad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Journal of Speech Technology

https://doi.org/10.1007/s10772-024-10130-8

ArabRecognizer: modern standard Arabic speech recognition inspired


by DeepSpeech2 utilizing Franco‑Arabic
Mohammed M. Nasef1 · Amr A. Elshall1 · Amr M. Sauber1

Received: 18 February 2024 / Accepted: 4 July 2024


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024

Abstract
Speech recognition is a critical task in spoken language applications. Globally known models such as DeepSpeech2 are effec-
tive for English speech recognition, however, they are not well-suited for languages like Arabic. This paper is interested in
recognizing the Arabic language, especially Modern Standard Arabic (MSA). This paper proposed two models that utilize
"Franco-Arabic" as an encoding mechanism and additional enhancements to recognize MSA. The first model uses Mel-
Frequency Cepstral Coefficients (MFCCs) as input features, while the second employs six sequential Gated Recurrent Unit
(GRU) layers. Each model is then followed by a fully connected layer with a dropout layer which helped reduce overfitting.
The Connectionist Temporal Classification (CTC) loss is used to calculate the prediction error and to maximize the correct
transcription likelihood. Two experiments were conducted for each model. The first experiment involved 41 h of continuous
speech over 15 epochs. Whereas, the second one utilized 69 h over 30 epochs. The experiments showed that the first model
excels in speed while the second excels in accuracy, and both outperformed the well-known DeepSpeech2.

Keywords Franco-Arabic · Modern Standard Arabic · MSA · Speech recognition · Arabic speech recognition ·
DeepSpeech2 · GRU​· CTC loss

1 Introduction English ASR studies have received much more interest and
attention, than Arabic ASR (Abdelhamid et al., 2020). The
Artificial Intelligence has significantly impacted the media Arabic language is regarded as one of the oldest, richest,
around us, notably images, written words, and especially miscellaneous languages and is ranked as the fourth most
processing audio signals. Many useful ideas, research, and spoken language in the world, the current research pro-
projects have emerged in audio processing, such as speaker gress is not yet satisfactory compared to other languages
identification, speech recognition, gender recognition, and (Abdelhamid et al., 2020). There are three classes of Ara-
other great research (Chollet, 2021). Automatic Speech Rec- bic language (Elmahdy et al., 2009): (1) Classical Arabic is
ognition (ASR) is the process of transforming human voice the Quran, Hadith, and classic Arabic poetry language. (2)
signals into words or commands (O’Shaughnessy, 2008). Arabic Dialects have multiple regional forms used for daily
spoken communication informally in different regions. With
the rise of social media, Arabic dialects are also needed to
* Mohammed M. Nasef be written. Each Arabic dialect has its own unique set of
[email protected]; features and properties, which makes it hard to process data
[email protected] from these different varieties. (3) Modern Standard Arabic
Amr A. Elshall (MSA) is based on classical Arabic with simplified pho-
[email protected]; nology, emphasizing clarity and comprehensibility. MSA is
[email protected]
used for formal communication, newspapers, modern books,
Amr M. Sauber and news. It serves as an official, well-recognized, and uni-
[email protected];
[email protected] fied language understood by all Arabic speakers. Addition-
ally, it facilitates streamlined processing by boasting a lower
1
Mathematics and Computer Science Department, Faculty complexity level compared to classical Arabic.
of Science, Menoufia University, Shebin el Koom 32511,
Egypt

Vol.:(0123456789)
International Journal of Speech Technology

There are many challenges facing the Arabic ASR adds complexity and processing time. The DNN/HMM
(Haraty & Ariss, 2007; Rahman et al., 2024). Such as many hybrid approach used the Minimum Phone Error (MPE)
phonemes (Al-Anzi & AbuZeina, 2022; Ali et al., 2009; criterion, trained sequentially with DNN, and achieved the
Alotaibi, 2008; Essa et al., 2008), grammatical and mor- highest accuracy, the results yielded WER of 17.86% for
phological complexity (Forsberg, 2003; Hussein et al., 2022; broadcast news reports, 29.85% for broadcast conversations,
Mohamed et al., 2013), intensive need of computational and an overall WER of 25.6%. It is worth mentioning that
power (Georgescu et al., 2021). Hence, few studies in Arabic the study utilized a corpus comprising 50 h of transcribed
ASR have been published due to these challenges. audio rather than using standard Common Voice.
An Arabic speech recognizer based on the SPHINX-IV A study on ASR explored end-to-end approaches using
framework and proposed an automatic toolkit Able to gener- three distinct datasets: MGB2 (MGB2 dataset: https://​arabi​
ate a Pronunciation Dictionary for both the standard Arabic cspee​ch.​org/​mgb2), MGB3 (MGB3 dataset: https://arabic-
language and the Holy Qur'an (Hyassat & Abu Zitar, 2006). speech.org, mgb3-asr- ), and MGB5 (MGB5 dataset: https://
Three corpora were utilized as datasets: the Holy Qur'an arabicspeech.org/mgb5) yielding WER of 12.5%, 27.5%,
Corpus (HQC-1), the command and control corpus (CAC-1), and 33.8% respectively (Hussein et al., 2022). The research
and the Arabic digits corpus (ADC). The training process involved partitioning each dataset into specific audio clip
employed a Hidden Markov Model (HMM), resulting in a lengths for experimental purposes. The model was applied
46.18% Word Error Rate (WER). iteratively to each segmented dataset, and the resulting aver-
ASR system is presented using a 1200-h speech corpus. age accuracies were reported. Several critical points require
The baseline recognizer utilized Gaussian Mixture Models discussion. Firstly, the paper uses 1200 h for training from
(GMMs) and HMMs, trained on 39-dimensional MFCCs MGB2, but the number of hours for development and evalu-
(AlHanai et al., 2016). The study employed a one-to-one ation is low and limited to only 10 h. It is worth noting that
mapping between characters and acoustic units, encompass- the entire MGB2 dataset comprises only 1,200 h. Secondly,
ing 960,000-word entries and 38 acoustic units. Feature the choice of the datasets, MGB2 comprises recordings from
extraction utilized the KALDI speech recognition toolkit "Al Jazeera" broadcasts, encompassing diverse dialects such
(Povey et al., 2011). Acoustic model training utilized the as those of correspondents, callers, and broadcasters, poten-
CNTK toolkit, while language models were built using the tially impacting model accuracy and misleading it. MGB3
SRILM toolkit (Yu et al.). The developed system incorpo- focuses on the Egyptian dialect, limiting the applicability
rated a Deep Neural Network (DNN) structure, integrating of the ASR system to Egyptian speakers exclusively. Simi-
various techniques such as feed-forward, convolutional larly, MGB5 targets the Moroccan dialect. Hence, utilizing
layers, time-delay networks, Long Short-Term Memory an MSA dataset would ensure broader applicability across
(LSTM), highway LSTM (H-LSTM), and grid LSTM Arabic speakers. Finally, the complexity and extensive use
(GLSTM). Evaluation of the corpus demonstrated signifi- of special characters in the lookup table poses challenges in
cant performance, achieving an 18.3% WER using trained model training.
GLSTM models. A comprehensive benchmark for Arabic speech recog-
Three distinct system architectures inspired by biological nition using ASR technologies was introduced (Obaidah,
methods were developed for Arabic ASR (Hmad & Allen, et al., 2024). The authors have collected a dataset of 132 h
2012). These systems were trained using an Arabic pho- of speech data from calls between agents and clients across
neme database (APD) manually extracted from the King the Arab region. Five different ASR systems were utilized to
Abdulaziz Arabic Phonetics Database (KAPD). The map- compare their performance on the collected dataset. The pro-
ping was converting each Arabic character into one or two posed models are designed to support multiple languages.
English characters using special symbols. The dataset was The evaluation results were as follows: Whisper developed
employed to train and evaluate three different Multilayer by OpenAI achieved the worst performance with a WER of
Perceptron (MLP) neural network architectures for pho- 83.8%, Azure API achieved a WER of 71.88%, Google API
neme recognition. Each system utilized MFCCs for feature achieved a WER of 67.1%, while Meta M4T V1 developed
extraction and adapted dataset normalization techniques for by Meta achieved a WER of 67.8%, and Chirp developed
training and testing. The systems achieved WER of 47.52%, by Google achieved the best performance with the lowest
44.58%, and 46.63% respectively. WER of 48.9%.
A comprehensive comparison of advanced speech rec- Several baseline sequence-to-sequence deep neural mod-
ognition techniques is presented (Cardinal et al., 2014). els for Arabic dialects and MSA were designed (Nasr et al.,
The study utilized a corpus comprising 50 h of transcribed 2023). The models include DeepSpeech2, Bidirectional
audio from the news channel "Al Jazeera" to train various Long Short-Term Memory (Bi-LSTM) with attention, Bi-
approaches. The mapping employed was word-to-word, tak- LSTM without attention, LSTM with attention, and LSTM
ing into account diacritics in the mapping process, which without attention. The Bi-LSTM with attention achieved
International Journal of Speech Technology

encouraging results with a 59% WER on the Yemeni speech encompasses signal preprocessing, converting Arabic char-
corpus, 83% WER on the Jordanian speech corpus, and acters to Franco-Arabic characters through the utilization
53% WER on the multi-dialectal Yemeni-Jordanian-Arabic of a specifically designed lookup table, and introducing
speech corpus. The authors fine-tuned DeepSpeech2 and two proposed models. These models are inspired by Deep-
achieved a WER of 31% for the Yemeni corpus, 68% for the Speech2 in an attempt to achieve appropriate performance
Jordanian corpus, and 30% for the multi-dialectal Arabic and solve problems with Arabic ASR.
corpus. Additionally, this paper demonstrates the fine-tuning
of DeepSpeech2 on the Common Voice MSA dataset, and
2.1 Pre‑processing
the evaluation was on 84 h resulting in 86% WER.
To solve the problems mentioned, such as: using special
The preprocessing stage is divided into two distinctive
characters, mapping and converting the Arabic characters,
phases. Initially, the first phase undertakes the conversion of
and using an appropriate dataset. This paper will present
letters from Arabic to Franco-Arabic. whereas, in the second
two proposed models to suggest some ideas for solving these
phase, the MFCCs and spectrogram features are extracted
problems and the Arabic language problems to enhance the
from the WAV files.
models' performance.
The following points summarize the contributions of this
paper: 2.1.1 Letter pre‑processing

• We have developed two models to address the Arabic The Arabic language is complicated and presents a formida-
speech recognition problem. The first model excels in ble challenge for any model to process. This difficulty arises
speed while the second excels in accuracy. from the presence of numerous special characters and unique
• Both proposed models consume minimal data for training script shapes, such as "‫ "ئ‬,"‫ "و‬,"‫ "ض‬,"‫ "ث‬,"‫ "أ‬and various
to achieve preliminary good results, unlike other models compound letters formed by combining multiple shapes.
that require more than 1000 h. Additionally, each letter takes several forms, and its form
• Both models have been trained on MSA to enhance their depends on its position in the word, as it can be connected
generalizability and usability for all Arabic speakers. to another letter or not. For instance, the letter "‫ "ع‬is written
• The use of Franco-Arabic in a scientific way to develop a like this if it is alone while written like this "‫ "المع‬if it is at
promising field in AI like speech recognition. Although it the end of the word and written like this "‫ "ميعاد‬if it is con-
is unscientific, it has many disadvantages and negatively nected to another letter finally written like this "‫ "عمرو‬if it is
affects the Arabic and English languages. at the beginning of the word.
• Use the most common features in speech recognition, Recently, Franco-Arabic has become more and more used
namely Mel-Frequency Cepstral Coefficients (MFCCs) since its inception due to the emergence of video games and
and Spectrogram, and perform comparative analysis of cultural fusion. Franco-Arabic is a method of typing Arabic
results. words using English letters and numbers. Writing in this way
causes many problems, wastes the power of the Arabic lan-
The rest of the paper is organized as follows: Sect. 2 pre- guage among children and youth, and reduces the usage of
sents the proposed methodologies, which include preproc- the true English language. Therefore, this paper employs
essing steps, the lookup table, and the proposed models. Franco-Arabic in a scientific manner to harness its full
Section 3 introduces the experimental results containing the potential. Numerous challenges emerged while constructing
dataset, computational power, and performance measures. the lookup table, foremost among them being the existence
Section 4 discusses the advances and limitations of the pro- of Arabic letters that lack counterparts in the Franco-Arabic
posed models. Finally, Sect. 5 introduces the conclusion of like "‫ "ؤ‬and "‫"ئ‬. Furthermore, many letters in the Arabic
this paper. language are represented by one letter in Franco-Arabic such
as "‫ "ز‬,"‫ "ذ‬and "‫ "ظ‬represented as "z" in Franco-Arabic,
because they have the same phonetics. To solve these prob-
2 Proposed methodologies lems, additional letters were added that belong to the English
language but do not belong to Franco-Arabic, to create a
DeepSpeech2 (Amodei et al., 2016) stands out as a formida- complete lookup table for a good mapping between letters.
ble model for addressing challenges facing English speech Table 1 shows the lookup used to map between letters.
recognition. Despite its efficacy in English, DeepSpeech2 By using this lookup table, all the Arabic sentences in
does not extend its proficiency to the realm of Arabic speech the dataset can be converted to Franco-Arabic sentences.
recognition. Consequently, this section delineates the meth- Figure 1 shows the conversion process from Arabic to
odology employed for Arabic ASR. The methodology Franco-Arabic.
International Journal of Speech Technology

2.1.2 Signal pre‑processing 2.2 First proposed model (ArabRecognizer 1)

Each audio clip in the dataset has a different sample rate. The input feature of DeepSpeech2 is a spectrogram. In
During the signal pre-processing phase, all audio files were the first proposed model, the input feature was changed
resampled to a unified rate of 16 kHz. Figure 2 shows the from spectrogram to MFCCs. Additionally, we changed
steps of signal preprocessing. the RNN utilized in the acoustic model to be bidirectional
In this paper MFCCs and spectrograms were used as GRU. GRU was introduced to fix the vanishing gradient
audio features because MFCCs are specifically designed to problem in 2014 faced by standard Recurrent Neural Net-
model human auditory perception, emphasizing frequencies works (RNN), so GRU outperformed vanilla RNN. GRU
that are most important for understanding speech (Moondra exhibits many characteristics similar to LSTM but with
& Chahal, 2023). Additionally, spectrograms are the main lower complexity and faster to compute. (Rana, 2016).
feature of Deepspeech2 so we applied it. Figure 3 shows how Figure 5 shows the architecture of the model and the pro-
to get MFCCs from a waveform signal. Figure 4 shows how posed model with its modifications.
to get a spectrogram from a waveform signal.

Table 1  Lookup Table


Arabic letters Franco- Arabic Arabic letters Franco- Arabic Arabic letters Franco-Arabic Arabic letters Franco-
letters letters letters Arabic
letters

‫ا‬ a ‫ذ‬ z ‫ظ‬ 1 ‫و‬ w


‫ب‬ b ‫ر‬ r ‫ع‬ 3 ‫ي‬ y
‫ت‬ t ‫ز‬ Z ‫غ‬ 8 ‫ى‬ i
‫ث‬ c ‫س‬ s ‫ف‬ f ‫أ‬ 2
‫ج‬ g ‫ش‬ 4 ‫ق‬ q ‫إ‬ A
‫ح‬ 7 ‫ص‬ 9 ‫ك‬ k ‫ؤ‬ W
‫خ‬ 5 ‫ض‬ 0 ‫ل‬ l ‫ئ‬ Y
‫د‬ d ‫ط‬ 6 ‫م‬ m ‫ـئـ‬ X
‫ن‬ n ‫ه‬ h ‫ة‬ H ‫ء‬ x

Fig. 1  Mapping process

Fig. 2  Signal preprocessing


diagram
International Journal of Speech Technology

Fig. 3  Feature Extraction (MFCCs)

Fig. 4  Feature Extraction (Spectrogram)


International Journal of Speech Technology

Fig. 5  First proposed model (ArabRecognizer 1)

In the first proposed model (ArabRecognizer 1), MFCCs with 512 units for each layer, in addition to a fully connected
is the input to multiple convolutional layers. This model uses layer with 1024 units, and a ReLU activation function followed
two layers of Conv2D with 32 × 32 filters. Each of them is fol- by a dropout layer to reduce overfitting. Finally, the output is
lowed by a Rectified Linear Unit (ReLU) activation function obtained from the Softmax layer. Adam optimizer is used to
and a Batch Normalization layer. The first Conv2D layer has optimize the CTC loss which is used to calculate the prediction
a stride of O’Shaughnessy (2008) and a kernel size of [11, error and maximize the likelihood of the correct transcrip-
41] without padding. The second Conv2D layer has a stride tion. Algorithm 1 shows the pseudo-code for the first proposed
of Chollet (2021); O’Shaughnessy, 2008) and a kernel size model (ArabRecognizer 1).
of Hussein et al. (2022); MGB2 dataset:https://arabicspeech.
org/mgb2) without padding. The acoustic model consists of a
bidirectional RNN (bi-RNN) using five sequential GRU layers,
International Journal of Speech Technology

Algorithm 1  ArabRecognizer 1 pseudo code

2.3 Second proposed model (ArabRecognizer 2) acoustic model comprises six sequential GRU layers, each
with 1024 units, and a feed-forward layer with 2048 units.
The second model proposed takes the spectrogram rather The ReLU activation function is added and followed by a
than the MFCCs as the input feature. The spectrogram passes dropout layer to minimize overfitting. The output is derived
through two Conv2D layers, each of which is 32 × 32 filters. from a Softmax layer and "Adam" optimizer is employed
Both layers are followed by a ReLU activation function and for training. The CTC loss function is utilized to calculate
a Batch Normalization layer. The first Conv2D layer has a the prediction error and maximize the probability of correct
stride of O’Shaughnessy (2008)O’Shaughnessy, 2008 and a transcription. Figure 6 shows the second proposed model
kernel size of [11, 41] without padding. The second Conv2D (ArabRecognizer 2) architecture. Algorithm 2 shows the
layer has a stride of Chollet (2021); O’Shaughnessy, 2008) pseudo-code for the second proposed model (ArabRecog-
and a kernel size of Hussein et al. (2022); MGB2 data- nizer 2).
set: https://arabicspeec/org, mgb2) without padding. The
International Journal of Speech Technology

Algorithm 2  ArabRecognizer 2 pseudo code

3 Experimental results mozilla.org, en & datasets2, 2022). Comprising approxi-


mately 120,000 audio clips, each clip ranges in length from
To evaluate the efficacy of the two proposed models, two exper- 3 to 8 s. Prepared by 1237 participants of diverse ages and
iments were conducted comparing DeepSpeech2 with the pro- genders, all contributors conversed in Modern Standard Ara-
posed models to determine their accuracy. This section provides bic. The audio format of the dataset is "MP3". Multiple ver-
a detailed description of the dataset, computational resources sions of the dataset exist, and for this paper, version "9.0" is
used, and the metrics employed to assess performance. utilized (Common voice dataset:https://commonvoice.moz-
illa.org/en & datasets2, 2022).
3.1 Dataset
3.2 Computational power
The dataset employed in this paper is sourced from "com-
monvoice.mozilla.org" and is named the "common voice" It constituted one of the prevailing challenges, significantly
dataset (Common voice dataset:https://commonvoice. impacting the workflow. In the initial stages of the study,
International Journal of Speech Technology

Fig. 6  Second Proposed Model (ArabRecognizer 2)

servers equipped with Central Processing Units (CPUs) 3.3 Performance measures


were employed. The specifications of these servers were as
follows: In this paper, two metrics were utilized to assess the perfor-
mance of the output. The primary metric is the Word Error
3.2.1 The first server Rate (WER), widely recognized and frequently utilized in
speech recognition tasks. The second metric is a "Similar-
Processor: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz ity Measure" employed to assess the similarity between the
2.10 GHz, RAM: 256 GB, Number of cores: 16 cores. target sentence and the predicted sentence. WER serves as a
prevalent metric for testing speech recognition performance
3.2.2 The second server or machine translation systems. The inherent challenge in
performance measurement arises from the potential disparity
Processor: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20 GHz, in length between the predicted sequence and the target (cor-
2195 MHz, RAM: 512 GB, Number of cores: 14 cores. rect) sequence. WER proves to be a valuable tool for system
Considering the aforementioned specifications, the time comparison and the assessment of enhancements within a
of each epoch was significantly extended. Consequently, it single system. However, it lacks granularity regarding error
was necessary to find a Graphics Processing Unit (GPU) types (Al-Anzi & AbuZeina, 2022). It is anticipated that sub-
source to improve workflow efficiency and accelerate the stitutions, deletions, and insertions may manifest in words
training process. Therefore, all models mentioned were or sequences. The computation of WER can be expressed
trained using Colab Pro, and the reported results stem from as indicated in Eq. 1 (Akasheh et al., 2024; Al-Anzi &
the usage of Colab Pro. AbuZeina, 2022):
Unfortunately, the disconnection issue encountered in
Colab Pro stems from resource constraints. This implies that
Colab Pro stops functioning once the available resources are
used up, preventing the model from completing the specified
number of epochs and leading to losing all the weights. To
address this challenge, a solution was devised by implement-
ing a function that saves weights after each epoch. Conse-
quently, in the event of a disconnection during training, the
model can reload the saved weights from the file and resume
the training from the latest checkpoint.

Fig. 7  First Experiment Results


International Journal of Speech Technology

Table 2  Comparison between Model WER Accuracy Average


the proposed models and Similarity
DeepSpeech2 in the first
experiment DeepSpeech2 (Amodei et al., 2016) 72% 28% 71%
The first proposed model (ArabRecognizer 1) 70% 30% 73%
The second proposed model (ArabRecognizer 2) 65% 35% 75%

between two sequences. It represents the minimum num-


ber of single-character edits such as insertions, deletions,
or substitutions needed to change one word into the other
(Akasheh et al., 2024).
Furthermore, the output from the Levenshtein distance
represents the count of incorrect characters in the predicted
sequence (Zhang et al., 2017). In this method, the target
and predicted sequences are aligned together to match their
lengths. Subsequently, the Levenshtein distance function
is applied to determine the number of characters requiring
modification to reach the target sequence, or equivalently,
the count of incorrect characters in the predicted sequence.
Finally, the function computes the similarity percentage
between the two sentences. The equation for sequence simi-
Fig. 8  Second experiment results larity is expressed as follows in Eq. 2:
E
SeqSim(Xt , Yp ) = ∗ 100 (2)
S+D+I Len
WER = (1)
N where:
where: S: Number of substitutions, D: Number of deletions, E: is the Levenshtein distance function output, Len: is
I: Number of insertions, and N: Number of words in the the length of any sequence after alignment, Xt : is the target
target sequence. The model accuracy will be Accuracy = 1 sequence, and yp: is the predicted sequence.
– WER.
In the majority of speech recognition research, the WER
is a commonly used metric to evaluate model quality. How-
ever, WER can be deemed unfair in certain instances. This
unfairness stems from its approach of counting the incor-
rect words in the predicted sentence. In scenarios where a
sentence comprises only two words with one of them con-
taining a single-letter error, WER evaluates the entire word
as incorrect which means 50% of the sentence is incorrect.
To address this limitation, this paper introduces an alterna-
tive measure for assessing the similarity between the target
and predicted sentences. This new measure is based on the
Levenshtein Distance (Zhang et al., 2017). The Levenshtein
distance is a string metric used to measure the difference
Fig. 9  Training process results using Franco-Arabic

Table 3  Comparison between Model WER Accuracy Average Time per epoch
proposed models and Similarity
DeepSpeech2 experiment 2
DeepSpeech2 (Amodei et al., 2016) 46% 54% 83% 46–50 min
The first proposed model (ArabRecognizer 1) 44% 56% 84% 39–44 min
The second proposed model (ArabRecognizer 2) 41% 59% 86.5% 53–55 min
International Journal of Speech Technology

prediction accuracy, reaching approximately 59%. Notably,


the average similarity between the predicted and target sen-
tences within the test set was approximately 86%. Figure 8
illustrates the distinctions among the three models and their
WER reduction rates with the increasing number of epochs.
Table 3 presents a comparative analysis of the three mod-
els in the second experiment. The comparison is based on
both prediction accuracy and the similarity between the pre-
dicted and actual sentences.
Figure 9 shows some sentences that the “ArabRecog-
Fig. 10  ArabRecognizer 2 Final output
nizer 2” model predicts during the testing phase using
Franco-Arabic.
3.4 First experimental results Figure 10 shows some of the results of the second
proposed model “ArabRecognizer 2” model, this model
In the first experiment, all three models underwent training achieves 59% prediction accuracy.
using 27,000 audio files and were subsequently tested on Upon analyzing the outcomes of the first and second
3,000 audio files drawn from the aforementioned dataset. experiments, increasing the data and the number of epochs
Each audio clip had a duration ranging from 3 to 8 s. The resulted in heightened prediction accuracy for both proposed
cumulative training duration for the three models in this models and increasing in average sentence similarity. Nota-
experiment amounted to approximately 38 h of continuous bly, the disparity in data size between the two experiments
speech. The models underwent 15 epochs. This paper delin- amounted to 20,000 audio clips, corresponding to approxi-
eates the distinctions among the three models concerning mately 28 h. Despite this modest increase, the models in
prediction accuracy and the similarity between target and the second experiment demonstrated significantly improved
predicted sentences for each model. performance compared to the first experiment. Specifically,
The best outcomes in terms of prediction accuracy and the prediction accuracy surged by approximately 25% in the
sentence similarity were observed with the second proposed second experiment. Figure 11 illustrates the progress made
model (ArabRecognizer 2). It achieved a prediction accuracy by the “ArabRecognizer 2” model in both the first and sec-
of 35%, with an average similarity of 75%. Figure 7 illus- ond experiments, displaying a consistent reduction in WER.
trates the correlation between the reduction in WER and the Figure 11 shows a comparative analysis between the two
incremental increase in the number of epochs. experiments. The initial experiment involved 30,000 audio
Table 2 presents a comparative analysis of the three mod- clips and 15 epochs, whereas the subsequent experiment
els in the initial experiment, evaluating them based on pre- incorporated 50,000 audio clips and extended to 30 epochs.
diction accuracy and sentence similarity. The first experiment concluded at 15 epochs due to the emer-
Based on the preceding results, the second proposed gence of overfitting. Consequently, the dataset size for the
model exhibited higher prediction accuracy and lower WER. second experiment was increased to 50,000 audio clips, the
However, the overall results were deemed unsatisfactory. A WER exhibited a continuous decrease until the 15th epoch
notable observation from this experiment is the consistent without encountering overfitting issues. Subsequently, the
decrease in WER, as evident in Fig. 7. This implies that number of epochs was extended to 30, until the overfitting
the three models require more data to further reduce WER problem appeared again.
and achieve higher accuracy. With an increase in data, it The first proposed model “ArabRecognizer 1” exhibited
becomes imperative to correspondingly augment the number the shortest time for completing one epoch, with an aver-
of epochs to ensure effective training of the models on the age runtime falling within the range of 39 to 44 min. This
expanded dataset. model did not achieve notably high prediction accuracy, but
the achieved accuracy remains acceptable. Remarkably, its
3.5 Second experimental results efficiency in minimizing the time required for one epoch
renders it particularly advantageous for online applications
In the second experiment, three models underwent train- where time is a significant factor.
ing on a dataset comprising 45,000 audio clips and were In the “ArabRecognizer 2” model, the WER exhibited
subsequently evaluated on 5,000 audio files. This implies a continuous decrease during the training phase, reaching
that the models underwent a cumulative training duration of 41% WER. Concurrently, the model achieved a prediction
approximately 63 h of continuous speech. The three mod- accuracy of 59%. Particularly noteworthy is the high degree
els were performed on 30 epochs. The second proposed of similarity observed between the predicted sentences and
model, denoted as "ArabRecognizer 2" achieved the highest
International Journal of Speech Technology

ArabRecognizer 2 4 Discussion
Expt. 1 VS Expt. 2

This section will present the advantages and limitations of


the proposed models.
WER

4.1 Proposed models' advantages

The proposed models exhibit several advantages that con-


tributed to their attaining high accuracy, even in the presence
Epochs
of resource constraints.

Fig. 11  The second proposed model on Expt. 1 VS Expt. 2


• This paper used Franco-Arabic, which is unscientific
and has many disadvantages, but it achieved great initial
the target sentences in the test phase, attaining a percentage results in our experiments.
of 86.5%. Considering these metrics, the “ArabRecognizer • All models were trained in Modern Standard Arabic
2” model emerges as the most effective among the models (MSA), which all Arabic speakers can speak.
presented in this paper, boasting the highest accuracy under • The first proposed model “ArabRecognizer 1” takes less
the experimental conditions. Given its accuracy in predic- training time than the original "DeepSpeech2" and it
tion and the notable similarity between sentences, this model achieves 2% higher accuracy than the original model.
represents a desirable choice for utilization in offline appli- • The second proposed model “ArabRecognizer 2”
cations. Table 4 shows a comparative analysis between the achieved a 5% higher accuracy than "DeepSpeech2" and
two proposed models and other related work. a 3% higher accuracy than “ArabRecognizer 1”.
Table 4 shows the result of fine-tuned and evaluated five
different ASR systems (Obaidah, et al., 2024). The Meta 4.2 Proposed models limitations
M4T, Whisper, Chirp, Google Cloud API, and Azure API
models are designed to support multiple languages. The • WER remains suboptimal.
fine-tuning was conducted on a dataset comprising 132 h of • The proposed models still encounter challenges in dis-
continuous speech. The evaluation results were as follows: tinguishing between letters with similar phonetics in
Whisper exhibited the highest WER at 83.8%, Azure API the Arabic language. This problem could be solved by
achieved a WER of 71.88%, Google API attained a WER of increasing the used data.
67.1%, Meta M4T V1 recorded a WER of 67.8%, and Chirp • The models still need more data to get more accurate.
outperformed the others with the lowest WER of 48.9%.
Also, Fine-tuned DeepSpeech2 using 84 h of continuous
speech (Nasr et al., 2023). The fine-tuning did not perform 5 Conclusion
well, achieving a WER of 86%. Therefore, the proposed
models were trained on 69 h of continuous speech. Finally, MSA is useful for all Arabic speakers. So, this paper
the proposed models were superior in accuracy by using a attempts to solve the problems of MSA speech recognition.
few numbers of hours for training. The first proposed model The DeepSpeech2 model was used as a baseline because it
achieved a WER of 44%, and the second proposed model is one of the most powerful models in speech recognition,
reached the lowest WER which achieved 41% WER.
Table 4  A comparative analysis Model WER Hours of
between the two proposed continuous
models and the other related speech
work
Whisper Large V1 (Obaidah, et al., 2024) 83.8% 132
Azure API (Obaidah, et al., 2024) 71.88% 132
Google API (Obaidah, et al., 2024) 67.1% 132
Meta M4T V1 (Obaidah, et al., 2024) 67.8% 132
Chirp (Obaidah, et al., 2024) 48.9% 132
DeepSpeech2 (Nasr et al., 2023) 86% 84
The first proposed model (ArabRecognizer 1) 44% 69
The second proposed model (ArabRecognizer 2) 41% 69
International Journal of Speech Technology

but it did not perform well enough in Arabic. Two proposed Ali, M., Elshafei, M., Al-Ghamdi, M., & Al-Muhtaseb, H. (2009).
models were presented for recognizing MSA speech using Arabic phonetic dictionaries for speech recognition. Journal of
Information Technology Research, 2(4), 67–80. https://​doi.​org/​
Franco-Arabic. The use of Franco-Arabic has given prom- 10.​4018/​jitr.​20090​62905
ising results that encourage us to continue developing this Alotaibi, Y. (2008). Comparative study of ANN and HMM to Arabic
approach. The models used the most common features in digits recognition systems. Journal of King Abdulaziz University-
speech recognition, MFCCs, and Spectrograms. The best Engineering Science, 19(1), 43–60. https://​doi.​org/​10.​4197/​Eng.​
19-1.3
performance was conducted for only 69 h over 30 epochs. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
The first proposed model achieved 44% WER, while the E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al.
second proposed model achieved 41% WER. The experi- (2016). Deep Speech 2 : End-to-end speech recognition in English
ments showed that the first model demonstrated superior and Mandarin, ICML (2016) (pp. 173–182). 1/2022
Cardinal, P., et al. (2014). Recent advances in ASR applied to an Ara-
speed, while the second model excelled in accuracy. The bic transcription system for Al-Jazeera. In Proceedings of annual
results of the two models achieved good WER rather than conference in International Speech Communication Association
DeepSpeech2 and some well-known methods for speech rec- (Interspeech), (pp. 2088–2092), January 2014.
ognition with fewer training hours. This paper proves the Chollet, F. (2021). Deep learning with Python. Manning: Second
Edition.
possibility of relying on Franco-Arabic as an intermediary Common voice dataset. https://​commo​nvoice.​mozil​la.​org/​en/​datas​ets
in Arabic language recognition systems. In future work, we 2/2022
plan to increase the dataset used and the number of epochs Elmahdy, M., Gruhn, R., Minker, W., & Abdennadher, S. (2009). Mod-
to enhance our models' accuracy. Additionally, we aim to ern standard Arabic based multilingual approach for dialectal Ara-
bic speech recognition. In 2009 eighth international symposium
address the challenge of distinguishing similar phonetic let- on natural language processing (pp. 169–174), Bangkok, Thai-
ters in Arabic. Finally, we intend to translate the two pro- land, October 2009. IEEE. https://​doi.​org/​10.​1109/​SNLP.​2009.​
posed models into practical real-world applications. 53409​23
Essa, E. M., Tolba, A. S., & Elmougy, S. (2008) A comparison of
combined classifier architectures for Arabic speech recognition.
Funding This study was not funded by any organization. In 2008 international conference on computer engineering & sys-
tems, (pp. 149–153), Cairo, Egypt, November 2008. IEEE. https://​
Data availability Available. doi.​org/​10.​1109/​ICCES.​2008.​47729​85
Forsberg, M. (2003). Why is speech recognition difficult? Chalmers
University of Technology ResearchGate. March 2003 (pp. 1–9).
Declarations Georgescu, A.-L., Pappalardo, A., Cucu, H., & Blott, M. (2021).
Performance vs hardware requirements in state-of-the-art
Conflict of interest The authors declare have no conflict of interest. automatic speech recognition. EURASIP Journal of Audio
Speech Music Processing, 2021(1), 28. https://​doi.​org/​10.​1186/​
Ethical approval No experiments involving humans or animals in this s13636-​021-​00217-4
article. Haraty, R. A., & El Ariss, O. (2007). CASRA+: A colloquial Ara-
bic speech recognition application. American Journal of Applied
Consent for publication All authors gave their consent. Sciences, 4(1), 23–32. https://​doi.​org/​10.​3844/​ajassp.​2007.​23.​32
Hmad, N., & Allen, T. (2012). Biologically inspired continuous Arabic
speech recognition. In M. Bramer & M. Petridis (Eds.), Research
and development in intelligent systems XXIX (pp. 245–258).
Springer.
References Hussein, A., Watanabe, S., & Ali, A. (2022). Arabic speech recognition
by end-to-end, modular systems and human. Computer Speech &
Abdelhamid, A., Alsayadi, H. A., Hegazy, I., & Fayed, Z. T. (2020). Language, 71, 101272. https://​doi.​org/​10.​1016/j.​csl.​2021.​101272
End-to-end Arabic speech recognition: A review. Bibliotheca Hyassat, H., & AbuZitar, R. (2006). Arabic speech recognition using
Alexandrina, Sep 2020. Retrieved Dec 12, 2023 from http://​resea​ SPHINX engine. International Journal of Speech Technology,
rch.​asu.​edu.​eg/​handle/​12345​6789/​178165 9(3–4), 133–150. https://​doi.​org/​10.​1007/​s10772-​008-​9009-1
Akasheh, W. M., Haider, A. S., Al-Saideen, B., & Sahari, Y. (2024). MGB2 dataset: https://​arabi​cspee​ch.​org/​mgb2/
Artificial intelligence-generated Arabic subtitles: Insights from MGB3 dataset: https://​arabi​cspee​ch.​org/​mgb3-​asr-2/
Veed.io’s automatic speech recognition system of Jordanian Ara- MGB5 dataset: https://​arabi​cspee​ch.​org/​mgb5/
bic. Texto Livre, 17, e46952. https://​doi.​org/​10.​1590/​1983-​3652.​ Mohamed, O., Shedeed, H., Tolba, M., & Gadalla, M. (2013). Mor-
2024.​46952 phame-based Arabic language modeling for automatic speech
Al-Anzi, F. S., & AbuZeina, D. (2022). Synopsis on Arabic speech rec- recognition, Jun 2013.
ognition. Ain Shams Engineering Journal, 13(2), 101534. https://​ Moondra, A., & Chahal, P. (2023). Improved speaker recognition for
doi.​org/​10.​1016/j.​asej.​2021.​06.​020 degraded human voice using modified-MFCC and LPC with
AlHanai, T., Hsu, W.-N. & Glass, J. (2016). Development of the MIT CNN. IJACSA. https://​doi.​org/​10.​14569/​IJACSA.​2023.​01404​16
ASR system for the 2016 Arabic multi-genre broadcast challenge. Nasr, S., Duwairi, R., & Quwaider, M. (2023). End-to-end speech
In 2016 IEEE spoken language technology workshop (SLT), (pp. recognition for Arabic dialects. Arabian Journal for Science
299–304), San Diego, CA, December 2016. IEEE. https://d​ oi.o​ rg/​ and Engineering, 48(8), 10617–10633. https://​doi.​org/​10.​1007/​
10.​1109/​SLT.​2016.​78462​80 s13369-​023-​07670-7
International Journal of Speech Technology

O’Shaughnessy, D. (2008). Automatic speech recognition: History, introduction to computational networks and the computational
methods and challenges. Pattern Recognition, 41(10), 2965–2979. network toolkit, Technical report.
https://​doi.​org/​10.​1016/j.​patcog.​2008.​05.​008 Zhang, S., Hu, Y., & Bian, G. (2017). Research on string similarity
Obaidah, Q. A., et al. (2024). A new benchmark for evaluating auto- algorithm based on Levenshtein Distance. In 2017 IEEE 2nd
matic speech recognition in the Arabic call domain. arXiv, 2024. advanced information technology, electronic and automation
https://​doi.​org/​10.​48550/​ARXIV.​2403.​04280 control conference (IAEAC), (pp. 2247–2251), Chongqing, China,
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, March 2017. IEEE. https://d​ oi.o​ rg/1​ 0.1​ 109/I​ AEAC.2​ 017.8​ 05441​ 9
N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.
(2011). The Kaldi speech recognition toolkit. In IEEE 2011 work- Publisher's Note Springer Nature remains neutral with regard to
shop on automatic speech recognition and understanding. IEEE jurisdictional claims in published maps and institutional affiliations.
Signal Processing Society, 2011, number EPFL-CONF-192584.
Rahman, A., Kabir, Md. M., Mridha, M. F., Alatiyyah, M., Alhas- Springer Nature or its licensor (e.g. a society or other partner) holds
son, H. F., & Alharbi, S. S. (2024). Arabic speech recognition: exclusive rights to this article under a publishing agreement with the
Advancement and challenges. IEEE Access, 12, 39689–39716. author(s) or other rightsholder(s); author self-archiving of the accepted
https://​doi.​org/​10.​1109/​ACCESS.​2024.​33762​37 manuscript version of this article is solely governed by the terms of
Rana, R. (2016). Gated Recurrent Unit (GRU) for emotion classifi- such publishing agreement and applicable law.
cation from noisy speech. arXiv, 2016. https://​doi.​org/​10.​48550/​
ARXIV.​1612.​07778
Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B.,
Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., et al. (2014). An

You might also like