Retracted Article: Audio Visual Automatic Speech Recognition Towards Education For Disabilities
Retracted Article: Audio Visual Automatic Speech Recognition Towards Education For Disabilities
[Link]
LE
Saswati Debnath1 · Pinki Roy2 · Suyel Namasudra3,4 · Ruben Gonzalez Crespo4
IC
Abstract
Education is a fundamental right that enriches everyone’s life. However, physically challenged people often debar from the
RT
general and advanced education system. Audio-Visual Automatic Speech Recognition (AV-ASR) based system is useful to
improve the education of physically challenged people by providing hands-free computing. They can communicate to the
learning system through AV-ASR. However, it is challenging to trace the lip correctly for visual modality. Thus, this paper
addresses the appearance-based visual feature along with the co-occurrence statistical measure for visual speech recognition.
Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) is proposed for
A
visual speech information. The experimental results show that the proposed system achieves 76.60 % accuracy for visual
speech and 96.00 % accuracy for audio speech recognition.
ED
Keywords AV-ASR · LBP-TOP · GLCM · MFCC · Clustering algorithm · Supervised learning
Introduction device. They can easily access the computer, improve writ-
ing through speech to text mechanism, and increase read-
CT
Learning disabilities, vision problems, physical disabilities ing-writing abilities. In voice dialing, medical documen-
are the barriers to achieve the goal in this digital world. tation, home appliances, automatic speech recognition is
Thus, to improve the learning system, Automatic Speech used widely (Debnath & Roy, 2018; Revathi et al., 2019).
Recognition (ASR) is very helpful for students with dis- However, the acoustic speech signal can be influenced by
abilities. The application area of a speech recognition sys- the transmission channel, use of the different microphones,
A
tem is widespread in real-time. Disabled people can use the different filter characteristics, limitation of frequency band-
ASR system for computing and searching on a hands-free width, etc. Although there are many approaches to overcome
TR
Saswati Debnath
[Link]@[Link] nal is corrupted by noise. Incorporating visual information
Pinki Roy into audio signal increases the possibility of a more reli-
pinkiroy2405@[Link] able recognition system. In automatic speech recognition,
Ruben Gonzalez Crespo visual speech helps to improve the recognition rate when
[Link]@[Link] background noise is more. The AV-ASR system can be used
1 to developed desktop applications that involve web brows-
Department of Computer Science and Engineering, Alliance
University, Bangalore, Karnataka, India ing tasks, air traffic control rooms, hands-free computing,
2 where a user can interact with machines without the use of
Department of Computer Science and Engineering, National
Institute of Technology, Silchar, Assam, India their hands. Hands-free computing is very useful for disa-
3 bled people. However, primarily ASR has been used to build
Department of Computer Science and Engineering, National
Institute of Technology Patna, Patna, Bihar, India such applications in a noisy environment, AV-ASR can pro-
4 vide better performance in those application areas. Thus,
Universidad Internacional de La Rioja, Logroño, Spain
13
Vol.:(0123456789)
3582 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
providing better learning platform for physically challenged provides the significant information of visual speech
people is the prime focus of this research. (Sui et al., 2017). Dynamic feature extraction and co-rela-
Production and perception of human speech are bi-model tion analysis of features are other important factors for
which include the analysis of the uttered acoustic signal and differentiating speech. Co-relation analysis of visual fea-
visual cues of the speaker (Dupont & Luettin, 2000). The tures provides discriminatory information of the different
human can use visual cues, which is the process of knowing speech, which has not been addressed by the researchers.
speech by watching the movement of the speaker’s lips. Most Gao et al. (2021) proposed a novel approach on residen-
of the ASR systems apply only audio signal and ignore vis- tial building load forecasting. Zhao et al. (2009) calcu-
LE
ual speech clues (Dupont & Luettin, 2000). It has been suc- lated LBP-TOP features for visual speech recognition.
cessfully proved that visual information of speech improves But the co-occurrences values of frames have not been
the robustness of noise in ASR. Therefore, it is very prom- considered which is very important to distinguish differ-
ising to cover the use of visual speech in the man-machine ent frames. The appearance-based features extracted from
IC
interaction system. However, the extraction of the visual Local Binary Pattern (LBP) and LBP-TOP are sensitive to
information is challenging because the visual articulations illumination and pose. Thus, these features are not robust
are less informative and different from speaker to speaker in environment variations. Jafarbigloo and Danyali (2021)
(Borde et al., 2014). Moreover, visual information can also proposed a Convolutional Neural Network (CNN) tech-
RT
be affected by different lighting conditions. Therefore, the nique in which Long Short-Term Memory (LSTM) has
detection of informative visual features is still challenging. been used for image classification. Rauf et al. (2021)
The development of AV-ASR systems must follow a better introduced an optimized LSTM technique that has been
understanding of human speech perception. The following used to enhance bat algorithm for COVID-19. The visual
issues might be addressed while developing an AV-ASR. features should be illumination invariant because input
ferent lighting conditions? is essential to address dynamic features which are illumi-
7. What are the methods of integrating two information nation invariant. To address the above-mentioned issues, a
sources? visual speech feature extraction method using appearance-
TR
recognition is added to improve the education system for occurrence matrix is calculated which helps to distinguish
disabled people. Physically challenged people can com- different movements of the lip. Illumination invariant gray-
municate with the system through audio-visual speech scale image features are also calculated to extract robust
(Erber, 1975). The proposed system processes both the visual features. Thus, visual features extracted in this
audio and visual speech signals and recognize the speech. research represent the co-relation of frames which is very
If the audio signal deteriorates then the system will con- effective to distinguish lip movement of different words.
sider the visual modality. For visual speech recognition, The main contributions of present paper are as follows:
the research proposes appearance and co-occurrence
statistical measures for visual features. Accessing data 1. AV-ASR based digital learning platform for disabled stu-
is also very important for audio-visual research. A new dent is the main focus of this paper. New visual speech
table based protocol for data accessing in cloud computing features are proposed to develop the model.
has been proposed in many research (Namasudra & Roy,
2017). The motion of lip movement, i.e., dynamic feature
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3583
2. Visual speech features are calculated in a spatio-tem- etc. The author has segmented the lip region for synchroniz-
poral domain, i.e., LBP-TOP which also captures the ing the lip movements with the input audio. An early stage
motion of visual features with the appearance features. of lip tracking has been done using the color-based method.
3. Co-occurrence matrix and different GLCM features are The main aim of this work has to develop a system that syn-
calculated from LBP-TOP which helps to distinguish chronizes lips with input speech. To extract visual features,
different features of lip movement. i.e., visemes, Hue Saturation and Value (HSV), and YCbCr
4. The recognition process is carried out using supervised color models have been used along with various morpho-
and unsupervised machine learning. logical operations. The synchronization of the lip with input
LE
speech has been implemented in this study. But the viseme
The paper is arranged as follows: “Literature Reviews” sec- features are rotation and illumination variants.
tion gives the literature review of some related work, pro- Borde et al. (2014) have proposed Zernike features extrac-
posed methodology of AV-ASR is described in “Proposed tion technique for visual speech recognition. Viola–Jones
IC
Methodology” section. Experimental results and analysis algorithm extracted mouth area from the image, i.e., mouth
of visual speech, as well as audio speech recognition, are localization. Zernike Moments (ZM) and Principal Compo-
described in “Performance Analysis” section. “Conclu- nent Analysis (PCA) techniques have been used for visual
sions and Future Work” section provides conclusion of this speech recognition and Mel-Frequency Cepstral Coefficients
RT
paper followed by some future work directions. (MFCCs) are extracted for audio speech recognition. The
acquired visual speech recognition rate is 63.88 %, using ZM
and PCA while the audio recognition rate is 100 %, using
Literature Reviews MFCC. However, ZMs have less feature dimensions and are
not efficient to represent all the visual features. The authors
Brief description of related article and their pros and cons
are given as follows:
Zhao et al. (2009) have introduced local spatio-temporal A
have used word data comprised of isolated city names and
the recording has been done in a lab environment.
Deep learning architecture-based AV-ASR has been pro-
ED
descriptors technique for lip reading in visual speech rec- posed by Noda et al. (2014). For acquiring noise-free audio
ognition. Spatio-temporal LBP has been extracted from features, a deep de-noising autoencoder has been used while
the mouth region and used to describe isolated phrase CNN has been used to extract visual speech. Along with the
sequences. LBP has been calculated from three planes by MFCC acoustic feature, the phoneme level has been calcu-
combining all local features from pixels, block, and volume lated from the corresponding mouth area. Furthermore, the
CT
levels to describe the mouth movement of a speaker. How- author has used multi-stream Hidden Markov Model (HMM)
ever, the method failed to extract global features as well as method to integrate audio and visual features.
lip geometry of the speaker which provide the shape of the Multimodal Recurrent Neural Network (multimodal
lip while speaking. RNN) scheme has been introduced by the author Feng et al.
A
The use of the LBP feature is to detect texture image, (2017) to calculate the subsequent features of the acoustic
texture classification, static image detection, background and visual speech signal. In this, multimodal RNN has been
subtraction, etc. Nowadays LBP is efficiently used in used for audio recognition, visual recognition, and fusion
TR
visual speech recognition. Ojala et al. (2002) have used of audio-visual recognition. Multimodal RNN integrates
LBP method in three orthogonal planes to represent the the output of both modalities by multimodal layers. How-
mouth movement. Features have been calculated by concat- ever, extracted visual features are not robust to illumination.
enating LBP on these planes using co-occurrence statistics Chen et al. (2022) have proposed an improved K-singular
RE
13
3584 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
independent datasets: the GRID corpus for English and the Proposed Methodology
HAVRUS corpus for Russian. The classic GMM–CHMM
technique produced the best recognition results on a The proposed method consists of audio and visual modali-
HAVRUS database that has been smaller in size. The paper ties which are shown in Fig. 1. The steps included in the
has presented current state of the audio-visual speech recog- proposed model are:
nition area as well as potential research directions. Kumar et
al. (2022) proposed a deep learning technique-based audio- 1. ROI detection: ROI detection for visual feature extrac-
visual speech recognition system for hearing impaired peo- tion is carried out using LBP.
LE
ple. Hearing challenged students confront several problems, 2. Visual speech feature: LBP-TOP and GLCM statistical
including a lack of skilled sign language facilitators and the features are used.
expensive cost of assistive technology. Using cutting edge 3. Audio feature Extraction: MFCC acoustic feature vec-
deep learning models, they have discovered a visual speech tors are used here.
IC
recognition technique in this paper. Azeta et al. (2010) have 4. Classification: K-means and Gaussian Expectation Max-
introduced an intelligent voice based education system. imization (GEM) Algorithms are used to reduce dimen-
Intelligent components, such as adaptability and suggestion sion and classification.
services have been supported by the framework given. A 5. Performance measure: Hard threshold technique for the
RT
prototype of intelligent Voice-based E-Education System clustering algorithm.
(iVEES) has been created and tested by visually impaired 6. Further classification is carried out using supervised
individuals. In the sphere of educational technology, the machine learning technique. Artificial Neural Network
Speech User Interface is critical. It assists users who are (ANN), Support Vector Machine (SVM), and Naive
unable to operate a computer using standard input devices,
A
Bayes (NB) classifiers are used here to carry out the
such as a keyboard and mouse. The author has designed work.
an application for young children under the age of ten to
learn mathematical operations (Shrawankar & Thakare, The proposed scheme is divided into 3 subsection, which
ED
2010). This application can also be used as a calculator that are described below:
is controlled by speech. There are many novel techniques
to access data over the internet (Namasudra & Roy, 2015; Visual Speech Feature Extraction
Namasudra, 2020).
LBP-TOP and GLCM: LBP is efficiently used in facial fea-
CT
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3585
LE
IC
the isolated visual phrases of speech (Jain & Rathna, 2017). Energy of GLCM is sum of the squared elements. The
RT
Texture description of every single region represents the range of energy is [0 1]. For a constant picture, the energy
appearance of that region (Ahonen et al., 2006). By com- is 1. The energy is calculated using equation 2.
bining all the region descriptors LBP method describes the I−1
global geometry of the image. Ojala et al. (2002) have intro- ∑
Energy = P(i, j)2 (2)
duced a grayscale and rotation invariant operator method in
A
i,j=0
LBP. Besides all this research, nowadays it is efficiently used
in visual speech recognition. Here, LBP technique is used to where P(i, j) is the GLCM matrix, I = (0, 1, 2 … , I − 1)
find out the ROI of the face. The example of LBP calculation are the distinct gray level intensities, calculate IXI order of
ED
is depicted in Fig. 2. GLCM, I is the number of gray levels in the image. Entropy
is often used to represent visual texture and it is calculated
p−1
∑ using equation 3.
LBP(R,P) = t(np − nc ).2p (1)
p=0 I−1 I−1
∑ ∑
CT
The whole face is divided into ten regions and after that equation 4.
features are extracted from every local region. It calculates I−1 I−1
∑∑
the lip movement concerning time. One plane represents
TR
(i, j)P(i, j) − 𝜇 m 𝜇n
the appearance-based features in a spatial domain whereas i=0 j=0 (4)
Correlation =
the other two planes give the change of visual features 𝜎m 𝜎n
with the time and features of motion in a time domain
respectively. Histograms are generated from these three where 𝜇m , 𝜇n and 𝜎m , 𝜎n denote the mean and standard devi-
ations of the row and column sums of the GLCM matrix
RE
13
3586 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
Audio Feature Extraction Using MFCC K-means creates the cluster in a circular shape because cen-
troids are updated iteratively using the mean value. But if
The speech is rectified by the shape of the vocal tract, the data points distribution is not circular then the K-means
tongue, and teeth. This shape determines what type of sound algorithm becomes unsuccessful to generate the proper
will produce. The shape of the vocal tract is represented in cluster.
the envelope of the short-time power spectrum and the job of GEM: GEM learning algorithm solves the uncertainty
MFCCs (Davis & Mermelstein, 1980; Soni et al., 2016) is to about the data points (Nadif & Govaert, 2005). GEM is a
represent this envelope. Olivan et al. (2021) proposed a deep distribution-based clustering algorithm and overcomes the
LE
learning-based scheme along with mel-spectogram to detect shortcomings of distance-based clustering. The working
music boundary. For calculating the MFCC, the following principle is based on the probability of data to determine
steps are followed: the presence in a cluster. The Expectation-Maximization
(EM) (Nadif & Govaert, 2005) algorithm is used in GEM
IC
Step 1: Analyse speech signal as a short frame. to find the model parameter. The processes of GEM are
Step 2: A window function is applied after framing. discussed below:
Step 3: Discrete Fourier Transform (DFT) is used to con-
vert the signal into the frequency domain. Step 1: Initialize the mean 𝜇k , covariance 𝜎k and mixing
RT
Step 4: Apply Mel filter bank. coefficients 𝜋k for cluster k, 𝜇j , covariance 𝜎j and mixing
Step 5: The logarithm of Mel filter bank energies are taken. coefficients 𝜋j for cluster j and evaluate the initial value
Step 6: Convert Mel spectrum to the time domain. of the log-likelihood.
Step 2: Expectation (E) step : Posterior probabilities of 𝛾j (x)
The resultant coefficients are MFCCs. Here, 19 MFCCs are is calculated using mean, covariance and mixture coef-
calculated as a speech feature vector.
(Nadif & Govaert, 2005) clustering techniques are used to j-th mixture component.
cluster the audio-visual speech. First, K-means is used for Step 3: Maximization (M) step:
the basic clustering and then GEM is used for the advanced ∑N � �
clustering technique. GEM is also used because it is a soft n=1 𝛾j xn xn
(8)
A
𝜇j = ∑N � �
clustering method and compares the results with K-means 𝛾 x
n=1 j n
clustering. During training, the threshold is calculated for
every digit and at the testing phase, measure the accuracy for where assign responsibility of a point xn to exactly one
TR
each audio-visual digit. A boosting-aided adaptive cluster- cluster and N represents all the data.
based undersampling approach has been proposed by Devi ∑N � �� � T
et al. (2020) for class imbalance problem. n=1 𝛾j xn xn − 𝜇j (xn − 𝜇j )
𝜎j = ∑N � � (9)
Revathi and Venkataramani (2009) have explored the 𝛾 x
RE
n=1 j n
effectiveness of perceptual features for recognizing isolated
words and continuous speech. Lazli and Boukadoum (2017) N
have proposed an unsupervised iterative process for regulat- ∑ ( )
𝜋j = NT 𝛾j xn (10)
ing a similarity measure to set the cluster’s number and their n=1
boundaries. This has been developed for overcoming the
shortcomings of conventional clustering algorithms, such as Step 4: Estimate log likelihood.
K-means and Fuzzy C-means which require prior knowledge Step 5: If not converged then return to step 2, i.e., Expecta-
of the number of clusters and a similarity measure. tion and Maximization step.
K-means: K-means (Kanungo et al., 2002) is one of the N
{ k }
∑ ∑∏
simplest learning algorithms that solve the well-known clus- ln p(X |𝜇,𝜎,𝜋) = ln P(xn |𝜇k ,𝜎k ) (11)
tering problem. It is a distance-based clustering algorithm. n=1 k=1 k
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3587
where 𝜇 , 𝜎 , and 𝜋 are the overall mean, covariance, and the accuracy. All these experiments are conducted in differ-
mixing coefficients respectively. ent modules.
The E-step and the M-step are the two processes for For audio speech recognition, 19 MFCC features are
the EM algorithm. The E-step is responsible for provid- extracted and for visual speech recognition, LBP-TOP along
ing parameter values that compute the expected values of with GLCM features are used. After feature extraction, the
the latent variable. Based on the latent variable, M-step threshold value is calculated for the clustering algorithm
updates the parameters of the model. in a training phase, which is described in “Performance
LE
IC
Algorithm 1 Appearance based features and their co-occurrence value anal-
ysis
Input: Speakers lip contour.
Output: Hybrid features.
RT
1: Start
2: for i=1 to m do (m=number of utterance)
3: for j=1 to n do, (n=number of lip image)
4: Compute LBP f orXY, XT, Y T planes
p−1
A
LBP(R,P ) = t(np − nc ).2p (12)
p=0
13
3588 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
LE
IC
Fig. 4 Block Diagram of testing
for AV-ASR using clustering
RT
A
ED
CT
A
TR
RE
Details of Dataset total of 1800 words have been recorded from the speaker.
The database has been recorded in an isolated sound booth at
Two datasets ’vVISWa’ (Borde et al., 2004) and ’CUAVE’ a resolution of 720 × 480 with the NTSC standard of 29.97
(Patterson et al., 2002) are used in this paper for experi- fps without any head movement.
ments. Borde et al. (2004) have published a paper about
the ‘vVISWa’ English digit dataset in 2016. The dataset is Performance Measure
consisting of 0 to 10 English digits recorded in a laboratory.
‘vVISWa’ is consisting of 10 speakers including 6 males, The threshold is generated for each digit using centroids
and 4 females. Repetition of each digit is 10 times by indi- obtained from the clustering algorithm. Codebook is
vidual speaker. ‘CUAVE’ is consisting of 0 to 10 English
digits recorded from 18 male and 18 female speakers. A
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3589
Table 1 Accuracy (%) of visual-speech recognition using LBP-TOP Table 3 Accuracy (%) of proposed visual speech recognition using
with GLCM and K-means clustering (’vVISWa’ dataset) ANN (‘vVISWa’ dataset)
Digit k=2 k=4 k=8 k = 16 Exp. no [Link] Hid- [Link] Hid- Iterations System
den layer den units accuracy
0 63.18 65.65 67.77 64.36 (%)
1 64.55 66.00 67.50 63.16
1 2 30,20 100 67.52
2 62.26 62.79 64.33 64.22
2 2 40,30 100 73.12
3 61.57 64.42 66.77 63.17
3 2 50,40 100 72.05
LE
4 62.20 63.59 65.10 63.83
4 2 60,50 100 70.45
5 63.24 64.00 62.75 62.16
5 2 70,60 100 69.12
6 60.11 66.51 68.01 63.27
7 59.13 66.45 66.00 62.00
8 60.33 63.89 65.15 63.00
IC
Table 4 Accuracy (%) of visual speech recognition using SVM
9 62.72 62.87 64.41 63.51 (‘vVISWa’ dataset)
Exp. no Kernel function System
accuracy
RT
Table 2 Accuracy (%) of visual-speech recognition using LBP-TOP (%)
with GLCM and GEM clustering (’vVISWa’ dataset)
1 Radial Basis Function (RBF) 73.23
Digit k=2 k=4 k=8 k = 16 2 Linear 64.22
3 Polynomial 70.76
0 64.37 66.00 70.76 67.10
A
1 64.00 67.66 70.00 65.66
2 63.16 67.00 68.56 64.12
Table 5 Accuracy (%) of visual speech recognition using Naive Bays
3 62.00 66.12 70.31 65.00 (‘vVISWa’ dataset)
ED
4 62.80 65.19 67.00 64.74
Exp. no Kernel function System
5 63.61 64.89 70.51 67.52 accuracy
6 62.10 67.51 69.05 66.27 (%)
7 60.31 65.85 70.00 65.00
1 Normal 72.04
8 61.00 64.24 68.10 64.40
CT
2 Kernel 74.23
9 62.12 64.92 69.11 64.51
score by finding out the distance between codebook and Visual Speech Recognition Using Clustering Method
feature vector. This score is considered as a parameter for
testing the utterances. Euclidean distance is used to measure For the extraction of visual features, the primary move is to
TR
the score. The equation of calculating threshold is: detect ROI, and here LBP is used for the detection of ROI.
𝜇1 𝜎1 + 𝜇2 𝜎2 In this research, ROI is the speaker’s lip contour and visual
Threshold = (13) speech features are extracted from lip contour. Appearance-
𝜎 1 + 𝜎2
based features are extracted here along with second-order
RE
where 𝜇1 is mean and 𝜎1 is standard deviation of the tested statistical analysis for visual speech recognition. The main
sample. 𝜇2 is mean and 𝜎2 is the standard deviation of the motivation of the proposed feature extraction method is
other randomly selected sample’s codebook. This threshold to capture dynamic visual information and co-occurrence
is digit-specific. For robustness both claimed and random values of features. Therefore, LBP-TOP along with GLCM
sample are considered to generate the threshold. The block features are proposed for visual speech recognition. LBP-
diagrams of training and testing are depicted in Figs. 3 and TOP divides the ROI into sub-region and here, divides each
4, respectively. ROI into 10 sub-regions, the dimension of each sub-region
Performance Measure for ANN, SVM, and NB is calcu- is 150. Therefore, the total dimension of the feature vector is
lated using following method: (150X10), i.e., 1500 for each frame after extraction of LBP-
TOP. LBP-TOP features are provided as input for GLCM
correctly identified test sample
Recognition Rate = × 100 % (14) calculation and energy, co-relation, contrast, variance as well
total supplied test sample
13
3590 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
LE
4 92.56 97.00 94.10 92.59
5 91.17 93.15 86.67 89.16
6 89.56 93.25 92.75 90.00
7 90.42 88.00 86.00 85.66
8 90.00 89.45 91.75 90.56
IC
9 92.71 90.91 89.45 91.00
RT
Table 7 Accuracy (%) of audio-speech recognition using MFCC and
GEM clustering (‘vVISWa’ dataset)
Digit k=2 k=4 k=8 k = 16
A
1 93.75 96.54 96.00 93.21
2 94.67 95.25 95.11 93.38
3 95.10 97.75 92.25 94.13
ED
4 93.45 96.75 93.91 92.82
5 92.65 97.25 94.15 88.22
6 92.23 94.38 91.81 90.37
7 90.67 87.00 88.32 86.62
8 91.33 95.72 92.65 88.26
CT
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3591
Table 8 Accuracy (%) of audio-speech recognition using MFCC and Table 11 Comparison of proposed visual speech features with exist-
ANN (‘vVISWa’ dataset) ing features using ‘vVISWa’ dataset
LE
4 2 60,50 100 91.12 PZM (Debnath & Roy, 2021) 74.00
5 2 70,60 100 90.34 LBP-TOP, GLCM and clustering (Proposed) 72
LBP-TOP, GLCM and NB (Proposed) 74.23
IC
Table 9 Accuracy (%) of audio-speech recognition using MFCC and
SVM (‘vVISWa’ dataset)
Exp. no Kernel function System
accuracy
RT
(%)
A
ED
Table 10 Accuracy (%) of audio-speech recognition using MFCC and
Naive Bayes (‘vVISWa’ dataset)
Exp. no Kernel function System Fig. 8 Results of audio speech rceognition using ’CUAVE’ dataset
accuracy
(%)
CT
After features extraction, ANN (Kuncheva, 2004; Debnath increases with the increasing size of the cluster from 2 to 4
& Roy, 2021), SVM (Debnath & Roy, 2021; Debnath et and drops down when cluster size increases from 4 to higher.
al., 2021), and NB (Debnath & Roy, 2021) machine learn- Using GEM the accuracy of audio speech recognition is
ing algorithms are applied to recognize the visual speech. more than 92% with k=4. Therefore, system performance
RE
The visual speech is recognized using the different num- depends on the cluster size for both the K-means and GEM
ber of hidden nodes and two hidden layers in ANN. The clustering methods. Recognition accuracy for audio speech
system achieves 73.12 % recognition accuracy using 40, 30 using clustering is represented in Tables 6 and 7.
number of hidden nodes. The experiments using SVM and
NB are carried out with different kernel functions and the
highest recognition accuracy of visual speech is obtained
using the NB classifier, which is 74.23 %. The performance
of visual speech recognition is calculated using ’vVISWa’
dataset with different classifiers and shown in Tables 3 , 4,
and 5, respectively. Figures 5, 6, and 7 represent the results
obtained from the ’CUAVE’ dataset.
13
3592 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
Comparative Study
LE
energy, contrast, correlation, variance, and entropy. K-means
and GEM are used as an unsupervised and ANN, SVM, and
NB are used as supervised machine learning methods. It
is observed from the experiments that the proposed visual
IC
Fig. 9 Comparison of proposed visual speech features with existing features provide a better recognition rate than the classical
features using ‘CUAVE’ dataset feature extraction method using both supervised and unsu-
pervised methods. The comparison of results with the exist-
ing feature extraction method is shown in Table 11 using
RT
Audio Speech Recognition Using ANN, SVM, and NB ’vVISWa’ dataset and in Fig. 9, using ’CUAVE’ dataset.
The proposed method performs well because it calculates
The recognition of audio speech is also performed using the statistical values from appearance features and gives
ANN, SVM, as well as NB classifiers. For ANN, 2 hidden more distinct information for visual speech. LBP has been
layers, 100 iterations, and the different number of hidden used by many researchers for visual speech recognition but
nodes are used for the experiment. SVM and NB machine
learning approaches are applied based on the different ker-
nel functions. The ‘RBF’kernel function of SVM gives A
it does not capture the dynamic features of lip movement, for
that reason researchers have proposed LBP-TOP to calculate
features in a three-dimension. However, the co-occurrence
ED
93.55 % recognition accuracy whereas 92.19 % accuracy is values are also important, statistical value from the co-
obtained using the kernel function of NB. The accuracy is occurrence matrix gives distinguished feature.
91.32 % using 50, 40 hidden nodes with ANN. Tables 8, 9,
and 10 are showing the performance of the proposed sys-
tem using ANN, SVM, and NB. Figure 8 represents the
CT
accuracy obtained from ’CUAVE’ dataset using both the Conclusions and Future Work
clustering and supervised learning algorithm.
The main focus of this paper is to provide AV-ASR based
Integration of Audio‑Visual Speech education system for physically challenged people. Because
A
fusion is considered for combining two systems. Indi- TOP captures the visual features in a spatio-temporal
vidual word recognition rate is calculated for both audio domain; therefore, the motion of the lip is also captured
and visual speech and then integrate two modalities for with appearance-based features. Five GLCM features are
the better result. If one recognition model is failed, then calculated to distinguish different frames of a particular
RE
the system considers the result from another model. Deci- utterance which is explained in the proposed methodology.
sion fusion provides better recognition rate for the overall After feature extraction, the classification of speech is also
system because each individual word is recognized as a a challenging problem. It is observed that the proposed fea-
token. For audio speech recognition, considered threshold ture extraction method gives a better recognition rate for
rate is more than 85% and for visual speech recognition visual speech using a clustering algorithm and supervised
threshold is more than 70 %. Thus, when the accuracy is machine learning algorithm. GEM is more efficient than
greater than or equal to the threshold, the input data is K-means because it calculates gaussian distribution for
acceptable. Based on the threshold, when one of the rec- clustering while K-means calculates the distance for gener-
ognition systems recognizes the respective input data of ating clusters. Moreover, K-means fails to generate the right
audio and visual speech, the system considers that speech cluster when the distribution of data samples is not circular.
gets recognized and provides the output. Thus, the distribution-based model gives better performance
instead of the distance measure model. Using supervised
machine learning, SVM and NB give more accuracy than
13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3593
ANN visual speech recognition. In human-to-human com- Debnath, S., & Roy, P. (2021). Audio-visual automatic speech recog-
munication, speech is a very natural and basic method. The nition using PZM, MFCC and statistical analysis. International
Journal of Interactive Multimedia and Artificial Intelligence, 7(2),
design process for a speech user interface for human-com- 121–133. [Link]
puter interaction is presented in this paper using audio-visual Devi, D., et al. (2020). A boosting-aided adaptive cluster-based under-
data. In a noisy environment when the audio signal will not sampling approach for treatment of class imbalance problem.
work the disabled people can communicate with the system International Journal of Data Warehousing and Mining (IJDWM),
16(3), 60–86. [Link]
using a visual speech signal. Further work can be designed Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for
for a hybrid visual feature extraction method to extract more continuous speech recognition. IEEE Transaction on Multimedia,
LE
robust features to develop for developing an AV-ASR-based 2(3), 141–151. [Link]
education system. Erber, N. P. (1975). Auditory-visual perception of speech. Journal of
Speech and Hearing Disorders, 40(4), 481–492. [Link]
10.1044/jshd.4004.481.
Feng, W., et al. (2017). Audio visual speech recognition with multi-
Author Contributions SD is the main author of this paper, who has
IC
modal recurrent neural networks. In International Joint Confer-
conceived the idea and discussed it with all co-authors. PR has devel- ence on Neural Networks (IJCNN), IEEE, pp. 681–688, 14-19.
oped the main algorithms. SN is the corresponding author. He has [Link]
performed the experiments of this paper. RGC has supervised the entire Galatas, G., et al. (2012). Audio-visual speech recognition using depth
work, evaluated the performance, and proofread the paper. information from the Kinect in noisy video conditions. In Pro-
RT
ceedings of International Conference on Pervasive Technologies
Related to Assistive Environments, ACM, pp. 1–4 [Link]
10.1145/2413097.2413100
Gao, J., et al. (2021). Decentralized federated learning framework for
References the neighborhood: A case study on residential building load fore-
casting. In Proceedings of the 19th ACM Conference on Embed-
A
Ahonen, T., et al. (2006). Face description with local binary patterns: ded Networked Sensor Systems, ACM pp. 453–459. [Link]
Applications to face recognition. IEEE Transactions on Pattern org/10.1145/3485730.3493450
Analysis and Machine Intelligence, 28(12), 2037–2041. [Link] Ivanko, D., et al. (2021). An experimental analysis of different
doi.org/10.1109/TPAMI.2006.244. approaches to audio-visual speech recognition and lip-reading.
ED
Azeta, A., et al. (2010). Intelligent voice-based e-education system: A In Proceedings of 15th International Conference on Electrome-
framework and evaluation. International Journal of Computing, chanics and Robotics, Springer, Singapore, pp. 197–209. [Link]
9, 327–334. [Link] doi.org/10.1007/978-981-15-5580-016
Borde, P., et al. (2004). ‘vVISWa’: A multilingual multi-pose audio Jafarbigloo, S. K., & Danyali, H. (2021). Nuclear atypia grading in
visual database for robust human computer interaction. Interna- breast cancer histopathological images based on CNN feature
tional Journal of Computer Applications, 137(4), 25–31. [Link] extraction and LSTM classification. CAAI Transactions on Intel-
CT
Chen, R., et al. (2022). Image-denoising algorithm based on improved Processing, IEEE, Montreal, pp. 368–372. [Link]
K-singular value decomposition and atom optimization. CAAI 1109/GlobalSIP.2017.8308666
Transactions on Intelligence Technology, 7(1), 117–127. [Link] Jiang, R., et al. (2020). Object tracking on event cameras with offline-
TR
spoken sentences. IEEE Transactions on Acoustics, Speech, and Kashevnik, A., et al. (2021). Multimodal corpus design for audio-vis-
Signal Processing, 28(4), 357–365. [Link] ual speech recognition in vehicle cabin. IEEE Access, 9, 34986–
TASSP.1980.1163420. 35003. [Link]
Debnath, S., et al. (2021). Study of different feature extraction method Kumar, L. A., et al. (2022). Deep learning based assistive technol-
for visual speech recognition. International Conference on Com- ogy on audio visual speech recognition for hearing impairedD.
puter Communication and Informatics (ICCCI), 2021, 1–5. International Journal of Cognitive Computing in Engineering, 3,
[Link] 24–30. [Link]
Debnath, S., & Roy, P. (2018). Study of speech enabled healthcare Kuncheva, I. (2004). Combining pattern classifiers: Methods and algo-
technology. International Journal of Medical Engineering and rithms. Wiley.
Informatics, 11(1), 71–85. [Link] Lazli, L., & Boukadoum, M. (2017). HMM/MLP speech recognition
096893. system using a novel data clustering approach. In IEEE 30th
Debnath, S., & Roy, P. (2021). Appearance and shape-based hybrid Canadian Conference on Electrical and Computer Engineering
visual feature extraction: Toward audio-visual automatic speech (CCECE), IEEE, Windsor. [Link]
recognition. Signal, Image and Video Processing, 15, 25–32. 7946644
[Link]
13
3594 Journal of Autism and Developmental Disorders (2023) 53:3581–3594
Mohanaiah, P., et al. (2013). Image texture feature extraction using Rauf, H. T., et al. (2021). Enhanced bat algorithm for COVID-19 short-
GLCM approach. International Journal of Scientific and Research term forecasting using optimized LSTM. Soft Computing, 25(20),
Publications, 3(5), 85. 12989–12999. [Link]
Nadif, M., & Govaert, G. (2005). Block Clustering via the Block GEM Revathi, A., & Venkataramani, Y. (2009). Perceptual features based
and two-way EM algorithms. The 3rd ACS/IEEE International isolated digit and continuous speech recognition using iterative
Conference on Computer Systems and Applications, IEEE. [Link] clustering approach networks and communication. In First Inter-
doi.org/10.1109/AICCSA.2005.1387029 national Conference on Networks & Communications, NetCoM.,
Namasudra, S., & Roy, P. (2015). Size based access control model in IEEE, Chennai. [Link]
cloud computing. In Proceeding of the International Conference Revathi, A., et al. (2019). Person authentication using speech as a bio-
on Electrical, Electronics, Signals, Communication and Optimi- metric against play back attacks. Multimedia Tools Application,
LE
zation, IEEE, Visakhapatnam, pp. 1–4. [Link] 78(2), 1569–1582. [Link]
EESCO.2015.7253753 Shikha, B., et al. (2020). An extreme learning machine-relevance feed-
Namasudra, S. (2020). Fast and secure data accessing by using DNA back framework for enhancing the accuracy of a hybrid image
computing for the cloud environment. IEEE Transactions on Ser- retrieval system. International Journal of Interactive Multimedia
vices Computing. [Link] and Artificial Intelligence, 6(2), 15–27. [Link]
IC
Namasudra, S., & Roy, P. (2017). A new table based protocol for data ijimai.2020.01.002.
accessing in cloud computing. Journal of Information Science Shrawankar, U., & Thakare, V. (2010). Speech user interface for com-
and Engineering, 33(3), 585–609. [Link] puter based education system. In International Conference on
2017.33.3.1. Signal and Image Processing, pp. 148–152. [Link]
Noda, K., et al. (2014). Audio-visual speech recognition using deep 1109/ICSIP.2010.5697459
RT
learning. Applied Intelligence, 42(4), 567. [Link] oi.o rg/1 0.1 007/ Soni, B., et al. (2016). Text-dependent speaker verification using clas-
s10489-014-0629-7. sical LBG, adaptive LBG and FCM vector quantization. Interna-
Ojala, T., et al. (2002). Multi resolution gray-scale and rotation invari- tional Journal of Speech Technology, 19(3), 525–536. [Link]
ant texture classification with local binary patterns. IEEE Transac- org/10.1007/s10772-016-9346-4.
tion on Pattern Analysis and Machine Intelligence, 24(7), 971– Sui, C., et al. (2017). A cascade gray-stereo visual feature extraction
987. [Link] method for visual and audio-visual speech recognition. Speech
A
Olivan, C. H., et al. (2021). Music boundary detection using convo- Communication, 90(1), 89. [Link]
lutional neural networks: A comparative analysis of combined 2017.01.005.
input features. International Journal of Interactive Multimedia Zhao, G., et al. (2009). Lipreading with local spatiotemporal descrip-
and Artificial Intelligence, 7(2), 78–88. [Link] tors. IEEE Transactions on Multimedia, 11(7), 56. [Link]
ED
arXiv.2008.07527. 10.1109/TMM.2009.2030637.
Patterson, E., et al. (2002). CUAVE: A new audio-visual database for
multimodal human-computer interface research. In IEEE Interna- Publisher's Note Springer Nature remains neutral with regard to
tional Conference on Acoustics, Speech, and Signal Processing, jurisdictional claims in published maps and institutional affiliations.
IEEE, Orlando. [Link]
CT
A
TR
RE
13