0% found this document useful (0 votes)
21 views14 pages

Retracted Article: Audio Visual Automatic Speech Recognition Towards Education For Disabilities

The paper discusses the development of an Audio-Visual Automatic Speech Recognition (AV-ASR) system aimed at enhancing education for physically challenged individuals by enabling hands-free computing and communication. It proposes the use of Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) for visual speech recognition, achieving 76.60% accuracy for visual and 96.00% for audio speech recognition. The research highlights the importance of integrating visual cues to improve the robustness of speech recognition, particularly in noisy environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Retracted Article: Audio Visual Automatic Speech Recognition Towards Education For Disabilities

The paper discusses the development of an Audio-Visual Automatic Speech Recognition (AV-ASR) system aimed at enhancing education for physically challenged individuals by enabling hands-free computing and communication. It proposes the use of Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) for visual speech recognition, achieving 76.60% accuracy for visual and 96.00% for audio speech recognition. The research highlights the importance of integrating visual cues to improve the robustness of speech recognition, particularly in noisy environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Journal of Autism and Developmental Disorders (2023) 53:3581–3594

[Link]

S.I. :IMPACT OF ASSISTIVE TECHNOLOGY IN SPECIAL EDUCATION

Audio‑Visual Automatic Speech Recognition Towards Education


for Disabilities

LE
Saswati Debnath1 · Pinki Roy2 · Suyel Namasudra3,4 · Ruben Gonzalez Crespo4

Accepted: 14 June 2022 / Published online: 12 July 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

IC
Abstract
Education is a fundamental right that enriches everyone’s life. However, physically challenged people often debar from the

RT
general and advanced education system. Audio-Visual Automatic Speech Recognition (AV-ASR) based system is useful to
improve the education of physically challenged people by providing hands-free computing. They can communicate to the
learning system through AV-ASR. However, it is challenging to trace the lip correctly for visual modality. Thus, this paper
addresses the appearance-based visual feature along with the co-occurrence statistical measure for visual speech recognition.
Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) is proposed for

A
visual speech information. The experimental results show that the proposed system achieves 76.60 % accuracy for visual
speech and 96.00 % accuracy for audio speech recognition.
ED
Keywords AV-ASR · LBP-TOP · GLCM · MFCC · Clustering algorithm · Supervised learning

Introduction device. They can easily access the computer, improve writ-
ing through speech to text mechanism, and increase read-
CT

Learning disabilities, vision problems, physical disabilities ing-writing abilities. In voice dialing, medical documen-
are the barriers to achieve the goal in this digital world. tation, home appliances, automatic speech recognition is
Thus, to improve the learning system, Automatic Speech used widely (Debnath & Roy, 2018; Revathi et al., 2019).
Recognition (ASR) is very helpful for students with dis- However, the acoustic speech signal can be influenced by
abilities. The application area of a speech recognition sys- the transmission channel, use of the different microphones,
A

tem is widespread in real-time. Disabled people can use the different filter characteristics, limitation of frequency band-
ASR system for computing and searching on a hands-free width, etc. Although there are many approaches to overcome
TR

certain situations and provide robust speech recognition, all


need to know is pattern and level of noise in advance. Thus,
* Suyel Namasudra another approach that can be used to recognize speech in
suyelnamasudra@[Link] the presence of acoustic noise is lipreading. The visual sig-
nal can provide speech information when an acoustic sig-
RE

Saswati Debnath
[Link]@[Link] nal is corrupted by noise. Incorporating visual information
Pinki Roy into audio signal increases the possibility of a more reli-
pinkiroy2405@[Link] able recognition system. In automatic speech recognition,
Ruben Gonzalez Crespo visual speech helps to improve the recognition rate when
[Link]@[Link] background noise is more. The AV-ASR system can be used
1 to developed desktop applications that involve web brows-
Department of Computer Science and Engineering, Alliance
University, Bangalore, Karnataka, India ing tasks, air traffic control rooms, hands-free computing,
2 where a user can interact with machines without the use of
Department of Computer Science and Engineering, National
Institute of Technology, Silchar, Assam, India their hands. Hands-free computing is very useful for disa-
3 bled people. However, primarily ASR has been used to build
Department of Computer Science and Engineering, National
Institute of Technology Patna, Patna, Bihar, India such applications in a noisy environment, AV-ASR can pro-
4 vide better performance in those application areas. Thus,
Universidad Internacional de La Rioja, Logroño, Spain

13
Vol.:(0123456789)
3582 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

providing better learning platform for physically challenged provides the significant information of visual speech
people is the prime focus of this research. (Sui et al., 2017). Dynamic feature extraction and co-rela-
Production and perception of human speech are bi-model tion analysis of features are other important factors for
which include the analysis of the uttered acoustic signal and differentiating speech. Co-relation analysis of visual fea-
visual cues of the speaker (Dupont & Luettin, 2000). The tures provides discriminatory information of the different
human can use visual cues, which is the process of knowing speech, which has not been addressed by the researchers.
speech by watching the movement of the speaker’s lips. Most Gao et al. (2021) proposed a novel approach on residen-
of the ASR systems apply only audio signal and ignore vis- tial building load forecasting. Zhao et al. (2009) calcu-

LE
ual speech clues (Dupont & Luettin, 2000). It has been suc- lated LBP-TOP features for visual speech recognition.
cessfully proved that visual information of speech improves But the co-occurrences values of frames have not been
the robustness of noise in ASR. Therefore, it is very prom- considered which is very important to distinguish differ-
ising to cover the use of visual speech in the man-machine ent frames. The appearance-based features extracted from

IC
interaction system. However, the extraction of the visual Local Binary Pattern (LBP) and LBP-TOP are sensitive to
information is challenging because the visual articulations illumination and pose. Thus, these features are not robust
are less informative and different from speaker to speaker in environment variations. Jafarbigloo and Danyali (2021)
(Borde et al., 2014). Moreover, visual information can also proposed a Convolutional Neural Network (CNN) tech-

RT
be affected by different lighting conditions. Therefore, the nique in which Long Short-Term Memory (LSTM) has
detection of informative visual features is still challenging. been used for image classification. Rauf et al. (2021)
The development of AV-ASR systems must follow a better introduced an optimized LSTM technique that has been
understanding of human speech perception. The following used to enhance bat algorithm for COVID-19. The visual
issues might be addressed while developing an AV-ASR. features should be illumination invariant because input

1. What are the benefits of visual information in speech


perception? A
data is video and variation in lighting conditions affects
the different frames. The color features have been used by
many researchers for Region of Interest (ROI) detection
ED
2. Which visual features are the most significant for lip and lip-reading (Dave, 2015). However, in visual speech
reading? recognition, color-changing features are less informative
3. Which features provide the discriminative information because of lighting variation. Sometimes the color models
to recognize different speech? are not very efficient due to poor illumination. Variation in
4. Whether the extracted visual units are equivalent to illumination and different face positions create difficulties
CT

phones? in visual feature analysis (Dupont & Luettin, 2000). Jiang


5. Whether the visual features are robust to rotation and et al. (2020) proposed a novel method for object tracking.
scale? Discriminate dynamic features are very important as the
6. Whether the captured visual features are robust to dif- motion of the lip gives visual speech information. Thus, it
A

ferent lighting conditions? is essential to address dynamic features which are illumi-
7. What are the methods of integrating two information nation invariant. To address the above-mentioned issues, a
sources? visual speech feature extraction method using appearance-
TR

based features and co-occurrence feature analysis is devel-


The solution to these issues could help to develop pro- oped. In this paper, visual speech features are calculated
posed AV-ASR model which improves the performance in a spatio-temporal domain which captures the motion
of the system. In this proposed system, visual speech of visual feature along with the appearance feature. Co-
RE

recognition is added to improve the education system for occurrence matrix is calculated which helps to distinguish
disabled people. Physically challenged people can com- different movements of the lip. Illumination invariant gray-
municate with the system through audio-visual speech scale image features are also calculated to extract robust
(Erber, 1975). The proposed system processes both the visual features. Thus, visual features extracted in this
audio and visual speech signals and recognize the speech. research represent the co-relation of frames which is very
If the audio signal deteriorates then the system will con- effective to distinguish lip movement of different words.
sider the visual modality. For visual speech recognition, The main contributions of present paper are as follows:
the research proposes appearance and co-occurrence
statistical measures for visual features. Accessing data 1. AV-ASR based digital learning platform for disabled stu-
is also very important for audio-visual research. A new dent is the main focus of this paper. New visual speech
table based protocol for data accessing in cloud computing features are proposed to develop the model.
has been proposed in many research (Namasudra & Roy,
2017). The motion of lip movement, i.e., dynamic feature

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3583

2. Visual speech features are calculated in a spatio-tem- etc. The author has segmented the lip region for synchroniz-
poral domain, i.e., LBP-TOP which also captures the ing the lip movements with the input audio. An early stage
motion of visual features with the appearance features. of lip tracking has been done using the color-based method.
3. Co-occurrence matrix and different GLCM features are The main aim of this work has to develop a system that syn-
calculated from LBP-TOP which helps to distinguish chronizes lips with input speech. To extract visual features,
different features of lip movement. i.e., visemes, Hue Saturation and Value (HSV), and YCbCr
4. The recognition process is carried out using supervised color models have been used along with various morpho-
and unsupervised machine learning. logical operations. The synchronization of the lip with input

LE
speech has been implemented in this study. But the viseme
The paper is arranged as follows: “Literature Reviews” sec- features are rotation and illumination variants.
tion gives the literature review of some related work, pro- Borde et al. (2014) have proposed Zernike features extrac-
posed methodology of AV-ASR is described in “Proposed tion technique for visual speech recognition. Viola–Jones

IC
Methodology” section. Experimental results and analysis algorithm extracted mouth area from the image, i.e., mouth
of visual speech, as well as audio speech recognition, are localization. Zernike Moments (ZM) and Principal Compo-
described in “Performance Analysis” section. “Conclu- nent Analysis (PCA) techniques have been used for visual
sions and Future Work” section provides conclusion of this speech recognition and Mel-Frequency Cepstral Coefficients

RT
paper followed by some future work directions. (MFCCs) are extracted for audio speech recognition. The
acquired visual speech recognition rate is 63.88 %, using ZM
and PCA while the audio recognition rate is 100 %, using
Literature Reviews MFCC. However, ZMs have less feature dimensions and are
not efficient to represent all the visual features. The authors
Brief description of related article and their pros and cons
are given as follows:
Zhao et al. (2009) have introduced local spatio-temporal A
have used word data comprised of isolated city names and
the recording has been done in a lab environment.
Deep learning architecture-based AV-ASR has been pro-
ED
descriptors technique for lip reading in visual speech rec- posed by Noda et al. (2014). For acquiring noise-free audio
ognition. Spatio-temporal LBP has been extracted from features, a deep de-noising autoencoder has been used while
the mouth region and used to describe isolated phrase CNN has been used to extract visual speech. Along with the
sequences. LBP has been calculated from three planes by MFCC acoustic feature, the phoneme level has been calcu-
combining all local features from pixels, block, and volume lated from the corresponding mouth area. Furthermore, the
CT

levels to describe the mouth movement of a speaker. How- author has used multi-stream Hidden Markov Model (HMM)
ever, the method failed to extract global features as well as method to integrate audio and visual features.
lip geometry of the speaker which provide the shape of the Multimodal Recurrent Neural Network (multimodal
lip while speaking. RNN) scheme has been introduced by the author Feng et al.
A

The use of the LBP feature is to detect texture image, (2017) to calculate the subsequent features of the acoustic
texture classification, static image detection, background and visual speech signal. In this, multimodal RNN has been
subtraction, etc. Nowadays LBP is efficiently used in used for audio recognition, visual recognition, and fusion
TR

visual speech recognition. Ojala et al. (2002) have used of audio-visual recognition. Multimodal RNN integrates
LBP method in three orthogonal planes to represent the the output of both modalities by multimodal layers. How-
mouth movement. Features have been calculated by concat- ever, extracted visual features are not robust to illumination.
enating LBP on these planes using co-occurrence statistics Chen et al. (2022) have proposed an improved K-singular
RE

in three directions. value decomposition and atom optimization techniques to


Depth level information has been used in noisy conditions reduce image noise. The authors have developed audio-
(Galatas et al., 2012) for visual speech recognition. Discrete visual speech recognition scheme for driver monitoring
Cosine Transform (DCT) and Linear Discriminant Analysis system (Kashevnik et al., 2021). Multimodal speech rec-
(LDA) techniques have been considered for visual speech for ognition allows for the use of audio data when video data
four realistic visual modalities in different noisy conditions. is unavailable at night, as well as the use of video data in
In the noisy background, depth-level visual information acoustically loud environments at highways. The main aim
provided good recognition accuracy. However, more visual of the proposed method is to multimodal corpus designing.
features can be used in noisy data. Ivanko et al. (2021) proposed a different fusion method of
Dave (2015) has introduced a lip localization-based fea- AV-ASR, such as Gaussian Mixture Model–Continuous
ture detection method for segmenting the lip region. Lip Hidden Markov Model (GMM–CHMM), Deep Neural
localization and tracking are useful in lip-reading, lip syn- Network–Hidden Markov Model (DNN–HMM) and end-
chronization, visual speech recognition, facial animation, to-end approaches. The tests have been carried out on two

13
3584 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

independent datasets: the GRID corpus for English and the Proposed Methodology
HAVRUS corpus for Russian. The classic GMM–CHMM
technique produced the best recognition results on a The proposed method consists of audio and visual modali-
HAVRUS database that has been smaller in size. The paper ties which are shown in Fig. 1. The steps included in the
has presented current state of the audio-visual speech recog- proposed model are:
nition area as well as potential research directions. Kumar et
al. (2022) proposed a deep learning technique-based audio- 1. ROI detection: ROI detection for visual feature extrac-
visual speech recognition system for hearing impaired peo- tion is carried out using LBP.

LE
ple. Hearing challenged students confront several problems, 2. Visual speech feature: LBP-TOP and GLCM statistical
including a lack of skilled sign language facilitators and the features are used.
expensive cost of assistive technology. Using cutting edge 3. Audio feature Extraction: MFCC acoustic feature vec-
deep learning models, they have discovered a visual speech tors are used here.

IC
recognition technique in this paper. Azeta et al. (2010) have 4. Classification: K-means and Gaussian Expectation Max-
introduced an intelligent voice based education system. imization (GEM) Algorithms are used to reduce dimen-
Intelligent components, such as adaptability and suggestion sion and classification.
services have been supported by the framework given. A 5. Performance measure: Hard threshold technique for the

RT
prototype of intelligent Voice-based E-Education System clustering algorithm.
(iVEES) has been created and tested by visually impaired 6. Further classification is carried out using supervised
individuals. In the sphere of educational technology, the machine learning technique. Artificial Neural Network
Speech User Interface is critical. It assists users who are (ANN), Support Vector Machine (SVM), and Naive
unable to operate a computer using standard input devices,

A
Bayes (NB) classifiers are used here to carry out the
such as a keyboard and mouse. The author has designed work.
an application for young children under the age of ten to
learn mathematical operations (Shrawankar & Thakare, The proposed scheme is divided into 3 subsection, which
ED
2010). This application can also be used as a calculator that are described below:
is controlled by speech. There are many novel techniques
to access data over the internet (Namasudra & Roy, 2015; Visual Speech Feature Extraction
Namasudra, 2020).
LBP-TOP and GLCM: LBP is efficiently used in facial fea-
CT

tures extraction by dividing the face into a small region and


extracting features from each region (Ahonen et al., 2006).
It works as a local spatiotemporal descriptor to represent
A
TR
RE

Fig. 1  Block Diagram of proposed AV-ASR model using clustering method

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3585

Fig. 2  Example of LBP com-


putation

LE
IC
the isolated visual phrases of speech (Jain & Rathna, 2017). Energy of GLCM is sum of the squared elements. The

RT
Texture description of every single region represents the range of energy is [0 1]. For a constant picture, the energy
appearance of that region (Ahonen et al., 2006). By com- is 1. The energy is calculated using equation 2.
bining all the region descriptors LBP method describes the I−1
global geometry of the image. Ojala et al. (2002) have intro- ∑
Energy = P(i, j)2 (2)
duced a grayscale and rotation invariant operator method in

A
i,j=0
LBP. Besides all this research, nowadays it is efficiently used
in visual speech recognition. Here, LBP technique is used to where P(i, j) is the GLCM matrix, I = (0, 1, 2 … , I − 1)
find out the ROI of the face. The example of LBP calculation are the distinct gray level intensities, calculate IXI order of
ED
is depicted in Fig. 2. GLCM, I is the number of gray levels in the image. Entropy
is often used to represent visual texture and it is calculated
p−1
∑ using equation 3.
LBP(R,P) = t(np − nc ).2p (1)
p=0 I−1 I−1
∑ ∑
CT

Entropy = − P(i, j) log (P(i, j)) (3)


where np is the neighborhood pixels in each block, nc is i=0 j=0
center pixels value. np is thresholded by its center pixels
value nc . P is sampling points for 4 × 4 block P = 16. R is Correlation measures how correlated a pixel is to its neigh-
the radius for 4X4 block R = 2 bor for the entire image. Correlation is calculated from the
A

The whole face is divided into ten regions and after that equation 4.
features are extracted from every local region. It calculates I−1 I−1
∑∑
the lip movement concerning time. One plane represents
TR

(i, j)P(i, j) − 𝜇 m 𝜇n
the appearance-based features in a spatial domain whereas i=0 j=0 (4)
Correlation =
the other two planes give the change of visual features 𝜎m 𝜎n
with the time and features of motion in a time domain
respectively. Histograms are generated from these three where 𝜇m , 𝜇n and 𝜎m , 𝜎n denote the mean and standard devi-
ations of the row and column sums of the GLCM matrix
RE

planes and add these three histograms to represent the


appearance-based visual feature. GLCM is a gray level P(i, j). Contrast and variance are calculated using equation 5
co-occurrence matrix used to extract second-order statisti- and 6 respectively. Over the entire image, contrast returns a
cal texture features (Mohanaiah et al., 2013). Shikha et al. measure of the intensity between a pixel and its neighbour.
(2020) have used GLCM features to extract different image I−1

attributes like texture, color, and shape. In this paper, Contrast = |i − j|2 P(i, j) (5)
GLCM is used after LBP-TOP feature extraction. LBP- i,j=0
TOP matrix for each frame is used as an input for GLCM
calculation. Energy, entropy, correlation, contrast, and I−1

variance are derived from the GLCM matrix. Therefore, Variance = (i − 𝜇)2 log (P(i, j)) (6)
five GLCM features are used here to differentiate different i,j=0

frames of particular utterance. Algorithm 1 represents the


where 𝜇 is the mean of the gray level distribution.
proposed feature extraction method.

13
3586 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

Audio Feature Extraction Using MFCC K-means creates the cluster in a circular shape because cen-
troids are updated iteratively using the mean value. But if
The speech is rectified by the shape of the vocal tract, the data points distribution is not circular then the K-means
tongue, and teeth. This shape determines what type of sound algorithm becomes unsuccessful to generate the proper
will produce. The shape of the vocal tract is represented in cluster.
the envelope of the short-time power spectrum and the job of GEM: GEM learning algorithm solves the uncertainty
MFCCs (Davis & Mermelstein, 1980; Soni et al., 2016) is to about the data points (Nadif & Govaert, 2005). GEM is a
represent this envelope. Olivan et al. (2021) proposed a deep distribution-based clustering algorithm and overcomes the

LE
learning-based scheme along with mel-spectogram to detect shortcomings of distance-based clustering. The working
music boundary. For calculating the MFCC, the following principle is based on the probability of data to determine
steps are followed: the presence in a cluster. The Expectation-Maximization
(EM) (Nadif & Govaert, 2005) algorithm is used in GEM

IC
Step 1: Analyse speech signal as a short frame. to find the model parameter. The processes of GEM are
Step 2: A window function is applied after framing. discussed below:
Step 3: Discrete Fourier Transform (DFT) is used to con-
vert the signal into the frequency domain. Step 1: Initialize the mean 𝜇k , covariance 𝜎k and mixing

RT
Step 4: Apply Mel filter bank. coefficients 𝜋k for cluster k, 𝜇j , covariance 𝜎j and mixing
Step 5: The logarithm of Mel filter bank energies are taken. coefficients 𝜋j for cluster j and evaluate the initial value
Step 6: Convert Mel spectrum to the time domain. of the log-likelihood.
Step 2: Expectation (E) step : Posterior probabilities of 𝛾j (x)
The resultant coefficients are MFCCs. Here, 19 MFCCs are is calculated using mean, covariance and mixture coef-
calculated as a speech feature vector.

Classification Using Clustering A ficients.



𝜋k .P x�𝜇k ,𝜎k

ED
𝛾j (x) = ∑ � � (7)
j=1 𝜋j .P x�𝜇j,𝜎j
Classification and dimension reduction is also an important
task in this domain. Clustering is very useful for group- where x is the parameters collectively, 𝜋k represent
ing and classifying data objects, especially for small data probability of belonging x to the k-th mixture compo-
set. Therefore, K-means (Kanungo et al., 2002) and GEM nent and 𝜋j represent probability of belonging x to the
CT

(Nadif & Govaert, 2005) clustering techniques are used to j-th mixture component.
cluster the audio-visual speech. First, K-means is used for Step 3: Maximization (M) step:
the basic clustering and then GEM is used for the advanced ∑N � �
clustering technique. GEM is also used because it is a soft n=1 𝛾j xn xn
(8)
A

𝜇j = ∑N � �
clustering method and compares the results with K-means 𝛾 x
n=1 j n
clustering. During training, the threshold is calculated for
every digit and at the testing phase, measure the accuracy for where assign responsibility of a point xn to exactly one
TR

each audio-visual digit. A boosting-aided adaptive cluster- cluster and N represents all the data.
based undersampling approach has been proposed by Devi ∑N � �� � T
et al. (2020) for class imbalance problem. n=1 𝛾j xn xn − 𝜇j (xn − 𝜇j )
𝜎j = ∑N � � (9)
Revathi and Venkataramani (2009) have explored the 𝛾 x
RE

n=1 j n
effectiveness of perceptual features for recognizing isolated
words and continuous speech. Lazli and Boukadoum (2017) N
have proposed an unsupervised iterative process for regulat- ∑ ( )
𝜋j = NT 𝛾j xn (10)
ing a similarity measure to set the cluster’s number and their n=1
boundaries. This has been developed for overcoming the
shortcomings of conventional clustering algorithms, such as Step 4: Estimate log likelihood.
K-means and Fuzzy C-means which require prior knowledge Step 5: If not converged then return to step 2, i.e., Expecta-
of the number of clusters and a similarity measure. tion and Maximization step.
K-means: K-means (Kanungo et al., 2002) is one of the N
{ k }
∑ ∑∏
simplest learning algorithms that solve the well-known clus- ln p(X |𝜇,𝜎,𝜋) = ln P(xn |𝜇k ,𝜎k ) (11)
tering problem. It is a distance-based clustering algorithm. n=1 k=1 k

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3587

where 𝜇 , 𝜎 , and 𝜋 are the overall mean, covariance, and the accuracy. All these experiments are conducted in differ-
mixing coefficients respectively. ent modules.
The E-step and the M-step are the two processes for For audio speech recognition, 19 MFCC features are
the EM algorithm. The E-step is responsible for provid- extracted and for visual speech recognition, LBP-TOP along
ing parameter values that compute the expected values of with GLCM features are used. After feature extraction, the
the latent variable. Based on the latent variable, M-step threshold value is calculated for the clustering algorithm
updates the parameters of the model. in a training phase, which is described in “Performance

LE
IC
Algorithm 1 Appearance based features and their co-occurrence value anal-
ysis
Input: Speakers lip contour.
Output: Hybrid features.

RT
1: Start
2: for i=1 to m do (m=number of utterance)
3: for j=1 to n do, (n=number of lip image)
4: Compute LBP f orXY, XT, Y T planes


p−1

A
LBP(R,P ) = t(np − nc ).2p (12)
p=0

5: S1 = [LBP (XY ) + LBP (XT ) + LBP (Y T )]


ED
6: end for
7: for S = S1 to Sn do
8: Calculate GLCM matrix P (i, j)
9: From GLCM calculate co-occurrences values; Energy, entropy, co-relation, con-
trast, and variance using equation 2 to 6
10: Store the values at P , where P = P1 to Pn
CT

11: end for


12: Comparison among the values of matrix P
13: Q ← store comparison results
14: end for
15: Stop
A
TR

Performance Analysis Measure” section. In the testing phase, accuracy is meas-


ured with respect to the threshold value of each digit. Thus,
Experimental Environment the recognition rate is calculated for each individual digit for
RE

both audio and visual speech.


Experiments are conducted in MATLAB, 2015 version. All However, in a supervised machine learning algorithm, the
the results and graphs are calculated in MATLAB. Different recognition rate is calculated using correctly identified test
experiments are designed for a comprehensive assessment sample and total supplied test sample which is described in
of the proposed system. The experiments are audio speech “Results and Discussion” section.
recognition using two different datasets and visual speech The system is developed based on threshold value, if any
recognition using two different datasets. one module will meet the threshold value then the system
For both the audio and visual speech recognition, results will accept the speech sample. Thus, disabled student can
are obtained using two datasets and for each dataset, both communicate with both the audio and visual speech and the
the clustering and supervised machine learning algorithm system will provide the output using any one speech sample
are used. ANN, SVM, and NB are used separately to achieve which is more convenient to the system.

13
3588 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

Fig. 3  Block Diagram of


training for AV-ASR using
clustering

LE
IC
Fig. 4  Block Diagram of testing
for AV-ASR using clustering

RT
A
ED
CT
A
TR
RE

Details of Dataset total of 1800 words have been recorded from the speaker.
The database has been recorded in an isolated sound booth at
Two datasets ’vVISWa’ (Borde et al., 2004) and ’CUAVE’ a resolution of 720 × 480 with the NTSC standard of 29.97
(Patterson et al., 2002) are used in this paper for experi- fps without any head movement.
ments. Borde et al. (2004) have published a paper about
the ‘vVISWa’ English digit dataset in 2016. The dataset is Performance Measure
consisting of 0 to 10 English digits recorded in a laboratory.
‘vVISWa’ is consisting of 10 speakers including 6 males, The threshold is generated for each digit using centroids
and 4 females. Repetition of each digit is 10 times by indi- obtained from the clustering algorithm. Codebook is
vidual speaker. ‘CUAVE’ is consisting of 0 to 10 English
digits recorded from 18 male and 18 female speakers. A

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3589

Table 1  Accuracy (%) of visual-speech recognition using LBP-TOP Table 3  Accuracy (%) of proposed visual speech recognition using
with GLCM and K-means clustering (’vVISWa’ dataset) ANN (‘vVISWa’ dataset)
Digit k=2 k=4 k=8 k = 16 Exp. no [Link] Hid- [Link] Hid- Iterations System
den layer den units accuracy
0 63.18 65.65 67.77 64.36 (%)
1 64.55 66.00 67.50 63.16
1 2 30,20 100 67.52
2 62.26 62.79 64.33 64.22
2 2 40,30 100 73.12
3 61.57 64.42 66.77 63.17
3 2 50,40 100 72.05

LE
4 62.20 63.59 65.10 63.83
4 2 60,50 100 70.45
5 63.24 64.00 62.75 62.16
5 2 70,60 100 69.12
6 60.11 66.51 68.01 63.27
7 59.13 66.45 66.00 62.00
8 60.33 63.89 65.15 63.00

IC
Table 4  Accuracy (%) of visual speech recognition using SVM
9 62.72 62.87 64.41 63.51 (‘vVISWa’ dataset)
Exp. no Kernel function System
accuracy

RT
Table 2  Accuracy (%) of visual-speech recognition using LBP-TOP (%)
with GLCM and GEM clustering (’vVISWa’ dataset)
1 Radial Basis Function (RBF) 73.23
Digit k=2 k=4 k=8 k = 16 2 Linear 64.22
3 Polynomial 70.76
0 64.37 66.00 70.76 67.10

A
1 64.00 67.66 70.00 65.66
2 63.16 67.00 68.56 64.12
Table 5  Accuracy (%) of visual speech recognition using Naive Bays
3 62.00 66.12 70.31 65.00 (‘vVISWa’ dataset)
ED
4 62.80 65.19 67.00 64.74
Exp. no Kernel function System
5 63.61 64.89 70.51 67.52 accuracy
6 62.10 67.51 69.05 66.27 (%)
7 60.31 65.85 70.00 65.00
1 Normal 72.04
8 61.00 64.24 68.10 64.40
CT

2 Kernel 74.23
9 62.12 64.92 69.11 64.51

Results and Discussion


generated for each repetition of a word and calculated the
A

score by finding out the distance between codebook and Visual Speech Recognition Using Clustering Method
feature vector. This score is considered as a parameter for
testing the utterances. Euclidean distance is used to measure For the extraction of visual features, the primary move is to
TR

the score. The equation of calculating threshold is: detect ROI, and here LBP is used for the detection of ROI.
𝜇1 𝜎1 + 𝜇2 𝜎2 In this research, ROI is the speaker’s lip contour and visual
Threshold = (13) speech features are extracted from lip contour. Appearance-
𝜎 1 + 𝜎2
based features are extracted here along with second-order
RE

where 𝜇1 is mean and 𝜎1 is standard deviation of the tested statistical analysis for visual speech recognition. The main
sample. 𝜇2 is mean and 𝜎2 is the standard deviation of the motivation of the proposed feature extraction method is
other randomly selected sample’s codebook. This threshold to capture dynamic visual information and co-occurrence
is digit-specific. For robustness both claimed and random values of features. Therefore, LBP-TOP along with GLCM
sample are considered to generate the threshold. The block features are proposed for visual speech recognition. LBP-
diagrams of training and testing are depicted in Figs. 3 and TOP divides the ROI into sub-region and here, divides each
4, respectively. ROI into 10 sub-regions, the dimension of each sub-region
Performance Measure for ANN, SVM, and NB is calcu- is 150. Therefore, the total dimension of the feature vector is
lated using following method: (150X10), i.e., 1500 for each frame after extraction of LBP-
TOP. LBP-TOP features are provided as input for GLCM
correctly identified test sample
Recognition Rate = × 100 % (14) calculation and energy, co-relation, contrast, variance as well
total supplied test sample

13
3590 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

Table 6  Accuracy (%) of audio-speech recognition using MFCC and


K-means clustering (‘vVISWa’ dataset)
Digit k=2 k=4 k=8 k = 16

0 91.78 96.76 95.11 90.21


1 90.67 96.00 94.50 91.22
2 88.34 90.34 90.55 90.56
3 89.01 94.00 93.50 92.47

LE
4 92.56 97.00 94.10 92.59
5 91.17 93.15 86.67 89.16
6 89.56 93.25 92.75 90.00
7 90.42 88.00 86.00 85.66
8 90.00 89.45 91.75 90.56

IC
9 92.71 90.91 89.45 91.00

Fig. 5  Results of proposed visual speech features and K-means using


’CUAVE’ dataset

RT
Table 7  Accuracy (%) of audio-speech recognition using MFCC and
GEM clustering (‘vVISWa’ dataset)
Digit k=2 k=4 k=8 k = 16

0 95.22 97.08 96.25 93.11

A
1 93.75 96.54 96.00 93.21
2 94.67 95.25 95.11 93.38
3 95.10 97.75 92.25 94.13
ED
4 93.45 96.75 93.91 92.82
5 92.65 97.25 94.15 88.22
6 92.23 94.38 91.81 90.37
7 90.67 87.00 88.32 86.62
8 91.33 95.72 92.65 88.26
CT

9 91.75 93.99 90.25 89.30

as entropy statistical measures are extracted from these fea-


A

Fig. 6  Results of proposed visual speech features and GEM using


ture matrices. Here, energy gives the sum of the squared ele-
’CUAVE’ dataset ments in the feature matrix, co-relation and contrast measure
the co-relation and intensity contrast of a pixel to its neigh-
TR

bor. The calculated values can be −1 or 1 for positive or


negative co-relation. The entropy is inversely proportional to
GLCM energy and it achieves its largest value when all the
elements of the given matrix are equal. After features extrac-
RE

tion, K-means and GEM clustering algorithms are used for


classification as well as dimension reduction. Cluster sizes 2,
4, 8, 16 are selected for both the K-means and GEM cluster-
ing algorithm. Accuracy increases with the increasing size of
the cluster from 2 to 4 and 4 to 8. But accuracy drops down
when cluster size increases from 8 to 16. This is because a
higher cluster size shows the scattered representation of data
which reduces the accuracy. The hard threshold is calculated
for each speech sample. Tables 1 and 2 represent the pro-
posed visual speech recognition using K-means and GEM.
Here, the ’vVISWa’ dataset is used.
Fig. 7  Results of proposed visual speech features using ’CUAVE’
dataset

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3591

Table 8  Accuracy (%) of audio-speech recognition using MFCC and Table 11  Comparison of proposed visual speech features with exist-
ANN (‘vVISWa’ dataset) ing features using ‘vVISWa’ dataset

Exp. no [Link] Hid- [Link] Hid- Iterations System Methodology System


den layer den units accuracy accuracy
(%) (%)

1 2 30,20 100 89.02 Zernike moment (Borde et al., 2014) 63


2 2 40,30 100 90.00 LBP-TOP (Zhao et al., 2009) 59
3 2 50,40 100 91.32 LBP-TOP and DCT (Sui et al., 2017) 69

LE
4 2 60,50 100 91.12 PZM (Debnath & Roy, 2021) 74.00
5 2 70,60 100 90.34 LBP-TOP, GLCM and clustering (Proposed) 72
LBP-TOP, GLCM and NB (Proposed) 74.23

IC
Table 9  Accuracy (%) of audio-speech recognition using MFCC and
SVM (‘vVISWa’ dataset)
Exp. no Kernel function System
accuracy

RT
(%)

1 Radial Basis Function (RBF) 93.55


2 Linear 78.2207
3 Polynomial 86.76

A
ED
Table 10  Accuracy (%) of audio-speech recognition using MFCC and
Naive Bayes (‘vVISWa’ dataset)

Exp. no Kernel function System Fig. 8  Results of audio speech rceognition using ’CUAVE’ dataset
accuracy
(%)
CT

1 Normal 91.51 Audio Speech Recognition Using Clustering


2 Kernel 92.19
Here, 19 MFCC coefficients are extracted from audio
speech. After feature extraction, classification using a clus-
A

tering algorithm is performed for audio data.


Visual Speech Recognition Using ANN, SVM, and NB Performance of audio speech recognition using K-means
clustering is more than 90% for each digit. Accuracy
TR

After features extraction, ANN (Kuncheva, 2004; Debnath increases with the increasing size of the cluster from 2 to 4
& Roy, 2021), SVM (Debnath & Roy, 2021; Debnath et and drops down when cluster size increases from 4 to higher.
al., 2021), and NB (Debnath & Roy, 2021) machine learn- Using GEM the accuracy of audio speech recognition is
ing algorithms are applied to recognize the visual speech. more than 92% with k=4. Therefore, system performance
RE

The visual speech is recognized using the different num- depends on the cluster size for both the K-means and GEM
ber of hidden nodes and two hidden layers in ANN. The clustering methods. Recognition accuracy for audio speech
system achieves 73.12 % recognition accuracy using 40, 30 using clustering is represented in Tables 6 and 7.
number of hidden nodes. The experiments using SVM and
NB are carried out with different kernel functions and the
highest recognition accuracy of visual speech is obtained
using the NB classifier, which is 74.23 %. The performance
of visual speech recognition is calculated using ’vVISWa’
dataset with different classifiers and shown in Tables 3 , 4,
and 5, respectively. Figures 5, 6, and 7 represent the results
obtained from the ’CUAVE’ dataset.

13
3592 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

Comparative Study

Visual speech features are extracted from the three orthogo-


nal planes to capture the dynamic features. GLCM matrix
is also calculated from LBP-TOP to measure second-order
statistical features. The proposed features set consists of
appearance-based visual features, the motion of lip, and the
statistical measure of feature matrix which calculates the

LE
energy, contrast, correlation, variance, and entropy. K-means
and GEM are used as an unsupervised and ANN, SVM, and
NB are used as supervised machine learning methods. It
is observed from the experiments that the proposed visual

IC
Fig. 9  Comparison of proposed visual speech features with existing features provide a better recognition rate than the classical
features using ‘CUAVE’ dataset feature extraction method using both supervised and unsu-
pervised methods. The comparison of results with the exist-
ing feature extraction method is shown in Table 11 using

RT
Audio Speech Recognition Using ANN, SVM, and NB ’vVISWa’ dataset and in Fig. 9, using ’CUAVE’ dataset.
The proposed method performs well because it calculates
The recognition of audio speech is also performed using the statistical values from appearance features and gives
ANN, SVM, as well as NB classifiers. For ANN, 2 hidden more distinct information for visual speech. LBP has been
layers, 100 iterations, and the different number of hidden used by many researchers for visual speech recognition but
nodes are used for the experiment. SVM and NB machine
learning approaches are applied based on the different ker-
nel functions. The ‘RBF’kernel function of SVM gives A
it does not capture the dynamic features of lip movement, for
that reason researchers have proposed LBP-TOP to calculate
features in a three-dimension. However, the co-occurrence
ED
93.55 % recognition accuracy whereas 92.19 % accuracy is values are also important, statistical value from the co-
obtained using the kernel function of NB. The accuracy is occurrence matrix gives distinguished feature.
91.32 % using 50, 40 hidden nodes with ANN. Tables 8, 9,
and 10 are showing the performance of the proposed sys-
tem using ANN, SVM, and NB. Figure 8 represents the
CT

accuracy obtained from ’CUAVE’ dataset using both the Conclusions and Future Work
clustering and supervised learning algorithm.
The main focus of this paper is to provide AV-ASR based
Integration of Audio‑Visual Speech education system for physically challenged people. Because
A

visual speech recognition is beneficial for ASR where back-


Integration of two systems can be done using feature ground noise is more. To achieve this goal, LBP-TOP and
fusion and decision level fusion. Here, decision level GLCM are proposed for visual speech recognition. LBP-
TR

fusion is considered for combining two systems. Indi- TOP captures the visual features in a spatio-temporal
vidual word recognition rate is calculated for both audio domain; therefore, the motion of the lip is also captured
and visual speech and then integrate two modalities for with appearance-based features. Five GLCM features are
the better result. If one recognition model is failed, then calculated to distinguish different frames of a particular
RE

the system considers the result from another model. Deci- utterance which is explained in the proposed methodology.
sion fusion provides better recognition rate for the overall After feature extraction, the classification of speech is also
system because each individual word is recognized as a a challenging problem. It is observed that the proposed fea-
token. For audio speech recognition, considered threshold ture extraction method gives a better recognition rate for
rate is more than 85% and for visual speech recognition visual speech using a clustering algorithm and supervised
threshold is more than 70 %. Thus, when the accuracy is machine learning algorithm. GEM is more efficient than
greater than or equal to the threshold, the input data is K-means because it calculates gaussian distribution for
acceptable. Based on the threshold, when one of the rec- clustering while K-means calculates the distance for gener-
ognition systems recognizes the respective input data of ating clusters. Moreover, K-means fails to generate the right
audio and visual speech, the system considers that speech cluster when the distribution of data samples is not circular.
gets recognized and provides the output. Thus, the distribution-based model gives better performance
instead of the distance measure model. Using supervised
machine learning, SVM and NB give more accuracy than

13
Journal of Autism and Developmental Disorders (2023) 53:3581–3594 3593

ANN visual speech recognition. In human-to-human com- Debnath, S., & Roy, P. (2021). Audio-visual automatic speech recog-
munication, speech is a very natural and basic method. The nition using PZM, MFCC and statistical analysis. International
Journal of Interactive Multimedia and Artificial Intelligence, 7(2),
design process for a speech user interface for human-com- 121–133. [Link]
puter interaction is presented in this paper using audio-visual Devi, D., et al. (2020). A boosting-aided adaptive cluster-based under-
data. In a noisy environment when the audio signal will not sampling approach for treatment of class imbalance problem.
work the disabled people can communicate with the system International Journal of Data Warehousing and Mining (IJDWM),
16(3), 60–86. [Link]
using a visual speech signal. Further work can be designed Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for
for a hybrid visual feature extraction method to extract more continuous speech recognition. IEEE Transaction on Multimedia,

LE
robust features to develop for developing an AV-ASR-based 2(3), 141–151. [Link]
education system. Erber, N. P. (1975). Auditory-visual perception of speech. Journal of
Speech and Hearing Disorders, 40(4), 481–492. [Link]
10.​1044/​jshd.​4004.​481.
Feng, W., et al. (2017). Audio visual speech recognition with multi-
Author Contributions SD is the main author of this paper, who has

IC
modal recurrent neural networks. In International Joint Confer-
conceived the idea and discussed it with all co-authors. PR has devel- ence on Neural Networks (IJCNN), IEEE, pp. 681–688, 14-19.
oped the main algorithms. SN is the corresponding author. He has [Link]
performed the experiments of this paper. RGC has supervised the entire Galatas, G., et al. (2012). Audio-visual speech recognition using depth
work, evaluated the performance, and proofread the paper. information from the Kinect in noisy video conditions. In Pro-

RT
ceedings of International Conference on Pervasive Technologies
Related to Assistive Environments, ACM, pp. 1–4 [Link]
10.​1145/​24130​97.​24131​00
Gao, J., et al. (2021). Decentralized federated learning framework for
References the neighborhood: A case study on residential building load fore-
casting. In Proceedings of the 19th ACM Conference on Embed-

A
Ahonen, T., et al. (2006). Face description with local binary patterns: ded Networked Sensor Systems, ACM pp. 453–459. [Link]
Applications to face recognition. IEEE Transactions on Pattern org/​10.​1145/​34857​30.​34934​50
Analysis and Machine Intelligence, 28(12), 2037–2041. [Link] Ivanko, D., et al. (2021). An experimental analysis of different
doi.​org/​10.​1109/​TPAMI.​2006.​244. approaches to audio-visual speech recognition and lip-reading.
ED
Azeta, A., et al. (2010). Intelligent voice-based e-education system: A In Proceedings of 15th International Conference on Electrome-
framework and evaluation. International Journal of Computing, chanics and Robotics, Springer, Singapore, pp. 197–209. [Link]
9, 327–334. [Link] doi.​org/​10.​1007/​978-​981-​15-​5580-​016
Borde, P., et al. (2004). ‘vVISWa’: A multilingual multi-pose audio Jafarbigloo, S. K., & Danyali, H. (2021). Nuclear atypia grading in
visual database for robust human computer interaction. Interna- breast cancer histopathological images based on CNN feature
tional Journal of Computer Applications, 137(4), 25–31. [Link] extraction and LSTM classification. CAAI Transactions on Intel-
CT

doi.​org/​10.​5120/​ijca2​01690​8696. ligence Technology, 6(4), 426–439. [Link]


Borde, P., et al. (2014). Recognition of isolated words using Zernike 12061.
and MFCC features for audio visual speech recognition. Interna- Jain, A., & Rathna, G. N. (2017). Visual speech recognition for isolated
tional Journal of Speech Technology, 18(1), 23. [Link] digits using discrete cosine transform and local binary pattern
10.​1007/​s10772-​014-​9257-1. features. In IEEE Global Conference on Signal and Information
A

Chen, R., et al. (2022). Image-denoising algorithm based on improved Processing, IEEE, Montreal, pp. 368–372. [Link]
K-singular value decomposition and atom optimization. CAAI 1109/​Globa​lSIP.​2017.​83086​66
Transactions on Intelligence Technology, 7(1), 117–127. [Link] Jiang, R., et al. (2020). Object tracking on event cameras with offline-
TR

doi.​org/​10.​1049/​cit2.​12044. online learning. CAAI Transactions on Intelligence Technology,


Dave, N. (2015). A lip localization based visual feature extraction 5(3), 165–171. [Link]
method. Electrical & Computer Engineering, 4(4), 452. [Link] Kanungo, T., et al. (2002). An efficient k-means clustering algorithm:
doi.​org/​10.​14810/​ecij.​2015.​4403. Analysis and implementation. IEEE Transactions on Pattern Anal-
Davis, S., & Mermelstein, P. (1980). Comparison of parametric rep- ysis and Machine Intelligence, 24(7), 2037–2041. [Link]
resentation for monosyllabic word recognition in continuously 10.​1109/​TPAMI.​2002.​10176​16.
RE

spoken sentences. IEEE Transactions on Acoustics, Speech, and Kashevnik, A., et al. (2021). Multimodal corpus design for audio-vis-
Signal Processing, 28(4), 357–365. [Link] ual speech recognition in vehicle cabin. IEEE Access, 9, 34986–
TASSP.​1980.​11634​20. 35003. [Link]
Debnath, S., et al. (2021). Study of different feature extraction method Kumar, L. A., et al. (2022). Deep learning based assistive technol-
for visual speech recognition. International Conference on Com- ogy on audio visual speech recognition for hearing impairedD.
puter Communication and Informatics (ICCCI), 2021, 1–5. International Journal of Cognitive Computing in Engineering, 3,
[Link] 24–30. [Link]
Debnath, S., & Roy, P. (2018). Study of speech enabled healthcare Kuncheva, I. (2004). Combining pattern classifiers: Methods and algo-
technology. International Journal of Medical Engineering and rithms. Wiley.
Informatics, 11(1), 71–85. [Link] Lazli, L., & Boukadoum, M. (2017). HMM/MLP speech recognition
096893. system using a novel data clustering approach. In IEEE 30th
Debnath, S., & Roy, P. (2021). Appearance and shape-based hybrid Canadian Conference on Electrical and Computer Engineering
visual feature extraction: Toward audio-visual automatic speech (CCECE), IEEE, Windsor. [Link]
recognition. Signal, Image and Video Processing, 15, 25–32. 79466​44
[Link]

13
3594 Journal of Autism and Developmental Disorders (2023) 53:3581–3594

Mohanaiah, P., et al. (2013). Image texture feature extraction using Rauf, H. T., et al. (2021). Enhanced bat algorithm for COVID-19 short-
GLCM approach. International Journal of Scientific and Research term forecasting using optimized LSTM. Soft Computing, 25(20),
Publications, 3(5), 85. 12989–12999. [Link]
Nadif, M., & Govaert, G. (2005). Block Clustering via the Block GEM Revathi, A., & Venkataramani, Y. (2009). Perceptual features based
and two-way EM algorithms. The 3rd ACS/IEEE International isolated digit and continuous speech recognition using iterative
Conference on Computer Systems and Applications, IEEE. [Link] clustering approach networks and communication. In First Inter-
doi.​org/​10.​1109/​AICCSA.​2005.​13870​29 national Conference on Networks & Communications, NetCoM.,
Namasudra, S., & Roy, P. (2015). Size based access control model in IEEE, Chennai. [Link]
cloud computing. In Proceeding of the International Conference Revathi, A., et al. (2019). Person authentication using speech as a bio-
on Electrical, Electronics, Signals, Communication and Optimi- metric against play back attacks. Multimedia Tools Application,

LE
zation, IEEE, Visakhapatnam, pp. 1–4. [Link] 78(2), 1569–1582. [Link]
EESCO.​2015.​72537​53 Shikha, B., et al. (2020). An extreme learning machine-relevance feed-
Namasudra, S. (2020). Fast and secure data accessing by using DNA back framework for enhancing the accuracy of a hybrid image
computing for the cloud environment. IEEE Transactions on Ser- retrieval system. International Journal of Interactive Multimedia
vices Computing. [Link] and Artificial Intelligence, 6(2), 15–27. [Link]

IC
Namasudra, S., & Roy, P. (2017). A new table based protocol for data ijimai.​2020.​01.​002.
accessing in cloud computing. Journal of Information Science Shrawankar, U., & Thakare, V. (2010). Speech user interface for com-
and Engineering, 33(3), 585–609. [Link] puter based education system. In International Conference on
2017.​33.3.1. Signal and Image Processing, pp. 148–152. [Link]
Noda, K., et al. (2014). Audio-visual speech recognition using deep 1109/​ICSIP.​2010.​56974​59

RT
learning. Applied Intelligence, 42(4), 567. [Link] oi.o​ rg/1​ 0.1​ 007/​ Soni, B., et al. (2016). Text-dependent speaker verification using clas-
s10489-​014-​0629-7. sical LBG, adaptive LBG and FCM vector quantization. Interna-
Ojala, T., et al. (2002). Multi resolution gray-scale and rotation invari- tional Journal of Speech Technology, 19(3), 525–536. [Link]
ant texture classification with local binary patterns. IEEE Transac- org/​10.​1007/​s10772-​016-​9346-4.
tion on Pattern Analysis and Machine Intelligence, 24(7), 971– Sui, C., et al. (2017). A cascade gray-stereo visual feature extraction
987. [Link] method for visual and audio-visual speech recognition. Speech

A
Olivan, C. H., et al. (2021). Music boundary detection using convo- Communication, 90(1), 89. [Link]
lutional neural networks: A comparative analysis of combined 2017.​01.​005.
input features. International Journal of Interactive Multimedia Zhao, G., et al. (2009). Lipreading with local spatiotemporal descrip-
and Artificial Intelligence, 7(2), 78–88. [Link] tors. IEEE Transactions on Multimedia, 11(7), 56. [Link]
ED
arXiv.​2008.​07527. 10.​1109/​TMM.​2009.​20306​37.
Patterson, E., et al. (2002). CUAVE: A new audio-visual database for
multimodal human-computer interface research. In IEEE Interna- Publisher's Note Springer Nature remains neutral with regard to
tional Conference on Acoustics, Speech, and Signal Processing, jurisdictional claims in published maps and institutional affiliations.
IEEE, Orlando. [Link]
CT
A
TR
RE

13

You might also like