0% found this document useful (0 votes)
42 views5 pages

Speech Emotion Recognition Using Deep Learning

Uploaded by

virusphotos2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

Speech Emotion Recognition Using Deep Learning

Uploaded by

virusphotos2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Speech Emotion Recognition Using Deep Learning

Mohamed A. Gismelbari Ilya I. Vixnin Gregory M. Kovalev


Dept. of Automatic Control Systems Dept. of Automatic Control Systems Dept. of Automatic Control Systems
Saint Petersburg Electrotechnical Saint Petersburg Electrotechnical Saint Petersburg Electrotechnical
University “LETI” University “LETI” University “LETI”
Saint Petersburg, Russia Saint Petersburg, Russia Saint Petersburg, Russia
[email protected] [email protected] [email protected]
2024 XXVII International Conference on Soft Computing and Measurements (SCM) | 979-8-3503-6370-8/24/$31.00 ©2024 IEEE | DOI: 10.1109/SCM62608.2024.10554077

Eugеne E. Gogolev
Dept. of Automatic Control Systems
Saint Petersburg Electrotechnical
University “LETI”
Saint Petersburg, Russia
[email protected]

Abstract—This study explores the application of deep for training and evaluating deep neural network (DNN)
learning techniques in recognizing emotional states from models. The study embarks on an extensive literature review
spoken language. Specifically, we employ Convolutional Neural on SER's recent methodologies and advancements. It
Networks (CNNs) and the HuBERT model to analyze the endeavours to construct a comprehensive SER system
Ryerson Audio-Visual Database of Emotional Speech and Song pipeline, utilizing deep learning models for emotion detection
(RAVDESS). Our findings suggest that deep learning models, from audio data. Utilizing Python and deep learning libraries
particularly the HuBERT model, exhibit significant potential in such as TensorFlow and Keras, the research explores
accurately identifying speech emotions. The models were common DNN architectures including Convolutional Neural
trained and tested on a dataset containing various emotional
Networks (CNNs), Long Short-Term Memory (LSTMs), and
expressions, including happiness, sadness, anger, and fear,
among others. The experimentation involved preprocessing the
Recurrent Neural Networks (RNNs) for SER. The project
audio data, feature extraction using Mel Frequency Cepstral aims to implement a SER model, assessing its performance
Coefficients (MFCCs), and implementing deep learning based on accuracy, precision, recall, and F1-score, and to
architectures for emotion classification. The HuBERT model, fine-tune this model for optimized performance.
with its advanced self-supervised learning mechanism, Additionally, the effectiveness of pre-trained models, like
outperformed traditional CNNs in terms of accuracy and HuBERT and ResNet, is compared against custom-built
efficiency. This research highlights the importance of selecting models. Finally, the paper outlines a potential business plan
appropriate deep learning models and feature sets for the task for the commercialization of SER technology, suggesting
of speech emotion recognition. Our analysis demonstrates that avenues for integrating emotional recognition capabilities
the HuBERT model, by leveraging contextual information and into existing and future products and services. This
temporal dynamics in speech, offers a promising approach for contribution aims to advance HMI by enabling systems to
developing more sensitive and accurate SER systems. These more effectively understand and respond to human emotions.
systems have potential applications in various fields, including
mental health assessment, interactive voice response systems,
and educational software, by enabling machines to understand
and respond to human emotions more effectively. The findings
of this study contribute to the ongoing discussion in the field of
artificial intelligence about the best practices for implementing
deep learning techniques in speech processing tasks.

Keywords—Speech Emotion Recognition, Deep Learning,


Convolutional Neural Networks, HuBERT Model, RAVDESS Fig. 1. Steps of SER
Dataset

I. INTRODUCTION II. SPEECH EMOTION RECOGNITION SYSTEMS


The advancement of spoken language processing, The field of artificial intelligence (AI) has catalysed
intersecting with natural language processing, cognitive significant advancements in human-machine communication,
sciences, and human-machine interaction (HMI), has notably through the development of speech emotion
significantly propelled the development of adaptive and recognition (SER). SER has gained prominence for its ability
responsive human-machine interfaces. Speech Emotion to discern nuanced emotional states from human speech,
Recognition (SER) emerges as a critical component in which is crucial for a variety of applications including
enhancing the naturalness and effectiveness of human- entertainment, automotive safety, virtual assistants,
machine dialogue systems. Its relevance spans across smart healthcare, customer service centres, and e-learning
environments, virtual assistants, and call centres, where platforms. The integration of SER into conversational
emotional nuances in speech profoundly influence user systems like Google Assistant, Siri, and Alexa exemplifies its
interactions. This research delves into deploying deep utility in enhancing user engagement and satisfaction by
learning techniques for SER, leveraging the Ryerson Audio- facilitating more natural and intuitive interactions between
Visual Database of Emotional Speech and Song (RAVDESS) humans and machines.

979-8-3503-6370-8/24/$31.00©2024 IEEE
380
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
The pursuit of accurate emotion detection from speech basis for extracting audio features, which are then structured
presents complex challenges due to the variability of into an array for neural network models to classify sounds.
utterances and the subtlety of emotional expressions.
Achieving high levels of SER performance necessitates
navigating through intricate processes, including the pre-
processing of audio data, feature extraction, and the
classification of emotions. The endeavour to refine SER
capabilities continues to be a methodological quest for Fig. 3. STFT of a sound signal
algorithms that can surpass existing benchmarks,
underscoring the intricacy and the multidimensional nature of
human emotions as conveyed through speech. C. Chroma
Chroma feature, a quality of a pitch class which refers to
III. COMMON AUDIO FEATURES EXTRACTED FOR SOUND the "color" of a musical pitch, which can be decomposed in
CLASSIFICATION PROBLEMS into an octave-invariant value called "chroma" and a "pitch
height" that indicates the octave the pitch is in.
In any machine learning project focused on audio, the
initial step involves collecting audio signal data, which then
must be transformed into features suitable for algorithmic
processing. A key part of this process is defining and
extracting the most crucial audio features for model
construction, specifically for audio classification tasks.

A. Mel Frequency Cepstral Coefficients (MFCCs)


Utilizing the Librosa library in Python, one can extract
significant features such as Mel Frequency Cepstral
Coefficients (MFCCs). MFCCs are pivotal in capturing the Fig. 4. Typical Chroma spectrogram of a sound signal
timbral and textural qualities of sound within the frequency
domain, closely approximating the human auditory system's The audio features explained in this section therefore
response. These coefficients are influenced by the shape of STFT, MFCC’s and Chroma will be the features extracted
the human vocal tract, including the tongue and teeth, from an audio signal and put into an array of features then
playing a vital role in the precise representation of sounds. fed into a model for emotion classification. This defines the
MFCCs effectively illustrate the nuanced differences in
Deep Learning pre-processing pipeline for audio data as
sound perception through the Mel scale, accounting for the
shown in Fig. 5.
way humans discern pitch differences across a broad
frequency range from 20Hz to 20kHz. This ability to reflect IV. THE RAVDESS DATASET
the perceptual properties of sound makes MFCCs invaluable
for audio classification models, enabling a more accurate and Ryerson Audio-Visual Database of Emotional Speech
human-like processing of audio data. and Song (RAVDESS) contains 1440 files: 60 trials per actor
x 24 actors = 1440. The RAVDESS contains 24 professional
actors (12 female, 12 male), vocalizing two lexically-
matched statements in a neutral North American accent.
Speech emotions includes calm, happy, sad, angry, fearful,
surprise, and disgust expressions. Each expression is
produced at two levels of emotional intensity (normal,
strong), with an additional neutral expression.

Fig. 2. Flowchart for obtaining MFCC coefficients

(1)

B. Short-Time Fourier Transform (STFT) Fig. 5. Deep Learning pre-processing pipeline for Audio Data

The Short-time Fourier Transform (STFT) is a method


that applies Fourier transforms to portions of a signal to V. DATA PREPARATION
obtain frequency information localized in time. Unlike the
This section defines the steps under which the audio
standard Fourier transform, which gives an average
signal is pre-processed before being fed into the deep
frequency overview for the whole signal, STFT can capture
learning model for training, testing and then eventually
how frequency components change over time. This is
emotion classification from speech. These steps can be
achieved by dividing the signal into fixed-size frames (e.g.,
defined as:
2048 samples) and transforming each separately, making
STFT essential for processing audio data in machine learning 1. The sample audio is provided as input from RAVDESS
applications. The result is a spectrogram that maps time, dataset after importing the whole dataset into the script.
frequency, and magnitude, providing a comprehensive
representation of the signal. This spectrogram serves as the Due to the fact that this research work is intended to
eventually be employed in a call-center i.e., only four

381
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
emotions that are considered critical in such an industry i.e., VII. HUBERT MODEL TRAINING AND TESTING
Neutral, Sad, Angry and Happy are extracted from the The HuBERT model, built on the Transformer
RAVDESS dataset. Then they undergo signal processing architecture, is specifically designed for tasks like speech
before and training, testing and eventually being classified emotion detection. It uniquely identifies discrete elements in
respectively. speech, enabling a deeper understanding of emotional cues.
2. A loop was made through RAVDESS directory to The training of HuBERT for emotion detection on the
collect from audio folders the emotion, and gender of speaker RAVDESS dataset, which includes emotions such as angry,
and a plot of all collected audio files and emotions have been sad, happy, and neutral, consists of an initial unsupervised
plotted. pre-training phase. Here, the model learns from unlabelled
data by predicting masked speech segments, developing a
3. Each of these audio files goes into feature extraction generic grasp of speech patterns. Subsequently, it undergoes
process as discussed in section III. fine-tuning with labelled data from RAVDESS, focusing on
the four specified emotions. This process refines the model's
predictive capabilities. Testing then validates its performance
on unseen data, assessing its precision in emotion
identification. HuBERT's advanced architecture and training
methodology offer significant potential for enhanced
accuracy in speech emotion recognition tasks. And as a
result, this is achieved. The HuBERT model resulted in an
accuracy of 83.77% as compared to 40% on the CNN model.
Fig. 6. Label Counts of intended emotions

4. All the extracted features of these emotions are then


concatenated in a data frame which will be the input for the
CNN model for emotion classification.

VI. CNN MODEL TRAINING AND TESTING


A conventional CNN architecture, as described in
literature, designed for emotion classification. This model
incorporates eight convolutional layers, all utilizing the
ReLU activation function. To mitigate the risk of overfitting,
dropout layers are strategically included in the model. The
architecture begins with the first convolutional layer, which
Fig. 7. HuBERT Model Architecture
is initialized to match the dimensions of the x_train input
variable, specifically (218,1). The culmination of the model
is a dense layer with four units, reflecting the number of VIII. RESULTS AND DISCUSSIONS
target emotion classes. The model employs the Adam
This chapter aims to detail the performance outcomes of
optimizer, with a specified learning rate of 0.0001, to
two Speech Emotion Recognition (SER) models: a custom
facilitate the learning process as summarized by the table
CNN model and the HuBERT base model, utilizing the
below. Test split defined at 20% and training 80%.
amended RAVDESS dataset. Performance evaluation
TABLE I. CNN MODEL ARCHITECTURE
involves confusion matrices, various evaluation metrics, and
training accuracy, focusing on the models' capability to
discern four specific emotions: neutral, happy, sad, and
angry, following label balancing. The CNN model, crafted
with convolutional, pooling, and dropout layers, underwent
training and testing on the cleaned dataset, assessing its
proficiency in emotion identification. Conversely, the pre-
trained HuBERT model received fine-tuning on the same
dataset, with its performance similarly evaluated. Notably,
the HuBERT base model significantly surpassed the CNN
model, achieving an 83.77% overall accuracy compared to
the CNN's 40.97%. This chapter will present a detailed
comparison between the models, showcasing their respective
Model was then compiled as follows and a test accuracy
strengths and weaknesses in SER through confusion
of 40% has been achieved. The loss was chosen as
‘categorical cross entropy’, epochs= 80, batch size=32. matrices and key evaluation metrics like precision, recall,
However, this accuracy is considered very low for such a and F1 score, offering a thorough analysis of each model's
task and such a dataset. This accuracy was the maximum effectiveness in emotion recognition.
reached by this model even after parameters tuning. Hence,
adoption of another model to achieve a higher accuracy given
the same inputs was considered. The model under
consideration was the HuBERT model which will be
discussed in the following section.

382
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
A. Confusion Matrix C. Precision, Recall, F1-score:

Fig. 10. Evaluation metrics on the 4 emotions for CNN and HuBERT model

Fig. 10, shows that, based on evaluation metrics, the


HuBERT model significantly outperforms the CNN model,
Fig. 8. Confusion matrix plots of CNN and HuBERT Model achieving higher scores in all metrics, particularly noting a
substantial increase in precision and F1-score for the
Examining the confusion matrices, as shown in Fig. 8, "neutral" and "sad" emotions compared to the CNN model.
reveals a clear disparity in performance between two models,
particularly evident in the diagonal segments which denote
accurate emotion classifications. The use of colour IX. CONCLUSION
spectrograms, with darker and lighter shades of blue This research successfully implemented and compared
indicating higher and lower accuracy respectively, visually two Speech Emotion Recognition (SER) models, a
suggests the HuBERT model's superior ability in emotion customized CNN and the HuBERT base model, on the
recognition. However, this observation, primarily based on RAVDESS dataset focusing on four primary emotions:
correct classifications of samples, does not fully encapsulate neutral, sad, angry, and happy. Performance analysis
overall performance. Thus, to attain a more comprehensive revealed the HuBERT model significantly outperformed the
analysis, additional evaluation metrics were employed. These CNN model, achieving an 83.77% accuracy versus 40.97%,
metrics, discussed in the previous section, allow for a deeper thus becoming the preferred choice for optimal SER
examination and comparison of the models, ultimately performance.
leading to a more grounded determination of which model
Key outcomes:
outperforms the other.
• Development of a comprehensive SER system
B. Model Predictions on Test Dataset pipeline, encompassing audio data collection through
The accuracy of the model on test dataset is determined by to emotion recognition.
• Utilization of Python and deep learning libraries
(TensorFlow, Keras, PyTorch, Librosa) to build the
(2) SER framework.
• Exploration of common neural network models
(CNN, LSTM, RNN) and the implementation of a
customized CNN model.
• Investigation and adoption of pre-trained models like
HuBERT and ResNet, with HuBERT being chosen
for its superior performance.
• Fine-tuning of both models to enhance performance,
emphasizing the importance of hyperparameter
optimization.
Fig. 9. Model prediction performance of CNN and HuBERT on test Future research directions are proposed to extend the
dataset work further, including:
Fig. 9, illustrates that the HuBERT model significantly 1. Expanding training data beyond the RAVDESS
surpasses the CNN model in terms of accuracy on the test dataset. Aiming to recognize all eight emotions
dataset, achieving 129 correct predictions out of 154 total provided in the RAVDESS dataset.
cases. In contrast, the CNN model managed only 63 correct
2. Exploring and comparing additional models such as
predictions, highlighting the superior performance of the
ResNet.
HuBERT model.
3. Developing a real-time SER system for potential
• CNN model accuracy on test dataset: commercial deployment.
This study advances SER research, offering a foundation
for future exploration. The proposed system and findings
have broad potential applications across industries, from
• HuBERT model accuracy on test dataset: healthcare to entertainment, highlighting the significance of
advancing SER technology for real-world applications.

383
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [8] Klaus R. Scherer. “Vocal communication of emotion: A review of
research paradigms”. In: Speech Communication 40.1-2 (2003), pp.
[1] Sanjita B.R., Nipunika A., Rohita Desai, “Speech Emotion 227–256. issn: 01676393. doi: 10.1016/S0167-6393(02)00084-5.
Recognition using MLP Classifier”.
[9] S.R. Livingstone and F.A. Russo. The Ryerson Audio-Visual
[2] Harini Murugan, “Speech Emotion Recognition Using CNN” Database of Emotional Speech and Song (RAVDESS). 2018, pp. 1–
[3] Florian Eyben et al. “The Geneva Minimalistic Acoustic Parameter 35. doi: 10. 5281/zenodo.1188976.
Set (GeMAPS) for Voice Research and Affective Computing”. In: [10] Siqing Wu, Tiago H. Falk, and Wai Yip Chan. “Automatic speech
IEEE emotion recognition using modulation spectral features”. In: Speech
[4] George Trigeorgis et al. “Adieu features? End-to-end speech emotion Communication 53.5 (2011), pp. 768–785. issn: 01676393. doi:
recognition using a deep convolutional recurrent network”. In: 10.1016/ j.specom.2010.08.013.
ICASSP, IEEE International Conference on Acoustics, Speech and [11] Mittal A., Arora V., & Kaur H. (2021). Speech Emotion Recognition
Signal Processing - Proceedings 2016 (2016), pp. 5200–5204. using HuBERT Features and Convolutional Neural Networks. In 2021
[5] Transactions on Affective Computing 7.2 (2016), pp. 190–202. issn: 6th International Conference on Computing, Communication and
19493045. doi: 10.1109/TAFFC.2015.2457417. Security (ICCCS) (pp. 1-5). IEEE.
[6] George Trigeorgis et al. “Adieu features? End-to-end speech emotion [12] Zhang Y., Yang Y., Li Y., Li W., & Zhao J. (2021). Speech emotion
recognition using a deep convolutional recurrent network”. In: recognition based on HuBERT and attention mechanism. In
ICASSP, IEEE International Conference on Acoustics, Speech and Proceedings of the 2021 6th International Conference on Automation,
Signal Processing - Proceedings 2016 (2016), pp. 5200–5204. Control and Robotics Engineering (CACRE) (pp. 277-280). IEEE.
[7] Jianfeng Zhao, Xia Mao, and Lijiang Chen. “Speech emotion
recognition using deep 1D & 2D CNN LSTM networks”. In:
Biomedical Signal Processing and Control 47 (2019), pp. 312–323.
issn: 17468108. doi: 10.1016/j.bspc.2018.08.035

384
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.

You might also like