Speech Emotion Recognition Using Deep Learning
Speech Emotion Recognition Using Deep Learning
Eugеne E. Gogolev
Dept. of Automatic Control Systems
Saint Petersburg Electrotechnical
University “LETI”
Saint Petersburg, Russia
[email protected]
Abstract—This study explores the application of deep for training and evaluating deep neural network (DNN)
learning techniques in recognizing emotional states from models. The study embarks on an extensive literature review
spoken language. Specifically, we employ Convolutional Neural on SER's recent methodologies and advancements. It
Networks (CNNs) and the HuBERT model to analyze the endeavours to construct a comprehensive SER system
Ryerson Audio-Visual Database of Emotional Speech and Song pipeline, utilizing deep learning models for emotion detection
(RAVDESS). Our findings suggest that deep learning models, from audio data. Utilizing Python and deep learning libraries
particularly the HuBERT model, exhibit significant potential in such as TensorFlow and Keras, the research explores
accurately identifying speech emotions. The models were common DNN architectures including Convolutional Neural
trained and tested on a dataset containing various emotional
Networks (CNNs), Long Short-Term Memory (LSTMs), and
expressions, including happiness, sadness, anger, and fear,
among others. The experimentation involved preprocessing the
Recurrent Neural Networks (RNNs) for SER. The project
audio data, feature extraction using Mel Frequency Cepstral aims to implement a SER model, assessing its performance
Coefficients (MFCCs), and implementing deep learning based on accuracy, precision, recall, and F1-score, and to
architectures for emotion classification. The HuBERT model, fine-tune this model for optimized performance.
with its advanced self-supervised learning mechanism, Additionally, the effectiveness of pre-trained models, like
outperformed traditional CNNs in terms of accuracy and HuBERT and ResNet, is compared against custom-built
efficiency. This research highlights the importance of selecting models. Finally, the paper outlines a potential business plan
appropriate deep learning models and feature sets for the task for the commercialization of SER technology, suggesting
of speech emotion recognition. Our analysis demonstrates that avenues for integrating emotional recognition capabilities
the HuBERT model, by leveraging contextual information and into existing and future products and services. This
temporal dynamics in speech, offers a promising approach for contribution aims to advance HMI by enabling systems to
developing more sensitive and accurate SER systems. These more effectively understand and respond to human emotions.
systems have potential applications in various fields, including
mental health assessment, interactive voice response systems,
and educational software, by enabling machines to understand
and respond to human emotions more effectively. The findings
of this study contribute to the ongoing discussion in the field of
artificial intelligence about the best practices for implementing
deep learning techniques in speech processing tasks.
979-8-3503-6370-8/24/$31.00©2024 IEEE
380
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
The pursuit of accurate emotion detection from speech basis for extracting audio features, which are then structured
presents complex challenges due to the variability of into an array for neural network models to classify sounds.
utterances and the subtlety of emotional expressions.
Achieving high levels of SER performance necessitates
navigating through intricate processes, including the pre-
processing of audio data, feature extraction, and the
classification of emotions. The endeavour to refine SER
capabilities continues to be a methodological quest for Fig. 3. STFT of a sound signal
algorithms that can surpass existing benchmarks,
underscoring the intricacy and the multidimensional nature of
human emotions as conveyed through speech. C. Chroma
Chroma feature, a quality of a pitch class which refers to
III. COMMON AUDIO FEATURES EXTRACTED FOR SOUND the "color" of a musical pitch, which can be decomposed in
CLASSIFICATION PROBLEMS into an octave-invariant value called "chroma" and a "pitch
height" that indicates the octave the pitch is in.
In any machine learning project focused on audio, the
initial step involves collecting audio signal data, which then
must be transformed into features suitable for algorithmic
processing. A key part of this process is defining and
extracting the most crucial audio features for model
construction, specifically for audio classification tasks.
(1)
B. Short-Time Fourier Transform (STFT) Fig. 5. Deep Learning pre-processing pipeline for Audio Data
381
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
emotions that are considered critical in such an industry i.e., VII. HUBERT MODEL TRAINING AND TESTING
Neutral, Sad, Angry and Happy are extracted from the The HuBERT model, built on the Transformer
RAVDESS dataset. Then they undergo signal processing architecture, is specifically designed for tasks like speech
before and training, testing and eventually being classified emotion detection. It uniquely identifies discrete elements in
respectively. speech, enabling a deeper understanding of emotional cues.
2. A loop was made through RAVDESS directory to The training of HuBERT for emotion detection on the
collect from audio folders the emotion, and gender of speaker RAVDESS dataset, which includes emotions such as angry,
and a plot of all collected audio files and emotions have been sad, happy, and neutral, consists of an initial unsupervised
plotted. pre-training phase. Here, the model learns from unlabelled
data by predicting masked speech segments, developing a
3. Each of these audio files goes into feature extraction generic grasp of speech patterns. Subsequently, it undergoes
process as discussed in section III. fine-tuning with labelled data from RAVDESS, focusing on
the four specified emotions. This process refines the model's
predictive capabilities. Testing then validates its performance
on unseen data, assessing its precision in emotion
identification. HuBERT's advanced architecture and training
methodology offer significant potential for enhanced
accuracy in speech emotion recognition tasks. And as a
result, this is achieved. The HuBERT model resulted in an
accuracy of 83.77% as compared to 40% on the CNN model.
Fig. 6. Label Counts of intended emotions
382
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
A. Confusion Matrix C. Precision, Recall, F1-score:
Fig. 10. Evaluation metrics on the 4 emotions for CNN and HuBERT model
383
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [8] Klaus R. Scherer. “Vocal communication of emotion: A review of
research paradigms”. In: Speech Communication 40.1-2 (2003), pp.
[1] Sanjita B.R., Nipunika A., Rohita Desai, “Speech Emotion 227–256. issn: 01676393. doi: 10.1016/S0167-6393(02)00084-5.
Recognition using MLP Classifier”.
[9] S.R. Livingstone and F.A. Russo. The Ryerson Audio-Visual
[2] Harini Murugan, “Speech Emotion Recognition Using CNN” Database of Emotional Speech and Song (RAVDESS). 2018, pp. 1–
[3] Florian Eyben et al. “The Geneva Minimalistic Acoustic Parameter 35. doi: 10. 5281/zenodo.1188976.
Set (GeMAPS) for Voice Research and Affective Computing”. In: [10] Siqing Wu, Tiago H. Falk, and Wai Yip Chan. “Automatic speech
IEEE emotion recognition using modulation spectral features”. In: Speech
[4] George Trigeorgis et al. “Adieu features? End-to-end speech emotion Communication 53.5 (2011), pp. 768–785. issn: 01676393. doi:
recognition using a deep convolutional recurrent network”. In: 10.1016/ j.specom.2010.08.013.
ICASSP, IEEE International Conference on Acoustics, Speech and [11] Mittal A., Arora V., & Kaur H. (2021). Speech Emotion Recognition
Signal Processing - Proceedings 2016 (2016), pp. 5200–5204. using HuBERT Features and Convolutional Neural Networks. In 2021
[5] Transactions on Affective Computing 7.2 (2016), pp. 190–202. issn: 6th International Conference on Computing, Communication and
19493045. doi: 10.1109/TAFFC.2015.2457417. Security (ICCCS) (pp. 1-5). IEEE.
[6] George Trigeorgis et al. “Adieu features? End-to-end speech emotion [12] Zhang Y., Yang Y., Li Y., Li W., & Zhao J. (2021). Speech emotion
recognition using a deep convolutional recurrent network”. In: recognition based on HuBERT and attention mechanism. In
ICASSP, IEEE International Conference on Acoustics, Speech and Proceedings of the 2021 6th International Conference on Automation,
Signal Processing - Proceedings 2016 (2016), pp. 5200–5204. Control and Robotics Engineering (CACRE) (pp. 277-280). IEEE.
[7] Jianfeng Zhao, Xia Mao, and Lijiang Chen. “Speech emotion
recognition using deep 1D & 2D CNN LSTM networks”. In:
Biomedical Signal Processing and Control 47 (2019), pp. 312–323.
issn: 17468108. doi: 10.1016/j.bspc.2018.08.035
384
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on July 13,2024 at 18:31:38 UTC from IEEE Xplore. Restrictions apply.