0% found this document useful (0 votes)
25 views16 pages

Web Development

The project focuses on developing a Speech Emotion Recognition (SER) system using Convolutional Neural Networks (CNNs) and Mel Frequency Cepstral Coefficients (MFCC) to accurately classify emotions from speech signals. By leveraging deep learning techniques and training on diverse datasets like RAVDESS, the system aims to enhance human-machine communication and improve real-time emotion recognition across various applications. The proposed approach addresses limitations of traditional SER models, such as reliance on handcrafted features and poor generalization, by automatically extracting features and ensuring robustness through data augmentation.

Uploaded by

dyuvaraj810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

Web Development

The project focuses on developing a Speech Emotion Recognition (SER) system using Convolutional Neural Networks (CNNs) and Mel Frequency Cepstral Coefficients (MFCC) to accurately classify emotions from speech signals. By leveraging deep learning techniques and training on diverse datasets like RAVDESS, the system aims to enhance human-machine communication and improve real-time emotion recognition across various applications. The proposed approach addresses limitations of traditional SER models, such as reliance on handcrafted features and poor generalization, by automatically extracting features and ensuring robustness through data augmentation.

Uploaded by

dyuvaraj810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

ABSTRACT

The project aims to build machines that can communicate with people via voice, which is a particularly
active field of study in artificial intelligence and machine learning. Speech is a rich source of information that can
carry paralinguistic information in addition to linguistic information. Emotion is one such important example of
this kind of paralinguistic information that is partly expressed through speech. Paralinguistic information, i.e.,
emotion, is enabled through the development of machines. Human-machine communication becomes easier and
natural with such machines. In this study, we investigated how well Convolutional Neural Networks (CNNs) could
identify emotions in speech. To achieve this, we used broad-band spectrograms of speech signals as input features.
The speech samples were collected from actors who deliberately expressed specific emotions while speaking. To
enhance the model's performance and generalization, we trained and tested our models on speech datasets in
multiple languages. Additionally, we applied two levels of data augmentation to enrich the training data, ensuring
improved robustness and accuracy in emotion recognition. The dropout method was used to regularize the
networks. We can use Convolutional Neural Network with MFCC in this project to identify the speech signals.
Experimental results can be applied in Python framework and provide better accuracy in classification.

PROGRAMOUTCOMES (POs)

Atthetimeofgraduation, studentsfrom theComputerScienceand Engineering


Program will possess:
EngineeringGraduateswillbeableto,
1. Engineering knowledge:Applytheknowledgeofmathematics,science,engineeringfundamentals, and
an engineering specialization to the solution of complex engineering problems.
2. Problem Analysis:Identify,formulate,reviewresearchliterature,andanalyzecomplexengineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences,
and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
systemcomponentsorprocessesthatmeetthespecifiedneedswithappropriateconsiderationforthe public
health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
5. Moderntool usage: Modern tool usage: Create, select, and apply appropriate techniques, resources,
i
and modern engineering and IT tool including prediction and modelling to complex engineering
activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal,health,safety,legalandculturalissues,andtheconsequentresponsibilitiesrelevanttothe
professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solution
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Applyethicalprinciplesandcommittoprofessionalethicsandresponsibilitiesandnormsof the
engineering practice.
9. Individualand teamwork: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.

ii
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

ProgramEducationalObjectives(PEOs)

ThegraduatesofComputerScienceandEngineeringwillbeableto,

PEO1-Goodcommunication,leadershipandentrepreneurshipskills

PEO2-Expertiseonadvancedcomputertechnologiestobecomecompetitive

PEO3– The habit of learning and nurture there search attitude

PEO4–TheabilitytoworkinSteamwithprofessionalethics

PROGRAMSPECIFICOUTCOMES(PSOs)
EngineeringGraduateswillbeableto,
PSO1 - Ability to comprehend the underlying principles and systematic methods for the
development, operation and maintenance of software, using professional engineering practices.
PSO2 - Ability to develop socially acceptable technical solutions to real world problems with
various strategies for sustainable development.
PSO3-AbilitytoapplytheskillsintheareasrelatedtoAlgorithms,Networking,WebDesigning,
Artificial Intelligence, Internet of Things and Data Analytics of various complexities towards
successful employment.

iii
INTRODUCTION:

Speech is a fundamental mode of human communication, conveying not only linguistic content but also
paralinguistic information such as emotions, tone, and stress. Emotion plays a crucial role in daily interactions,
influencing decision-making, social behaviour, and overall communication effectiveness. Understanding
emotions in speech is essential for developing more intelligent and interactive systems, such as virtual
assistants, sentiment analysis tools, and human-computer interaction applications.

Speech Emotion Recognition (SER) is a challenging yet vital research area in artificial intelligence, aiming
to detect emotions based on audio signals, independent of the semantic content. By identifying emotions such
as happiness, sadness, anger, fear, and neutrality, SER systems can improve various AI-driven applications.
Traditionally, emotion classification in speech relied on handcrafted feature extraction and conventional
machine learning models such as Support Vector Machines (SVM) and Random Forest.

The methods often struggle with accurately capturing complex variations in speech tone, pitch, and
intensityleading to lower classification performance. The effectiveness of emotion recognition depends on
extracting meaningful speech features and employing robust classifiers. Spectral and prosodic features, such as
pitch, energy, formants, and frequency variations, contain valuable emotional information. However, manual
feature engineering poses challenges in real-time applications and limits the scalability of SER systems across
different datasets and languages. To overcome these limitations, deep learning approaches, particularly
Convolutional Neural Networks (CNNs), have been widely adopted in SER. CNNs automatically learn relevant
features from raw or transformed audio data, reducing the dependency on manual feature selection.

Integrates Mel Frequency Cepstral Coefficients (MFCC) feature extraction with a CNN-based deep learning
model to improve speech emotion classification accuracy. The system is trained on the Ryerson Audio-Visual
Database of Emotional Speech and Song (RAVDESS), which provides diverse emotional speech samples,
enhancing model generalization.

iv
OBJECTIVES

Primary objective of this project is to develop an efficient Speech Emotion Recognition (SER) system using
deep learning techniques to classify emotions from speech signals accurately. The system leverages Mel
Frequency Cepstral Coefficients (MFCC) feature extraction and a Convolutional Neural Network (CNN) model
to improve the performance of emotion classification. The following are the key objectives of the project:

 To develop an automated system for emotion recognition from speech – The project aims to build a
robust and intelligent system that can identify human emotions based on voice signals, enhancing the
interaction between humans and machines.
 To implement deep learning techniques for improved classification accuracy – The project utilizes
CNN-based deep learning models, which automatically extract relevant features from speech,
eliminating the need for manual feature engineering.
 To integrate MFCC feature extraction for enhanced speech analysis – By using MFCC, the system
effectively captures the speech characteristics that contain emotional cues, leading to better feature
representation for classification.
 To train and test the model using the RAVDESS dataset – The system is designed to be trained on the
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset, which consists of
high-quality emotional speech samples, ensuring better generalization and robustness of the model.
 To improve real-time emotion recognition for AI-driven applications – The project aims to develop a
system that can be integrated into real-world applications such as virtual assistants, sentiment analysis,
customer service, and mental health monitoring.
 To enhance the adaptability of SER across different speech variations – The system is designed to
handle diverse speech tones, intensities, and linguistic variations, making it more reliable and effective
across different environments and users.

v
LITERATURE SURVEY

Speech Emotion Recognition (SER) has gained significant attention in recent years due to its potential
applications in human-computer interaction, mental health assessment, and AI-driven communication.
Traditional approaches relied on handcrafted features and conventional machine learning techniques, whereas
modern advancements integrate deep learning and attention-based models for improved accuracy. This
literature survey explores various methodologies and frameworks developed for SER, focusing on feature
extraction, classification models, and optimization techniques. The discussed research works present
advancements in dynamic temporal modelling, feature selection optimization, deep learning approaches, and
attention mechanisms to enhance the robustness of emotion detection from speech. By examining these studies,
we aim to highlight key challenges, trends, and future opportunities in the field of SER.

Lin and Busso (2021) proposed a chunk-level sequence-to-one dynamic temporal modelling framework
for speech emotion recognition. Their research highlights the importance of temporal dependencies in
emotional speech and introduces a method to divide speech signals into meaningful segments or "chunks." By
processing these segments independently and aggregating the results, the model achieves improved
performance compared to traditional approaches. The study focuses on optimizing recurrent neural networks
(RNNs) and transformer-based models to enhance feature learning from sequential data. The experimental
results demonstrated that the chunk-based approach reduces overfitting and enhances the model's ability to
recognize emotions from varying speech durations. This study establishes a foundation for sequence-aware
SER models, which play a crucial role in real-world applications such as affective computing and virtual
assistant systems.

Kanwal and Asghar (2021) explored a novel approach to improving speech emotion recognition
performance by optimizing feature selection. Their work introduces a clustering-based Genetic Algorithm (GA)
optimization framework that enhances the feature extraction process. Instead of relying on a vast number of
handcrafted features, the proposed system utilizes GA to identify the most discriminative feature subset,
reducing computational complexity while maintaining high accuracy.

vi
SYSTEM ANALYSIS
EXISTING SYSTEM:
Traditional Speech Emotion Recognition (SER) systems rely on classical machine learning
techniques such as Support Vector Machines (SVM), Naïve Bayes, and Random Forest for
classifying emotions from speech signals. These methods heavily depend on handcrafted feature
extraction, where spectral and prosodic features like pitch, formants, energy, and speech rate are
manually selected and fed into the classifier. While these approaches have shown reasonable success,
they struggle with accurately capturing the intricate variations in speech tone, intensity, and
frequency that are crucial for distinguishing emotions. Moreover, these systems often face
difficulties in adapting to different speakers, languages, and background noise, leading to lower
recognition accuracy in real-world applications. Another major drawback of traditional SER models
is their limited ability to generalize across diverse datasets. Since handcrafted features are highly
dataset-specific, models trained on one dataset may not perform well when applied to another.
Additionally, machine learning-based systems require extensive preprocessing and fine-tuning to
achieve optimal results, making them computationally expensive and unsuitable for real-time
applications. Many existing methods lack deep learning capabilities, preventing them from
leveraging hierarchical feature extraction that can capture complex speech patterns. Furthermore, due
to limited dataset generalization and inefficient feature representation, real-time emotion detection
remains a challenge in classical approaches, making them less effective for AI-driven applications
such as virtual assistants, human-computer interaction, and sentiment analysis.
3.1.1 DISADVANTAGES
 Traditional SER models rely on handcrafted features, reducing classification accuracy.
 Machine learning-based approaches struggle with complex variations in speech tones and intensities.
 Existing systems lack deep learning capabilities, limiting their ability to capture intricate emotional
patterns.
 Many models fail in real-time applications due to poor dataset generalization.

PROPOSED SYSTEM
The limitations of traditional Speech Emotion Recognition (SER) models, this project
vii
proposes a deep learning-based approach using Convolutional Neural Networks (CNNs) combined
with Mel Frequency Cepstral Coefficients (MFCC) feature extraction. The CNN model is designed
to automatically learn hierarchical speech features from spectrograms, eliminating the need for
manual feature selection. MFCC is used to extract key speech characteristics such as pitch, tone, and
frequency variations, which are essential for identifying emotions like happiness, sadness, anger,
fear, and neutrality. By leveraging deep learning, the proposed system improves classification
accuracy and enhances its adaptability to diverse speech patterns across different speakers and
languages. The RAVDESS dataset is employed to train the model, ensuring that it learns from high-
quality emotional speech recordings. Unlike traditional machine learning approaches that struggle
with dataset generalization, the proposed system utilizes data augmentation and dropout techniques
to enhance model robustness and prevent overfitting. The input speech undergoes preprocessing,
including intensity normalization, feature extraction, and conversion into a spectrogram
representation before being fed into the CNN model. The trained network optimizes weight
adjustments based on extracted features, leading to improved classification performance. This
approach makes the system highly efficient, scalable, and suitable for real-time applications in
domains such as virtual assistants, sentiment analysis, customer service, and mental health
monitoring. By integrating CNN-based classification with MFCC feature extraction, the proposed
system provides a highly accurate, automated, and real-time speech emotion recognition solution.
3.2.1 ADVANTAGES
 Utilizes deep learning with CNN for improved accuracy in speech emotion recognition.
 Automatically extracts features using MFCC, eliminating the need for manual feature
engineering.
 Enhances real-time emotion recognition with better generalization across different speech
variations.
 Improves robustness using data augmentation and dropout techniques to prevent overfitting.

MODULE DESCRIPTION:
DATASET ACQUISITION
Speech Emotion Recognition (SER) focuses on identifying the emotional aspects of spoken
language, independent of the actual meaning of the words. While humans naturally and efficiently
recognize emotions during communication, automating this process using computational methods
remains a subject of extensive research. Developing accurate and real-time emotion recognition
systems is crucial for applications such as mobile assistant interactions, call centre customer service
analysis, in-vehicle driver monitoring, aviation
viii safety, and human-machine communication
interfaces. The integration of emotional intelligence in machines enhances their ability to engage
more naturally with users, making interactions feel more human-like and intuitive. In this module,
datasets consisting of multiple speech recordings in WAV format are collected and uploaded for
further processing. These datasets serve as the foundation for training and testing the SER model.
Open-source datasets such as RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and
Song), as well as other publicly available repositories, are utilized to provide diverse emotional
expressions across different speakers. By leveraging such datasets, the system ensures a
comprehensive learning experience, enabling the model to recognize emotions across varied speech
patterns, intonations, and speaker characteristics.

PREPROCESSING:
Speech emotion recognition relies on a thorough analysis of the speech signal generation
mechanism, wherein certain acoustic features embedded within a speaker’s voice convey emotional
information. The preprocessing stage is vital as it refines and prepares the raw audio data for further
analysis, ensuring that essential features are retained while eliminating noise and redundancies. This
module involves several key steps, including converting input speech files into a uniform WAV
format, normalizing audio signals, and transforming them into a 2D format to facilitate efficient
feature extraction. Our speech emotion recognition system follows a structured approach similar to
conventional pattern recognition systems, comprising four essential stages: speech input, feature
extraction, CNN-based classification, and emotion output. The system allows users to input speech
files in any format and size, which are subsequently converted into a standardized 2D spectrogram
representation for better feature analysis.

SYSTEM DESIGN AND IMPLEMENTATION

SYSTEM ARCHITECTURE:
The architecture of the Speech Emotion Recognition (SER) system consists of four key
phases: data acquisition, feature detection, emotion identification, and classification. In the first
phase, speech files are collected and processed for further analysis. The second phase involves pre-
processing and feature extraction, where important speech characteristics such as Mel-Frequency
Cepstral Coefficients (MFCCs) are extracted to represent the signal. The third phase maps the
extracted features to a labelled emotion dataset to identify patterns. Finally, in the classification
phase, machine learning or deep learning model
ix assigns an emotion label (such as happy, sad, or
angry) based on the extracted features. This structured approach ensures efficient and accurate
emotion recognition from speech signals.

Speech files Pre-processing Extracted


Classification
features

Data acquisition Features extraction Emotion name


Mapping

Dataset

Phase 1: Phase 4: Label


Phase 2: Phase 3:
Capturing the as Emotion
Features Emotion
data
detection identification

DATA FLOW DIAGRAM:


A data flow diagram shows how data is processed within a system based on inputs and
outputs. Visual symbols are used to represent the flow of information, data sources and destinations,
and where data is stored. Data flow diagrams are often used as a first step toward redesigning a
system. They provide a graphical representation of a system at any level of detail, creating an easy-
to-understand picture of what the system does. A general overview of a system is represented with a
context diagram, also known as a level 0 DFD, which shows a system as a single process. A level 1
diagram provides greater detail, focusing on a system’s main functions. Diagrams that are level 2 or
higher illustrate a system’s functioning with increasing detail. It’s rare for a DFD to go beyond level
2 because of the increasing complexity, which makes it less effective as a communication tool.

Symbol Description
x
An entity. A source of data or a
destination for data.

A process or task that is performed by


the system.

A data store, a place where data is held


between processes.

A data flow.

LEVEL 1
DFD Level 0 is also called a Context Diagram. It’s a basic overview of the whole system or
process being analysed or modelled. It’s designed to be an at-a-glance view, showing the system as a
single high-level process, with its relationship to external entities. It should be easily understood by a
wide audience, including stakeholders, business analysts, data analysts and developers.

Maps signals
WAV file
with datasets
USER Speech emotion recognition Trained database

LEVEL 2
xi
DFD Level 1 provides a more detailed breakout of pieces of the Context Level Diagram. You
will highlight the main functions carried out by the system, as you break down the high-level process
of the Context Diagram into its sub-processes.

Emotion recognition
USER

If match
found

Input the speech Feature extract Mapping features Dataset

If match
found

Labelled datasets

USE CASE DIAGRAM:

xii
Input speech file

Preprocessing

User
Feature extraction

System

Emotion classification

Display Recognition Emotion

DATASET DESIGN:
xiii
The RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset is a
widely used benchmark dataset for Speech Emotion Recognition (SER). It consists of emotionally
expressive speech and song recordings performed by actors, covering a range of emotions. The
dataset is designed for research in affective computing, deep learning, and human-computer
interaction.
DATASET: KAGGLE

Emotion ID Emotion Type Description


01 Neutral Normal speech without strong emotional expression
02 Calm Soft and relaxed speech
03 Happy Cheerful and positive tone
04 Sad Melancholic or sorrowful tone
05 Angry Aggressive and loud tone
06 Fearful Anxious or scared tone
07 Disgust Speech expressing dislike or aversion
08 Surprised Speech indicating shock or amazement

Filename Emotion
[Link] Neutral
[Link] Neutral
[Link] Calm
[Link] Calm
[Link] Calm (Strong)
[Link] Happy
[Link] Happy
[Link] Happy (Strong)
[Link] Sad
[Link] Sad
[Link] Angry
[Link] Angry

xiv
CONCLUSION
The Speech Emotion Recognition (SER) system plays a crucial role in understanding human emotions
through voice analysis, enhancing various applications such as human-computer interaction, healthcare, call
centres, and affective computing. The proposed system effectively captures speech data, processes it,
extracts significant features, and classifies emotions using advanced machine learning or deep learning
techniques. By mapping extracted features to a well-structured dataset, the system ensures improved
accuracy in emotion detection, contributing to real-time applications where recognizing emotional states is
essential for enhancing user experiences and [Link] of the key strengths of this project is its
ability to overcome the limitations of traditional manual emotion recognition by automating the process
through a structured framework. The integration of feature extraction techniques such as MFCCs, spectral
features, and deep learning classifiers ensures robustness in recognizing subtle variations in speech that
indicate different emotional states. Additionally, the system is designed to handle diverse datasets, making it
adaptable to real-world scenarios where variations in tone, pitch, and intensity affect emotion detection
accuracy. By leveraging an optimized classification model, the system enhances efficiency and reliability,
making it suitable for deployment in real-time [Link] conclusion, the SER system demonstrates a
powerful and scalable approach to emotion recognition from speech data. The advancements in deep
learning and speech processing methodologies provide a solid foundation for future improvements,
including multilingual emotion recognition, cross-cultural emotion analysis, and real-time deployment in
interactive systems. As artificial intelligence continues to evolve, integrating this technology with virtual
assistants, mental health applications, and smart communication systems will further enhance human-
machine interactions. Future work can focus on refining model accuracy, expanding datasets, and improving
computational efficiency to make SER systems even more effective in practical applications.

xv
REFERENCES
[1] Abbaschian, Babak Joze, Daniel Sierra-Sosa, and Adel Elmaghraby. "Deep learning techniques for
speech emotion recognition, from databases to models." Sensors 21.4 (2021).
[2] Heusser, Verena, et al. "Bimodal speech emotion recognition using pre-trained language models." arXiv
preprint arXiv:1912.02610 (2019).
[3] Kanwal, Sofia, and Sohail Asghar. "Speech emotion recognition using clustering-based GA-optimized
feature set." IEEE Access 9 (2021): 125830-125842.
[4] Kumar, Puneet, Vishesh Kaushik, and Balasubramanian Raman. "Towards the Explainability of
Multimodal Speech Emotion Recognition." Proc. Interspeech 2021 (2021): 1748-1752.
[5] Lin, Wei-Cheng, and Carlos Busso. "Chunk-level speech emotion recognition: A general framework of
sequence-to-one dynamic temporal modeling." IEEE Transactions on Affective Computing (2021).
[6] Lieskovská, Eva, et al. "A review on speech emotion recognition using deep learning and attention
mechanism." Electronics 10.10 (2021): 1163.
[7] Pan, Zexu, et al. "Multi-modal attention for speech emotion recognition." arXiv preprint
arXiv:2009.04107 (2020).
[8] Seo, Minji, and Myungho Kim. "Fusing visual attention CNN and bag of visual words for cross-corpus
speech emotion recognition." Sensors 20.19 (2020): 5559
[9] Vryzas, Nikolaos, et al. "Speech emotion recognition for performance interaction." Journal of the Audio
Engineering Society 66.6 (2018): 457-467.
[10] Wani, Taiba Majid, et al. "A comprehensive review of speech emotion recognition systems." IEEE
Access 9 (2021): 47795-47814.

xvi

You might also like