Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, International Journal of Advanced Robotic Systems
The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gen...
Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn- or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore suggest a combined analysis by descriptive statistics of audio and video Low-Level-Descriptors for subsequent static SVM Classification. This strategy also allows for a combined feature-space optimization which will be discussed herein. The high effectiveness of this approach is shown on a database of 11.5h containing six emotional situations in an airplane scenario. 1
Pattern Recognition (ICPR), …, 2010
The information of the psycho-physical state of the subject is becoming a valuable addition to the modern audio or video recognition systems. As well as enabling a better user experience, it can also assist in superior recognition accuracy of the base system. In the article, we present our approach to multi-modal (audio-video) emotion recognition system. For audio sub-system, a feature set comprised of prosodic, spectral and cepstrum features is selected and support vector classifier is used to produce the scores for each emotional category. For video sub-system a novel approach is presented, which does not rely on the tracking of specific facial landmarks and thus, eliminates the problems usually caused, if the tracking algorithm fails at detecting the correct area. The system is evaluated on the eNTERFACE database and the recognition accuracy of our audio-video fusion is compared to the published results in the literature.
3rd International Conference on Computer Vision Theory and Applications. VISAPP, 2008
Abstract: Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn-or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore ...
International Journal of Emerging Trends in Engineering Research, 2024
Emotion recognition is a pivotal area of research with applications spanning education, healthcare, and intelligent customer service. Multimodal emotion recognition (MER) has emerged as a superior approach by integrating multiple modalities such as speech, text, and facial expressions, offering enhanced accuracy and robustness over unimodal methods. This paper reviews the evolution and current state of MER, highlighting its significance, challenges, and methodologies. We delve into various datasets, including IEMOCAP and MELD, providing a comparative analysis of their strengths and limitations. The literature review covers recent advancements in deep learning techniques, focusing on fusion strategies like early, late, and hybrid fusion. Identified gaps include issues related to data redundancy, feature extraction complexity, and real-time detection. Our proposed methodology leverages deep learning for feature extraction and a hybrid fusion approach to improve emotion detection accuracy. This research aims to guide future studies in addressing current limitations and advancing the field of MER. The main of this paper review recent methodologies in multimodal emotion recognition, analyze different data fusion techniques,identify challenges and research gaps.
Multimedia Tools and Applications, 2010
A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches.
2007
Automatic multimodal recognition of spontaneous emotional expressions is a largely unexplored and challenging problem. In this paper, we explore audio-visual emotion recognition in a realistic human conversation setting-the Adult Attachment Interview (AAI). Based on the assumption that facial expression and vocal expression are at the same coarse affective states, positive and negative emotion sequences are labeled according to Facial Action Coding System. Facial texture in visual channel and prosody in audio channel are integrated in the framework of Adaboost multi-stream hidden Markov model (AdaMHMM) in which the Adaboost learning scheme is used to build component HMM fusion. Our approach is evaluated in AAI spontaneous emotion recognition experiments.
International Journal of Interactive Multimedia and Artificial Intelligence, 2011
Emotions play a crucial role in person to person interaction. In recent years, there has been a growing interest in improving all aspects of interaction between humans and computers. The ability to understand human emotions is desirable for the computer in several applications especially by observing facial expressions. This paper explores a ways of humancomputer interaction that enable the computer to be more aware of the user's emotional expressions we present a approach for the emotion recognition from a facial expression, hand and body posture. Our model uses multimodal emotion recognition system in which we use two different models for facial expression recognition and for hand and body posture recognition and then combining the result of both classifiers using a third classifier which give the resulting emotion. Multimodal system gives more accurate result than a signal or bimodal system
2021
Emotions play a vital role in mental and physical activities of human lives. One of the biggest challenges in Human-Computer Interaction is emotion recognition. With the resurgence in the fields of Artificial Intelligence and Machine learning, a considerable number of studies have been carried out in order to address the challenge of emotion recognition. The individual heterogeneity of expressing emotions is a key problem that needs to be addressed in accurately detecting the emotional state of an individual. The purpose of this work is to propose a novel ensemble method to predict the emotions using a multimodal approach. The presented multimodal approach with the modalities of facial expressions, voice variations and, speech and social media content, are used to identify seven emotional states: anger, fear, disgust, happiness, sadness, surprise and neutral emotion. In this study, for the facial expression-based emotion recognition and voice variation-based emotion recognition, Dee...
arXiv (Cornell University), 2020
Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audiovisual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.
This paper provides proposes a bimodal system for emotion recognition that uses face and speech analysis. Hidden Markov models-HMMs are used to learn and to describe the temporal dynamics of the emotion clues in the visual and acoustic channels. This approach provides a powerful method enabling to fuse the data we extract from separate modalities. The paper presents the best performing models and the results of the proposed recognition system.
A Pattern Analysis Approach, 2015
The paper describes a novel technique for the recognition of emotions from multimodal data. We focus on the recognition of the six prototypic emotions. The results from the facial expression recognition and from the emotion recognition from speech are combined using a bi-modal multimodal semantic data fusion model that determines the most probable emotion of the subject. Two types of models based on geometric face features for facial expression recognition are being used, depending on the presence or absence of speech. In our approach we define an algorithm that is robust to changes of face shape that occur during regular speech. The influence of phoneme generation on the face shape during speech is removed by using features that are only related to the eyes and the eyebrows. The paper includes results from testing the presented models.
Affect and emotion in human- …, 2008
IEEE Transactions on Affective Computing, 2017
This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE'05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.
Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998
Recognizing human facial expression and emotion by computer is an interesting and challenging problem. Many have investigated emotional contents in speech alone, or recognition of human facial expressions solely from images. However, relatively little has been done in combining these two modalities for recognizing human emotions. De Silva et al. 4 studied human subjects' ability to recognize emotions from viewing video clips of facial expressions and listening to the corresponding emotional speech stimuli. They found that humans recognize some emotions better by audio information, and other emotions better by video. They also proposed an algorithm to integrate both kinds of inputs to mimic human's recognition process. While attempting to implement the algorithm, we encountered di culties which led us to a di erent approach. We found these two modalities to be c omplimentary. By using both, we show it is possible to achieve higher recognition rates than either modality alone.
Cornell University - arXiv, 2022
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the proposed system advocates the importance of utilizing complementary information from multiple modalities for emotion recognition.
Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 2018
This paper introduces a multimodal emotion recognition system based on two different modalities, i.e., affective speech and facial expression. For affective speech, the common low-level descriptors including prosodic and spectral audio features (i.e., energy, zero crossing rate, MFCC, LPC, PLP and temporal derivatives) are extracted, whereas a novel visual feature extraction method is proposed in the case of facial expression. This method exploits the displacement of specific landmarks across consecutive frames of an utterance for feature extraction. To this end, the time series of temporal variations for each landmark is analyzed individually for extracting primary visual features, and then, the extracted features of all landmarks are concatenated for constructing the final feature vector. The analysis of displacement signal of landmarks is performed by the discrete wavelet transform which is a widely used mathematical transform in signal processing applications. In order to reduce the complexity of derived models and improve the efficiency, a variety of dimensionalityreduction schemes are applied. Furthermore, to exploit the advantages of multimodal emotion recognition systems, the feature-level fusion of the audio and the proposed visual features is examined. Results of experiments conducted on three SAVEE, RML and eNTERFACE05 databases show the efficiency of proposed visual feature extraction method in terms of performance criteria.
2010
Recent advances in human-computer interaction technology go beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. An important augmentation, and potential source of feedback, comes from recognizing the user's expressed emotion or affect. This chapter presents an overview of research efforts to classify emotion using different modalities: audio, visual and audiovisual combined. Theories of emotion provide a framework for defining emotional categories or classes. The first step, then, in the study of human affect recognition involves the construction of suitable databases. The authors describe fifteen audio, visual and audiovisual data sets, and the types of feature that researchers have used to represent the emotional content. They discuss data-driven methods of feature selection and reduction, which discard noise and irrelevant information to maximize the concentration of useful information. They focus on the popular types of classifier that are used to decide to which emotion class a given example belongs, and methods of fusing information from multiple modalities. Finally, the authors point to some interesting areas for future investigation in this field, and conclude. Recent work in the field of emotion recognition involves combining the audio and visual modalities to improve the performance of emotion recognition systems. This has resulted in the recording of audio
… and Innovations 2007: …, 2007
IEEE Transactions on Multimedia, 2013
We present a novel facial expression recognition framework using audio-visual information analysis. We propose to model the cross-modality data correlation while allowing them to be treated as asynchronous streams. We also show that our framework can improve the recognition performance while significantly reducing the computational cost by avoiding redundant or insignificant frame processing by incorporating auditory information. In particular, we design a single good image representation of image sequence by weighted sums of registered face images where the weights are derived using auditory features. We use a still image based technique for the expression recognition task. Our framework, however, can be generalized to work with dynamic features as well. We performed experiments using eNTERFACE'05 audio-visual emotional database containing six archetypal emotion classes: Happy, Sad, Surprise, Fear, Anger and Disgust. We present one-to-one binary classification as well as multi-class classification performances evaluated using both subject dependent and independent strategies. Furthermore, we compare multi-class classification accuracies with those of previously published literature which use the same database. Our analyses show promising results.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.