Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, IEEE Transactions on Multimedia
…
10 pages
1 file
We present a novel facial expression recognition framework using audio-visual information analysis. We propose to model the cross-modality data correlation while allowing them to be treated as asynchronous streams. We also show that our framework can improve the recognition performance while significantly reducing the computational cost by avoiding redundant or insignificant frame processing by incorporating auditory information. In particular, we design a single good image representation of image sequence by weighted sums of registered face images where the weights are derived using auditory features. We use a still image based technique for the expression recognition task. Our framework, however, can be generalized to work with dynamic features as well. We performed experiments using eNTERFACE'05 audio-visual emotional database containing six archetypal emotion classes: Happy, Sad, Surprise, Fear, Anger and Disgust. We present one-to-one binary classification as well as multi-class classification performances evaluated using both subject dependent and independent strategies. Furthermore, we compare multi-class classification accuracies with those of previously published literature which use the same database. Our analyses show promising results.
Pattern Recognition (ICPR), …, 2010
The information of the psycho-physical state of the subject is becoming a valuable addition to the modern audio or video recognition systems. As well as enabling a better user experience, it can also assist in superior recognition accuracy of the base system. In the article, we present our approach to multi-modal (audio-video) emotion recognition system. For audio sub-system, a feature set comprised of prosodic, spectral and cepstrum features is selected and support vector classifier is used to produce the scores for each emotional category. For video sub-system a novel approach is presented, which does not rely on the tracking of specific facial landmarks and thus, eliminates the problems usually caused, if the tracking algorithm fails at detecting the correct area. The system is evaluated on the eNTERFACE database and the recognition accuracy of our audio-video fusion is compared to the published results in the literature.
Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn- or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore suggest a combined analysis by descriptive statistics of audio and video Low-Level-Descriptors for subsequent static SVM Classification. This strategy also allows for a combined feature-space optimization which will be discussed herein. The high effectiveness of this approach is shown on a database of 11.5h containing six emotional situations in an airplane scenario. 1
Designing systems able to interact with students in a natural manner is a complex and far from solved problem. A key aspect of natural interaction is the ability to understand and appropriately respond to human emotions. This paper details our response to the continuous Audio/Visual Emotion Challenge (AVEC'12) whose goal is to predict four affective signals describing human emotions. The proposed method uses Fourier spectra to extract multi-scale dynamic descriptions of signals characterizing face appearance, head movements and voice. We perform a kernel regression with very few representative samples selected via a supervised weighted-distance-based clustering, that leads to a high generalization power. We also propose a particularly fast regressor-level fusion framework to merge systems based on different modalities. Experiments have proven the efficiency of each key point of the proposed method and our results on challenge data were the highest among 10 international research teams.
2013
Abstract The paper presents a multi-modal emotion recognition system exploiting audio and video (ie, facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed.
A Pattern Analysis Approach, 2015
The paper describes a novel technique for the recognition of emotions from multimodal data. We focus on the recognition of the six prototypic emotions. The results from the facial expression recognition and from the emotion recognition from speech are combined using a bi-modal multimodal semantic data fusion model that determines the most probable emotion of the subject. Two types of models based on geometric face features for facial expression recognition are being used, depending on the presence or absence of speech. In our approach we define an algorithm that is robust to changes of face shape that occur during regular speech. The influence of phoneme generation on the face shape during speech is removed by using features that are only related to the eyes and the eyebrows. The paper includes results from testing the presented models.
Multimedia Tools and Applications, 2014
Access to audiovisual databases, which contain enough variety and are richly annotated is essential to assess the performance of algorithms in affective computing applications, which require emotion recognition from face and/or speech data. Most databases available today have been recorded under tightly controlled environments, are mostly acted and do not contain speech data. We first present a semi-automatic method that can extract audiovisual facial video clips from movies and TV programs in any language. The method is based on automatic detection and tracking of faces in a movie until the face is occluded or a scene cut occurs. We also created a video-based database, named as BAUM-2, which consists of annotated audiovisual facial clips in several languages. The collected clips simulate real-world conditions by containing various head poses, illumination conditions, accessories, temporary occlusions and subjects with a wide range of ages. The proposed semi-automatic affective clip extraction method can easily be used to extend the database to contain clips in other languages. We also created an image based facial expression database from the peak frames of the video clips, which is named as BAUM-2i. Baseline image and This work was supported by the Turkish Scientific and Technical Research Council (TUBITAK) under project 110E056.
IEEE Transactions on Affective Computing, 2017
This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE'05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.
Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998
Recognizing human facial expression and emotion by computer is an interesting and challenging problem. Many have investigated emotional contents in speech alone, or recognition of human facial expressions solely from images. However, relatively little has been done in combining these two modalities for recognizing human emotions. De Silva et al. 4 studied human subjects' ability to recognize emotions from viewing video clips of facial expressions and listening to the corresponding emotional speech stimuli. They found that humans recognize some emotions better by audio information, and other emotions better by video. They also proposed an algorithm to integrate both kinds of inputs to mimic human's recognition process. While attempting to implement the algorithm, we encountered di culties which led us to a di erent approach. We found these two modalities to be c omplimentary. By using both, we show it is possible to achieve higher recognition rates than either modality alone.
IEEE Transactions on Multimedia, 2007
The ability of a computer to detect and appropriately respond to changes in a user's affective state has significant implications to Human-Computer Interaction (HCI). In this paper, we present our efforts toward audio-visual affect recognition on 11 affective states customized for HCI application (four cognitive/motivational and seven basic affective states) of 20 nonactor subjects. A smoothing method is proposed to reduce the detrimental influence of speech on facial expression recognition. The feature selection analysis shows that subjects are prone to use brow movement in face, pitch and energy in prosody to express their affects while speaking. For person-dependent recognition, we apply the voting method to combine the frame-based classification results from both audio and visual channels. The result shows 7.5% improvement over the best unimodal performance. For person-independent test, we apply multistream HMM to combine the information from multiple component streams. This test shows 6.1% improvement over the best component performance.
2009 Workshop on Applications of Computer Vision (WACV), 2009
We present a bimodal information analysis system for automatic emotion recognition. Our approach is based on the analysis of video sequences which combines facial expressions observed visually with acoustic features to automatically recognize five universal emotion classes: Anger, Disgust, Happiness, Sadness and Surprise. We address the challenges posed during the temporal analysis of the bimodal data and introduce a novel technique for combining the best features of instantaneous and temporal based visual recognition systems. We obtain robust appearance-based visual features which we classify instantaneously and aggregate it temporally to improve the recognition rates when compared to single-frame based instantaneous classification. The performance of the system is further boosted by using the complementary audio information for the bimodal emotion recognition. We combine the two modalities at both feature and score level to compare the respective joint emotion recognition rates. The emotions are instantaneously classified using a Support Vector Machine and sequentially aggregated based on their classification probabilities. This approach is validated on a posed audio-visual database and a natural interactive database. The experiments performed on these databases provide encouraging results with the best combined recognition rate being 82%.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
3rd International Conference on Computer Vision Theory and Applications. VISAPP, 2008
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020
Multimedia Tools and Applications, 2010
International Journal of Advanced Computer Science and Applications, 2022
arXiv (Cornell University), 2020
Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 2018
personal.ee.surrey.ac.uk
Proceedings of the …, 2004
IEEE 22th International Conference on Tools with Artificial Intelligence, 2010
Sensors, 2019
International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING, 2024
18th International Conference on Pattern Recognition (ICPR'06), 2006
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
Mathematical Problems in Engineering