Advanced Techniques in Text-Based Emotion
Recognition and Conversations
Introduction
This lecture extends the discussion on emotion recognition from text, focusing on how to build feature
representations, model conversations, and incorporate recent deep learning advances for dynamic
emotion prediction.
Feature Representation Techniques
Bag of Words (BoW)
Represent documents as histograms of word frequency.
Build a vocabulary (dictionary) from all training documents.
Each document is converted to a vector reflecting the frequency of each vocabulary word.
Example: Documents about Delhi might feature frequent use of "government," "parliament," etc.
N-grams
Represent sequences of n consecutive words.
Unigram (n=1): Each word is treated individually.
Bigram (n=2): Pairs of consecutive words (e.g., "Delhi is")
Trigram (n=3): Three-word sequences (e.g., "Delhi is the")
Captures contextual co-occurrence patterns, enhancing semantic inference.
TF-IDF (Term Frequency-Inverse Document Frequency)
Term Frequency (TF): Measures how often a word appears in a document.
Inverse Document Frequency (IDF): Penalizes common words (e.g., "the") occurring across many
documents.
Formula: TF-IDF = TF * log(N / M)
N: Total number of documents
M: Number of documents containing the word
Balances saliency and discriminative power.
Part-of-Speech (POS) Tagging
Page 1 of 5
Annotate each word with its grammatical role (noun, verb, adjective).
Provides structural and semantic context.
Pointwise Mutual Information (PMI)
Measures association between words based on co-occurrence.
PMI-IR: Enhanced with search engine hit probabilities.
Early Emotion Recognition Systems
Example: Blog Mood Classification (Gishe et al.)
Dataset: 815,000 blog posts with mood labels (132 classes) from LiveJournal.
Features Used:
Bag of words
POS tags
PMI and PMI-IR scores
Classifiers trained on these features for mood prediction.
Representation Learning Techniques
Word2Vec (Mikolov et al., 2013)
Learns vector representations of words using neural networks.
CBOW (Continuous Bag of Words):
Predict a target word from surrounding context.
Skip-gram:
Predict surrounding words given a target word.
Input is a one-hot vector; output is a dense, lower-dimensional embedding.
Applications
Vectorized representation of each word used for pooling and ML classification.
Document2Vec: Extension for full document representation.
GloVe (Global Vectors for Word Representation)
Learns embeddings from a global word co-occurrence matrix.
Captures fine-grained semantic similarities.
Bag of Concepts
Page 2 of 5
Extends BoW to represent clustered conceptual units rather than individual terms.
Uses Word2Vec + K-means clustering + TF-IDF weighting.
Deep Learning Architectures
Example: Kratzwald et al. (2018)
Inputs: Bag of words + word embeddings
Dual stream:
Feedforward feature extractor (BoW + embeddings)
RNN (LSTM) for sequential word modeling
Fusion and classification for emotion prediction
Example: Shelke et al.
Focus: Emotion from social media posts (text + emoticons)
Steps:
Text preprocessing: Tokenization, stop word removal, lemmatization
Emoticon labeling (e.g., joy = 1)
Feature extraction using DepecheMood lexicon
Feature ranking + fusion
Deep Neural Network classification
Semantic Emotion Neural Network (Batbaatar et al.)
Dual stream:
Semantic encoder (RNN) for contextual info
Emotion encoder (CNN) for affective info
Concatenate and classify
BERT and Transformer-Based Models
Attention-based models parallelize token processing
BERT: Bidirectional Encoder Representations from Transformers
Pretrained on masked language modeling and next sentence prediction
Examples:
Huang et al.:
Dual BERT models (FriendBERT and ChatBERT)
Emotion pretraining + fine-tuning on Twitter data
Page 3 of 5
Kumar et al.:
Dual-channel explainable emotion system
CNN-RNN and RNN-CNN pipelines
Explainability via intra-/inter-cluster distance analysis
Emotion Recognition During Conversations
Need for Conversational Modeling
Emotions vary with conversational flow.
Requires contextual tracking of speaker responses over time.
Example: Poria et al.
Emotion dynamics across dialogue turns in a sitcom scene.
Emotions annotated frame-by-frame for each speaker.
DialogueCRN (Hu et al.)
Contextual Reasoning Network for emotion tracking in dialogues
Inputs:
Situation-level context (dialogue history)
Speaker-level context (current utterance)
Outputs:
Fused representation → emotion prediction
Yeh et al.
Interaction-aware network with GRU + attention
Parallel speaker analysis with bi-directional RNNs
Lian et al.
Domain-Adversarial Network for speech + text
Workflow:
Speech-to-text → utterance-level GRU
Temporal attention
Predict emotion at each time step
Conclusion
Page 4 of 5
Text-based emotion recognition has evolved from simple vector models like BoW and TF-IDF to
sophisticated deep learning and transformer-based approaches. Modern systems also consider
dialogue context and speaker roles to dynamically infer emotions over time. As the field matures,
conversational and multimodal affect prediction continues to gain importance, especially in emotionally
intelligent systems.
Page 5 of 5