MMML Tutorial ACL2017
MMML Tutorial ACL2017
1
Your Instructors
Louis-Philippe Morency
morency@[Link]
Tadas Baltrusaitis
tbaltrus@[Link]
2
CMU Course 11-777: Multimodal Machine Learning
3
Tutorial Schedule
▪ Introduction
▪ What is Multimodal?
▪ Historical view, multimodal vs multimedia
▪ Why multimodal
▪ Multimodal applications: image captioning, video
description, AVSR,…
▪ Core technical challenges
▪ Representation learning, translation, alignment, fusion
and co-learning
Tutorial Schedule
▪ Unimodal representations
▪ Visual representations
▪ Convolutional neural networks
▪ Acoustic representations
▪ Spectrograms, autoencoders
Tutorial Schedule
▪ Multimodal representations
▪ Joint representations
▪ Visual semantic spaces, multimodal autoencoder
▪ Tensor fusion representation
▪ Coordinated representations
▪ Similarity metrics, canonical correlation analysis
▪ Coffee break [20 mins]
Tutorial Schedule
▪ Multimodal fusion
▪ Model free approaches
▪ Early and late fusion, hybrid models
▪ Kernel-based fusion
▪ Multiple kernel learning
▪ Multimodal graphical models
▪ Factorial HMM, Multi-view Hidden CRF
▪ Multi-view LSTM model
What is
Multimodal?
11
What is Multimodal?
Multimodal distribution
Sensory Modalities
What is Multimodal?
Modality
The way in which something happens or is experienced.
• Modality refers to a certain type of information and/or the
representation format in which information is stored.
• Sensory modality: one of the primary forms of sensation,
as vision or touch; channel of communication.
Medium (“middle”)
A means or instrumentality for storing or communicating
information; system of communication/transmission.
• Medium is the means whereby this information is
delivered to the senses of the interpreter.
14
Examples of Modalities
Verbal Visual
▪ Lexicon ▪ Gestures
▪ Words ▪ Head gestures
▪ Syntax ▪ Eye gestures
▪ Part-of-speech ▪ Arm gestures
▪ Dependencies ▪ Body language
▪ Pragmatics ▪ Body posture
▪ Discourse acts ▪ Proxemics
16
Multiple Communities and Modalities
19
The “Behavioral” Era (1970s until late 1980s)
20
Language and Gestures
David McNeill
University of Chicago
Center for Gesture and Speech Research
21
The McGurk Effect (1976)
22
The McGurk Effect (1976)
23
➢ The “Computational” Era(Late 1980s until 2000)
24
➢ The “Computational” Era(Late 1980s until 2000)
25
➢ The “Computational” Era (Late 1980s until 2000)
2) Multimodal/multisensory interfaces
• Multimodal Human-Computer Interaction (HCI)
“Study of how to design and evaluate new computer
systems where human interact through multiple
modalities, including both input and output modalities.”
26
➢ The “Computational” Era (Late 1980s until 2000)
2) Multimodal/multisensory interfaces
27
➢ The “Computational” Era (Late 1980s until 2000)
2) Multimodal/multisensory interfaces
Rosalind Picard
Affective Computing is
computing that relates to, arises
from, or deliberately influences
emotion or other affective
phenomena.
28
➢ The “Computational” Era (Late 1980s until 2000)
3) Multimedia Computing
[1994-2010]
29
➢ The “Computational” Era (Late 1980s until 2000)
3) Multimedia Computing
Multimedia content analysis
▪ Shot-boundary detection (1991 - )
▪ Parsing a video into continuous camera shots
▪ Still and dynamic video abstracts (1992 - )
▪ Making video browsable via representative frames (keyframes)
▪ Generating short clips carrying the essence of the video content
▪ High-level parsing (1997 - )
▪ Parsing a video into semantically meaningful segments
▪ Automatic annotation (indexing) (1999 - )
▪ Detecting prespecified events/scenes/objects in video
30
Multimodal Computation Models
x1 x2 x3 x4
31
Multimodal Computation Models
32
➢ The “Interaction” Era (2000s)
33
➢ The “Interaction” Era (2000s)
34
➢ The “Interaction” Era (2000s)
35
Multimodal Computational Models
Audio-visual
speech
segmentation
1970
Our tutorial focuses on this era!
1980 1990 2000 2010
38
➢ The “deep learning” era (2010s until …)
39
➢ The “deep learning” era (2010s until …)
▪ Video description
▪ Visual Question-Answer
40
Real-World Tasks Tackled by Multimodal Research
▪ Affect recognition
▪ Emotion
▪ Persuasion
▪ Personality traits
▪ Media description
▪ Image captioning
▪ Video captioning
▪ Visual Question Answering
▪ Event recognition
▪ Action recognition
▪ Segmentation
▪ Multimedia information retrieval
▪ Content based/Cross-media
Core Technical
Challenges
42
Core Challenges in “Deep” Multimodal ML
[Link]
A Joint representations:
Representation
Modality 1 Modality 2
44
Joint Multimodal Representation
“Wow!”
Tensed voice
Joint Representation
(Multimodal Space)
45
Joint Multimodal Representations
DepthMultimodal
[Ngiam et al., ICML 2011]
• Bimodal Deep Belief Network
Image captioning
[Srivastava and Salahutdinov, NIPS 2012]
•
DepthVideo
DepthVerbal
Multimodal Deep Boltzmann Machine
46
Multimodal Vector Space Arithmetic
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
47
Core Challenge 1: Representation
48
Coordinated Representation: Deep CCA
View 𝐻𝑥
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 View 𝐻𝑦
𝒖,𝒗
𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
𝒗 ··· ···
𝒖 𝑾𝒙 𝑾𝒚
··· ···
𝒀 Text Image
𝑿
𝑿 𝒀
Andrew et al., ICML 2013
49
Core Challenge 2: Alignment
t3 t5 B Implicit Alignment
50
Implicit Alignment
Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping,
[Link]
51
Attention Models for Image Captioning
Distribution Output
over L word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2
𝑠0 𝑠1 𝑠2
𝑧1 𝑦1 𝑧2 𝑦2
52
Core Challenge 3: Fusion
A Model-Agnostic Approaches
Modality 1
Modality 1 Classifier
Classifier
53
Core Challenge 3: Fusion
54
Core Challenge 4: Translation
Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.
A Example-based B Model-driven
55
Core Challenge 4 – Translation
Prediction
Modality 1 Modality 2
Help during
training
57
Core Challenge 5: Co-Learning
58
Taxonomy of Multimodal Research [ [Link] ]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy
Basic Concepts:
Score and Loss
Functions
61
Linear Classification (e.g., neural network)
Image
?
(Size: 32*32*3)
62
1) Score Function
𝑓 𝑥𝑖 ; 𝑊, 𝑏 = 𝑊𝑥𝑖 + 𝑏
Weights [10x3072] Bias vector [10x1]
Class score
[10x1] Parameters [10x3073]
63
Interpreting a Linear Classifier
𝑓(𝑥)
The planar decision surface
𝑓(𝑥) in data-space for the simple
𝑓(𝑥) linear discriminant function:
𝑊𝑥𝑖 + 𝑏 > 0
𝑓(𝑥)
𝑤
−𝑏
𝑤
64
Some Notation Tricks – Multi-Label Classification
𝑊 = 𝑊1 𝑊2 … 𝑊𝑁
𝑓 𝑥𝑖 ; 𝑊, 𝑏 = 𝑊𝑥𝑖 + 𝑏 𝑓 𝑥𝑖 ; 𝑊 = 𝑊𝑥𝑖
𝑓 𝑥𝑖 ; 𝑊𝑗 , 𝑏𝑗 or
𝑓 𝑥𝑖 ; 𝑊, 𝑏 𝑗 or 𝑓𝑗
66
Interpreting Multiple Linear Classifiers
𝑓 𝑥𝑖 ; 𝑊𝑗 , 𝑏𝑗 = 𝑊𝑗 𝑥𝑖 + 𝑏𝑗
CIFAR-10 object
𝑓𝑐𝑎𝑟 recognition dataset
𝑓𝑎𝑖𝑟𝑝𝑙𝑎𝑛𝑒
𝑓𝑑𝑒𝑒𝑟
67
Linear Classification: 2) Loss Function
(or cost function or objective)
Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓
1
𝜎 𝑓
0.5
0
0 𝑓 ➢ Score function
69
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓
0
0 𝑓 ➢ Score function
70
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓
𝑓
Softmax function: 𝑒 𝑦𝑖
𝑝 𝑦𝑖 𝑥𝑖 ; 𝑊) =
(multiple classes)
σ𝑗 𝑒 𝑓 𝑗
71
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Cross-entropy loss:
𝑓𝑦 Softmax function
𝑒 𝑖
𝐿𝑖 = −log
σ𝑗 𝑒 𝑓𝑗 Minimizing the
negative log likelihood.
72
Second Loss Function: Hinge Loss
(or max-margin loss or Multi-class SVM loss)
loss due to
example i sum over all difference between the correct class
incorrect labels score and incorrect class score
73
Second Loss Function: Hinge Loss
(or max-margin loss or Multi-class SVM loss)
e.g. 10
Example:
Weighted sum
𝑊𝑥 + 𝑏
Activation function
Output
𝑦 = 𝑓(𝑊𝑥 + 𝑏)
Neural Networks – activation function
▪ 𝑓 𝑥 = tanh 𝑥
▪ Sigmoid - 𝑓 𝑥 = (1 + 𝑒 −𝑥 )−1
▪ Linear – 𝑓 𝑥 = 𝑎𝑥 + 𝑏
▪ Geometrically
▪ Points in the direction of the greatest rate of increase of the function and
its magnitude is the slope of the graph in that direction
▪ More formally in 1D
𝑑𝑓 𝑥 𝑓 𝑥+ℎ −𝑓 𝑥
= lim
𝑑𝑥 ℎ→0 ℎ
▪ In higher dimensions
𝜕𝑓 𝑓 𝑎1 , … , 𝑎𝑖 + ℎ, … , 𝑎𝑛 − 𝑓 𝑎1 , … , 𝑎𝑖 , … , 𝑎𝑛
(𝑎 , … , 𝑎𝑛 ) = lim
𝜕𝑥𝑖 1 ℎ→0 ℎ
➢ In multiple dimension, the gradient is the vector of (partial derivatives)
and is called a Jacobian.
Gradient Computation
Chain rule:
𝜕𝑦 𝜕𝑦 𝜕ℎ
= 𝑦 = 𝑓(ℎ)
𝜕𝑥 𝜕ℎ 𝜕𝑥 𝑦
ℎ ℎ = 𝑔(𝑥)
85
Optimization: Gradient Computation
𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
= 𝑦 𝑦 = 𝑓(ℎ1 , ℎ2 , ℎ3 )
𝜕𝑥 𝜕ℎ𝑗 𝜕𝑥
𝑗
ℎ1 ℎ2 ℎ3 ℎ𝑗 = 𝑔(𝑥)
86
Optimization: Gradient Computation
𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
= 𝑦 𝑦 = 𝑓(ℎ1 , ℎ2 , ℎ3 )
𝜕𝑥1 𝜕ℎ𝑗 𝜕𝑥1
𝑗
𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
= ℎ1 ℎ2 ℎ3 ℎ𝑗 = 𝑔(𝒙)
𝜕𝑥2 𝜕ℎ𝑗 𝜕𝑥1
𝑗
𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗 𝑥1 𝑥2 𝑥3
=
𝜕𝑥3 𝜕ℎ𝑗 𝜕𝑥1
𝑗
87
Optimization: Gradient Computation
Vector representation:
𝜕𝑦 𝜕𝑦 𝜕𝑦
𝛻𝒙 𝑦 = , , 𝑦 𝑦 = 𝑓(𝒉)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
Gradient
𝑇
𝜕𝒉 𝒉 𝒉 = 𝑔(𝒙)
𝛻𝒙 𝑦 = 𝛻𝒉 𝑦
𝜕𝒙
“backprop” Gradient
“local” Jacobian 𝒙
(matrix of size ℎ × 𝑥 computed
using partial derivatives)
88
Backpropagation Algorithm (efficient gradient)
89
How to follow the gradient
Gradient descent:
Input observation 𝒙𝒊
0
0 classification
0
0
0
1
Part-of-speech ?
0 (noun, verb,…)
0
0
0 Sentiment ?
0 (positive or negative)
0
Spoken language
0
0
0 Named entity ?
0 (names of person,…)
0
0
0
…
“one-hot” vector
𝒙𝒊 = number of words in dictionary
93
Unimodal Classification – Language Modality
Document-level
Written language
Input observation 𝒙𝒊
1
0 classification
0
1
0
1
0
0
0
0 Sentiment ?
1 (positive or negative)
0
Spoken language
0
0
0
1
0
0
0
…
“bag-of-word” vector
𝒙𝒊 = number of words in dictionary
94
How to Learn (Better) Language Representations?
He
100 000d
Was
100 000d
walking x W1 W2 y
Away
because
300d 300d
[0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0] [0; 1; 0; 0;….; 0; 0; 0; 0;…; 0; 0]
[0; 0; 0; 1;….; 0; 0; 0; 0;…; 0; 0]
He was walking away because … [0; 0; 0; 0;….; 1; 0; 0; 0;…; 0; 0]
He was running away because … [0; 0; 0; 0;….; 0; 0; 0; 0;…; 0; 1]
Word2vec algorithm: [Link]
How to use these word representations
100 000d
Similarity = 0.0 x W1
Transform: x’=x*W
Goal: 300 dimensional vector
300d
Walking: [0,1; 0,0003; 0;….; 0,02; 0.08; 0,05]
Running: [0,1; 0,0004; 0;….; 0,01; 0.09; 0,05]
Similarity = 0.9
Vector space models of words
Trained on the Google news corpus with over 300 billion words
Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality”, NIPS 2013
Unimodal
representations:
Visual Modality
100
Visual Descriptors
LBP
SIFT descriptors
Optical Flow Gabor Jets
Why use Convolutional Neural Networks
▪ Using basic Multi Layer
Perceptrons does not work
Objects
well for images
▪ Intention to build more abstract
representation as we go up
every layer Parts
Edges/blobs
Input pixels
Why not just use an MLP for images (1)?
▪ Addition of:
▪ Convolution layer
▪ Pooling layer
▪ Everything else is the same (loss, score and
optimization)
▪ MLP layer is called Fully Connected layer
Convolution in 2D
∗ =
Fully connected layer
Weighted sum
𝑊𝑥 + 𝑏
Activation function
Output
𝑦 = 𝑓(𝑊𝑥 + 𝑏)
Convolution as MLP (1)
▪ Remove activation
Input
Weighted sum
𝑊𝑥 + 𝑏 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑
𝑦 = 𝑊𝑥 + 𝑏
Convolution as MLP (2)
Weighted sum
𝑊𝑥 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑
𝑦 = 𝑊𝑥
Convolution as MLP (3)
Weighted sum
𝑊𝑥 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑
𝑦 = 𝑊𝑥
How do we do convolution in MLP recap
𝑤1 𝑤2 𝑤3 0 0 0
0 𝑤1 𝑤2 ⋯ 0 0 0
0 0 𝑤1 0 0 0
𝑊= ⋮ ⋱ ⋮
0 0 0 𝑤3 0 0
0 0 0 ⋯ 𝑤2 𝑤3 0
0 0 0 𝑤1 𝑤2 𝑤3
Pooling layer
σ𝑛𝑖=1 𝑥𝑖 𝑒 𝛼𝑥𝑖
𝑦 = 𝑛 𝛼𝑥
σ𝑖=1 𝑒 𝑖
Example: AlexNet Model
Input observation 𝒙𝒊
0.21
0.14
0.56
0.45
0.9
0.98
• Sampling rates: 8~96kHz 0.75
• Bit depth: 8, 16 or 24 bits 0.34
0.24
• Time window size: 20ms 0.11
• Offset: 10ms 0.02
Spectogram
114
Unimodal Classification – Acoustic Modality
Digitalized acoustic signal
Input observation 𝒙𝒊
0.21
0.14
0.56
0.45
0.9
0.98 Emotion ?
• Sampling rates: 8~96kHz 0.75
• Bit depth: 8, 16 or 24 bits 0.34
0.24
• Time window size: 20ms 0.11 Spoken word ?
• Offset: 10ms 0.02
0.24
0.26
0.58 Voice quality ?
0.9
0.99
0.79
0.45
0.34
0.24
…
Spectogram
115
Audio representation for speech recognition
Decoder
𝑥′1 𝑥′2 𝑥′𝑛
Encoder
▪ Two parts encoder/decoder 𝑓
𝑥1 𝑥2 𝑥𝑛
▪ 𝑥′ = 𝑓(𝑔 𝑥 ) – score
function
▪ 𝑔 - encoder
▪ 𝑓 - decoder
Autoencoders
▪ 𝑊∗ = 𝑊𝑇
▪ word2vec is actually a bit similar to
autoencoder (except for the auto part)
Denoising autoencoder
▪ Simple idea
▪ Add noise to input 𝒙 but
learn to reconstruct original 𝑥′
𝑥′11 𝑥′
𝑥′22 𝑥′
𝑥′𝑛𝑛
Decoder
▪ Each encoding unit has a 𝒉′𝟏
corresponding decoder
𝒉𝟐
▪ Inference as before is
Encoder
feed forward structure, 𝒉𝟏
but now with more
hidden layers 𝒙
Stacked autoencoders
▪ Reconstruct using
previously learned decoders
mappings 𝒙′
▪ Fine-tune the full network Decoder 𝒉′𝟏
end-to-end
𝒉𝟐
Encoder 𝒉𝟏
𝒙
Stacked denoising autoencoders
𝒙
Multimodal
Representations
125
Core Challenge: Representation
126
Deep Multimodal Boltzmann machines
[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014]
Deep Multimodal Boltzmann machines
Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012
Deep Multimodal autoencoders
▪ A deep representation
learning approach
▪ A bimodal auto-encoder
▪ Used for Audio-visual speech
recognition
···
▪ A simple multilayer
perceptron will be used
to translate from visual ··· ···
(CNN) to language
(LSTM) ··· ···
Text Image
𝑿 𝒀
132
Multimodal Joint Representation
··· 𝒉𝒎
134
Unimodal, Bimodal and Trimodal Interactions
135
Multimodal Tensor Fusion Network (TFN)
e.g. Sentiment
Models both unimodal and ··· softmax
bimodal interactions:
Bimodal
𝒉𝒙 𝒉𝒚 𝒉𝒙 𝒉𝒙 ⊗ 𝒉𝒚 𝒉𝒎
𝒉𝒎 = ⊗ = Unimodal
1 1 1 𝒉𝒚 1
··· ···
Text Image
[Zadeh, Jones and Morency, EMNLP 2017] 𝑿 𝒀
136
Multimodal Tensor Fusion Network (TFN)
𝒉𝒙
𝒉𝒙 ⊗ 𝒉𝒛
𝒉𝒙 ⊗ 𝒉𝒚
Can be extended to three modalities:
𝒉𝒙 𝒉 𝒉 𝒉𝒚
𝒉𝒎 = ⊗ 𝒚 ⊗ 𝒛
1 1 1 𝒉𝒛 ⊗ 𝒉𝒚
𝒉𝒛
𝒉𝒙 ⊗ 𝒉𝒚 ⊗ 𝒉𝒛
Explicitly models unimodal,
bimodal and trimodal 𝒉𝒙 ··· ··· ··· 𝒉𝒛
𝒉𝒚
interactions !
··· ··· ···
Text Image Audio
[Zadeh, Jones and Morency, EMNLP 2017] 𝑿 𝒀 𝒁
137
Experimental Results – MOSI Dataset
138
Coordinated
Multimodal
Representations
139
Coordinated Multimodal Representations
··· ···
··· ···
··· ···
Text Image
𝑿 𝒀
140
Coordinated Multimodal Embeddings
[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]
Multimodal Vector Space Arithmetic
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Multimodal Vector Space Arithmetic
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Canonical Correlation Analysis
projection of Y
for each view, that are maximally
correlated:
projection of X
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 𝑯𝒙 𝑯𝒚
𝒖,𝒗
··· ···
= argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 𝑼 𝑽
𝒖,𝒗
··· ···
Text Image
𝑿 𝒀
144
Correlated Projection
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀
𝒖,𝒗
𝒗
𝒖
𝒀
𝑿
145
Canonical Correlation Analysis
We want to learn multiple projection pairs 𝒖(𝑖) 𝑿, 𝒗(𝑖) 𝒀 :
147
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰
1 0 0 𝜆1 0 0
𝚺𝑿𝑿 𝚺𝒀𝑿 0 1 0 0 𝜆2 0
𝑼,𝑽 0 0 1 0 0 𝜆3
Σ= 𝜆1 0 0
𝚺𝑿𝒀 𝚺𝒀𝒀 1 0 0
0 𝜆2 0 0 1 0
0 0 𝜆3 0 0 1
148
Deep Canonical Correlation Analysis
argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚
View 𝐻𝑥
𝑽,𝑼,𝑾𝒙 ,𝑾𝒚
View 𝐻𝑦
Linear projections
1
maximizing correlation 𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
2 Orthogonal projections ··· ···
𝑾𝒙 𝑾𝒚
Unit variance of the
3 ··· ···
projection vectors
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
149
Deep Canonically Correlated Autoencoders (DCCAE)
𝑿′ 𝒀′
Jointly optimize for DCCA and Text Image
autoencoders loss functions ··· ···
➢ A trade-off between multi-view
··· ···
correlation and reconstruction
View 𝐻𝑥
error from individual views
View 𝐻𝑦
𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Wang et al., ICML 2015 𝑿 𝒀
150
Basic Concepts:
Recurrent Neural
Networks
151
Feedforward Neural Network
𝑽
(𝑡) (𝑡)
𝒉(𝑡) 𝒉 = 𝑡𝑎𝑛ℎ(𝑼𝒙 )
𝑼
𝒙(𝑡)
152
Recurrent Neural Networks
𝐿 = 𝐿(𝑡)
𝑡
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(𝑡)
153
Recurrent Neural Networks - Unrolling
𝐿 = 𝐿(𝑡)
𝑡
𝑦 (1) (𝑡)
𝑦 (2) 𝑦 (3) 𝑦 (𝑡)
𝒛(𝟏) 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡) , 𝑽) 𝒛(2) 𝒛(3) 𝒛(𝑡)
𝑾
𝑽
𝒉(1) 𝒉(2) 𝒉(3) 𝒉(𝑡)
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(1) 𝒙(2) 𝒙(3) 𝒙(𝑡)
𝐿 = 𝐿(𝑡)
𝑡
𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )
156
Backpropagation Through Time
𝜕𝑧 (𝜏)
𝒉(𝜏) 𝛻𝒉 𝜏 𝐿 = 𝛻𝒛 𝜏 𝐿 (𝜏) = 𝛻𝒛 𝜏 𝐿𝑽 𝒉(𝜏)
𝜕𝒉
157
Backpropagation Through Time
𝜕𝒉(𝑡) 𝑼 𝒙( 𝜏 )
𝑼 𝛻𝑼 𝐿 = 𝛻𝒉 𝑡 𝐿
𝜕𝑼
𝑡
158
Long-term Dependencies
159
Recurrent Neural Networks
𝐿(𝑡)
tanh
𝒉(𝑡−1)
+1
𝑦 (𝑡) 𝒉(𝑡)
𝒛(𝑡) 𝒙(𝑡)
-1
𝒉(𝑡)
𝒙(𝑡)
160
LSTM ideas: (1) “Memory” Cell and Self Loop
[Hochreiter and Schmidhuber, 1997]
𝐿(𝑡)
tanh
𝒉(𝑡−1)
+1
𝑦 (𝑡) 𝒄(𝑡) 𝒉(𝑡)
𝒛(𝑡) 𝒙(𝑡)
-1 cell
Self-
loop
𝒉(𝑡) Self-
𝒙(𝑡)
161
LSTM Ideas: (2) Input and Output Gates
[Hochreiter and Schmidhuber, 1997]
sigmoid
(𝑡−1)
𝒙(𝑡) 𝒉 +1
𝒙(𝑡) 0 Output gate
162
LSTM Ideas: (3) Forget Gate [Gers et al., 2000]
𝒈 𝑡𝑎𝑛ℎ
𝒊 𝑠𝑖𝑔𝑚 (𝑡−1) 𝒄(𝑡) = 𝒇⨀𝒄 𝑡−1 + 𝒊⨀𝒈
𝒇 = 𝑾 𝒉 (𝑡)
𝑠𝑖𝑔𝑚 𝒙
𝒐 𝑠𝑖𝑔𝑚 𝒉(𝑡) = 𝒐⨀tanh(𝒄 𝑡 )
𝐿(𝑡) tanh sum
𝒉(𝑡−1) 𝒈
+1
x 𝒄(𝑡) x 𝒉(𝑡)
-1 cell
𝑦 (𝑡) 𝒙(𝑡)
𝒛(𝑡) sigmoid Self-
(𝑡−1)
𝒉 +1 𝒊 loop
𝒙(𝑡) 0 Input gate x
sigmoid Self-
𝒉(𝑡) 𝒉(𝑡−1)
+1 𝒇
𝒙(𝑡) 0 Forget gate
sigmoid
(𝑡−𝟏)
𝒙(𝑡) 𝒉 +1 𝒐
𝒙(𝑡) 0 Output gate
163
Recurrent Neural Network using LSTM Units
𝑽
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
𝑾
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )
𝐿(𝜏)
𝑦 (𝜏 )
𝒛( 𝜏 )
𝑽
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
𝑾
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )
𝑽
LSTM(1)
2
LSTM(2)
2
LSTM(3)
2
LSTM(𝜏)
2
𝑾𝟐
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1
𝑾𝟏
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )
166
Deep LSTM Network
𝐿(1) 𝐿(2) 𝐿(3) 𝐿(𝜏)
𝑽
LSTM(1)
2
LSTM(2)
2
LSTM(3)
2
LSTM(𝜏)
2
𝑾𝟐
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1
𝑾𝟏
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )
167
Translation
and Alignment
168
Core Challenge 4: Translation
Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.
A Example-based B Model-driven
169
Translation
➢ Speech synthesis
Example-based translation
[Vinyals et al., “Show and Tell: A Neural Image Caption Generator”, CVPR 2015]
173
Visual Question Answering
▪ Tricky to do automatically!
▪ Ideally want humans to evaluate
▪ What do you ask?
▪ Can’t use human evaluation for validating models –
too slow and expensive
▪ Using standard machine translation metrics
instead
▪ BLEU, ROUGE CIDER, Meteor
Core Challenge: Alignment
t3 t5 B Implicit Alignment
176
Explicit alignment
177
Temporal sequence alignment
Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping
𝑦
▪ Where 𝒑𝒕𝑥 and 𝒑𝒕 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued
𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑦
𝐿(𝒑𝒕𝑥 , 𝒑𝒕 ) = 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!
= 𝐗𝐖𝑥
=
= 𝐘𝐖y
Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗
182
DTW - limitations
▪ Computationally complex
m sequences
▪ Sensitive to outliers
▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
projection of Y
Linear projections maximizing
1 correlation
projection of X
2 Orthogonal projections 𝑯𝒙 𝑯𝒚
184
Canonical Correlation Analysis reminder
projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹
𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping
[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Generalized Time warping
Weizmann
Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames
188
Alignment examples (multimodal)
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
Context
1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding
of “le” of “chien” of “sur” of “la” of “plage”
Attention Model for Machine Translation
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog
Attention
module / Hidden state 𝒔0
gate
Context 𝒛𝟎
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog on
Attention
module / Hidden state 𝒔1
gate
Context 𝒛𝟏
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog on the
Attention
module / Hidden state 𝒔2
gate
Context 𝒛𝟐
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
198
Attention Model for Image Captioning
Distribution Output
over L locations word
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2
𝑠0 𝑠1 𝑠2
𝑧1 𝑦1 𝑧2 𝑦2
199
Attention Model for Image Captioning
Attention Model for Video Sequences
𝒉𝑡−1 x + 𝒉𝑡
𝒙𝑡 + ReLU x
𝑎𝑡−1 𝑎𝑡 𝑎𝑡+1
𝒉𝑡−1 𝒉𝑡 𝒉𝑡+1
𝒙𝑡−1 𝒙𝑡 𝒙𝑡+1
202
Temporal Attention Gated Model (TAGM)
CCV dataset
▪ 20 video categories
▪ Biking, birthday, wedding etc.
60
55
50
45
40
35
RNN GRU LSTM TAGM (ours)
205
Multimodal Fusion
206
Multimodal Fusion
Modality 1
Modality 2
Classifier
Modality n
Modality 1
Classifier
Modality 2 Fusion
Classifier mechanism
Modality n
Classifier
▪ Train a unimodal predictor and a multimodal fusion one
▪ Requires multiple training stages
▪ Do not model low level interactions between modalities
▪ Fusion mechanism can be voting, weighted sum or an ML approach
Model free approaches – hybrid fusion
Modality 1
Classifier
Modality 2
Classifier
Fusion
mechanism
Modality 1
Classifier
Modality 2
▪ Pick a family of kernels for each modality and learn which kernels are important for the
classification case
▪ Generalizes the idea of Support Vector Machines
▪ Works as well for unimodal and multimodal data, very little adaptation is needed
[Lanckriet 2004]
Multimodal Fusion for Sequential Data
Multi-View
Modality-private structure
Hidden Conditional Random Field
Sentiment
• Internal grouping of observations
y
Modality-shared structure
• Interaction and synchrony ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴
ℎ1𝑉 ℎ2𝑉 ℎ3𝑉 ℎ4𝑉 ℎ5𝑉
𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓
𝒙𝑽𝟏 𝒙𝑽𝟐 𝒙𝑽𝟑 𝒙𝑽𝟒 𝒙𝑽𝟓
𝑝 𝑦 𝒙𝑨 , 𝒙𝑉 ; 𝜽) = 𝑝 𝑦, 𝒉𝑨 , 𝒉𝑽 𝒙𝑨 , 𝒙𝑽 ; 𝜽
𝒉𝑨 ,𝒉𝑽 We saw the yellowdog
212
Sequence Modeling with LSTM
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏
213
Multimodal Sequence Modeling – Early Fusion
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
214
Multi-View Long Short-Term Memory (MV-LSTM)
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
215
Multi-View Long Short-Term Memory
MV-
(1) sigm
𝒙𝒕
𝒙(2)
𝒕
MV-
(3)
𝒙𝒕 sigm
MV-
sigm
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
216
Topologies for Multi-View LSTM
Fully-
View-specific connected
Multi-view topologies α=1, β=0 α=1, β=1
𝒙(1) 𝒈(1) 𝒙(1) 𝒈(1)
𝒈(1)
𝒕
𝒕 𝒕 𝒕 𝒕
𝒉(1)
𝒕−𝟏 MV- (2) 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
MV- 𝒈𝒕
tanh
LSTM(1) 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
(3)
𝒈
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏
𝒙(1)
𝒕 Coupled Hybrid
α: Memory from α=0, β=1 α=2/3, β=1/3
𝒙(2)
𝒕
current view 𝒙(1)
𝒕 𝒈(1) 𝒙(1)
𝒕 𝒈(1)
(3) 𝒕 𝒕
𝒙𝒕 β: Memory from 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
other views
𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
Design parameters 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
217
Multi-View Long Short-Term Memory (MV-LSTM)
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
218
Memory Based
219
Memory Based
[Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018]
220
Multimodal Machine Learning
[Link]
221