0% found this document useful (0 votes)
54 views221 pages

MMML Tutorial ACL2017

The document provides an introduction to multimodal machine learning including definitions of key concepts and an overview of the tutorial schedule. The tutorial covers topics such as representation learning, translation, alignment, fusion and co-learning of multiple modalities.

Uploaded by

huo si
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views221 pages

MMML Tutorial ACL2017

The document provides an introduction to multimodal machine learning including definitions of key concepts and an overview of the tutorial schedule. The tutorial covers topics such as representation learning, translation, alignment, fusion and co-learning of multiple modalities.

Uploaded by

huo si
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Tutorial on

Multimodal Machine Learning


Louis-Philippe (LP) Morency
Tadas Baltrusaitis

CMU Multimodal Communication and


Machine Learning Laboratory [MultiComp Lab]

1
Your Instructors

Louis-Philippe Morency
morency@[Link]

Tadas Baltrusaitis
tbaltrus@[Link]

2
CMU Course 11-777: Multimodal Machine Learning

3
Tutorial Schedule

▪ Introduction
▪ What is Multimodal?
▪ Historical view, multimodal vs multimedia
▪ Why multimodal
▪ Multimodal applications: image captioning, video
description, AVSR,…
▪ Core technical challenges
▪ Representation learning, translation, alignment, fusion
and co-learning
Tutorial Schedule

▪ Basic concepts – Part 1


▪ Linear models
▪ Score and loss functions, regularization
▪ Neural networks
▪ Activation functions, multi-layer perceptron
▪ Optimization
▪ Stochastic gradient descent, backpropagation
Tutorial Schedule

▪ Unimodal representations
▪ Visual representations
▪ Convolutional neural networks
▪ Acoustic representations
▪ Spectrograms, autoencoders
Tutorial Schedule

▪ Multimodal representations
▪ Joint representations
▪ Visual semantic spaces, multimodal autoencoder
▪ Tensor fusion representation
▪ Coordinated representations
▪ Similarity metrics, canonical correlation analysis
▪ Coffee break [20 mins]
Tutorial Schedule

▪ Basic concepts – Part 2


▪ Recurrent neural networks
▪ Long Short-Term Memory models
▪ Optimization
▪ Backpropagation through time
Tutorial Schedule

▪ Translation and alignment


▪ Translation applications
▪ Machine translation, image captioning
▪ Explicit alignment
▪ Dynamic time warping, deep canonical time warping
▪ Implicit alignment
▪ Attention models, multi instance learning
▪ Temporal attention-gated model
Tutorial Schedule

▪ Multimodal fusion
▪ Model free approaches
▪ Early and late fusion, hybrid models
▪ Kernel-based fusion
▪ Multiple kernel learning
▪ Multimodal graphical models
▪ Factorial HMM, Multi-view Hidden CRF
▪ Multi-view LSTM model
What is
Multimodal?
11
What is Multimodal?

Multimodal distribution

➢ Multiple modes, i.e., distinct “peaks”


(local maxima) in the probability
density function
What is Multimodal?

Sensory Modalities
What is Multimodal?

Modality
The way in which something happens or is experienced.
• Modality refers to a certain type of information and/or the
representation format in which information is stored.
• Sensory modality: one of the primary forms of sensation,
as vision or touch; channel of communication.
Medium (“middle”)
A means or instrumentality for storing or communicating
information; system of communication/transmission.
• Medium is the means whereby this information is
delivered to the senses of the interpreter.

14
Examples of Modalities

 Natural language (both spoken or written)


 Visual (from images or videos)
 Auditory (including voice, sounds and music)
 Haptics / touch
 Smell, taste and self-motion
 Physiological signals
▪ Electrocardiogram (ECG), skin conductance
 Other modalities
▪ Infrared images, depth images, fMRI
Multimodal Communicative Behaviors

Verbal Visual
▪ Lexicon ▪ Gestures
▪ Words ▪ Head gestures
▪ Syntax ▪ Eye gestures
▪ Part-of-speech ▪ Arm gestures
▪ Dependencies ▪ Body language
▪ Pragmatics ▪ Body posture
▪ Discourse acts ▪ Proxemics

Vocal ▪ Eye contact


▪ Head gaze
▪ Prosody
▪ Eye gaze
▪ Intonation
▪ Voice quality ▪ Facial expressions
▪ FACS action units
▪ Vocal expressions
▪ Smile, frowning
▪ Laughter, moans

16
Multiple Communities and Modalities

Psychology Medical Speech Vision

Language Multimedia Robotics Learning


A Historical View
18
Prior Research on “Multimodal”

Four eras of multimodal research


➢ The “behavioral” era (1970s until late 1980s)

➢ The “computational” era (late 1980s until 2000)

➢ The “interaction” era (2000 - 2010)

➢ The “deep learning” era (2010s until …)


❖ Main focus of this tutorial

1970 1980 1990 2000 2010

19
The “Behavioral” Era (1970s until late 1980s)

Multimodal Behavior Therapy by Arnold Lazarus [1973]


➢ 7 dimensions of personality (or modalities)

Multi-sensory integration (in psychology):


• Multimodal signal detection: Independent decisions vs.
integration [1980]
• Infants' perception of substance and temporal synchrony
in multimodal events [1983]
• A multimodal assessment of behavioral and cognitive deficits in
abused and neglected preschoolers [1984]

 TRIVIA: Geoffrey Hinton received his B.A. in Psychology ☺

1970 1980 1990 2000 2010

20
Language and Gestures

David McNeill
University of Chicago
Center for Gesture and Speech Research

“For McNeill, gestures are in effect the speaker’s


thought in action, and integral components of speech,
not merely accompaniments or additions.”

1970 1980 1990 2000 2010

21
The McGurk Effect (1976)

Hearing lips and seeing voices – Nature

1970 1980 1990 2000 2010

22
The McGurk Effect (1976)

Hearing lips and seeing voices – Nature

1970 1980 1990 2000 2010

23
➢ The “Computational” Era(Late 1980s until 2000)

1) Audio-Visual Speech Recognition (AVSR)


• Motivated by the McGurk effect
• First AVSR System in 1986
“Automatic lipreading to enhance speech recognition”
• Good survey paper [2002]
“Recent Advances in the Automatic Recognition of
Audio-Visual Speech”

 TRIVIA: The first multimodal deep learning paper was about


audio-visual speech recognition [ICML 2011]

1970 1980 1990 2000 2010

24
➢ The “Computational” Era(Late 1980s until 2000)

1) Audio-Visual Speech Recognition (AVSR)

1970 1980 1990 2000 2010

25
➢ The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces
• Multimodal Human-Computer Interaction (HCI)
“Study of how to design and evaluate new computer
systems where human interact through multiple
modalities, including both input and output modalities.”

1970 1980 1990 2000 2010

26
➢ The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces

Glove-talk: A neural network interface between a data-glove and a


speech synthesizer By Sidney Fels & Geoffrey Hinton [CHI’95]

1970 1980 1990 2000 2010

27
➢ The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces
Rosalind Picard
Affective Computing is
computing that relates to, arises
from, or deliberately influences
emotion or other affective
phenomena.

1970 1980 1990 2000 2010

28
➢ The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing

[1994-2010]

“The Informedia Digital Video Library Project automatically combines speech,


image and natural language understanding to create a full-content searchable
digital video library.”

1970 1980 1990 2000 2010

29
➢ The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing
Multimedia content analysis
▪ Shot-boundary detection (1991 - )
▪ Parsing a video into continuous camera shots
▪ Still and dynamic video abstracts (1992 - )
▪ Making video browsable via representative frames (keyframes)
▪ Generating short clips carrying the essence of the video content
▪ High-level parsing (1997 - )
▪ Parsing a video into semantically meaningful segments
▪ Automatic annotation (indexing) (1999 - )
▪ Detecting prespecified events/scenes/objects in video

1970 1980 1990 2000 2010

30
Multimodal Computation Models

• Hidden Markov Models [1960s]


h0 h1 h2 h3 h4

x1 x2 x3 x4

 Factorial Hidden Markov  Coupled Hidden Markov


Models [1996] Models [1997]

1970 1980 1990 2000 2010

31
Multimodal Computation Models

• Artificial Neural Networks [1940s]

 Backpropagation [1975]  Convolutional neural


networks [1980s]

1970 1980 1990 2000 2010

32
➢ The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction


AMI Project [2001-2006, IDIAP]
• 100+ hours of meeting recordings
• Fully synchronized audio-video
• Transcribed and annotated

CHIL Project [Alex Waibel]


• Computers in the Human Interaction Loop
• Multi-sensor multimodal processing
• Face-to-face interactions

 TRIVIA: Samy Bengio started at IDIAP working on AMI project

1970 1980 1990 2000 2010

33
➢ The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction


CALO Project [2003-2008, SRI]
• Cognitive Assistant that Learns and Organizes
• Personalized Assistant that Learns (PAL)
• Siri was a spinoff from this project

SSP Project [2008-2011, IDIAP]


• Social Signal Processing
• First coined by Sandy Pentland in 2007
• Great dataset repository: [Link]

 TRIVIA: LP’s PhD research was partially funded by CALO ☺

1970 1980 1990 2000 2010

34
➢ The “Interaction” Era (2000s)

2) Multimedia Information Retrieval


“Yearly competition to
promote progress in
content-based retrieval
from digital video via open,
metrics-based evaluation”
[Hosted by NIST, 2001-2016]

Research tasks and challenges:


• Shot boundary, story segmentation, search
• “High-level feature extraction”: semantic event detection
• Introduced in 2008: copy detection and surveillance events
• Introduced in 2010: Multimedia event detection (MED)

1970 1980 1990 2000 2010

35
Multimodal Computational Models

▪ Dynamic Bayesian Networks


▪ Kevin Murphy’s PhD thesis and Matlab toolbox
▪ Asynchronous HMM for multimodal [Samy Bengio, 2007]

Audio-visual
speech
segmentation

1970 1980 1990 2000 2010


Multimodal Computational Models

▪ Discriminative sequential models


▪ Conditional random fields [Lafferty et al., 2001]

▪ Latent-dynamic CRF [Morency et al., 2007]

1970 1980 1990 2000 2010


➢ The “deep learning” era (2010s until …)

Representation learning (a.k.a. deep learning)


• Multimodal deep learning [ICML 2011]
• Multimodal Learning with Deep Boltzmann Machines [NIPS 2012]
• Visual attention: Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention [ICML 2015]

Key enablers for multimodal research:


• New large-scale multimodal datasets
• Faster computer and GPUS
• High-level visual features
• “Dimensional” linguistic features

1970
Our tutorial focuses on this era!
1980 1990 2000 2010

38
➢ The “deep learning” era (2010s until …)

Many new challenges and multimodal corpora !!


Audio-Visual Emotion Challenge (AVEC, 2011- )
• Emotional dimension estimation
• Standardized training and test sets
• Based on the SEMAINE dataset

Emotion Recognition in the Wild Challenge (EmotiW 2013- )


• Emotional dimension estimation
• Standardized training and test sets
• Based on the SEMAINE dataset

1970 1980 1990 2000 2010

39
➢ The “deep learning” era (2010s until …)

Renew of multimedia content analysis !


▪ Image captioning

▪ Video description
▪ Visual Question-Answer

1970 1980 1990 2000 2010

40
Real-World Tasks Tackled by Multimodal Research

▪ Affect recognition
▪ Emotion
▪ Persuasion
▪ Personality traits
▪ Media description
▪ Image captioning
▪ Video captioning
▪ Visual Question Answering
▪ Event recognition
▪ Action recognition
▪ Segmentation
▪ Multimedia information retrieval
▪ Content based/Cross-media
Core Technical
Challenges
42
Core Challenges in “Deep” Multimodal ML

Tadas Baltrusaitis, Chaitanya Ahuja,


and Louis-Philippe Morency

[Link]

These challenges are non-exclusive.


43
Core Challenge 1: Representation

Definition: Learning how to represent and summarize multimodal data in away


that exploits the complementarity and redundancy.

A Joint representations:
Representation

Modality 1 Modality 2

44
Joint Multimodal Representation

“I like it!” Joyful tone

“Wow!”

Tensed voice
Joint Representation
(Multimodal Space)

45
Joint Multimodal Representations

Audio-visual speech recognition Multimodal Representation

DepthMultimodal
[Ngiam et al., ICML 2011]
• Bimodal Deep Belief Network

Image captioning
[Srivastava and Salahutdinov, NIPS 2012]

DepthVideo

DepthVerbal
Multimodal Deep Boltzmann Machine

Audio-visual emotion recognition


[Kim et al., ICASSP 2013]
• Deep Boltzmann Machine Visual Verbal

46
Multimodal Vector Space Arithmetic

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

47
Core Challenge 1: Representation

Definition: Learning how to represent and summarize multimodal data in away


that exploits the complementarity and redundancy.

A Joint representations: B Coordinated representations:


Representation Repres. 1 Repres 2

Modality 1 Modality 2 Modality 1 Modality 2

48
Coordinated Representation: Deep CCA

Learn linear projections that are maximally correlated:

View 𝐻𝑥
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 View 𝐻𝑦
𝒖,𝒗

𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
𝒗 ··· ···
𝒖 𝑾𝒙 𝑾𝒚
··· ···
𝒀 Text Image
𝑿
𝑿 𝒀
Andrew et al., ICML 2013

49
Core Challenge 2: Alignment

Definition: Identify the direct relations between (sub)elements from two or


more different modalities.
Modality 1 Modality 2
A Explicit Alignment
t1
The goal is to directly find correspondences
t2 t4 between elements of different modalities
Fancy algorithm

t3 t5 B Implicit Alignment

Uses internally latent alignment of modalities in


order to better solve a different problem
tn tn

50
Implicit Alignment

Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping,
[Link]

51
Attention Models for Image Captioning

Distribution Output
over L word
locations

𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦1 𝑧2 𝑦2

Expectation First word


over features:
D

52
Core Challenge 3: Fusion

Definition: To join information from two or more modalities to perform a


prediction task.

A Model-Agnostic Approaches

1) Early Fusion 2) Late Fusion

Modality 1
Modality 1 Classifier
Classifier

Modality 2 Modality 2 Classifier

53
Core Challenge 3: Fusion

Definition: To join information from two or more modalities to perform a


prediction task.

B Model-Based (Intermediate) Approaches

1) Deep neural networks


Multiple kernel learning
2) Kernel-based methods y

ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴


ℎ1𝑉 ℎ2𝑉 ℎ3𝑉 ℎ4𝑉 ℎ5𝑉
3) Graphical models 𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓
𝒙𝑽𝟏 𝒙𝑽𝟐 𝒙𝑽𝟑 𝒙𝑽𝟒 𝒙𝑽𝟓

Multi-View Hidden CRF

54
Core Challenge 4: Translation

Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.

A Example-based B Model-driven

55
Core Challenge 4 – Translation

Visual gestures Transcriptions


(both speaker and +
listener gestures) Audio streams
Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics
Symposium on Computer Animation, 2013
Core Challenge 5: Co-Learning

Definition: Transfer knowledge between modalities, including their


representations and predictive models.

Prediction

Modality 1 Modality 2
Help during
training

57
Core Challenge 5: Co-Learning

A Parallel B Non-Parallel C Hybrid

58
Taxonomy of Multimodal Research [ [Link] ]

Representation o Encoder-decoder ▪ Model-based


o Online prediction o Kernel-based
▪ Joint
o Neural networks Alignment o Graphical models
o Neural networks
o Graphical models ▪ Explicit
o Sequential o Unsupervised Co-learning
▪ Coordinated o Supervised ▪ Parallel data
o Similarity ▪ Implicit o Co-training
o Structured o Graphical models o Transfer learning
Translation o Neural networks ▪ Non-parallel data
▪ Example-based Fusion ▪ Zero-shot learning
o Retrieval ▪ Concept grounding
▪ Model agnostic
o Combination ▪ Transfer learning
o Early fusion
▪ Model-based o Late fusion ▪ Hybrid data
o Grammar-based o Hybrid fusion ▪ Bridging
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Applications [ [Link] ]

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy
Basic Concepts:
Score and Loss
Functions
61
Linear Classification (e.g., neural network)

Image

?
(Size: 32*32*3)

1. Define a (linear) score function


2. Define the loss function (possibly nonlinear)
3. Optimization

62
1) Score Function

Image Duck ? What should be


Cat ? the prediction
Dog ? score for each
Pig ? label class?
Bird ?
(Size: 32*32*3)

For linear classifier: Input observation (ith element of the dataset)


[3072x1]

𝑓 𝑥𝑖 ; 𝑊, 𝑏 = 𝑊𝑥𝑖 + 𝑏
Weights [10x3072] Bias vector [10x1]
Class score
[10x1] Parameters [10x3073]
63
Interpreting a Linear Classifier

𝑓(𝑥)
The planar decision surface
𝑓(𝑥) in data-space for the simple
𝑓(𝑥) linear discriminant function:

𝑊𝑥𝑖 + 𝑏 > 0
𝑓(𝑥)
𝑤

−𝑏
𝑤

64
Some Notation Tricks – Multi-Label Classification

𝑊 = 𝑊1 𝑊2 … 𝑊𝑁

𝑓 𝑥𝑖 ; 𝑊, 𝑏 = 𝑊𝑥𝑖 + 𝑏 𝑓 𝑥𝑖 ; 𝑊 = 𝑊𝑥𝑖

Weights x Input + Bias Weights x Input

[10x3072] [3072x1] [10x1] [10x3073] [3073x1]

The bias vector will Add a “1” at the


be the last column of end of the input
the weight matrix observation vector
65
Some Notation Tricks

General formulation of linear classifier: 𝑓 𝑥𝑖 ; 𝑊, 𝑏


“dog” linear classifier:
𝑓 𝑥𝑖 ; 𝑊𝑑𝑜𝑔 , 𝑏𝑑𝑜𝑔 or
𝑓 𝑥𝑖 ; 𝑊, 𝑏 𝑑𝑜𝑔 or 𝑓𝑑𝑜𝑔

Linear classifier for label j:

𝑓 𝑥𝑖 ; 𝑊𝑗 , 𝑏𝑗 or
𝑓 𝑥𝑖 ; 𝑊, 𝑏 𝑗 or 𝑓𝑗
66
Interpreting Multiple Linear Classifiers

𝑓 𝑥𝑖 ; 𝑊𝑗 , 𝑏𝑗 = 𝑊𝑗 𝑥𝑖 + 𝑏𝑗

CIFAR-10 object
𝑓𝑐𝑎𝑟 recognition dataset

𝑓𝑎𝑖𝑟𝑝𝑙𝑎𝑛𝑒

𝑓𝑑𝑒𝑒𝑟
67
Linear Classification: 2) Loss Function
(or cost function or objective)

Scores Label Loss


𝑓 𝑥𝑖 ; 𝑊 𝑦𝑖 = 2 (𝑑𝑜𝑔) 𝐿𝑖 = ?
Image 𝑥𝑖
0 (duck) ? -12.3 How to assign
1 (cat) ? 45.6 only one number
2 (dog) ? 98.7 representing
3 (pig) ? 12.2 how “unhappy”
4 (bird) ? -45.3
(Size: 32*32*3) we are about
Multi-class problem these scores?

The loss function quantifies the amount by which


the prediction scores deviate from the actual values.
A first challenge: how to normalize the scores?
68
First Loss Function: Cross-Entropy Loss
(or logistic loss)

Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓

1
𝜎 𝑓
0.5

0
0 𝑓 ➢ Score function
69
First Loss Function: Cross-Entropy Loss
(or logistic loss)

Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓

Logistic regression: 𝑝 𝑦𝑖 = "𝑑𝑜𝑔" 𝑥𝑖 ; 𝑤) = 𝜎 𝑤 𝑇 𝑥𝑖


(two classes) = true
for two-class problem
1
𝜎 𝑓
0.5

0
0 𝑓 ➢ Score function
70
First Loss Function: Cross-Entropy Loss
(or logistic loss)

Logistic function: 1
𝜎 𝑓 =
1 + 𝑒 −𝑓

Logistic regression: 𝑝 𝑦𝑖 = "𝑑𝑜𝑔" 𝑥𝑖 ; 𝑤) = 𝜎 𝑤 𝑇 𝑥𝑖


(two classes) = true
for two-class problem

𝑓
Softmax function: 𝑒 𝑦𝑖
𝑝 𝑦𝑖 𝑥𝑖 ; 𝑊) =
(multiple classes)
σ𝑗 𝑒 𝑓 𝑗

71
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Cross-entropy loss:
𝑓𝑦 Softmax function
𝑒 𝑖
𝐿𝑖 = −log
σ𝑗 𝑒 𝑓𝑗 Minimizing the
negative log likelihood.

72
Second Loss Function: Hinge Loss
(or max-margin loss or Multi-class SVM loss)

loss due to
example i sum over all difference between the correct class
incorrect labels score and incorrect class score

73
Second Loss Function: Hinge Loss
(or max-margin loss or Multi-class SVM loss)

e.g. 10

Example:

How to find the optimal W?


74
Basic Concepts:
Neural Networks
75
Neural Networks – inspiration

▪ Made up of artificial neurons


Neural Networks – score function

▪ Made up of artificial neurons


▪ Linear function (dot product) followed by a nonlinear
activation function
▪ Example a Multi Layer Perceptron
Basic NN building block

▪ Weighted sum followed by an activation function


Input

Weighted sum
𝑊𝑥 + 𝑏

Activation function

Output

𝑦 = 𝑓(𝑊𝑥 + 𝑏)
Neural Networks – activation function

▪ 𝑓 𝑥 = tanh 𝑥

▪ Sigmoid - 𝑓 𝑥 = (1 + 𝑒 −𝑥 )−1

▪ Linear – 𝑓 𝑥 = 𝑎𝑥 + 𝑏

▪ ReLU 𝑓 𝑥 = max 0, 𝑥 ~log(1 + exp(𝑥) )


▪ Rectifier Linear Units
▪ Faster training - no gradient vanishing
▪ Induces sparsity
Neural Networks – loss function

▪ Already discussed it – cross-entropy, Euclidean


loss, cosine similarity, etc.
▪ Combine it with the score function to have an
end-to-end training objective
▪ As example use Euclidean loss for data-point I
𝐿𝑖 = (𝑓 𝑥𝑖 − 𝑦𝑖 )2 = (𝑓3;𝑊3 (𝑓2;𝑊2 (𝑓1;𝑊1 𝑥𝑖 )) )2
▪ Full loss is computed across all training samples
𝐿 = ෍(𝑓 𝑥𝑖 − 𝑦𝑖 )2
𝑖
Multi-Layer Feedforward Network
Activation functions (individual layers)
𝑓1;𝑊1 𝑥 = 𝜎(𝑊1 𝑥 + 𝑏1 ) 𝑊2
𝑊1 𝑊3
𝑓2;𝑊2 𝑥 = 𝜎(𝑊2 𝑥 + 𝑏2 ) 𝑥𝑖 𝑦𝑖
𝑓3;𝑊3 𝑥 = 𝜎(𝑊3 𝑥 + 𝑏3 )
Score function
𝑦𝑖 = 𝑓 𝑥𝑖 = 𝑓3;𝑊3 (𝑓2;𝑊2 (𝑓1;𝑊1 𝑥𝑖 ))

Loss function (e.g., Euclidean loss)


𝐿𝑖 = (𝑓 𝑥𝑖 − 𝑦𝑖 )2 = (𝑓3;𝑊3 (𝑓2;𝑊2 (𝑓1;𝑊1 𝑥𝑖 )) )2
81
Basic Concepts:
Optimization
82
Optimizing a generic function

▪ We want to find a minimum (or maximum) of a


generic function
▪ How do we do that?
▪ Searching everywhere (global optimum) is
computationally infeasible
▪ We could search randomly from our starting point
(mostly picked at random) – impractical and not
accurate
▪ Instead we can follow the gradient
What is a gradient?

▪ Geometrically
▪ Points in the direction of the greatest rate of increase of the function and
its magnitude is the slope of the graph in that direction

▪ More formally in 1D
𝑑𝑓 𝑥 𝑓 𝑥+ℎ −𝑓 𝑥
= lim
𝑑𝑥 ℎ→0 ℎ

▪ In higher dimensions
𝜕𝑓 𝑓 𝑎1 , … , 𝑎𝑖 + ℎ, … , 𝑎𝑛 − 𝑓 𝑎1 , … , 𝑎𝑖 , … , 𝑎𝑛
(𝑎 , … , 𝑎𝑛 ) = lim
𝜕𝑥𝑖 1 ℎ→0 ℎ
➢ In multiple dimension, the gradient is the vector of (partial derivatives)
and is called a Jacobian.
Gradient Computation

Chain rule:
𝜕𝑦 𝜕𝑦 𝜕ℎ
= 𝑦 = 𝑓(ℎ)
𝜕𝑥 𝜕ℎ 𝜕𝑥 𝑦

ℎ ℎ = 𝑔(𝑥)

85
Optimization: Gradient Computation

Multiple-path chain rule:

𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
=෍ 𝑦 𝑦 = 𝑓(ℎ1 , ℎ2 , ℎ3 )
𝜕𝑥 𝜕ℎ𝑗 𝜕𝑥
𝑗

ℎ1 ℎ2 ℎ3 ℎ𝑗 = 𝑔(𝑥)

86
Optimization: Gradient Computation

Multiple-path chain rule:

𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
=෍ 𝑦 𝑦 = 𝑓(ℎ1 , ℎ2 , ℎ3 )
𝜕𝑥1 𝜕ℎ𝑗 𝜕𝑥1
𝑗

𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗
=෍ ℎ1 ℎ2 ℎ3 ℎ𝑗 = 𝑔(𝒙)
𝜕𝑥2 𝜕ℎ𝑗 𝜕𝑥1
𝑗

𝜕𝑦 𝜕𝑦 𝜕ℎ𝑗 𝑥1 𝑥2 𝑥3
=෍
𝜕𝑥3 𝜕ℎ𝑗 𝜕𝑥1
𝑗

87
Optimization: Gradient Computation

Vector representation:
𝜕𝑦 𝜕𝑦 𝜕𝑦
𝛻𝒙 𝑦 = , , 𝑦 𝑦 = 𝑓(𝒉)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
Gradient

𝑇
𝜕𝒉 𝒉 𝒉 = 𝑔(𝒙)
𝛻𝒙 𝑦 = 𝛻𝒉 𝑦
𝜕𝒙
“backprop” Gradient
“local” Jacobian 𝒙
(matrix of size ℎ × 𝑥 computed
using partial derivatives)

88
Backpropagation Algorithm (efficient gradient)

Forward pass 𝐿 𝐿 = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦|𝒛)


(cross-entropy)
▪ Following the graph topology,
compute value of each unit 𝑦
𝒛 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉2 , 𝑾𝟑 )
Backpropagation pass
𝑾𝟑
▪ Initialize output gradient = 1 𝒉𝟐 𝒉𝟐 = 𝑓(𝒉𝟏 ; 𝑾𝟐 )
▪ Compute “local” Jacobian matrix
using values from forward pass 𝑾𝟐
𝒉𝟏 𝒉1 = 𝑓(𝒙; 𝑾𝟏 )
▪ Use the chain rule:

Gradient = “local” Jacobian x 𝑾𝟏


“backprop” gradient 𝒙

89
How to follow the gradient

▪ Many methods for optimization


▪ Gradient Descent (actually the “simplest” one)
▪ Newton methods (use Hessian – second derivative)
▪ Quasi-Newton (use approximate Hessian)
▪ BFGS
▪ LBFGS
▪ Don’t require learning rates (fewer hyperparameters)
▪ But, do not work with stochastic and batch methods so
rarely used to train modern Neural Networks
▪ All of them look at the gradient
▪ Very few non gradient based optimization methods
Parameter Update Strategies

Gradient descent:

𝜃 (𝑡+1) = 𝜃 𝑡 − 𝜖𝑘 𝛻𝜃 𝐿 Gradient of our loss function

New model Previous


parameters parameters Learning rate
at iteration k

𝜖𝑘 = 1 − 𝛼 𝜖0 + 𝛼𝜖𝜏 Decay learning rate linearly until iteration 𝜏


Learning rate Decay
at iteration k Initial learning rate

Extensions: ▪ Stochastic (“batch”)


▪ with momentum
▪ AdaGrad
▪ RMSProp
91
Unimodal
representations:
Language Modality
92
Unimodal Classification – Language Modality
Word-level
Written language

Input observation 𝒙𝒊
0
0 classification
0
0
0
1
Part-of-speech ?
0 (noun, verb,…)
0
0
0 Sentiment ?
0 (positive or negative)
0
Spoken language

0
0
0 Named entity ?
0 (names of person,…)
0
0
0

“one-hot” vector
𝒙𝒊 = number of words in dictionary

93
Unimodal Classification – Language Modality
Document-level
Written language

Input observation 𝒙𝒊
1
0 classification
0
1
0
1
0
0
0
0 Sentiment ?
1 (positive or negative)
0
Spoken language

0
0
0
1
0
0
0

“bag-of-word” vector
𝒙𝒊 = number of words in dictionary

94
How to Learn (Better) Language Representations?

Distribution hypothesis: Approximate the


word meaning by its surrounding words

Words used in a similar context will lie close together

He was walking away because …


He was running away because …

Instead of capturing co-occurrence counts directly,


predict surrounding words of every word
How to Learn (Better) Language Representations?

No activation function -> very fast

He

100 000d
Was
100 000d

walking x W1 W2 y
Away
because

300d 300d
[0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0] [0; 1; 0; 0;….; 0; 0; 0; 0;…; 0; 0]
[0; 0; 0; 1;….; 0; 0; 0; 0;…; 0; 0]
He was walking away because … [0; 0; 0; 0;….; 1; 0; 0; 0;…; 0; 0]
He was running away because … [0; 0; 0; 0;….; 0; 0; 0; 0;…; 0; 1]
Word2vec algorithm: [Link]
How to use these word representations

If we would have a vocabulary of 100 000 words:

Classic NLP: 100 000 dimensional vector


Walking: [0; 0; 0; 0;….; 0; 0; 1; 0;…; 0; 0]
Running: [0; 0; 0; 0;….; 0; 0; 0; 0;…; 1; 0]

100 000d
Similarity = 0.0 x W1
Transform: x’=x*W
Goal: 300 dimensional vector
300d
Walking: [0,1; 0,0003; 0;….; 0,02; 0.08; 0,05]
Running: [0,1; 0,0004; 0;….; 0,01; 0.09; 0,05]

Similarity = 0.9
Vector space models of words

While learning these word representations, we are


actually building a vector space in which all words
reside with certain relationships between them

Encodes both syntactic and semantic relationships

This vector space allows for algebraic operations:

Vec(king) – vec(man) + vec(woman) ≈ vec(queen)


Vector space models of words: semantic relationships

Trained on the Google news corpus with over 300 billion words
Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality”, NIPS 2013
Unimodal
representations:
Visual Modality
100
Visual Descriptors

Image gradient Edge detectionHistograms of Oriented Gr

LBP

SIFT descriptors
Optical Flow Gabor Jets
Why use Convolutional Neural Networks
▪ Using basic Multi Layer
Perceptrons does not work
Objects
well for images
▪ Intention to build more abstract
representation as we go up
every layer Parts

Edges/blobs

Input pixels
Why not just use an MLP for images (1)?

▪ MLP connects each pixel in an image to each


neuron
▪ Does not exploit redundancy in image structure
▪ Detecting edges, blobs
▪ Don’t need to treat the top left of image
differently from the center
▪ Too many parameters
▪ For a small 200 × 200 pixel RGB image the first
matrix would have 120000 × 𝑛 parameters for
the first layer alone
▪ MLP does not exploit translation invariance
▪ MLP does not necessarily encourage visual
abstraction
Main differences of CNN from MLP

▪ Addition of:
▪ Convolution layer
▪ Pooling layer
▪ Everything else is the same (loss, score and
optimization)
▪ MLP layer is called Fully Connected layer
Convolution in 2D

∗ =
Fully connected layer

▪ Weighted sum followed by an activation function


Input

Weighted sum
𝑊𝑥 + 𝑏

Activation function

Output

𝑦 = 𝑓(𝑊𝑥 + 𝑏)
Convolution as MLP (1)

▪ Remove activation
Input

Weighted sum
𝑊𝑥 + 𝑏 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑

𝑦 = 𝑊𝑥 + 𝑏
Convolution as MLP (2)

▪ Remove redundant links making the matrix W sparse


(optionally remove the bias term)
Input

Weighted sum
𝑊𝑥 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑

𝑦 = 𝑊𝑥
Convolution as MLP (3)

▪ We can also share the weights in matrix W not to do


redundant computation
Input

Weighted sum
𝑊𝑥 Kernel 𝒘𝟏 𝒘𝟐 𝒘𝟑

𝑦 = 𝑊𝑥
How do we do convolution in MLP recap

▪ Not a fully connected layer 𝒘𝟏 𝒘𝟐 𝒘𝟑


anymore
▪ Shared weights
▪ Same colour indicates same
(shared) weight

𝑤1 𝑤2 𝑤3 0 0 0
0 𝑤1 𝑤2 ⋯ 0 0 0
0 0 𝑤1 0 0 0
𝑊= ⋮ ⋱ ⋮
0 0 0 𝑤3 0 0
0 0 0 ⋯ 𝑤2 𝑤3 0
0 0 0 𝑤1 𝑤2 𝑤3
Pooling layer

▪ Used for sub-sampling

Pick the maximum value from input using a smooth


and differentiable approximation

σ𝑛𝑖=1 𝑥𝑖 𝑒 𝛼𝑥𝑖
𝑦 = 𝑛 𝛼𝑥
σ𝑖=1 𝑒 𝑖
Example: AlexNet Model

▪ Used for object classification task


▪ 1000 way classification task – pick one
Unimodal
representations:
Acoustic Modality
113
Unimodal Classification – Acoustic Modality
Digitalized acoustic signal

Input observation 𝒙𝒊
0.21
0.14
0.56
0.45
0.9
0.98
• Sampling rates: 8~96kHz 0.75
• Bit depth: 8, 16 or 24 bits 0.34
0.24
• Time window size: 20ms 0.11
• Offset: 10ms 0.02

Spectogram
114
Unimodal Classification – Acoustic Modality
Digitalized acoustic signal

Input observation 𝒙𝒊
0.21
0.14
0.56
0.45
0.9
0.98 Emotion ?
• Sampling rates: 8~96kHz 0.75
• Bit depth: 8, 16 or 24 bits 0.34
0.24
• Time window size: 20ms 0.11 Spoken word ?
• Offset: 10ms 0.02
0.24
0.26
0.58 Voice quality ?
0.9
0.99
0.79
0.45
0.34
0.24

Spectogram
115
Audio representation for speech recognition

▪ Speech recognition systems historically much more


complex than vision systems – language models,
vocabularies etc.
▪ Large breakthrough of using representation learning
instead of hand-crafted features
▪ [Hinton et al., Deep Neural Networks for Acoustic Modeling in Speech
Recognition: The Shared Views of Four Research Groups, 2012]
▪ A huge boost in performance (up to 30% on some
datasets)
Autoencoders

▪ What does auto mean?


▪ Greek for self – self encoding
▪ Feed forward network

Decoder
𝑥′1 𝑥′2 𝑥′𝑛

intended to reproduce the 𝑔


input ℎ1 ℎ2 ℎ𝑘

Encoder
▪ Two parts encoder/decoder 𝑓
𝑥1 𝑥2 𝑥𝑛
▪ 𝑥′ = 𝑓(𝑔 𝑥 ) – score
function
▪ 𝑔 - encoder
▪ 𝑓 - decoder
Autoencoders

▪ Mostly follows Neural Network


structure
▪ A matrix multiplication followed by a sigmoid
𝑥′1 𝑥′2 𝑥′𝑛
▪ Activation will depend on type of 𝒙
𝑔 = 𝜎(𝑊 ∗ 𝒉)
▪ Sigmoid for binary
▪ Linear for real valued ℎ1 ℎ2 ℎ𝑘

▪ Often we use tied weights to force the 𝑓 = 𝜎(𝑊𝒙)


sharing of weights in encoder/decoder 𝑥1 𝑥2 𝑥𝑛

▪ 𝑊∗ = 𝑊𝑇
▪ word2vec is actually a bit similar to
autoencoder (except for the auto part)
Denoising autoencoder

▪ Simple idea
▪ Add noise to input 𝒙 but
learn to reconstruct original 𝑥′
𝑥′11 𝑥′
𝑥′22 𝑥′
𝑥′𝑛𝑛

▪ Leads to a more robust 𝑔


representation and prevents ℎ1 ℎ2 ℎ𝑘 Loss
copying 𝑓
▪ Learns what the relationship 𝑥ො1 𝑥ො𝑥22 𝑥ො𝑥𝑛𝑛

is to represent a certain 𝒙 Noise


𝑥1 𝑥2 𝑥𝑛
▪ Different noise added during
each epoch
Stacked autoencoders

▪ Can stack autoencoders 𝒙′


as well

Decoder
▪ Each encoding unit has a 𝒉′𝟏
corresponding decoder
𝒉𝟐
▪ Inference as before is

Encoder
feed forward structure, 𝒉𝟏
but now with more
hidden layers 𝒙
Stacked autoencoders

▪ Greedy layer-wise training


▪ Start with training first layer
▪ Learn to encode 𝒙 to 𝒉𝟏 and
to decode 𝒙 from 𝒉𝟏
▪ Use backpropagation
𝒙′
Decoder
𝒉𝟏
Encoder
𝒙
Stacked autoencoders

▪ Map from all 𝒙’s to 𝒉𝟏 ’s


▪ Discard decoder for now
▪ Train the second layer
▪ Learn to encode 𝒉𝟏 to 𝒉𝟐 𝒉′𝟏
and to decode 𝒉𝟐 from 𝒉𝟏 Decoder
▪ Repeat for as many layers 𝒉𝟐
Encoder
𝒉𝟏
Fixed
𝒙
Stacked autoencoders

▪ Reconstruct using
previously learned decoders
mappings 𝒙′
▪ Fine-tune the full network Decoder 𝒉′𝟏
end-to-end
𝒉𝟐

Encoder 𝒉𝟏

𝒙
Stacked denoising autoencoders

▪ Can extend this to a


denoising model
𝒙′
▪ Add noise when training
each of the layers Decoder 𝒉′𝟏
▪ Often with increasing amount
of noise per layer 𝒉𝟐
▪ 0.1 for first, 0.2 for second, 0.3
for third Encoder 𝒉𝟏

𝒙
Multimodal
Representations
125
Core Challenge: Representation

Definition: Learning how to represent and summarize multimodal data in away


that exploits the complementarity and redundancy.

A Joint representations: B Coordinated representations:


Representation Repres. 1 Repres 2

Modality 1 Modality 2 Modality 1 Modality 2

126
Deep Multimodal Boltzmann machines

▪ Generative model ··· softmax


▪ Individual modalities trained like a
DBN
▪ Multimodal representation trained
using Variational approaches
▪ Used for image tagging and cross-
media retrieval
▪ Reconstruction of one modality from
another is a bit more “natural” than in
autoencoder representation
▪ Can actually sample text and images

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014]
Deep Multimodal Boltzmann machines

Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012
Deep Multimodal autoencoders

▪ A deep representation
learning approach
▪ A bimodal auto-encoder
▪ Used for Audio-visual speech
recognition

[Ngiam et al., Multimodal Deep Learning, 2011]


Deep Multimodal autoencoders - training

▪ Individual modalities can be


pretrained
▪ RBMs
▪ Denoising Autoencoders
▪ To train the model to
reconstruct the other modality
▪ Use both
▪ Remove audio

[Ngiam et al., Multimodal Deep Learning, 2011]


Deep Multimodal autoencoders - training

▪ Individual modalities can be


pretrained
▪ RBMs
▪ Denoising Autoencoders
▪ To train the model to
reconstruct the other modality
▪ Use both
▪ Remove audio
▪ Remove video
[Ngiam et al., Multimodal Deep Learning, 2011]
Multimodal Encoder-Decoder

▪ Visual modality often


encoded using CNN
▪ Language modality will
be decoded using LSTM

···
▪ A simple multilayer
perceptron will be used
to translate from visual ··· ···
(CNN) to language
(LSTM) ··· ···
Text Image
𝑿 𝒀
132
Multimodal Joint Representation

▪ For supervised learning tasks


▪ Joining the unimodal
e.g. Sentiment
representations:
··· softmax
▪ Simple concatenation
▪ Element-wise multiplication ··· 𝒉𝒎
or summation
▪ Multilayer perceptron 𝒉𝒙 ··· ··· 𝒉𝒚
▪ How to explicitly model
··· ···
both unimodal and
Text Image
bimodal interactions? 𝑿 𝒀
Multimodal Sentiment Analysis

Sentiment Intensity [-3,+3]


··· softmax

··· 𝒉𝒎

𝒉𝒙 ··· ··· ··· 𝒉𝒛


𝒉𝒚
··· ··· ···
𝒉𝒎 = 𝒇 𝑾 ∙ 𝒉𝒙 , 𝒉 𝒚 , 𝒉𝒛 Text Image Audio
𝑿 𝒀 𝒁

134
Unimodal, Bimodal and Trimodal Interactions

Speaker’s behaviors Sentiment Intensity


“This movie is sick” ? Ambiguous !
Unimodal

“This movie is fair”


Unimodal cues
Smile

Loud voice ? Ambiguous !

“This movie is sick” Smile


Resolves ambiguity
Bimodal

“This movie is sick” Frown (bimodal interaction)

“This movie is sick” Loud voice ? Still Ambiguous !


Trimodal

“This movie is sick” Smile Loud voice


Different trimodal
“This movie is fair” Smile Loud voice interactions !

135
Multimodal Tensor Fusion Network (TFN)

e.g. Sentiment
Models both unimodal and ··· softmax
bimodal interactions:
Bimodal
𝒉𝒙 𝒉𝒚 𝒉𝒙 𝒉𝒙 ⊗ 𝒉𝒚 𝒉𝒎
𝒉𝒎 = ⊗ = Unimodal
1 1 1 𝒉𝒚 1

Important ! 𝒉𝒙 ··· ··· 𝒉𝒚

··· ···
Text Image
[Zadeh, Jones and Morency, EMNLP 2017] 𝑿 𝒀

136
Multimodal Tensor Fusion Network (TFN)

𝒉𝒙
𝒉𝒙 ⊗ 𝒉𝒛
𝒉𝒙 ⊗ 𝒉𝒚
Can be extended to three modalities:
𝒉𝒙 𝒉 𝒉 𝒉𝒚
𝒉𝒎 = ⊗ 𝒚 ⊗ 𝒛
1 1 1 𝒉𝒛 ⊗ 𝒉𝒚
𝒉𝒛
𝒉𝒙 ⊗ 𝒉𝒚 ⊗ 𝒉𝒛
Explicitly models unimodal,
bimodal and trimodal 𝒉𝒙 ··· ··· ··· 𝒉𝒛
𝒉𝒚
interactions !
··· ··· ···
Text Image Audio
[Zadeh, Jones and Morency, EMNLP 2017] 𝑿 𝒀 𝒁

137
Experimental Results – MOSI Dataset

Improvement over State-Of-The-Art

138
Coordinated
Multimodal
Representations
139
Coordinated Multimodal Representations

Learn (unsupervised) two or more


coordinated representations from
multiple modalities. A loss function
is defined to bring closer these (e.g.,
multiple representations. Similarity metric cosine
distance)

··· ···

··· ···

··· ···
Text Image
𝑿 𝒀
140
Coordinated Multimodal Embeddings

[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]
Multimodal Vector Space Arithmetic

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Multimodal Vector Space Arithmetic

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Canonical Correlation Analysis

“canonical”: reduced to the simplest or clearest


schema possible
1 Learn two linear projections, one

projection of Y
for each view, that are maximally
correlated:
projection of X

𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 𝑯𝒙 𝑯𝒚
𝒖,𝒗
··· ···
= argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 𝑼 𝑽
𝒖,𝒗
··· ···
Text Image
𝑿 𝒀
144
Correlated Projection

1 Learn two linear projections, one for each view,


that are maximally correlated:

𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀
𝒖,𝒗

𝒗
𝒖

𝒀
𝑿

Two views 𝑿, 𝒀 where same instances have the same color

145
Canonical Correlation Analysis
We want to learn multiple projection pairs 𝒖(𝑖) 𝑿, 𝒗(𝑖) 𝒀 :

𝒖∗(𝑖) , 𝒗∗(𝑖) = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻(𝑖) 𝑿, 𝒗𝑻(𝑖) 𝒀 ≈ 𝒖𝑻(𝑖) 𝚺𝑿𝒀 𝒗(𝑖)


𝒖 𝑖 ,𝒗(𝑖)

2 We want these multiple projection pairs to be orthogonal


(“canonical”) to each other:

𝒖𝑻(𝑖) 𝚺𝑿𝒀 𝒗(𝑗) = 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

𝑼𝚺𝑿𝒀 𝑽 = 𝑡𝑟(𝑼𝚺𝑿𝒀 𝑽) where 𝑼 = [𝒖 1 , 𝒖 2 ,…, 𝒖 𝑘 ]


and 𝑽 = [𝒗 1 , 𝒗 2 ,…, 𝒗 𝑘 ]
146
Canonical Correlation Analysis

3 Since this objective function is invariant to scaling, we


can constraint the projections to have unit variance:
𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑰 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰

Canonical Correlation Analysis:

maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)


subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰

147
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰

1 0 0 𝜆1 0 0
𝚺𝑿𝑿 𝚺𝒀𝑿 0 1 0 0 𝜆2 0
𝑼,𝑽 0 0 1 0 0 𝜆3
Σ= 𝜆1 0 0
𝚺𝑿𝒀 𝚺𝒀𝒀 1 0 0
0 𝜆2 0 0 1 0
0 0 𝜆3 0 0 1

148
Deep Canonical Correlation Analysis

Same objective function as CCA:

argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚

View 𝐻𝑥
𝑽,𝑼,𝑾𝒙 ,𝑾𝒚

View 𝐻𝑦
Linear projections
1
maximizing correlation 𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
2 Orthogonal projections ··· ···
𝑾𝒙 𝑾𝒚
Unit variance of the
3 ··· ···
projection vectors
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
149
Deep Canonically Correlated Autoencoders (DCCAE)
𝑿′ 𝒀′
Jointly optimize for DCCA and Text Image
autoencoders loss functions ··· ···
➢ A trade-off between multi-view
··· ···
correlation and reconstruction

View 𝐻𝑥
error from individual views
View 𝐻𝑦

𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Wang et al., ICML 2015 𝑿 𝒀
150
Basic Concepts:
Recurrent Neural
Networks
151
Feedforward Neural Network

𝐿(𝑡) 𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) )

𝑦 (𝑡) (𝑡) (𝑡)


𝒛(𝑡) 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉 , 𝑽)

𝑽
(𝑡) (𝑡)
𝒉(𝑡) 𝒉 = 𝑡𝑎𝑛ℎ(𝑼𝒙 )

𝑼
𝒙(𝑡)

152
Recurrent Neural Networks

𝐿 = ෍ 𝐿(𝑡)
𝑡

𝐿(𝑡) 𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) )

𝑦 (𝑡) (𝑡) (𝑡)


𝒛(𝑡) 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉 , 𝑽)
𝑾
𝑽
𝒉(𝑡)

𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(𝑡)

153
Recurrent Neural Networks - Unrolling

𝐿 = ෍ 𝐿(𝑡)
𝑡

𝐿(1) 𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) ) 𝐿(2) 𝐿(3) 𝐿(𝑡)

𝑦 (1) (𝑡)
𝑦 (2) 𝑦 (3) 𝑦 (𝑡)
𝒛(𝟏) 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉(𝑡) , 𝑽) 𝒛(2) 𝒛(3) 𝒛(𝑡)
𝑾
𝑽
𝒉(1) 𝒉(2) 𝒉(3) 𝒉(𝑡)

𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(1) 𝒙(2) 𝒙(3) 𝒙(𝑡)

Same model parameters are used for all time parts.


154
Recurrent Neural Networks – Language models
P(next word is P(next word is P(next word is P(next word is
“dog”) “on”) “the”) “beach”)

1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding


of “START” of “dog” of “on” of “nice”

➢ Model long-term information


155
Recurrent Neural Networks

𝐿 = ෍ 𝐿(𝑡)
𝑡

𝐿(1) 𝐿(𝑡) = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) ) 𝐿(2) 𝐿(3) 𝐿(𝜏)

𝑦 (1) 𝑦 (2) 𝑦 (3) 𝑦 (𝜏 )


(𝑡) (𝑡)
𝒛(𝟏) 𝒛 = 𝑚𝑎𝑡𝑚𝑢𝑙𝑡(𝒉 , 𝑽) 𝒛(2) 𝒛(3) 𝒛( 𝜏 )
𝑾
𝑽
𝒉(1) 𝒉(2) 𝒉(3) 𝒉(𝜏)

𝒉(𝑡) = 𝑡𝑎𝑛ℎ(𝑼𝒙 𝑡
+ 𝑾𝒉(𝑡−1) )
𝑼
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )

156
Backpropagation Through Time

𝐿 = ෍ 𝐿(𝑡) = − ෍ 𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) )


𝑡 𝑡
Gradient =“backprop” gradient
𝜕𝐿 𝐿(𝜏)
𝐿(𝜏) or 𝐿(𝑡) =1 x “local” Jacobian
𝜕𝐿(𝑡)
𝜕𝐿 𝜕𝐿 𝜕𝐿(𝑡) 𝑦 (𝜏 )
(𝜏 ) 𝛻𝒛 𝑡 𝐿 = (𝑡) = (𝑡) (𝑡) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑧 𝑡 − 𝟏 (𝑡) 𝒛( 𝜏 )
𝒛 or 𝒛(𝑡) 𝑖
𝜕𝑧 𝜕𝐿 𝜕𝑧 𝑖 𝑖,𝑦
𝑖 𝑖

𝜕𝑧 (𝜏)
𝒉(𝜏) 𝛻𝒉 𝜏 𝐿 = 𝛻𝒛 𝜏 𝐿 (𝜏) = 𝛻𝒛 𝜏 𝐿𝑽 𝒉(𝜏)
𝜕𝒉

𝒉 (𝑡) 𝒉 (𝑡+1) 𝜕𝒐(𝑡) 𝜕𝒉(𝑡+1)


𝛻𝒉 𝑡 𝐿 = 𝛻𝒛 𝑡 𝐿 (𝑡) + 𝛻𝒛 𝑡+1 𝐿 𝒙( 𝜏 )
𝜕𝒉 𝜕𝒉(𝑡)

157
Backpropagation Through Time

𝐿 = ෍ 𝐿(𝑡) = − ෍ 𝑙𝑜𝑔𝑃(𝑌 = 𝑦 (𝑡) |𝒛(𝑡) )


𝑡 𝑡
Gradient =“backprop” gradient
𝐿(𝜏)
x “local” Jacobian
𝑦 (𝜏 )
𝜕𝒛(𝑡) 𝒛( 𝜏 )
𝑽 𝛻𝑽 𝐿 = ෍ 𝛻𝒛 𝑡 𝐿
𝜕𝑽
𝑡 𝑽
𝜕𝒉(𝑡) 𝒉(𝜏)
𝑾 𝛻𝑾 𝐿 = ෍ 𝛻𝒉 𝑡 𝐿
𝜕𝑾 𝑾
𝑡

𝜕𝒉(𝑡) 𝑼 𝒙( 𝜏 )
𝑼 𝛻𝑼 𝐿 = ෍ 𝛻𝒉 𝑡 𝐿
𝜕𝑼
𝑡

158
Long-term Dependencies

Vanishing gradient problem for RNNs:


𝒉(𝑡) ~𝑡𝑎𝑛ℎ(𝑾𝒉(𝑡−1) )

➢ The influence of a given input on the hidden layer, and therefore on


the network output, either decays or blows up exponentially as it
cycles around the network's recurrent connections.

159
Recurrent Neural Networks

𝐿(𝑡)
tanh
𝒉(𝑡−1)
+1
𝑦 (𝑡) 𝒉(𝑡)
𝒛(𝑡) 𝒙(𝑡)
-1

𝒉(𝑡)

𝒙(𝑡)

160
LSTM ideas: (1) “Memory” Cell and Self Loop
[Hochreiter and Schmidhuber, 1997]

Long Short-Term Memory (LSTM)

𝐿(𝑡)
tanh
𝒉(𝑡−1)
+1
𝑦 (𝑡) 𝒄(𝑡) 𝒉(𝑡)
𝒛(𝑡) 𝒙(𝑡)
-1 cell
Self-
loop

𝒉(𝑡) Self-

𝒙(𝑡)

161
LSTM Ideas: (2) Input and Output Gates
[Hochreiter and Schmidhuber, 1997]

𝐿(𝑡) tanh sum


𝒉(𝑡−1)
+1
x 𝒄(𝑡) x 𝒉(𝑡)
-1 cell
𝑦 (𝑡) 𝒙(𝑡) Self-
𝒛(𝑡) sigmoid loop
(𝑡−1)
𝒉 +1
𝒙(𝑡) 0 Input gate
Self-
𝒉(𝑡)

sigmoid
(𝑡−1)
𝒙(𝑡) 𝒉 +1
𝒙(𝑡) 0 Output gate

162
LSTM Ideas: (3) Forget Gate [Gers et al., 2000]
𝒈 𝑡𝑎𝑛ℎ
𝒊 𝑠𝑖𝑔𝑚 (𝑡−1) 𝒄(𝑡) = 𝒇⨀𝒄 𝑡−1 + 𝒊⨀𝒈
𝒇 = 𝑾 𝒉 (𝑡)
𝑠𝑖𝑔𝑚 𝒙
𝒐 𝑠𝑖𝑔𝑚 𝒉(𝑡) = 𝒐⨀tanh(𝒄 𝑡 )
𝐿(𝑡) tanh sum
𝒉(𝑡−1) 𝒈
+1
x 𝒄(𝑡) x 𝒉(𝑡)
-1 cell
𝑦 (𝑡) 𝒙(𝑡)
𝒛(𝑡) sigmoid Self-
(𝑡−1)
𝒉 +1 𝒊 loop
𝒙(𝑡) 0 Input gate x
sigmoid Self-
𝒉(𝑡) 𝒉(𝑡−1)
+1 𝒇
𝒙(𝑡) 0 Forget gate
sigmoid
(𝑡−𝟏)
𝒙(𝑡) 𝒉 +1 𝒐
𝒙(𝑡) 0 Output gate

163
Recurrent Neural Network using LSTM Units

𝐿(1) 𝐿(2) 𝐿(3) 𝐿(𝜏)

𝑦 (1) 𝑦 (2) 𝑦 (3) 𝑦 (𝜏 )


𝒛(𝟏) 𝒛(2) 𝒛(3) 𝒛( 𝜏 )

𝑽
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

𝑾
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )

Gradient can still be computer using backpropagation!


164
Recurrent Neural Network using LSTM Units

𝐿(𝜏)

𝑦 (𝜏 )
𝒛( 𝜏 )

𝑽
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

𝑾
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )

Gradient can still be computer using backpropagation!


165
Bi-directional LSTM Network
𝐿(1) 𝐿(2) 𝐿(3) 𝐿(𝜏)

𝑦 (1) 𝑦 (2) 𝑦 (3) 𝑦 (𝜏 )


𝒛(𝟏) 𝒛(2) 𝒛(3) 𝒛( 𝜏 )

𝑽
LSTM(1)
2
LSTM(2)
2
LSTM(3)
2
LSTM(𝜏)
2

𝑾𝟐
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1

𝑾𝟏
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )

166
Deep LSTM Network
𝐿(1) 𝐿(2) 𝐿(3) 𝐿(𝜏)

𝑦 (1) 𝑦 (2) 𝑦 (3) 𝑦 (𝜏 )


𝒛(𝟏) 𝒛(2) 𝒛(3) 𝒛( 𝜏 )

𝑽
LSTM(1)
2
LSTM(2)
2
LSTM(3)
2
LSTM(𝜏)
2

𝑾𝟐
LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)
1 1 1 1

𝑾𝟏
𝒙(1) 𝒙(2) 𝒙(3) 𝒙( 𝜏 )

167
Translation
and Alignment
168
Core Challenge 4: Translation

Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.

A Example-based B Model-driven

169
Translation

➢ Visual animations Challenges:


I. Different representations
II. Multiple source modalities
III. Open ended translations
➢ Image captioning IV. Subjective evaluation
V. Repetitive processes

➢ Speech synthesis
Example-based translation

▪ Cross-media retrieval – bounded task


▪ Multimodal representation plays a key role here

[Wei et al. 2015]


Example-based translation

▪ Need a way to measure similarity between the modalities


▪ Remember multimodal representations
▪ CCA
▪ Coordinated
▪ Joint
▪ Hashing
▪ Can use pairs of instances to train them and retrieve closest ones
during retrieval stage
▪ Objective and bounded task

[Wang et al. 2014]


Model-based Image captioning with Encoder-Decoder

[Vinyals et al., “Show and Tell: A Neural Image Caption Generator”, CVPR 2015]

173
Visual Question Answering

▪ A very new and exciting task created in part to address evaluation


problems with the above task
▪ Task - Given an image and a question answer the question
([Link]
Evaluation on “Unbounded” Translations

▪ Tricky to do automatically!
▪ Ideally want humans to evaluate
▪ What do you ask?
▪ Can’t use human evaluation for validating models –
too slow and expensive
▪ Using standard machine translation metrics
instead
▪ BLEU, ROUGE CIDER, Meteor
Core Challenge: Alignment

Definition: Identify the direct relations between (sub)elements


from two or more different modalities.

Modality 1 Modality 2 A Explicit Alignment

t1 The goal is to directly find


correspondences between elements of
t2 t4 different modalities
Fancy algorithm

t3 t5 B Implicit Alignment

Uses internally latent alignment of


modalities in order to better solve a
tn tn different problem

176
Explicit alignment
177
Temporal sequence alignment

Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping

▪ We have two unaligned temporal unimodal


signals
▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
difference:
𝑙
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
𝑡
−𝒚 𝑦
𝒑𝑡
2
𝑡=1

𝑦
▪ Where 𝒑𝒕𝑥 and 𝒑𝒕 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Restrictions
▪ Monotonicity – no going back in
time
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
𝑦
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Solved using dynamic
programming whilst respecting
the restrictions

𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )

𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑦
𝐿(𝒑𝒕𝑥 , 𝒑𝒕 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!

= 𝐗𝐖𝑥
=
= 𝐘𝐖y

Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗

182
DTW - limitations

▪ Computationally complex

m sequences

▪ Sensitive to outliers

▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰

projection of Y
Linear projections maximizing
1 correlation

projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚

Unit variance of the projection ··· ···


3 vectors 𝑼 𝑽
··· ···
Text Image
𝑿 𝒀

184
Canonical Correlation Analysis reminder

▪ When data is normalized it is actually equivalent to smallest RMSE


reconstruction
▪ CCA loss can also be re-written as:

projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰


projection of X

𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping

▪ Dynamic Time Warping + Canonical Correlation Analysis


= Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

▪ Allows to align multi-modal or multi-view (same modality


but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Generalized Time warping

▪ Generalize to multiple sequences all of different


modality
2
𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝐹
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments

(1) Time warping


(2) Spatial embedding

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]


Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Subject 2: 217 frames
Subject 3: 222 frames

Weizmann

Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames

188
Alignment examples (multimodal)

But how to model non-linear alignment functions?


Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ The projections are orthogonal (like in DCCA)


▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Implicit alignment
192
Machine Translation

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each


language can be seen almost as a modality.
Encoder-Decoder Architecture [Cho et al., “Learning Phrase Representations
using RNN Encoder-Decoder for Statistical
for Machine Translation Machine Translation”, EMNLP 2014]

Context

1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding
of “le” of “chien” of “sur” of “la” of “plage”
Attention Model for Machine Translation

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog

Attention
module / Hidden state 𝒔0
gate

Context 𝒛𝟎

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine


Translation by Jointly Learning to Align
le chien sur la plage and Translate”, ICLR 2015]
Attention Model for Machine Translation

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog on

Attention
module / Hidden state 𝒔1
gate

Context 𝒛𝟏

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine


Translation by Jointly Learning to Align
le chien sur la plage and Translate”, ICLR 2015]
Attention Model for Machine Translation

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog on the

Attention
module / Hidden state 𝒔2
gate

Context 𝒛𝟐

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine


Translation by Jointly Learning to Align
le chien sur la plage and Translate”, ICLR 2015]
Attention Model for Machine Translation

198
Attention Model for Image Captioning

Distribution Output
over L locations word

𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦1 𝑧2 𝑦2

Expectation First word


over features: D

199
Attention Model for Image Captioning
Attention Model for Video Sequences

[Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for


Robust Sequence Classification, CVPR, 2017 ]
Temporal Attention-Gated Model (TAGM)

1 − 𝑎𝑡 Recurrent Attention-Gated Unit

𝒉𝑡−1 x + 𝒉𝑡

𝒙𝑡 + ReLU x

𝑎𝑡−1 𝑎𝑡 𝑎𝑡+1

𝒉𝑡−1 𝒉𝑡 𝒉𝑡+1

𝒙𝑡−1 𝒙𝑡 𝒙𝑡+1

202
Temporal Attention Gated Model (TAGM)

CCV dataset
▪ 20 video categories
▪ Biking, birthday, wedding etc.

Mean average precision


65

60

55

50

45

40

35
RNN GRU LSTM TAGM (ours)

[Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for


Robust Sequence Classification, CVPR, 2017]
Temporal Attention Gated Model (TAGM)

[Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for


Robust Sequence Classification, CVPR, 2017 ]
Temporal Attention Gated Model (TAGM)

Text-based Sentiment Analysis

[Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for


Robust Sequence Classification, CVPR, 2017 ]

205
Multimodal Fusion
206
Multimodal Fusion

▪ Process of joining information from two or more modalities to perform


a prediction
▪ One of the earlier and more established problems
▪ e.g. audio-visual speech recognition, multimedia event detection,
multimodal emotion recognition
▪ Two major types Prediction
▪ Model Free
▪ Early, late, hybrid
▪ Model Based Fancy
algorithm
▪ Kernel Methods
▪ Graphical models
▪ Neural networks
Modality 1 Modality 2 Modality 3
Model free approaches – early fusion

Modality 1

Modality 2
Classifier
Modality n

▪ Easy to implement – just concatenate the features


▪ Exploit dependencies between features
▪ Can end up very high dimensional
▪ More difficult to use if features have different framerates
Model free approaches – late fusion

Modality 1
Classifier

Modality 2 Fusion
Classifier mechanism

Modality n
Classifier
▪ Train a unimodal predictor and a multimodal fusion one
▪ Requires multiple training stages
▪ Do not model low level interactions between modalities
▪ Fusion mechanism can be voting, weighted sum or an ML approach
Model free approaches – hybrid fusion

Modality 1
Classifier

Modality 2
Classifier
Fusion
mechanism

Modality 1
Classifier
Modality 2

▪ Combine benefits of both early and late fusion mechanisms


Multiple Kernel Learning

▪ Pick a family of kernels for each modality and learn which kernels are important for the
classification case
▪ Generalizes the idea of Support Vector Machines
▪ Works as well for unimodal and multimodal data, very little adaptation is needed

[Lanckriet 2004]
Multimodal Fusion for Sequential Data
Multi-View
Modality-private structure
Hidden Conditional Random Field
Sentiment
• Internal grouping of observations
y
Modality-shared structure
• Interaction and synchrony ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴
ℎ1𝑉 ℎ2𝑉 ℎ3𝑉 ℎ4𝑉 ℎ5𝑉
𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓
𝒙𝑽𝟏 𝒙𝑽𝟐 𝒙𝑽𝟑 𝒙𝑽𝟒 𝒙𝑽𝟓
𝑝 𝑦 𝒙𝑨 , 𝒙𝑉 ; 𝜽) = ෍ 𝑝 𝑦, 𝒉𝑨 , 𝒉𝑽 𝒙𝑨 , 𝒙𝑽 ; 𝜽
𝒉𝑨 ,𝒉𝑽 We saw the yellowdog

➢ Approximate inference using loopy-belief


[Song, Morency and
Davis, CVPR 2012]

212
Sequence Modeling with LSTM

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

213
Multimodal Sequence Modeling – Early Fusion

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

(1) (1) (1) (1)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(2) (2) (2) (2)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(3) (3) (3) (3)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

214
Multi-View Long Short-Term Memory (MV-LSTM)

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

MV- MV- MV- MV-


LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

(1) (1) (1) (1)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(2) (2) (2) (2)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(3) (3) (3) (3)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]

215
Multi-View Long Short-Term Memory

Multi-view topologies Multiple


memory cells
(1) 𝒈(1)
𝒕 𝒄𝒕(1)
𝒉𝒕−𝟏 MV- (2)
𝒉(1)
𝒕
MV- 𝒈𝒕
LSTM(1) 𝒉(2)
tanh 𝒄𝒕(2) 𝒉(2)
𝒕−𝟏 𝒕
(3)
𝒈 𝒄𝒕(3)
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕

MV-
(1) sigm
𝒙𝒕
𝒙(2)
𝒕
MV-
(3)
𝒙𝒕 sigm

MV-
sigm

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]

216
Topologies for Multi-View LSTM

Fully-
View-specific connected
Multi-view topologies α=1, β=0 α=1, β=1
𝒙(1) 𝒈(1) 𝒙(1) 𝒈(1)
𝒈(1)
𝒕
𝒕 𝒕 𝒕 𝒕
𝒉(1)
𝒕−𝟏 MV- (2) 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
MV- 𝒈𝒕
tanh
LSTM(1) 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
(3)
𝒈
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏

𝒙(1)
𝒕 Coupled Hybrid
α: Memory from α=0, β=1 α=2/3, β=1/3
𝒙(2)
𝒕
current view 𝒙(1)
𝒕 𝒈(1) 𝒙(1)
𝒕 𝒈(1)
(3) 𝒕 𝒕
𝒙𝒕 β: Memory from 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
other views
𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
Design parameters 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]

217
Multi-View Long Short-Term Memory (MV-LSTM)

Multimodal prediction of children engagement

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]

218
Memory Based

▪ A memory accumulates multimodal


information over time.
▪ From the representations throughout a
source network.
▪ No need to modify the structure of the
source network, only attached the
memory.

219
Memory Based

[Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018]

220
Multimodal Machine Learning

Tadas Baltrusaitis, Chaitanya Ahuja,


and Louis-Philippe Morency

[Link]

221

You might also like