Cognitive Computing
Lecture 8
Introduction to Multimodal Machine Learning
Dr. Hany Hanafy Mahmoud
Table of Contents
• What is Multimodal cognitive system?
• Multimodal History
• Multimodal learning
• Core Technical Challenges
• Multimodal Research Task
What is Multimodal cognitive system?
• Science related to data with more sensory modalities
• This approach is rooted in the theoretical assumption that
cognitive performance can be influenced by other modes
of psychological processing
• E.g., perceptual, emotional, social, and responses to the
physical environment.
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
Multimodal Communicative Behaviors
• Verbal: What you see?
• Vocal: How you say it?
• Visual: How visual behavior looks like?
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
• Verbal:
• Lexicon (words), Syntax (POS), …
• Lexical analyzer: divides the text into words, phrases, and
paragraphs. It identifies the structure of words in sentences
• Semantic Analysis: it determines if the text has any meaning and
attempts to discover its true meaning.
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
• Vocal:
• Voice quality
• Intonation
• Vocal expressions (laugher,…)
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
• Visual:
• Gestures: head gestures, eye gestures
• Body language: arm movements, body posture, proxemics
• Eye contact and head gaze
• Facial expressions: smile, …
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
• Modality: is a certain type if information & data representation
format.
• Sensory Modality: primary forms of sensation as vision,
hearing, touch, ...
• Medium: is instrumentation for storing & communicating
information.
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
Multiple Communicates
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
Examples of Modalities
• NLP (text & speech)
• Visual (images or videos)
• Auditory (voice, sound or music)
• Smell, taste, touch
• Physiological Signals; Electrocardiogram, ECG, skin conductance
• Other Modalities: infrared images, depth images, fMRI
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
What is Multimodal cognitive system?
Different modalities: show diverse qualities, structures and
representations.
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
Connection types:
• Correlation: there is a statistical association / relationship bet. variables. It reflects
things that appear to behave in a “similar” way.
• Causation: a change in one variable causes a change in another variable. It is when
you say something causes something else to happen.
• Co-occurrence: refers to the frequency with which two / more entities (such as
words, phrases, or concepts) appear together within a given context, such as a
document. It is a measure of how often entities are found in proximity to each other,
indicating potential relationships or associations between them.
• Associations: refers to any relationship between two variables, including linear,
curvilinear, or non-linear relationships. Therefore, all correlations are associations, but
not all associations are correlations
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
Multi-Modal Machine Learning (MMML): is the study of
computer algorithms that learn and improve through the use
and experience of data from multiple modalities.
Artificial Intelligence for Multimodal data: are able to
demonstrate intelligence capabilities such as understanding,
reasoning, planning, …
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
What is Multimodal cognitive system?
New Modality
Representation
Prediction
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
Core Multimodal Challenges
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
1. Representation:
It reflects cross-modal interactions between individual elements
across different modalities
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
1. Representation:
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
2. Alignment:
Identifying cross-modal connections between all elements of
multiple modalities, building from the data structure.
Most modalities have internal structure with multiple elements
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
2. Alignment:
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
3. Reasoning:
Combine knowledge through multiple inferential steps,
exploiting multimodal alignment and problem structure
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
4. Generation:
Learn a generative process to produce raw modalities that
reflects cross-modal interactions, structure and coherence.
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
5. Transference:
Transfer knowledge between modalities to help target modality
which may be noisy or with limited resources.
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
5. Transference:
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal Challenges
6. Quantification:
Theoretical study to better understand heterogeneity, cross-
modal interactions and the multimodal learning process.
https://www.youtube.com/watch?v=DPkwjgaRvyI&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW
Multimodal History
• Behavioral: 1970 till late 1980s
• Computational: late 1980s sill late 2000
• Interaction: 2000 to 2010
• Deep learning: 2010s until now
• Next era: ?
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
Multimodal History
• Behavioral: 1970 till late 1980s
اإليماءات هي في الواقع تفكير المتحدث في العمل ومكونات متكاملة للكالم ،وليس مجرد مرافقات أو إضافات
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
Multimodal History
• Computational: late 1980s sill late 2000
The goal of affective computing is to create a computing
system capable of perceiving, recognizing, and
understanding human emotions and responding
intelligently, sensitively, and naturally, thus making human–
computer interaction more natural
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
Multimodal History
• Interaction: 2000 to 2010
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
Multimodal History
• Deep learning: 2010s until now
https://www.youtube.com/watch?v=VIq5r7mCAyw&t=4131s
Multimodal History
• 1990 to 202X Timeline:
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=3
Multimodal Research Task
Real-world tasks for MMML:
A. Affected recognition: recognize emotions, sentiment
B. Media description: image and video captioning
C. Multimodal QA: image and video QA, visual reasoning
D. Multimodal navigation: language guided navigation, autonomous
driving
E. Multimodal Dialog: ground dialog
F. Event recognition: action recognition and segmentation
G. Multimedia information retrieval: content based, cross media
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=3
Multimodal Research Task
• Dataset
Datasets: https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=2
On GitHub: https://github.com/topics/multimodal-datasets
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=3
Multimodal Research Task
• Dataset
Datasets:
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=2
On GitHub:
https://github.com/topics/multimodal-datasets
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=3
Datasets Affect Recognition
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Affect Recognition
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Affect Recognition
Cross media Retrieval
Confounding variable is an unmeasured third variable that
influences both the supposed cause and the supposed effect.
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Media Description
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Media Description
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Multimedia QA
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Multimedia QA
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Datasets Media Description
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Example 1: Select-Additive Learning
Sentiment classification task for verbal, acoustic, visual. It improves the
generalizability of trained neural networks for multimodal sentiment analysis
https://arxiv.org/abs/1609.05244
Confounding variables are factors that can influence both the independent and dependent variables in a study, leading to
biased or incorrect conclusions about the relationship between them. In machine learning, addressing confounding variables is
crucial for accurate causal inference and prediction.
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Example 1: Select-Additive Learning
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Example 2: World-level gated Fusion
Multimodal Sentiment Analysis: Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model
https://arxiv.org/abs/1802.00924
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Example 2: World-level gated Fusion
GME: Gated Multimodal Embedding
https://www.youtube.com/watch?v=fBYu8I52nVM&list=UULFqlHIJTGYhiwQpNuPU5e2gg&index=54
Multimodal Research Task
Datasets Requirements for the project
• Dataset should have at least two modalities
• Teams of 2 or 3 students
• Stages:
• Pre-proposal: define dataset and research task
• Study related work to your selected research topic
• Experiment with Unimodal representations
• Implement & evaluate state-of-the-art model(s)
• Create GitHub repository & it is accessible by course staff
• Each report should include a description of the task from each team member.
• Make a video that present the robot in action
• Write a paper.
https://www.youtube.com/watch?v=607EcmU9mFs&list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW&index=3