Visual Perception and Learning in the Open World
CISC3027 Special Topics in Computer and Information Science
Lecture 8: Recognition with More Sensors / Modalities
(part-2)
Instructor: Shu Kong
Email: skong@[Link]
Office: E11 4025
How to specify THAT object to let robot get it for you?
We need some interaction – human computer interaction (HCI)!
How to specify THAT object to let robot get it for you?
Detect object instances using language
Detect object instances using language
What’s next?
• Visual signal is not enough
• Applications in real open world rely on multiple sensors, more than an RGB camera.
What’s next?
• Visual signal is not enough
• Applications in real open world rely on multiple sensors, more than an RGB camera.
Multi-modality / multi-sensor
• RGB + Infrared
• RGB + depth
• RGB + LiDAR
• RGB + IMU
• RGB + language
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
Tempe, Arizona, Sunday night, March 19, 2018
RGB + Infrared for object detection
• Again, why use infrared?
Tempe, Arizona, Sunday night, March 19, 2018
No person?
RGB + Infrared for object detection
• Again, why use infrared?
Tempe, Arizona, Sunday night, March 19, 2018
No person?
RGB + Infrared for object detection
• Again, why use infrared?
Single-modal RGB sensor cannot capture all objects, e.g., under poor illumination.
Thermal captures stronger signatures for objects that emit heat.
How about for objects that do not? Let’s fuse modalities.
Tempe, Arizona, Sunday night, March 19, 2018
No person?
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
RGB + Infrared for object detection
• Again, why use infrared?
HDR scenes
RGB + Infrared for object detection
• Again, why use infrared?
HDR scenes
RGB + Infrared for object detection
• Again, why use infrared?
• Can we use infrared only without RGB? Why (not)?
RGB + Infrared for object detection
• Again, why use infrared?
• Can we use infrared only without RGB? Why (not)?
RGB + Infrared for object detection
• Again, why use infrared?
• Can we use infrared only without RGB? Why (not)?
RGB + Infrared for object detection
• Again, why use infrared?
• Can we use infrared only without RGB? Why (not)?
RGB + Infrared for object detection
• Again, why use infrared?
• Can we use infrared only without RGB? Why (not)?
RGB + Infrared for object detection
RGB + Infrared for object detection
Beam splitter for spatial synchronization
RGB + Infrared for object detection
Beam splitter for spatial synchronization
RGB + Infrared for object detection
How to fuse RGB+Infrared?
RGB + Infrared for object detection
How to fuse RGB+Infrared?
RGB + Infrared for object detection
How to fuse RGB+Infrared?
RGB + Infrared for object detection
How to fuse RGB+Infrared?
[1] Devaguptapu, et al. Borrow from Anywhere: Pseudo Multi-modal Object Detection in Thermal Imagery. CVPRW, 2019
RGB + Infrared for object detection
How to fuse RGB+Infrared?
[1] Devaguptapu, et al. Borrow from Anywhere: Pseudo Multi-modal Object Detection in Thermal Imagery. CVPRW, 2019
RGB + Infrared for object detection
How to fuse RGB+Infrared?
RGB + Infrared for object detection
How to fuse RGB+Infrared?
RGB + Infrared for object detection
How to fuse RGB+Infrared?
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling
(a) Naïve approach: pool single-modal detections together
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling (b) NMS
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
(b) Remove overlapping detections with Non-Maximum Suppression (NMS)
Dalal & Triggs. “Histograms of oriented gradients for human detection”. CVPR, 2005.
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling (b) NMS
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
(b) Remove overlapping detections with Non-Maximum Suppression (NMS)
This is a waste of information.
Dalal & Triggs. “Histograms of oriented gradients for human detection”. CVPR, 2005.
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling (b) NMS (c) Average
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
(b) Remove overlapping detections with Non-Maximum Suppression (NMS)
This is a waste of information.
(c) To fuse modalities rather than suppress, let’s try averaging (but this must decrease score)
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling (b) NMS (c) Average
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
(b) Remove overlapping detections with Non-Maximum Suppression (NMS)
This is a waste of information.
(c) To fuse modalities rather than suppress, let’s try averaging (but this must decrease score)
Intuitively, a probabilistic approach to fusion should boost scores when modalities agree
Late fusion of RGB + Infrared
modality x1 modality x2
fusion
(a) Pooling (b) NMS (c) Average (d) ProbEn
(a) Naïve approach: pool single-modal detections together
This will likely produce overlapping detections.
(b) Remove overlapping detections with Non-Maximum Suppression (NMS)
This is a waste of information.
(c) To fuse modalities rather than suppress, let’s try averaging (but this must decrease score)
Intuitively, a probabilistic approach to fusion should boost scores when modalities agree
(d) We propose probabilistic ensembling (ProbEn), a non-learned approach derived from first
principles.
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Probabilistic ensembling of RGB & Infrared detections
modality x1 modality x2
p (y | x1, x2) =
?
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
modality x1 modality x2
p (y | x1, x2) =
?
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
p (y | x1, x2) =
?
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p (y | x1, x2) =
?
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) =
p(x1, x2)
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
p(y)
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
p(y)
p(y | x1) p(y | x2)
∝
p(y)
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
ProbEn p(y)
• multiply single modal probability
• divide by the class prior p(y | x1) p(y | x2)
∝
• re-normalize p(y)
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
ProbEn p(y)
• multiply single modal probability
• divide by the class prior p(y | x1) p(y | x2)
∝
• re-normalize p(y)
prior RGB Thermal p(y|x1) p(y|x2)
Re-norm
p(y) p(y|x1) p(y|x2) p(y)
person 0.5 0.7 0.8 0.7*0.8 / 0.5 0.9
car 0.5 0.3 0.2 0.3*0.2 / 0.5 0.1
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
ProbEn p(y)
• multiply single modal probability
• divide by the class prior p(y | x1) p(y | x2)
∝
• re-normalize p(y)
The optimal strategy to fuse is to sum logits
• Not to average softmax scores;
softmax posterior logits • Not to average logits;
exp(si[k]) • But to sum logits! sum logits
p (y=k | xi) =
∑𝑗 exp(si[j])
exp( s1[k] + s2[k] )
p (y=k | x1, x2) ∝
∝ exp(si[k]) p(y=k)
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Probabilistic ensembling of RGB & Infrared detections
ProbEn is the optimal fusion strategy given the conditional independence assumption
p (x1 | y ) = p (x1 | x2 , y )
modality x1 modality x2
Bayes rule
p(x1, x2 | y) p(y)
p (y | x1, x2) = ∝ p(x1, x2 | y) p(y)
p(x1, x2)
p(x1 | y) p(y) p(x2 | y) p(y)
∝
ProbEn p(y)
• multiply single modal probability
• divide by the class prior p(y | x1) p(y | x2)
∝
• re-normalize p(y)
The optimal strategy to fuse is to sum logits
• Not to average softmax scores;
softmax posterior logits • Not to average logits;
exp(si[k]) • But to sum logits! sum logits
p (y=k | xi) =
∑𝑗 exp(si[j])
exp( s1[k] + s2[k] )
p (y=k | x1, x2) ∝
∝ exp(si[k]) p(y=k)
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Probabilistic ensembling of RGB & Infrared detections
Probabilistic ensembling of RGB & Infrared detections
Probabilistic ensembling of RGB & Infrared detections
Probabilistic ensembling of RGB & Infrared detections
Probabilistic ensembling of RGB & Infrared detections
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Probabilistic ensembling of RGB & Infrared detections
ProbEn handles missing modalities.
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Probabilistic late-fusion of RGB + Infrared Log-Average Miss Rate
better
● ProbEn outperforms heuristic fusion methods, e.g., avg and NMS. 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35
RGB
Thermal
MidFusion
KAIST dataset Pooling
NMS
average fusion
ProbEn
detections detections detections
detector detector detector head Detector
Ensemble
detector head
feature fusion
feature feature feature detections detections
RGB-feature RGB-feature thermal-feature RGB-detector thermal detector
extractor extractor extractor
Thermal ProbEn
Probabilistic late-fusion of RGB + Infrared Log-Average Miss Rate
better
● ProbEn outperforms heuristic fusion methods, e.g., avg and NMS. 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35
RGB
● ProbEn still improves even when the conditional independence
Thermal
assumption does not hold. MidFusion
Pooling
NMS
average fusion
ProbEn
ProbEn (3)
Probabilistic late-fusion of RGB + Infrared Log-Average Miss Rate
better
● ProbEn outperforms heuristic fusion methods, e.g., avg and NMS. 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35
RGB
● ProbEn still improves even when the conditional independence
Thermal
assumption does not hold. MidFusion
Pooling
● ProbEn off-the-shelf detectors achieves 26% relative improvement!
NMS
average fusion
ProbEn
ProbEn (3)
RPN+BDT [CVPRW 2017]
TC-DET [ECCV 2020]
IATDNN [InfoFusion 2019]
IAF RCNN [PR 2019]
CIAN [InfoFusion 2019]
MSDS [BMVC 2018]
AR-CNN [ICCV 2019]
MBNet [ECCV 2020]
MLPD [RA-L 2021]
GAFF [WACV 2021] 0.65
ProbEn (3 w/ GAFF) 0.51
Chen, Shi, Ye, Mertz, Ramanan, Kong. “Multimodal Object Detection via Probabilistic Ensembling”. ECCV, 2022
Multi-modality / multi-sensor
• RGB + Infrared
• RGB + depth
• RGB + LiDAR
• RGB + IMU
• RGB + language
RGB + depth
• Why use depth?
RGB + depth
• Why use depth?
RGB + depth
• Why use depth?
RGB + depth
• Why use depth?
RGB + depth for semantic segmentation
• Why use depth?
RGB + depth for semantic segmentation
• Why use depth?
RGB + depth for semantic segmentation
• Why use depth?
RGB + depth for semantic segmentation
• Why use depth?
RGB + depth for semantic segmentation
• Why use depth?
• Intuitively, we need to pay attention to far / small objects.
RGB + depth for semantic segmentation
RGB + depth for semantic segmentation
RGB + depth for semantic segmentation
RGB + depth for semantic segmentation
RGB + depth for semantic segmentation