0% found this document useful (0 votes)
71 views10 pages

Chou 2020

Uploaded by

kkeerthi595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views10 pages

Chou 2020

Uploaded by

kkeerthi595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Visual Question Answering on 360◦ Images

Shih-Han Chou1,2 , Wei-Lun Chao3 , Wei-Sheng Lai5 , Min Sun2 , Ming-Hsuan Yang4,5
1
University of British Columbia 2 National Tsing Hua University
3
The Ohio State University 4 University of California at Merced 5 Google
“Scene” question example:
Q: What room is depicted in the image?
A: living room
“Exist” question example:
Q: Is there a plant in the bedroom?
A: yes

“Counting” question example:


Q: How many chairs are there?
A: more than 4

“Property” question example:


Q: What is the color of the vase at the right of pictures?
A: black
“Spatial” question example:
Q: Which side of the tv is the pictures?
A: right side

Figure 1: An example of our VQA 360 dataset. We introduce VQA 360 , a novel task of visual question answering on
◦ ◦

360◦ images, and collect the first real VQA 360◦ dataset, in which each image is annotated with around 11 questions of five
types (marked by different colors). The bounding boxes indicate where to look to infer the answers. Best viewed in color.

Abstract 1. Introduction

In this work, we introduce VQA 360 , a novel task of
visual question answering on 360◦ images. Unlike a nor- Visual question answering (VQA) has attracted signif-
mal field-of-view image, a 360◦ image captures the en- icant attention recently across multiple research commu-
tire visual content around the optical center of a cam- nities. In this task, a machine needs to visually perceive
era, demanding more sophisticated spatial understanding the environment, understand human languages, and perform
and reasoning. To address this problem, we collect the multimodal reasoning—all of them are essential compo-
first VQA 360◦ dataset, containing around 17,000 real- nents to develop modern AI systems. Merely in the past
world image-question-answer triplets for a variety of ques- three years, more than two dozen datasets have been pub-
tion types. We then study two different VQA models on lished, covering a wide variety of scenes, language styles, as
VQA 360◦ , including one conventional model that takes an well as reasoning difficulties [2, 17, 19, 23, 35, 36, 48]. To-
equirectangular image (with intrinsic distortion) as input gether with those datasets are over a hundred algorithms be-
and one dedicated model that first projects a 360◦ image ing developed, consistently shrinking the gap between hu-
onto cubemaps and subsequently aggregates the informa- mans’ and machines’ performance [4, 16, 24, 25, 26].
tion from multiple spatial resolutions. We demonstrate that Despite such an explosive effort, existing work is con-
the cubemap-based model with multi-level fusion and at- strained in the way a machine visually perceives the world.
tention diffusion performs favorably against other variants Specifically, nearly all the datasets use normal field-of-view
and the equirectangular-based models. Nevertheless, the (NFOV) images taken by consumer cameras. Convolutional
gap between the humans’ and machines’ performance re- neural networks (CNNs) that are carefully designed for such
veals the need for more advanced VQA 360◦ algorithms. images [21, 37] have been necessary to extract powerful vi-
We, therefore, expect our dataset and studies to serve as sual features. Nevertheless, NFOV images are not the only
the benchmark for future development in this challenging way, and very likely not the most efficient way, for a ma-
task. Dataset, code, and pre-trained models are available chine to interact with the world. For example, considering
online.1 a 360◦ horizontally surrounding scene, the NFOV of a con-
sumer camera can only capture an 18% portion [42]. Such a
1 http://aliensunmin.github.io/project/360-VQA/ fact, together with the reduced price of 360◦ cameras (e.g.,

11596

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
Ricoh Theta S, Samsung Gear 360, and GoPro Omni), has
motivated researchers to dig into 360◦ vision [9, 10, 22, 40]. top
We could imagine every robot to be equipped with a 360◦
camera in the near future. It is thus desirable to extend VQA
to such an informative visual domain. behind
at the left
in front of
at the right
behind
side side
In this work, we make the first attempt toward VQA
on 360◦ images (VQA 360◦ ). Two major challenges im-
mediately emerge. First, modern deep learning algorithms
bottom
are heavily data consuming, yet so far, there is no pub-
licly available dataset for VQA 360◦ . Second, 360◦ (i.e.,
equirectangular) images have intrinsic distortion and larger
spatial coverage, requiring a novel way to process visual
inputs and perform sophisticated spatial reasoning. Specifi- at the left at the right
top side in front of side behind bottom
cally, a machine needs to understand the spatial information
in questions, search answers across the entire 360◦ scene, Figure 2: 360◦ image and cubemaps. A equirectangular
and finally aggregate the information to answer. 360◦ image can be represented by six cubemaps, each cor-
To resolve the first challenge, we collect the first real responding to a spatial location, to reduce spatial distortion.
VQA 360◦ dataset, using 360◦ images from real-world
scenes. Our dataset contains about 17,000 image-question-
els for VQA 360◦ , including one that can effectively
answer triplets with human-annotated answers (see an ex-
handle spatial distortion while performing multi-level
ample in Figure 1). We have carefully taken the bias is-
spatial reasoning. We then point out future directions
sue [19, 24], which many existing VQA datasets suffer, into
for algorithm design for VQA 360◦ .
account in designing our dataset. We thus expect our dataset
to benefit the development of this novel task.
2. Related Work
In addition, we study two models to address VQA 360◦ .
On the one hand, we use equirectangular images as input, VQA models. Visual Question Answering requires com-
similar to conventional VQA models on NFOV images. On prehending and reasoning with visual (image) and textual
the other hand, to alleviate spatial distortion, we represent (question) information [47]. The mainstream of model ar-
an input 360◦ image by six cubemaps [20]. Each map has chitectures is to first learn the joint image-question repre-
its own spatial location and suffers less distortion (cf. Fig- sentation and then predict the answer through multi-way
ure 2). We develop a multi-level attention mechanism with classification. In the first stage, two mechanisms, visual at-
spatial indexing to aggregate information from each cube- tention [1, 44, 34] and multimodal fusion [16, 4], have been
map while performing reasoning. In this way, a machine widely explored. For example, the stacked attention net-
can infer answers at multiple spatial resolutions and loca- works (SANs) [45] was developed to perform multi-round
tions, effectively addressing the algorithmic challenge of attention for higher-level visual understanding. On the other
VQA 360◦ . Moreover, cubemap-based architecture is flexi- hand, Fukui et al. [16] proposed the Multimodal Compact
ble to take existing (pre-trained) VQA models as backbone Bilinear pooling (MCB) to learn a joint representation, and
feature extractors on cubemaps, effectively fusing multi- Ben et al. [4] developed a tensor-based Tucker decomposi-
modal information and overcoming the limited data issue. tion to efficiently parameterize the bilinear interaction. Re-
We conduct extensive empirical studies to evaluate mul- cently, several work [8, 32, 33, 39] extended BERT [15]
tiple variants of these models. The superior performance by developing new pre-training tasks to learn (bidirectional)
by the cubemap-based model demonstrates the need to ex- transformers [43] for joint image and text representations.
plicitly consider intrinsic properties of VQA 360◦ , both vi- Despite the variety of architectures, most of existing
sually and semantically. By analyzing the gap between the methods directly apply CNNs to the whole NFOV image to
machine’s and the human’s performance, we further suggest extract (local) features, which may not be suitable to 360◦
future directions to improve algorithms for VQA 360◦ . images. In this paper, we explore a different architecture to
Our contributions in this work are two-fold: extract CNN features from the cubemap representations of
a 360◦ image and then fuse features across cubemaps. The
• We define a novel task named VQA 360◦ . We point
cubemap-based model shares some similarity to [1, 45], yet
out the intrinsic difficulties compared to VQA on
we apply multiple-rounds of attentions to different spatial
NFOV images. We further collect the first real VQA
resolutions, one within and one across cubemaps, so as to
360◦ dataset, which is designed to include complicated
achieve better spatial understanding.
questions specifically for 360◦ images.
• We comprehensively evaluate two kinds of VQA mod- VQA datasets. There have been over two dozen of VQA

1597

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
datasets on NFOV images published in recent years. Most kitchen, bedroom, etc.) and workplaces (e.g., office, con-
of them aim for open-ended answering [2, 19, 30], provid- ference room, auditorium, etc.). To maximize the image
ing for a pair of image and question with one or multiple diversity, we discard images captured in the same room but
correct answers [6, 48]. An alternative setting is multiple- with different viewpoints. In total, we collect 744 images
choice answering: a set of candidate answers are provided from the Stanford 2D-3D dataset and 746 images from the
for each question, in which one of them is correct. Our Matterport3D dataset.
VQA 360◦ dataset belongs to the first category but focuses All the 360◦ images are stored in the equirectangular for-
on a very different input domain, 360◦ images. mat and resized to 1024 × 512. The equirectangular projec-
We note that there are two emerging VQA tasks, embod- tion maps latitude and longitude of a sphere to the horizon-
ied QA [13] and interactive QA [18], that require a machine tal and vertical lines (e.g., a point at the top of the sphere
to interact with the 3D environment (e.g., turn right or move is mapped to a straight line in an equirectangular image),
closer). Our dataset and task are different, from two aspects. which inevitably introduces heavy spatial distortion.
First, we work on real-world scenes, while both of them are
3.2. Question Generation
on synthetic ones. Second, we take 360◦ images as input
while they take NFOV images. A machine there has to take We design several question templates (c.f. Table 1), to-
actions to explore the environment, being less efficient. gether with the semantic segmentation and scene types as-
sociated with each 360◦ image2 , to automatically gener-
360◦ vision. With the growing popularity of virtual re-
ate questions. Our templates contain five different types:
ality (VR) and augmented reality (AR), 360◦ images and
“scene”, “exist”, “counting”, “property” and “spatial”.
videos have attracted increasing attention lately. One of
While imposing templates limit the diversity of questions,
the interesting problems is to automatically navigate a
the main purpose of our dataset is to promote VQA on a new
360◦ video [22, 40, 42] or create a fast-forward sum-
visual domain that has larger spatial coverage and complex-
mary [31]. Other research topics include 360◦ video sta-
ity. As illustrated in Figure 1, a 360◦ image can easily con-
bilization [29], compression [41], saliency prediction [9],
tain multiple objects distributed at multiple locations. We
depth estimation [14], and object detection [11, 40]. Re-
thus specifically design the question templates—either in-
cently, Chou et al. [10] study visual grounding to local-
clude spatial specifications or ask for spatial reasoning—to
ize objects in a 360◦ video for a given narrative, while
disambiguate the questions and encourage machines to ac-
Chen et al. [7] explore natural language navigation in 360◦
quire better spatial understanding. For instance, to answer
street environments. In contrast to these tasks, VQA on
“What is the color of the vase at the right of pictures?” in
360◦ images requires further inferring the answers accord-
Figure 1, a machine needs to first find the pictures (right-
ing to questions, demanding more sophisticated reasoning
most), look to the right to find the vase, and return the
of the scene.
color3 . To answer “Which side of the TV is the pictures?”,
a machine needs to detect the TV and picture, and then re-
3. VQA 360◦ Dataset turn their relative spatial information in the scene. Both ex-
We first present the proposed VQA 360◦ dataset to give a amples require visual and spatial understanding at multiple
clear look at the task and its intrinsic challenges. We begin resolutions and locations, which are scarce in existing VQA
with the dataset construction, including image collection, datasets on NFOV images (see the supplementary material
question generation, and answer annotation. We then pro- for details). On average, we create 11 questions per image.
vide detailed statistics for our VQA 360◦ dataset. 3.3. Answer Annotations & Question Refinements
3.1. Images Collection We resort to human annotators to provide precise an-
swers. We ask 20 in-house annotators to answer the ques-
We focus on indoor scenes as they are usually more
tions in our dataset. To avoid synonyms words and to ease
dense with contents such as objects, which are suitable for
the process, we offer candidate answers according to the
developing algorithms for sophisticated reasoning. In con-
question types for annotators to select directly. Annota-
trast, outdoor scenes, like those in [22, 31, 41, 42], capture
tors can also type free-form answers if none of the candi-
certain (ego-centric) activities and are of sparse contents,
dates is applicable. We note that the automatically gener-
which are more suitable for summarization or navigation.
ated questions might be irrelevant to the image or lead to
We collect 360◦ images of indoor scenes from two pub-
ambiguous answers4 . In such cases, we instruct the annota-
licly accessible datasets, Stanford 2D-3D [3] and Matter-
2 We can obtain room types and objects appearing in the scenes.
port3D [5]. Both datasets provide useful side informa- 3 Thereare three vases in Figure 1. Adding spatial specifications is thus
tion such as scene types and semantic segmentation, which necessary, and different specifications will lead to different answers.
benefit question generation. There are about 23 different 4 For instance, if there are two chairs with different colors, a question

scenes, including common areas in houses (e.g., bathroom, “What is the color of the chair?” will lead to ambiguous answers.

1598

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
Q type Template Example Answer
Scene What room is depicted in the image? What room is depicted in the image? bedroom/...
Is/Are there (a) <obj1> ?
+ in the <scene> Is there a chair in the kitchen?
Exist + <direc> Is there a chair at my right side? yes/no
+ <direc> of the <obj2> Is there a chair at the right side of the window?
+ <direc> of the <obj2> in the <scene> Is there a chair at the right side of the window in the kitchen?
How many <obj1> are ?
+ in the <scene> How many chairs are in the kitchen?
Counting + <direc> How many chairs are at my right side? 0/1/2/...
+ <direc> of the <obj2> How many chairs are at the right side of the window?
+ <direc> of the <obj2> in the <scene> How many chairs are at the right side of the window in the kitchen?
What is the (<color>) <obj1> made of?
What is the color of the <obj1> ?
+ in the <scene> What is the red sofa in the bedroom made of?
Property
+ <direc> What is the red sofa at my right side made of? plastic/wood/...
+ <direc> of the <obj2> What is the color of the sofa at the right of the window? red/brown/...
+ <direc> of the <obj2> in the <scene> What is the color of the sofa at the right of the window in the bedroom?
Where can I find the <obj1>?
Which side of the <obj1> is the <obj2>?
Spatial
+ <color> Where can I find the white flowers? in front of you/...
+ <material> Which side if the white chair is the wooden door? right side/...

Table 1: Question templates and examples. We design the following question templates and utilize the scene types and
semantic segmentation of the images to automatically generate questions.

Training Validation Test


#images 743 148 599
QA pairs 8227 1756 6962
#unique answers 51 51 53
#Scene type Q 765 150 614
#Counting type Q 1986 495 1934
#Existed type Q 2015 417 1655
#Property type Q 1355 322 1246
#Spatial type Q 2106 372 1513

Table 2: Summary of 360◦ VQA dataset. We summa-


rize the number of images, QA pairs, and unique answers in
Figure 3: Distribution of answers. We balance our dataset
each split of our dataset. We also provide a detailed statistic
such that the answers of the same question type appear uni-
for each type of question.
formly (e.g., “yes/no”, “0/1”, and “right side/left side”).

tors to slightly modify the questions—e.g., by adding spa-


tial specifications—to make them image-related or identifi- type have the similar number of presence (e.g., “yes/no”,
able. We also instruct annotators to draw bounding boxes “0/1”, “right/left side”), preventing a machine from cheat-
(for a subset of image-question pairs), which indicate spe- ing by predicting the dominant answer. For question types
cific objects or locations associated with the answer. Such with a few unique answers, we make sure that the unique
information facilitates the analysis of model performances. answers appear almost uniformly to minimize dataset bias.

3.4. Dataset Statistics 4. VQA 360◦ Models


Our VQA 360◦ dataset consists of 1, 490 images and In this section, we study two VQA models, including one
16, 945 question-answer pairs, which are split into the train- dedicated to resolving inherent challenges in VQA 360◦ .
ing, validation, and test sets with 50%, 10%, and 40% of
images, respectively. We summarize the statistics in Table 2 Notations and problem definitions. Given a question q
and show the distribution of the top 20 answers in Figure 3. and an image i, a machine needs to generate the answer a.
We note that each question type has at least 2 corresponding One common VQA model is to first extract visual features
answers in the top 20 ones. Moreover, those from the same fi = FI (i) and question features fq = FQ (q), followed

1599

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
How many tvs are in the bedroom? Attention Mechanism

GRU 𝒯([l ( j), giq( j)], fq) fq


fq ( j) M( fq) α (1)
l ( j) giq


Tucker
fi(1) [ ] α ( j) α (2)

model
i (1) Fusion α (1)

CNN


VQA

Att
giq(1) α (1) α (2) α (J)

Aggregate & Prediction


i


1 fq
fi(2)

model
CNN

VQA
α (J)

Att
i (2)
giq(2) α (2)

Cubemap
Extractor
(b) Tucker Fusion (c) Attention Diffusion


i ( j) Aggregate & Prediction


l (1) giq(1) fq
α (1) [ ]
fi(J )

model
(2)

CNN
l (2) giq

VQA

Att

Softmax
i (J) giq(J) α (J) α (2) [ ] Tucker

FC

Fusion
share weight (J)
l (J) giq
α (J) [ ]
(a) Overall model
(d) Fusion Aggregate

Figure 4: VQA 360◦ models. We propose a cubemap-based architecture that first extracts visual features from the cubemaps
of the input 360◦ image and then performs bottom-up multi-level attention and feature aggregation.

by multimodal representations giq = G(fi , fq ). The multi- both concerns into account is thus desirable.
modal representations are then inputted into a classifier C(·) Moreover, existing VQA models like MLB [26] and
of K classes, corresponding to the top K frequent answers, SAN [45] only consider a single visual resolution when per-
to generate the answer a. Representative choices for FI (·) forming feature aggregation in G(fi , fq ). For 360◦ images
and FQ (·) are CNN and RNN models [45], respectively. that cover a large spatial range, a more sophisticated mech-
anism that involves multiple resolutions of feature aggrega-
4.1. Equirectangular-based Models tion is required. To this end, we propose a cubemap-based
As the most common format to store and display a 360◦ model to simultaneously tackle the above challenges.
image is the equirectangular projection into a 2D array, 4.2. Cubemap-based Models
we can indeed directly apply existing (pre-trained) VQA
models for VQA 360◦ . We take the Multimodal Low-rank To reduce spatial distortion, we first represent a 360◦ im-
Bilinear Attention Network (MLB) model [26] as an ex- age by six non-overlapping cubemaps, {i(j) }Jj=1 , via the
ample, which adopts an efficient bilinear interaction for perspective projection (c.f. Figure 2; see the supplementary
G(fi , fq ). We first extract the visual features fi by the material for details). Each cubemap corresponds to a spe-
pre-trained ResNet-152 [21] and adopt the Gated Recurrent cific portion of the 360◦ image with less distortion. Col-
Units (GRU) [12, 28] to extract the question features fq . lectively, the cubemaps together can recover the original
We then input the resulting giq = G(fi , fq ) into a fully- image. This representation naturally leads to a bottom-up
connected layer with K output units to build a K-way clas- architecture that begins with the local region understanding
sifier C(·). We optimize the whole network using the train- and then global reasoning (cf. Figure 4).
ing set of our VQA 360◦ dataset and set K to be the number In the first stage, we can apply any existing VQA models,
of unique training answers (i.e., 51). e.g., MLB [26], to each cubemap individually, resulting in
The MLB model G(fi , fq ) pre-trained on the VQA-1 [2] J local multimodal representations:
dataset requires fi to retain a 14 × 14 spatial resolution, (j)
equivalent to inputting a 448 × 448 image to the ResNet. giq = G(fi(j) , fq ) , (1)
We thus adopt a few strategies, including cropping or resiz- where fi(j) denotes the visual features of the j-th cubemap.
ing the original 360◦ image, or inputting the original image Bottom-up multi-level attention. In the second stage,
while resizing the output ResNet features into a 14 × 14 the main challenge is to effectively aggregate information
spatial resolution by an average pooling layer. We analyze from cubemaps. While average and max pooling have been
these strategies in Section 5. widely used, they simply ignore the location associated with
Challenges. While the above strategies allow us to exploit each cubemap. We thus resort to the attention mechanism:
VQA models pre-trained on much larger NFOV datasets J
(e.g., VQA-1 [2]), applying CNNs directly on 360◦ im- (j)
X X
gi = α(j) giq , s.t. α(j) ≥ 0, α(j) = 1. (2)
ages suffers the inherent spatial distortion [40]. On the j=1 j
other hand, adopting specifically designed spherical con-
volutions [40] prevents us from leveraging existing models The attention weight α(j) can be computed according to in-
and pre-trained weights. An intermediate solution that takes formation of each cubemap, including its location, making

1600

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
aggregation more flexible. As many existing VQA models pre-trained MLB model to each cubemap of size 448 ×
already apply the attention mechanism within the input im- 448, and consider the following three different aggregation
ages [26, 45] (e.g., a cubemap in our cases), the attention to schemes before performing the final answer prediction.
aggregate across cubemaps is actually the second-level of (j)
• C UBEMAP -AVGPOOL: apply average pooling on giq .
attention but on a coarse resolution. • T UCKER: attention weights by Tucker fusion in (3).
We apply Tucker fusion T (·, ·) [4] to compute the atten- • T UCKER &D IFFUSION: attention weights by Tucker
(j)
tion weights according to the cubemap feature giq , loca- fusion followed by the diffusion in (4).
tion indicator l(j) , and question feature fq : Tucker fusion
Variants of equirectangular-based models. We consider
has been shown effective and efficient in fusing information
four ways to apply MLB on the equirectangular images.
from multiple modalities. The resulting α(j) is as follows,
• C ENTRAL - CROP: resize the shorter size of the image
(j) to 448 to preserve the aspect ratio and then crop the
α(j) = softmax{T ([l(j) , giq ], fq )} , (3)
image to 448 × 448 to extract ResNet features.
where [·, ·] means concatenation. The softmax is performed • R ESIZE: resize the image into 448 × 448 without any
over j ∈ {1, · · · , J}. We use a one-hot vector l(j) to encode cropping and extract ResNet features.
the cubemap location. In this way, the attention weights can • R ES N ET-AVGPOOL: resize the shorter size of the im-
zoom into the cubemap location mentioned in the question. age to 448 and apply an average pooling layer on the
ResNet output to obtain 14 × 14 resolution features.
Attention diffusion. The attention weighs by (3); however, • D IRECT- SPLIT: split an equirectangular image into
do not explicitly consider spatial relationship across cube- 2×3 patches, resize each to 448×448 and apply MLB,
maps. For a question like “Is there a chair at the right side and then apply T UCKER &D IFFUSION to aggregate in-
of the window?”, we would expect the model to first attend formation for predicting the answer.
to the cubemap that contain the window, and then shift its
Note that the D IRECT- SPLIT and T UCKER &D IFFUSION
attention to the cubemap at the right. To incorporate such a
models have the same architecture but different inputs.
capability, we learn a diffusion matrix M (fq ) conditioned
on the question fq : the entry M (fq )u,v indicates how much Baselines. We provide Q- TYPE PRIOR, a model that out-
attention to be shifted from the cubemap v to u. The result- puts the most frequent answer of each question type.
ing formula for gi in (2) becomes: Implementation details. We first pre-train the backbone
J J
! J MLB model on the VQA-1 [2] dataset, which contains over
(u)
X X X
gi = M (fq )u,v α(v) giq , s.t. M (fq )u,v = 1. 100, 000 NFOV images and 300, 000 question-answer pairs
u=1 v=1 u=1 for training. Then, we plug the pre-trained model in all
(4) the compared models and fine-tune the models on our VQA
360◦ training set for 150 epochs. We optimize our models
Answer prediction. The resulting feature gi in (4) or (2)
with the ADAM [27] optimizer and select the model with
then undergoes another Tucker fusion to extract higher-level
the best performance on the validation set.
image-question interactions before inputted into the classi-
(j) Evaluation metric. We use the top-1 accuracy for evalua-
fier C(·). We can also replace giq in (4) or (2) by the con-
(j) tion. We report two types of accuracy: the average accuracy
catenation of giq and l(j) to incorporate location cues into i) over all the questions, and ii) over question types.
gi . This strategy is, however, meaningless to average or max
pooling—it simply results in an all-one vector. We illustrate 5.2. Analysis and Discussions
the overall model architecture in Figure 4. More details are
included in the supplementary material. Table 3 summarizes the results on VQA 360◦ test set.
The cubemap-based model with T UCKER &D IFFUSION for
5. Experimental Results attention weights performs favorably against other models,
demonstrating the effectiveness of multi-level and diffused
5.1. Setup attention on top of cubemaps representation for VQA 360◦ .
Variants of cubemap-based models. The cubemap-based In the following, we discuss several key observations.
model can take any existing VQA model as the backbone. Limited language bias. The top row (Q-type prior) in Ta-
We choose the MLB model [26], a bilinear multimodal fu- ble 3 examines the dataset bias, which predicts the most
sion and attention model. We experiment with other VQA frequent answer of each question type. The inferior results
backbones [4, 38] in the supplementary material to demon- suggest a low language bias in our dataset. Specifically, for
strate the applicability of the cubemap-based models. “exist” type questions that only have two valid answers each
We remove the fully-connected layer of the original (i.e, “yes” or “no”), using language prior is close to random
MLB model to extract multimodal features. We apply the guess. Machines need to rely on images to answer.

1601

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
Model Variants Overall avg Avg by type Scene Exist Counting Property Spatial
Q- TYPE PRIOR - 33.50 31.71 25.41 55.47 33.56 21.99 22.14
Equirectangular-based C ENTRAL - CROP 53.39 54.07 60.66 75.00 47.10 50.16 37.45
Equirectangular-based R ESIZE 54.21 55.77 68.46 75.66 47.31 51.48 35.96
Equirectangular-based R ES N ET-AVGPOOL 54.47 56.14 69.34 76.81 46.32 50.96 37.25
Equirectangular-based⋆ R ES N ET-AVGPOOL 54.15 55.55 67.48 77.17 46.17 49.04 37.90
Equirectangular-based D IRECT- SPLIT 54.77 56.59 71.36 75.75 46.68 49.56 39.62
Cubemap-based C UBEMAP -AVGPOOL 54.60 56.23 69.17 76.22 46.79 51.72 37.26
Cubemap-based T UCKER 57.71 59.07 69.89 77.23 46.53 48.24 53.47
Cubemap-based T UCKER &D IFFUSION 58.66 60.26 72.01 76.34 46.84 50.12 55.98
Cubemap-based⋆ T UCKER &D IFFUSION 54.09 55.54 67.65 76.16 45.91 48.60 39.39

Table 3: Quantitative results on the VQA 360◦ test set. The ⋆ models are trained from scratch on the VQA 360◦ training
set without pre-training on the VQA-1. The best result of each column is marked by the bold black color.

Equirectangular-based models. As shown in Table 3, Model Avg. Avg. by Q type Spatial


the R ES N ET-AVGPOOL model outperforms the C ENTRAL - T UCKER (w/o) 53.81 53.81 36.09
CROP and R ESIZE , indicating the poor applicability of crop- T UCKER (w/) 57.71 59.07 53.47
ping and resizing to 360◦ images. Since 360◦ images have T UCKER &D IFFUSION (w/o) 54.91 56.51 39.13
T UCKER &D IFFUSION (w/) 58.66 60.26 55.98
large spatial coverage, in which objects might be of small
sizes, resizing will miss those small objects while central Table 4: Comparison of w/ and w/o location feature.
cropping will lose 50% of the image content.
Cubemaps v.s. Equirectangular input. One major issue Model Overall Scene Exist Counting Property Spatial
of applying existing VQA models directly to the 360◦ im- Human 84.05 88.95 91.79 71.58 89.97 85.25
ages is the spatial distortion. This is justified by the fact Machine 59.80 68.89 77.12 49.65 45.81 61.97
that all the equirectangular-based models are outperformed
by all the cubemap-based models (except the C UBEMAP - Table 5: Results of human evaluation. We also include
AVGPOOL one) on the overall performance. Specifically, by the machine’s performance on the same 1,000 questions to
comparing the D IRECT- SPLIT and T UCKER &D IFFUSION, analyze the humans’ and machines’ gap.
whose main difference is the input, the 3 ∼ 4% perfor-
mance gap clearly reflects the influence of distortion. By Human Evaluation. We conduct a user study on our VQA
looking into different question types, we also observe con- 360◦ dataset. We sample 1, 000 image-question-answer
sistent improvements by applying cubemaps. triplets from the test set and ask at least two different users
Pre-training. Comparing the models with ⋆ (trained from to answer each question. To ease the process, we give users
scratch) and without ⋆ (with pre-training), the pre-trained five candidate answers, including the correct answer and
weights (from the VQA-1 dataset) benefits the overall per- four other answers that are semantically related to the ques-
formance, especially for the cubemap-based models. tion. There are a total of 50 unique users participating in the
Attention. Applying cubemaps resolves one challenge of user study. We note that the annotators labeling our dataset
VQA 360◦ : spatial distortion. We argue that a sophisticated are not involved in the human evaluation to avoid any bias.
way to aggregate cubemaps features to support spatial rea- We summarize the results of human evaluation and
soning is essential to further boost the performance. This is the machine’s prediction5 in Table 5. Humans achieve a
shown from the improvement by T UCKER &D IFFUSION or 84.05% overall accuracy, which is at the same level as many
T UCKER, compared to C UBEMAP -AVGPOOL: the former existing VQA datasets [2, 6, 46] and is much higher than an-
two apply attention mechanisms guided by questions and other dataset on indoor images [35], justifying the quality of
cubemap locations for multi-level attention. Specifically, our VQA 360◦ dataset. Among the five question types, hu-
T UCKER &D IFFUSION outperforms C UBEMAP -AVGPOOL mans perform relatively poorly on “counting”, which makes
by a notable 3.4% at Avg. by Q type, mostly from the “spa- sense due to the complicated contents of 360◦ images and
tial” question type. T UCKER &D IFFUSION with spatial dif- the possible small objects. Overall, there is about ∼ 25%
fusion also outperforms T UCKER in all the question types. performance gap between human and machines. The gap
(j)
Location feature. Concatenating l(j) with giq in (2) is larger especially on “counting”, “property”, and “spatial”
and (4) enables our model to differentiate cubemaps. Ta- types, suggesting the directions to improve algorithms so as
ble 4 compares the T UCKER &D IFFUSION and T UCKER to match humans’ inference abilities.
with/without l(j) . The location indicator leads to consistent
5 We use our best cubemap-based model T UCKER &D IFFUSION.
improvement, especially on the “spatial” type questions.

1602

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
Q: Which side of the window is the painting? GT: right side / Pred: right side Q: What room is depicted in the image? GT: hallway / Pred: hallway

0.089 0.035 0.655 0.132 0.050 0.039 0.214 0.036 0.248 0.248 0.237 0.017

Q: Where can i find the bed? GT: at your left side/ Pred: at your left side Q: Which side of the door is the white board? GT: left side/ Pred: right side

0.033 ~0.000 0.043 0.540 0.384 ~0.000 0.025 ~0.000 0.086 0.028 0.640 ~0.000

Figure 5: Visualization of attention. We use the cubemap-based model T UCKER &D IFFUSION as it performs the best. The
digits below the cubemaps indicate the attention across cubemaps. The heat maps indicate the attention within cubemaps.
Qualitative results. We present qualitative results in Fig- strate the need to explicitly model intrinsic properties of
ure 5. Besides showing the predicted answers, we visualize 360◦ images, while the noticeable gap between humans’
the attention weights across cubemaps (by the digits) and and machines’ performance reveals the difficulty of reason-
within cubemaps (by the heat maps). The cubemap-based ing on 360◦ images compared to NFOV images.
model with T UCKER &D IFFUSION can zoom in to the cube- We surmise that the gap may partially be attributed to
maps related to the questions, capture the answer regions, the hand-crafted cubemap cropping. On one end, objects
and aggregate them to predict the final answers. Take the appear around the cubemap boundaries may be splitted.
question “Which side of the window is the painting?” for On the other end, it requires specifically designed mecha-
example (the top-left one of Figure 5). The model puts high nisms (e.g., attention diffusion (4)) to reason the spatial re-
attention on the cubemaps with windows and pictures and lationship among cubemaps. These issues likely explain the
is able to infer the relative location. For the question “What human-machine gap at the “counting” and “spatial” ques-
room is depicted in the image?” (the top-right of Figure 5), tions. Thus, to advance VQA 360◦ , we suggest developing
the model distributes attention to all cubemaps except the image-dependent cropping that detects objectness regions
top and bottom ones to learn information through them. We from the equirectangular images. We also suggest develop-
also show a failure case in the bottom-right of Figure 5. The ing a back-projection-and-inference mechanism that back-
question asks “Which side of the door is the whiteboard?”. projects the detected objects into the 360◦ environment and
However, the model mistakenly recognizes the window as performs reasoning accordingly. Besides, the current ques-
the white board and incorrectly answers “right side”. tions are generated (or initialized) by templates. A future
work is to include more human efforts to increase the ques-
6. Discussion and Conclusion
tion diversity. We expect our dataset and studies to serve as
We introduce VQA 360◦ , a novel VQA task on a chal- the benchmark for the future developments.
lenging visual domain, 360◦ images. We collect the first
VQA 360◦ dataset and experiment with multiple VQA Acknowledgments. This work is supported in part by
models. We then present a multi-level attention model to NSF CAREER (# 1149783) and MOST 108-2634-F-007-
effectively handle spatial distortion (via cubemaps) and per- 006 Joint Research Center for AI Technology and All Vista
form sophisticated reasoning. Experimental results demon- Healthcare, Taiwan.

1603

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
References [19] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and
D. Parikh. Making the V in VQA matter: Elevating the
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, role of image understanding in visual question answering.
S. Gould, and L. Zhang. Bottom-up and top-down atten- In CVPR, 2017. 1, 2, 3
tion for image captioning and visual question answering. In [20] N. Greene. Environment mapping and other applications of
CVPR, 2018. 2 world projections. IEEE CGA, 1986. 2
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
C. Lawrence Zitnick, and D. Parikh. VQA: Visual question for image recognition. In CVPR, 2016. 1, 5
answering. In ICCV, 2015. 1, 3, 5, 6, 7 [22] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang,
[3] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D- and M. Sun. Deep 360 pilot: Learning a deep agent for pi-
3D-Semantic Data for Indoor Scene Understanding. arXiv, loting through 360 sports videos. In CVPR, 2017. 2, 3
2017. 3 [23] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L.
[4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for
Multimodal tucker fusion for visual question answering. In compositional language and elementary visual reasoning. In
ICCV, 2017. 1, 2, 6 CVPR, 2017. 1
[5] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, [24] K. Kafle and C. Kanan. An analysis of visual question an-
M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: swering algorithms. In ICCV, 2017. 1, 2
Learning from RGB-D data in indoor environments. In 3DV, [25] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A
2017. 3 strong baseline for visual question answering. arXiv, 2017.
[6] W.-L. Chao, H. Hu, and F. Sha. Being negative but construc- 1
tively: Lessons learnt from creating better visual question [26] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T.
answering datasets. In NAACL, 2018. 3, 7 Zhang. Hadamard product for low-rank bilinear pooling. In
[7] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi. Touch- ICLR, 2017. 1, 5, 6
down: Natural language navigation and spatial reasoning in [27] D. P. Kingma and J. Ba. Adam: A method for stochastic
visual street environments. In CVPR, 2019. 3 optimization. In ICLR, 2015. 6
[8] Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, [28] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun,
Y. Cheng, and J. Liu. Uniter: Learning universal image-text A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS,
representations. arXiv preprint arXiv:1909.11740, 2019. 2 2015. 5
[9] H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, [29] J. Kopf. 360◦ video stabilization. ACM TOG, 2016. 3
and M. Sun. Cube padding for weakly-supervised saliency [30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
prediction in 360◦ videos. In CVPR, 2018. 2, 3 S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bern-
[10] S.-H. Chou, Y.-C. Chen, K.-H. Zeng, H.-N. Hu, J. Fu, and stein, and F.-F. Li. Visual genome: Connecting language and
M. Sun. Self-view grounding given a narrated 360◦ Video. vision using crowdsourced dense image annotations. IJCV,
In AAAI, 2018. 2, 3 2017. 3
[11] S.-H. Chou, C. Sun, W.-Y. Chang, W.-T. Hsu, M. Sun, and [31] W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and
J. Fu. 360-indoor: Towards learning real-world objects in S. B. Kang. Semantic-driven generation of hyperlapse from
360◦ indoor equirectangular images. arXiv, 2019. 3 360◦ video. TVCG, 2017. 3
[12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical [32] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang.
evaluation of gated recurrent neural networks on sequence Visualbert: A simple and performant baseline for vision and
modeling. arXiv, 2014. 5 language. arXiv preprint arXiv:1908.03557, 2019. 2
[13] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Ba- [33] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining
tra. Embodied question answering. In CVPR, 2018. 3 task-agnostic visiolinguistic representations for vision-and-
[14] G. P. de La Garanderie and A. Atapour. Eliminating the blind language tasks. In NeurIPS, 2019. 2
spot: Adapting 3D object detection and monocular depth es- [34] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
timation to 360 panoramic imagery. In ECCV, 2018. 3 question-image co-attention for visual question answering.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: In NIPS, 2016. 2
Pre-training of deep bidirectional transformers for language [35] M. Malinowski and M. Fritz. A multi-world approach to
understanding. arXiv, 2018. 2 question answering about real-world scenes based on uncer-
[16] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and tain input. In NIPS, 2014. 1, 7
M. Rohrbach. Multimodal compact bilinear pooling for vi- [36] M. Ren, R. Kiros, and R. Zemel. Exploring models and data
sual question answering and visual grounding. arXiv, 2016. for image question answering. In NIPS, 2015. 1
1, 2 [37] K. Simonyan and A. Zisserman. Very deep convolutional
[17] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. networks for large-scale image recognition. In ICLR, 2015.
Are you talking to a machine? dataset and methods for mul- 1
tilingual image question. In NIPS, 2015. 1 [38] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen,
[18] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, M. Shah, M. Rohrbach, D. Batra, and D. Parikh. Pythia-a
and A. Farhadi. IQA: Visual question answering in interac- platform for vision & language research. In SysML Work-
tive environments. In CVPR, 2018. 3 shop, NeurIPS, volume 2018, 2018. 6

1604

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.
[39] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai.
Vl-bert: Pre-training of generic visual-linguistic representa-
tions. arXiv preprint arXiv:1908.08530, 2019. 2
[40] Y.-C. Su and K. Grauman. Learning spherical convolution
for fast features from 360◦ imagery. In NIPS, 2017. 2, 3, 5
[41] Y.-C. Su and K. Grauman. Learning compressible 360◦
Video video isomers. In CVPR, 2018. 3
[42] Y.-C. Su, D. Jayaraman, and K. Grauman. Pano2Vid: Auto-
matic cinematography for watching 360◦ videos. In ACCV,
2016. 1, 3
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all
you need. In NeurIPS, 2017. 2
[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In ICML,
2015. 2
[45] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked
attention networks for image question answering. In CVPR,
2016. 2, 5, 6
[46] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs:
Fill in the blank description generation and question answer-
ing. In ICCV, 2015. 7
[47] K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C.
Niebles, and M. Sun. Leveraging video descriptions to learn
video question answering. In AAAI, 2017. 2
[48] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:
Grounded question answering in images. In CVPR, 2016. 1,
3

1605

Authorized licensed use limited to: Auckland University of Technology. Downloaded on July 14,2020 at 07:46:54 UTC from IEEE Xplore. Restrictions apply.

You might also like