0% found this document useful (0 votes)
223 views16 pages

Conceptual 12M

Uploaded by

Soumava Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views16 pages

Conceptual 12M

Uploaded by

Soumava Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training

To Recognize Long-Tail Visual Concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut


Google Research
schangpi,piyushsharma,dingnan,[email protected]
arXiv:2102.08981v2 [cs.CV] 30 Mar 2021

Abstract

The availability of large-scale image captioning and


visual question answering datasets has contributed sig-
nificantly to recent successes in vision-and-language pre-
training. However, these datasets are often collected with
overrestrictive requirements inherited from their original <PERSON> was the first US president to
attend a tournament in sumo's hallowed
target tasks (e.g., image caption generation), which limit Ryogoku Kokugikan arena. (AFP photo)
the resulting dataset scale and diversity. We take a step
further in pushing the limits of vision-and-language pre-
training data by relaxing the data collection pipeline used
in Conceptual Captions 3M (CC3M) [70] and introduce
#jellyfish #blue #ocean #pretty Sea
the Conceptual 12M (CC12M), a dataset with 12 million Turtle Wallpaper, Aquarius Aesthetic,
Blue Aesthetic Pastel, The Adventure
image-text pairs specifically meant to be used for vision- Zone, Capricorn And <PERSON>, Life
and-language pre-training. We perform an analysis of this Aquatic, Ocean Life, Jellyfish, Marine
Hand holding a fresh mangosteen Life
dataset and benchmark its effectiveness against CC3M on
Figure 1: CC12M Even when the alt-texts do not precisely de-
multiple downstream tasks with an emphasis on long-tail vi-
scribe their corresponding Web images, they still provide rich
sual recognition. Our results clearly illustrate the benefit of sources for learning long-tail visual concepts such as sumo, man-
scaling up pre-training data for vision-and-language tasks, gosteen, and jellyfish. We scale up vision-and-language pre-
as indicated by the new state-of-the-art results on both the training data to 12 million by relaxing overly strict filters in Con-
nocaps and Conceptual Captions benchmarks.1 ceptual Captions [70].

anisms via high-capacity multi-layer Transformers [78], in


1. Introduction combination with self-supervised learning objectives such
as masked language modeling [25], has proven to be effec-
Transfer learning using pre-training and fine-tuning has tive and widely applicable. On the other hand, the avail-
become a prevalent paradigm in computer vision, natural ability of large-scale labeled and weakly-labeled data in the
language processing, and vision-and-language (V+L) re- V+L domain [61, 20, 43, 70] is truly what enables such
search. It has been shown, for instance, that V+L pre- models to learn associations between the two modalities.
training leads to transferrable joint representations that ben-
In either vision or language community, one notable
efit multiple downstream V+L tasks, including visual ques-
trend is that scaling up training data is useful. In contrast,
tion answering, image and text retrieval, and referring ex-
datasets in V+L research remain relatively limited in terms
pression comprehension [55, 49, 21, 77, 3, 74, 88, 48, 56].
of scale and diversity. The capability of JFT-300M [76]
What makes V+L pre-training successful? On one hand,
and Instagram [58] over orders-of-magnitude smaller Im-
this is due to advances in architectures and modeling that are
ageNet [69] has been put to test on multiple downstream
mainly inspired by BERT and similar models in natural lan-
image classification and object detection tasks. In NLP, the
guage understanding and generation [25, 53, 82, 46, 26, 66].
size of pre-training data sources for training deep language
In particular, the idea of using flexible self-attention mech-
models rose from the 20GB BooksCorpus [90]+English
1 Our dataset is available at https : / / github . com / Wikipedia in BERT[25], to the 570GB dataset in GPT-
google-research-datasets/conceptual-12m. 3 [12] and the 745GB C4 dataset in T5 [66].

1
In contrast, V+L datasets are limited in two ways. (a) A public larger-scale V+L pre-training dataset that cov-
First, the effective sizes of popular V+L datasets are low. ers a much wider range of concepts than existing ones.
The number of images in these datasets range from fewer (b) Evaluation on downstream vision-to-language genera-
than a few hundred thousands [84, 20, 44, 28] to sev- tion and vision-and-language matching with an empha-
eral millions [70], with lower text quality as the scale in- sis on long-tail recognition that consistently shows the
creases. Second, many of the small-sized datasets share the superiority of this dataset over CC3M.
same, limited visual domain; COCO-Captions [20], Visual (c) State-of-the-art results on the nocaps (novel object cap-
Genome [44], and VQA2 [27] are (mostly) based on sev- tioning) and Conceptual Captions benchmarks.
eral hundreds thousand of COCO images [52]. The lack
in scale and diversity of visual concepts (with respect to 2. Vision-and-Language Pre-Training Data
vision/language-only counterparts) makes it hard for V+L
We first review the data collection pipeline for the Con-
models to perform adequately in the wild.
ceptual Captions 3M (CC3M) outlined in Sect. 3 of [70],
One major reason for these gaps is the difficulty in which we followed closely. We then describe a series of
collecting such datasets. Unlike in image classification, relaxation and simplification to the pipeline that results in
“text” in V+L datasets is longer and less likely to be CC12M, a much larger set of image-text pairs. Finally, we
agreed upon, making the annotation process more costly perform an analysis of CC12M in comparison with CC3M
and time-consuming. One approach to remedy this is to and other existing V+L datasets.
make use of large amounts of the alt-texts accompanying
images on the Web. For instance, Sharma et al. intro- 2.1. Conceptual Captions 3M: Pipeline for extract-
duced Conceptual Captions (CC3M) [70], a dataset of 3.3M ing and cleaning Image Alt-Text from the Web
himage, captioni pairs that result from a filtering and post-
The Conceptual Captions dataset consists of about
processing pipeline of those alt-texts. Despite being au-
3.3M Web images and their corresponding cleaned, hyper-
tomatically collected, CC3M is shown to be effective in
nymized Alt-texts [70]. This approach leverages a promis-
both image captioning in the wild [70, 19] and V+L pre-
ing source of (weak) supervision for learning correspon-
training [55, 49, 21, 77, 3, 74, 88, 48, 56]. In other words, it
dance between visual and linguistic concepts: once the
provides a promising start for large-scale V+L annotations.
pipeline is established, the data collection requires no ad-
In this paper, we explore pushing the limits of V+L data ditional human intervention. It consists of the following 4
using this approach. Our key insight is that specific down- steps: (i) image-based filtering based on size, aspect ratio,
stream V+L tasks (e.g., VQA, image captioning) can be encoding format and offensive content, (ii) text-based fil-
overly restrictive if the goal is to collect large-scale V+L an- tering based on language, captialization, token frequency,
notations. For instance, CC3M was collected to favor high- pre-defined unwanted phrases, as well as part-of-speech
precision texts that are fit for the downstream task of image (POS), sentiment/polarity, and adult content detection (us-
captioning. Yet, we have witnessed this dataset being in- ing Google Cloud Natural Language APIs), (iii) image-
creasingly adopted for V+L pre-training [55, 21, 3, 74, 88, text–based filtering based on the number of image tags (as
48, 56], arguably beyond its original purpose. predicted by Google Cloud Vision APIs) that overlap with
We hypothesize that the V+L field could benefit from the existing text, (iv) text transformations, most notably hy-
such an insight, and therefore we introduce Conceptual pernymization of named entities, including proper names
12M (CC12M), a high(er)-recall V+L dataset for the pur- of persons, organizations and locations (e.g., both “Harri-
pose of V+L pre-training. By relaxing multiple image and son Ford” and “Calista Flockhart” are replaced by “actor”),
text filters used in CC3M, we obtain a less precise but 4x deletion of time-related spans, and digit replacement (using
larger V+L set of himage, texti pairs. We perform an anal- # as a digit abstraction).
ysis of this dataset and show that it covers a wider range of The large scale nature and the high degree of textual and
visual concepts. visual diversity make this dataset particularly suited to V+L
We test our hypothesis by benchmarking the effective- pre-training [55, 21, 74, 88, 48, 56].
ness of CC12M as a pre-training data source on several V+L
tasks, in comparison to CC3M. We explore two main pre- 2.2. CC12M: Relaxing filters for higher recall
training strategies (and more in the Supplementary mate- Conceptual Captions has been created to work out-of-
rial): one for vision-to-language generation and the other the-box for training image captioning models, and thus it
for vision-and-language matching. Our experiments indi- involves substantial image, text, and image-text filtering and
cate that scaling up pre-training V+L has a dramatic pos- processing to obtain clean, high-precision captions. As a re-
itive effect on image captioning, novel object captioning, sult, this approach comes at the cost of low recall (many po-
and (zero-shot) image retrieval. tentially useful himage, Alt-texti pairs are discarded). How-
In summary, our main contributions are: ever, this trade-off may not be optimal if the dataset is to

2
Dataset # examples token/type caption length
CC3M train 3,318,333 804.8 10.3 ± 4.5
CC12M 12,423,374 370.0 20.2 ± 16.3
Table 1: Basic statistics of CC12M vs. CC3M
be used primarily for V+L pre-training. Motivated by this,
we follow a similar procedure as the one described in [70]
but relax some of its filters, and construct the dataset called
Conceptual 12M (CC12M), as detailed below.
Filtering. As described above, the construction of CC3M
used three main filtering types [70]: image-based, text-
based, and image-text–based. To arrive at CC12M, we keep
the image-text filtering intact, and relax the unimodal fil-
ters only. First, for image-based filtering, we set the maxi-
mum ratio of larger to smaller dimension to 2.5 instead of
2. We still keep only JPEG images with size greater than
400 pixels, and still exclude images that trigger pornogra-
Figure 2: Word clouds of top 100 tokens in CC3M (the top cloud)
phy detectors. Second, in text-based filtering, we allow text and in CC12M (the bottom cloud).
between 3 and 256 words in the alt-text. We still discard
candidates with no noun or no determiner, but permit ones indicating a longer-tail distribution and a higher diversity
without prepositions. We discard the heuristics regarding degree of the concepts captured. Lastly, the average caption
high unique-word ratio covering various POS tags and word length of CC12M is much longer. This is overall achieved
capitalization. We set the maximum fraction of word repe- by our relaxation of the filters, especially the text one.
tition allowed to 0.2. Given a larger pool of text due to the Quality. We compute a rough estimate of precision on
above relaxations, the threshold for counting a word type as 100 examples by asking two annotators to rate how well
rare is increased from 5 to 20. the given alt-text fits the image on a 1–5 scale: 1 (no fit),
Text transformation. The main motivation for CC3M to 2 (barely fit), 3 (somewhat), 4 (good fit, but disfluent lan-
perform text transformation is that a majority of candidate guage), 5 (perfect). We define precision as the fraction of
captions contain ultrafine-grained entities such as proper captions with a score 4 or above. We see a drop in preci-
names (people, venues, locations, etc.), making it extremely sion, 76.6% vs. 90.3% as reported for CC3M (Table 2 in
difficult to learn as part of the image captioning task. In [70]). This analysis points to the precision/recall tradeoff in
contrast, we are not restricted by the end task of image cap- transitioning from CC3M to CC12M. Fig. 1 illustrates such
tion generation. Our intuition is that relatively more dif- a tradeoff: the “jellyfish” example would have been filtered
ficult pre-training data would lead to better transferability. out from CC3M (due to a high percentage of nouns and a
We thus do not perform hypernimization or digit substitu- lack of proprositions), but it is included in CC12M.
tion as in [70]. The only exception to the “keep alt-texts as Visual concept distribution. We use the caption text to-
raw as possible” rule is performing person-name substitu- kens to represent the visual concepts. The long tail of visual
tions, which we identify as necessary to protect the privacy concepts that emerge in CC12M spans many categories, and
of the individuals in these images. For this step, we use the can be attributed to (1) a dramatic increase in scale, and (2)
Google Cloud Natural Language APIs to detect all named the absence of fine-grained entity hypernymization. We list
entities of type Person, and substitute them by a special to- some of them here to illustrate this point, in the format of
ken hPERSONi. Around 25% of all the alt-texts in CC12M “hwordi hfrequency in CC3Mi → − hfrequency in CC12Mi”:
are transformed in this fashion. luffy 0 →
− 152, mangosteen 0 → − 212, zanzibar 0 → − 1138,
sumo 1 → − 661, pokemon 1 → − 8615, chevrolet 1 → − 12181,
2.3. Characteristics of CC12M mehndi 3 → − 9218, pooh 4 → − 7286, cyberpunk 5 → − 5247,
We provide an analysis of CC12M along multiple dimen- keto 6 →
− 6046, hound 9 → − 3392, quiche 50 → − 1109, durian
sions, focusing on comparing it to the most relevant CC3M. 61 →− 552, jellyfish 456 →− 2901.
Additional analyses are in the supplementary material. We also visualize the head of the distribution in Fig. 2.
Basic statistics. As seen in Table 1, CC12M consists of We observe that “person” becomes much more frequent due
12.4M image-text pairs2 , about 4x larger than CC3M. It has to person substitution with the token “hPERSONi”. More-
a much lower token (word count) to type (vocab size) ratio, over, there are fewer “actor”, “artist”, “(football) player”, as
a result of removing hypernymization.
2 Extracted as of May 2020. Finally, we inspect tokens that are unseen in CC3M.

3
Image Captioning (IC) We focus on the two most fundamental V+L tasks:
vision-to-language generation and vision-and-language
[GO] Token 1 ... Token N [EOS]
matching. In both cases, our emphasis is on (i) the simplest
Multi-Layer Transformer Multi-Layer Transformer
Image Encoder Text Decoder setting in which the learning objectives during pre-training
Glob Reg 1 ... Reg 16 Tag 1 ... Tag 16 and downstream tasks match, and (ii) long-tail recognition
and out-of-distribution generalization, as we believe this is
Visual-Linguistic Matching (VLM) where pre-training has the most impact. Fig. 3 and Table 2
Match or not? summarize our experimental setup, in terms of the down-
Pooling & Fully-connected Layers
stream tasks and the fine-tuning and evaluation datasets.
Multi-Layer Transformer Multi-Layer Transformer
Image Encoder Text Encoder
3.1. Vision-to-Language Generation
Glob Reg 1 ... Reg 16 Tag 1 ... Tag 16 [CLS] Token 1 Token 2 ... Token N
3.1.1 Pre-Training Tasks
Figure 3: Main Pre-Training Tasks: image captioning (vision-
We use image captioning (ic) as the pre-training task. The
to-language generation) and visual-linguistic matching (vision-
and-language understanding). task is to predict the target caption given image features.
Downstream Downstream datasets To train the model parameters, we use the standard cross
task Train Eval entropy loss given the groundtruth caption.
Novel object captioning COCO Captions nocaps Note that there exist vision-to-language generation pre-
Novel object captioning LocNar COCO LocNar OID training strategies that are different from ours. For instance,
Image captioning CC3M Zhou et al. [88] adapt BERT [25] to generate text. As
Zero-shot IR None Flickr30K masked language modeling is used for pre-training, there
IR Flickr30K
is no decoder and, at inference time, text is generated us-
IR LocNar Flickr30K
ing the encoder network one token at a time, appending
Table 2: Generation (top) and matching (bottom) tasks and the mask token to the image and the text generated so far.
datasets considered in this paper. IR = Image Retrieval. Thus, this approach is inefficient as the number of passes
We observe that these tokens may occur very frequently over the input image is linear in the desired caption length.
in CC12M if they are fine-grained instances such as loca- It is also unclear how to incorporate advanced decoding
tions (“france,” “africa,” “dc,” “toronto”) or digits (“2019”, schemes such as beam search, top-k sampling, or nucleus
“10”, “2018”, “2020”). This is due to the removal of hyper- sampling (see, e.g., [31]) with such an approach. Finally,
nymization and the dropping of time-related span deletion. our experiments (see Supplementary material) suggest that
the ic pre-training task is superior to its masked variants
Biases. We study the context in which several sen-
and justify using the simple ic learning objective.
sitive terms related to gender, age, race, ethnicity ap-
pear such as “black”/“white”/“asian”/“african”/“indian”,
“man”/”woman”, “young”/“old”, etc. We observe no large 3.1.2 Downstream Tasks
biases in the distribution of these terms, either in terms of
co-occurrence between sensitive term pairs or with other to- Our downstream tasks are selected to measure progress
kens. Furthermore, we check the distribution of web do- toward solving image captioning in the wild. They also
mains and, similar to visual concepts, we find this to be stand to benefit from visual grounding, especially since pre-
diverse and long-tail: >100K with >40K contributing >10 training, by definition, is expected to cover a wider range of
samples. We take our preliminary study as a positive in- (long-tail) visual concepts than fine-tuning datasets.
dication of no severe biases stemming from particular do- nocaps [2] is a recent object-captioning-at-scale bench-
mains or communities. Finally, we provide a Broader Im- mark consisting of 4,500 validation and 10,600 test images
pact statement in the supplementary material. with 10 hidden reference captions. Unlike in the standard
image captioning setting, nocaps’s distributions of images
3. Evaluating Vision-and-Language Pre- during training (COCO Captions) and evaluation (Open Im-
Training Data ages) are different: the Open Images dataset [45, 42] cov-
ers one order of magnitude more objects (600 classes) than
The previous section describes CC3M and our CC12M. COCO [52] (80 classes). This discrepancy defines the chal-
In this section, we evaluate both datasets on their ability to lenge: solutions must be able to learn to describe novel
benefit V+L downstream tasks, measuring the impact from concepts from sources external to the COCO training set,
visual grounding produced under the two settings. For the such as text corpora, knowledge bases, or object detection
sake of comparison, we do not include the images that ap- datasets. In the Supplementary material, we provide de-
pear in CC3M in CC12M in our experiments. tails on the nocapsleaderboard. In addition, besides CC3M

4
and CC12M, we also explore using the Open Images Local- the LocNar Flickr30K portion (train split of 30,546 images,
ized Narratives dataset (LocNar) [64], as an alternative “in- and test split of 1000 images) for training and evaluation.
domain” (from a visual standpoint) pre-training data source. Evaluation metrics. To measure the performance on image
Localized Narratives (LocNar) [64] is a collection of retrieval, we consider the standard metrics Recall@1 (R1),
datasets with images that are paired with captions obtained Recall@5 (R5), and Recall@10 (R10).
by converting speech to text via ASR and manual post-
processing it3 . Inspired by the setting in nocaps, we use 3.3. Implementation Details
the COCO[52] portion (train split of around 130K images)
for training/fine-tuning, and Open Images [45] portion of Representing images and texts. We use Graph-RISE [37,
evaluation (val split of around 40K images). Note that the 38] to featurize the entire image. We train a Faster-
LocNar captions are much longer than standard captioning RCNN [68] on Visual Genome [43], with a ResNet101 [29]
datasets (41.8 words/caption), setting it apart from nocaps. backbone trained on JFT [30] and fine-tuned on Ima-
Conceptual Captions 3M [70] is our main reference for geNet [69]. We select top-16 box proposals and featurize
V+L pre-training data source. At the same time, the image each of them with Graph-RISE, similar to [19]. Inspired by
captioning task on this dataset itself is a valuable bench- [50], we obtain up to 16 image tags from the Google Cloud
mark for vision-to-language generation in the wild. Thus, Vision APIs, and treat them as text inputs to our model.
we adopt it as a downstream task for CC12M. This means These global, regional, and tag features end up being repre-
that, in the case of CC3M, from-scratch and pre-training sented as a bag of 1+16+16 vectors, serving as bottom-up
settings collapse. features [7] for our model.
Evaluation metrics. To measure the performance on im- Model and Learning. For ic-based pre-training and
age caption generation, we consider the standard met- downstream tasks, we follow the state-of-the-art architec-
rics BLEU-1,4 [62], ROUGE-L [51], METEOR [10], ture that heavily rely on self-attention [78] or similar mech-
CIDEr-D [79], and SPICE [4]. anisms [70, 85, 19, 33, 23]. We implement a Transformer-
based encoder-decoder model, using [19] as a starting point.
3.2. Vision-and-Language Matching In addition, we encode each feature vector with a deeper
embedding layer and apply layer normalization [9]. Follow-
3.2.1 Pre-training Tasks
ing [55], we encode the corners and the area of bounding
In visual-linguistic matching (vlm), the task takes as input boxes and apply layer normalization when combining geo-
both image and text features and predicts whether the input metric and regional semantic features. These modifications
image and text are matched. To train the model’s parame- lead to an improved CIDEr score of 100.9 on the CC3M dev
ters, we use a contrastive softmax loss, for which the origi- benchmark (Table 7), vs. 93.7 as reported by [19]. We de-
nal image-text pairs are used as positive examples, while all scribe additional details in the supplementary material, in-
other image-text pairs in the mini-batch are used as negative cluding infrastructure description, runtime, model size, hy-
examples [55, 77]. perparameter ranges and tuning methods, and the configu-
ration of the best-performing model.
3.2.2 Downstream Tasks For the vlm-based pre-training and downstream tasks,
we reuse the architecture above but discard the decoder. We
The task of caption-based image retrieval (IR) is to iden- use mean pooling to obtain a fixed-length vector for each
tify a relevant image from a pool given a caption describing modality, and compute the product of the transformed (last-
its content. The Flickr30K dataset [63] consists of 31,000 layer Transformer encoder representation) image and the
images from Flickr, each associated with five captions. Fol- transformed text before applying softmax.
lowing existing work [47, 55], we use 1,000 images for val-
idation, 1,000 images for testing, and use the rest of image- 4. Experimental Results
text pairs for model training.
We further consider zero-shot caption-based image re- 4.1. Vision-to-Language Generation
trieval [55] on the Flickr30K dataset. The term “zero-shot” Table 3 shows our results on nocaps. We report in
refers to the setting in which we discard training data and Row 1 the performance of our baseline model without pre-
apply pre-trained models “as-is”, i.e., without fine-tuning training. Rows 2-3 show the performance of off-the-shelf
on the target task. captioning systems trained on CC3M and CC12M, respec-
Finally, we further evaluate our retrieval system on the tively. This indicates the “raw” power (zero-shot setting)
Localized Narratives dataset [64] (see Sect. 3.1.2). We use of the pre-trained network in generating captions out of
3 This dataset also contains mouse traces synchronized with the text, but the box. We note that, without fine-tuning on COCO Cap-
we do not use the traces here. tions, the model underperforms our baseline numbers on all

5
Pretraining Train or nocaps val
data finetune on in-domain near-domain out-of-domain overall
coco cap? CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE

None X 72.8 11.1 57.1 10.2 34.1 8.3 69.8 14.5 21.9 47.9 54.7 10.0
CC3M 29.2 7.4 27.5 6.9 37.3 7.4 36.0 2.8 12.6 29.1 29.7 7.1
CC12M 20.7 6.9 24.1 6.9 41.6 8.0 31.8 2.9 12.1 26.8 27.1 7.2
CC3M X 81.8 11.6 73.7 11.1 65.3 10.1 74.6 19.1 24.1 51.5 73.2 11.0
CC12M X 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
CC3M+CC12M X 92.6 12.5 88.3 12.1 94.5 11.9 79.2 24.4 26.1 55.1 90.2 12.1
Table 3: Automatic metric scores on the nocaps val set: performance of from-scratch (Row 1), pre-trained (Rows 2-3), and fine-tuned
(Rows 4-5) models. CC12M outperforms CC3M by a large margin after fine-tuning (Row 4 vs. 5). Together, they achieve a new best,
surpassing 90 CIDEr points on nocaps val. Bold indicates best-to-date, underline indicates second-best.

a group of people riding on top of a skateboard . a man in a military uniform holding a microphone .
a group of people that are dressed in costumes . Before fine-tuning After fine-tuning Before fine-tuning After fine-tuning
Before fine-tuning After fine-tuning members of the a group of people riding on file photo , performs a man in a military
file : martial artist a group of men playing a CC3M CC3M during the opening
CC3M public take a tour . the back of a motorcycle . uniform holds a sword .
dies at age . game of sport . ceremony .
segway tour at the a group of people riding a
< person > at the a couple of sumo CC12M < person > playing a man in a kilt playing
CC12M white house segway in a park CC12M the bagpipes
grand sumo wrestlers in a stadium . the bagpipes .

Figure 4: Qualitative results on nocaps. Each example comes with a caption predicted by the model that is trained on COCO Captions
without pre-training (very top, right under the image), as well as captions predicted by models pre-trained on CC3M (middle) and CC12M
(bottom), where the left/right column indicates if the model is fine-tuned on COCO Captions.

metrics, which is indicative of the need for the model to column). This result indicates that, although the annota-
learn the COCO captioning style, to which the existing au- tion protocol for nocaps uses the priming of annotators to
tomatic metrics are quite sensitive. In addition, we observe mention one or more of displayed fine-grained ground-truth
a slightly better performance by CC3M except for BLUE4 object classes (e.g., “red panda”) present in the image [2],
and SPICE. This illustrates the benefit of data processing the large-scale and natural fine-grainedness of CC12M suc-
and bias toward high-precision captions present in CC3M. ceeds in correctly learning to generate captions containing
With a fine-tuned model, the benefit of transfer learning such concepts, in spite of being textually out-of-domain.
using pre-training on this task is clear (Row 1 vs. Rows Following [2], we also report results of our best model
4,5,6), with CC12M outperforming CC3M by +14.2 CIDEr on the COCO Captions val2017 split, see Table 5, with 5K
points and another +2.8 with CC3M+CC12M. Fig. 4 illus- and 10K fine-tuning steps. We note that, since we do not
trates this effect; scaling up pre-training data benefits learn- rely on techniques such as constrained beam search (CBS)
ing multimodal correspondences from a much larger pool [5, 2] that constrain the model outputs, we do not suffer
of concepts, potentially making the model less susceptible from the large performance trade-offs seen with the pre-
to hallucinations (e.g., guessing “microphone” as it has not vious solutions (degradation on in-domain performance as
seen “bagpipes” in the training set), and also more informa- out-of-domain performance increases, see each model vs.
tive (e.g. choosing “sumo wrestlers” over “men”/“people”). “reference”). Our result on out-of-domain data, as we vary
Table 4 compares our best model (ic pre-trained on the number of fine-tuning steps (last two rows), suggests
CC3M+CC12M) to existing state-of-the-art results on no- that over–fine-tuning on COCO Captions may incur a cost
caps, and show that ours achieves state-of-the-art perfor- in terms of poor generalization.
mance on CIDEr, outperforming a concurrent work [32] A second set of results is reported in Table 6. We observe
that uses a different pre-training approach directly on the that, even when the task requires the generation of much
Open Images dataset, which nocaps is based on. Impor- longer captions for LocNar, CC12M achieves superior per-
tantly, we observe that the gain in the overall score can formance (as measured by CIDEr) compared to CC3M as
be largely attributed to the out-of-domain performance (3rd pretraining data. However, the gain is smaller compared to

6
nocaps val
Method in-domain near-domain out-of-domain overall
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE

UpDown [2] 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS [2] 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo + CBS [2] 79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
OscarL [50] 79.9 12.4 68.2 11.8 45.1 9.4 65.2 11.4
OscarL + CBS [50] 78.8 12.2 78.9 12.1 77.4 10.5 78.6 11.8
OscarL + SCST + CBS [50] 85.4 11.9 84.0 11.7 80.3 10.0 83.4 11.4
VIVO [32] 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS [32] 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO + SCST + CBS [32] 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
pretrain ic on CC12M 88.3 12.3 86.0 11.8 91.3 11.2 87.4 11.8
pretrain ic on CC3M+CC12M 92.6 12.5 88.3 12.1 94.5 11.9 90.2 12.1
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2
nocaps test
UpDown [2] 74.3 11.5 56.9 10.3 30.1 8.1 54.3 10.1
UpDown + ELMo + CBS [2] 76.0 11.8 74.2 11.5 66.7 9.7 73.1 11.2
VIVO + SCST + CBS [32] 89.0 12.9 87.8 12.6 80.1 11.1 86.6 12.4
pretrain ic on CC12M 82.9 11.9 85.7 12.0 85.3 11.3 85.3 11.8
pretrain ic on CC3M+CC12M 87.2 12.3 87.4 12.1 87.2 11.4 87.3 12.0
Human 80.6 15.0 84.6 14.7 91.6 14.2 85.3 14.7
Table 4: Comparison between our best model (in italics, pre-trained on CC12M with ic and fine-tuned on COCO Captions) and existing
models, on the nocaps val (top) and test (bottom) splits. Bold indicates best-to-date, underline indicates second-best.
COCO nocaps CC3M CC3M
Method val2017 val Method dev test
CIDEr CIDEr CIDEr CIDEr

UpDown (reference) 116.2 55.3 FRCNN [19] 89.2 94.4


UpDown + CBS 97.7 73.1 TTIC+BIU (single model) - 98.0
UpDown + ELMo + CBS 95.4 74.3 Ultra [19] 93.7 98.4
no pretrain (reference) 108.5 54.7 no pretrain 100.9 -
pretrain ic on CC12M (5K) 108.1 87.4 pretrain ic on CC12M (no ft) 39.3 -
pretrain ic on CC12M (10K) 110.9 87.1 pretrain ic on CC12M 105.4 -
Table 5: Performance on the in-domain COCO Captions val2017 Table 7: Performance on the Conceptual Captions (CC3M)
split along with the nocaps val split. Our methods are in italics benchmark. Our methods are in italics. “ft” stands for fine-tuning.
with the number of fine-tuning steps in the parentheses. The top two CC3M test CIDEr baseline scores are from the Con-
Pretraining Finetuning LocNar LocNar ceptual Captions Leaderboard as of Nov 15, 2020.
data data COCO val OID val
CIDEr CIDEr split) obtains a low dev CIDEr of 39.3. This again indicates
None LocNar COCO 29.6 33.8 that the additional processing steps done for CC3M (e.g.,
CC3M LocNar COCO 29.1 35.7 hypernimization) result in captions that are different enough
CC12M LocNar COCO 30.0 38.6 from the ones in CC12M to require a fine-tuning step.
Table 6: Novel object captioning on LocNar.
4.2. Vision-and-Language Matching
the one observed for nocaps. We attribute this to the fact Table 8 reports zero-shot and default IR performance on
that injecting novel concepts into longer texts is harder, and Flickr30K as well as default IR performance on LocNar
also the fact that LocNar does not use priming in their an- Flickr30K. The results are consistent with those in vision-
notation process, leading to more generic terms in their an- to-language generation. First, both CC3M and CC12M
notation (“musical instruments” vs. “trumpets”). are beneficial, improving over “from-scratch” training (Pre-
Finally, we fine-tune our best pre-trained model (ic on training data as “None”) by at least 8.6% and 6.6% in R1 on
CC12M) using CC3M in Table 7, and then evaluate on the Flickr30K and LocNar Flickr30K, respectively. Addition-
dev split. We find that we improve the CIDEr score on the ally, CC12M significantly outperforms CC3M in all cases.
dev split from 100.9 to 105.4 (+4.5 CIDER points). We note Finally, combining the two datasets (CC3M+CC12M) re-
that the model trained on CC12M and evaluated directly on sults in even better performance. We provide qualitative
the CC3M dev set (without fine-tuning on the CC3M train results and additional discussion in the supplementary ma-

7
Pretraining Finetuning Flickr30K been extended to visual region inputs, while the next sen-
data data test tence prediction is analogous to vlm. Based directly upon
R1 R5 R10 BERT, V+L pre-training research has largely been focused
None Flickr30K 43.7 74.8 84.1
on V+L understanding [55, 49, 21, 77, 3, 74, 48, 56], with
CC3M None 35.4 65.2 76.2
classification or regression tasks that do not involve gener-
CC12M None 42.5 73.1 83.4
CC3M+CC12M None 47.1 76.4 83.4 ation. One exception is UnifiedVL [88], which pre-trains a
CC3M Flickr30K 52.3 81.7 88.4 unified architecture for both image captioning (generation)
CC12M Flickr30K 58.5 86.6 92.1 and VQA (understanding). Our work focuses on simpler
CC3M+CC12M Flickr30K 61.5 87.5 92.8 objectives and consider one at a time. This allows for a
Pretraining Finetuning LocNar Flickr30K “clean” study of the effect of pre-training data sources. At
data data test the same time, we also pre-train vision-to-language genera-
R1 R5 R10 tion and encoder-decoder jointly as opposed to an encoder-
None LocNar Flickr30K 54.5 85.0 91.0 only setup. Our work also shows that ic is a strong ob-
CC3M LocNar Flickr30K 61.1 88.2 93.7 jective for vision-to-language generation with respect to the
CC12M LocNar Flickr30K 70.2 92.1 95.6 widely-used masking-based objectives. Consistent with our
CC3M+CC12M LocNar Flickr30K 71.0 93.0 97.0 results, ic is successfully adopted for learning visual rep-
Table 8: Image retrieval on Flickr30K and LocNar Flickr30K resentations for lower-level vision tasks [24].

terial. Long-tail Visual Recognition in V+L. Addressing long-


Our zero-shot IR results (the three rows in Table 8 with tail distributions of visual concepts is an important com-
fine-tuning data as “None”) are also competitive to the state- ponent of V+L systems that generalize, as long and free-
of-the-art, despite the fact that our model is much smaller form texts exhibit a large number of compositional, fine-
(6 layers of transformers of hidden layer size 512 with 8 grained categories [89, 54, 19]. Our work focuses on down-
attention heads vs. 12 layers of size 768 with 12 attention stream testbeds for V+L research that require this adapta-
heads) and uses late fusion instead of early fusion. In par- tion ability. For example, the train-test distribution discrep-
ticular, our zero-shot IR on CC3M outperforms the one in ancy in nocaps exists in both visual (COCO vs. Open Im-
ViLBERT [55] (35.4 vs. 31.9 in R1), while the CC12M per- ages) and textual domains (80 to object classes vs. 600
formance goes up by +7.1% R1 to 42.5, and an additional classes). The same can be said for zero-shot image re-
+4.6% R1 to 47.1 when using CC3M+CC12M, surpassing trieval [55], in which the model must generalize visually
the “from-scratch” setting. and textually from the pre-training data sources of CC3M
or CC12M to Flickr30K. Our work identifies pre-training
5. Related Work with large-scale noisy data as a promising solution. In ad-
dition, for the task noval object captioning, our approach
V+L Pre-training. V+L pre-training research makes use works more robustly across in- and out-of- domain scenar-
existing large-scale datasets with image-text pairs. A ma- ios and is simpler than the state-of-the-art techniques that
jority of these resources are image captioning datasets. utilize constrained beam search (CBS) [5], finite state ma-
CC3M [70] has been the most popular for pre-training [55, chine construction plus CBS [6], generating slot-filling tem-
56, 3, 74, 88, 48, 21, 50]. Smaller but less noisy SBU Cap- plates [57, 81], and copying mechanisms [83].
tions [61] (1̃M) and COCO Captions [20] (106K) datasets
are also of high interest. Some work [77, 21, 50] use V+L
resources collected for dense captioning or visual ques- 6. Conclusion
tion answering (VQA), such as VG [44], VQA2 [27], and
We introduce the new V+L pre-training resource
GQA [34]. In contrast, CC12M is not collected for specific
CC12M, obtained by extending the pipeline in [70]. We
target tasks, and thus it is order-of-magnitude larger than
show that the scale and diversity of V+L pre-training mat-
those datasets.4 Furthermore, it is much more visually di-
ters on both generation and matching, especially on bench-
verse, especially given the fact that COCO Captions, VG,
marks that require long-tail recognition such as nocaps. Our
VQA2, GQA are built on top of COCO images [52] or its
results indicate leveraging noisy Web-scale image-text pairs
subsets.
as a promising direction for V+L research.
Objectives in V+L pre-training research are largely in-
fluenced by BERT [25]. Masked language modeling has Acknowledgments. We thank Peter Anderson for his feedback
on earlier version of the draft, Bo Pang, Zhenhai Zhu for helpful
4 Recently appearing after we submitted our paper, ALIGN [36],
discussions, Sebastian Goodman and Ashish V. Thapliyal for help
CLIP [65], WIT [72], WenLan [35] all explore enlarging Web-scale data
for V+L pre-training with success (albeit with different focuses), further with model implementation, Chris Alberti for help with the data
confirming our intuition that scale is a critical factor. collection pipeline, and Harsh Agrawal for detail on nocaps.

8
References [15] Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and
Loı̈c Barrault. Probing the need for visual context in mul-
[1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- timodal machine translation. In NAACL, 2019. 14
dha Kembhavi. Don’t just assume; look and answer: Over- [16] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen,
coming priors for visual question answering. In CVPR, 2018. and Jingjing Liu. Behind the scene: Revealing the secrets of
14 pre-trained vision-and-language models. In ECCV, 2020. 14
[2] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, [17] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos,
Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- and Dawn Song. The secret sharer: Evaluating and testing
fan Lee, and Peter Anderson. nocaps: novel object caption- unintended memorization in neural networks. In {USENIX}
ing at scale. In ICCV, 2019. 4, 6, 7 Security, 2019. 12
[3] Chris Alberti, Jeffrey Ling, Michael Collins, and David Re- [18] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
itter. Fusion of detected objects in text for visual question Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
answering. In EMNLP-IJCNLP, 2019. 1, 2, 8, 12 Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and
[4] Peter Anderson, Basura Fernando, Mark Johnson, and Colin Raffel. Extracting training data from large language
Stephen Gould. SPICE: semantic propositional image cap- models. arXiv preprint arXiv:2012.07805, 2020. 12
tion evaluation. In ECCV, 2016. 5 [19] Soravit Changpinyo, Bo Pang, Piyush Sharma, and Radu
[5] Peter Anderson, Basura Fernando, Mark Johnson, and Soricut. Decoupled box proposal and featurization with
Stephen Gould. Guided open vocabulary image captioning ultrafine-grained semantic labels improve image captioning
with constrained beam search. In EMNLP, 2017. 6, 8 and visual question answering. In EMNLP-IJCNLP, 2019.
[6] Peter Anderson, Stephen Gould, and Mark Johnson. 2, 5, 7, 8
Partially-supervised image captioning. In NeurIPS, 2018. 8 [20] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-
tam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick.
[7] Peter Anderson, Xiaodong He, Chris Buehler, Damien
Microsoft COCO Captions: Data collection and evaluation
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
server. arXiv preprint arXiv:1504.00325, 2015. 1, 2, 8, 12
Bottom-up and top-down attention for image captioning and
[21] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
visual question answering. In CVPR, 2018. 5
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
[8] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret UNITER: Learning UNiversal Image-TExt Representations.
Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi In ECCV, 2020. 1, 2, 8, 12, 13, 15
Parikh. VQA: Visual question answering. In ICCV, 2015.
[22] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer.
12
Don’t take the easy way out: Ensemble based methods for
[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- avoiding known dataset biases. In EMNLP-IJCNLP, 2019.
ton. Layer normalization. arXiv preprint arXiv:1607.06450, 14
2016. 5, 15 [23] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and
[10] Satanjeev Banerjee and Alon Lavie. METEOR: An auto- Rita Cucchiara. Meshed-Memory Transformer for Image
matic metric for MT evaluation with improved correlation Captioning. In CVPR, 2020. 5
with human judgments. In ACL Workshops, 2005. 5 [24] Karan Desai and Justin Johnson. VirTex: Learning visual
[11] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh representations from textual annotations. In CVPR, 2021. 8
Saligrama, and Adam Kalai. Man is to computer program- [25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
mer as woman is to homemaker? debiasing word embed- Toutanova. BERT: Pre-training of deep bidirectional trans-
dings. In NeurIPS, 2016. 12 formers for language understanding. In NAACL, 2019. 1, 4,
[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie 8, 14, 16
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- [26] Sebastian Goodman, Zhenzhong Lan, and Radu Soricut.
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Multi-stage pretraining for abstractive summarization. arXiv
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, preprint arXiv:1909.10599, 2019. 1
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. [27] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, tra, and Devi Parikh. Making the V in VQA matter: El-
Mark Chen, Eric Sigler, Mateusz Litwin, Scottand Gray, evating the role of image understanding in visual question
Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- answering. In CVPR, 2017. 2, 8, 12
Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. [28] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhat-
Language models are few-shot learners. arXiv preprint tacharya. Captioning images taken by people who are blind.
arXiv:2005.14165, 2020. 1 In ECCV, 2020. 2
[13] Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Darrell, and Anna Rohrbach. Women also snowboard: Over- Deep residual learning for image recognition. In CVPR,
coming bias in captioning models. In ECCV, 2018. 12 2016. 5
[14] Remi Cadene, Corentin Dancette, Hedi Ben-younes, [30] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the
Matthieu Cord, and Devi Parikh. RUBi: Reducing unimodal knowledge in a neural network. In Proceedings of NeurIPS
biases in visual question answering. In NeurIPS, 2019. 14 workshop, 2015. 5

9
[31] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. [44] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
The curious case of neural text degeneration. In ICLR, 2020. Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
4 tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
[32] Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Li Fei-Fei. Visual Genome: Connecting language and vision
Jianfeng Gao, and Zicheng Liu. VIVO: Surpassing human using crowdsourced dense image annotations. International
performance in novel object captioning with visual vocabu- Journal of Computer Vision, 123(1):32–73, 2017. 2, 8
lary pre-training. arXiv preprint arXiv:2009.13682, 2020. 6, [45] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R.
7 Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Ste-
[33] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. fan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari.
Attention on attention for image captioning. In CVPR, 2019. The open images dataset V4: unified image classification,
5 object detection, and visual relationship detection at scale.
CoRR, abs/1811.00982, 2018. 4, 5
[34] Drew A Hudson and Christopher D Manning. Gqa: a
new dataset for compositional question answering over real- [46] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
world images. arXiv preprint arXiv:1902.09506, 2019. 8 Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
bert for self-supervised learning of language representations.
[35] Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao
In ICLR, 2020. 1
Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui
[47] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-
Xu, Weihao Zheng, et al. WenLan: Bridging vision and
aodong He. Stacked cross attention for image-text matching.
language by large-scale multi-modal pre-training. arXiv
In ECCV, 2018. 5
preprint arXiv:2103.06561, 2021. 8
[48] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming
[36] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Zhou. Unicoder-VL: A universal encoder for vision and lan-
Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom
guage by cross-modal pre-training. In AAAI, 2020. 1, 2, 8,
Duerig. Scaling up visual and vision-language representa-
12
tion learning with noisy text supervision. arXiv preprint
[49] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh,
arXiv:2102.05918, 2021. 8
and Kai-Wei Chang. VisualBERT: A simple and perfor-
[37] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Alek- mant baseline for vision and language. arXiv preprint
sei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, An- arXiv:1908.03557, 2019. 1, 2, 8, 12, 13
drew Tomkins, and Sujith Ravi. Graph-RISE: Graph-
[50] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan
regularized image semantic embedding. arXiv preprint
Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong,
arXiv:1902.10814, 2019. 5
Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-
[38] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei semantics aligned pre-training for vision-language tasks. In
Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew ECCV, 2020. 5, 7, 8, 12
Tomkins, and Sujith Ravi. Ultra fine-grained image semantic [51] Chin-Yew Lin. ROUGE: A package for automatic evaluation
embedding. In WSDM, 2020. 5 of summaries. In Text Summarization Branches Out, 2004. 5
[39] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and [52] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D.
Tamara Berg. ReferItGame: Referring to objects in pho- Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva
tographs of natural scenes. In EMNLP, 2014. 12 Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft
[40] Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and COCO: common objects in context. In ECCV, 2014. 2, 4, 5,
Erkut Erdem. Re-evaluating automatic metrics for image 8
captioning. In EACL, 2017. 16 [53] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
[41] Diederik P. Kingma and Jimmy Ba. Adam: A method for dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
stochastic optimization. In ICLR, 2015. 16 Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro-
[42] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami bustly optimized BERT pretraining approach. arXiv preprint
Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- arXiv:1907.11692, 2019. 1
jlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi [54] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang,
Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Boqing Gong, and Stella X. Yu. Large-scale long-tailed
Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun recognition in an open world. In CVPR, 2019. 8
Feng, Dhyanesh Narayanan, and Kevin Murphy. Open- [55] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL-
Images: A public dataset for large-scale multi-label and BERT: Pretraining task-agnostic visiolinguistic representa-
multi-class image classification. Dataset available from tions for vision-and-language tasks. In NeurIPS, 2019. 1, 2,
https://g.co/dataset/openimages, 2017. 4 5, 8, 12, 14, 15
[43] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, [56] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- Parikh, and Stefan Lee. 12-in-1: Multi-task vision and lan-
tidis, Li-Jia Li, David A. Shamma, Michael Bernstein, and guage representation learning. In CVPR, 2020. 1, 2, 8, 12,
Li Fei-Fei. Visual Genome: Connecting language and vi- 14
sion using crowdsourced dense image annotations. IJCV, [57] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
123(1):32–73, 2017. 1, 5, 12 Neural baby talk. In CVPR, 2018. 8

10
[58] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, text dataset for multimodal multilingual machine learning.
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, arXiv preprint arXiv:2103.01913, 2021. 8
and Laurens van der Maaten. Exploring the limits of weakly [73] Pierre Stock and Moustapha Cisse. Convnets and imagenet
supervised pretraining. In ECCV, 2018. 1 beyond accuracy: Understanding mistakes and uncovering
[59] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana biases. In ECCV, 2018. 12
Camburu, Alan L. Yuille, and Kevin Murphy. Generation [74] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
and comprehension of unambiguous object descriptions. In Wei, and Jifeng Dai. VL-BERT: Pre-training of generic
CVPR, 2016. 12 visual-linguistic representations. In ICLR, 2020. 1, 2, 8,
[60] Milad Nasr, Reza Shokri, and Amir Houmansadr. Compre- 12, 13
hensive privacy analysis of deep learning: Passive and active [75] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun
white-box inference attacks against centralized and federated Bai, and Yoav Artzi. A corpus for reasoning about natural
learning. In IEEE SP, 2019. 12 language grounded in photographs. In ACL, 2018. 12
[61] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. [76] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
Im2Text: Describing images using 1 million captioned pho- nav Gupta. Revisiting unreasonable effectiveness of data in
tographs. In NIPS, 2011. 1, 8, 12 deep learning era. In ICCV, 2017. 1
[62] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing [77] Hao Tan and Mohit Bansal. LXMERT: Learning cross-
Zhu. Bleu: A method for automatic evaluation of machine modality encoder representations from transformers. In
translation. In ACL, 2002. 5 EMNLP-IJCNLP, 2019. 1, 2, 5, 8, 12, 15
[63] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, [78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
nik. Flickr30k entities: Collecting region-to-phrase corre- Polosukhin. Attention is all you need. In NIPS, 2017. 1, 5
spondences for richer image-to-sentence models. In ICCV, [79] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi
2015. 5 Parikh. CIDEr: Consensus-based image description evalu-
ation. In CVPR, 2015. 5
[64] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu
[80] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang,
Soricut, and Vittorio Ferrari. Connecting vision and lan-
and Vicente Ordonez. Balanced datasets are not enough: Es-
guage with localized narratives. In ECCV, 2020. 5, 16
timating and mitigating gender bias in deep image represen-
[65] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tations. In ICCV, 2019. 12
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[81] Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. Decoupled
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
novel object captioner. In MM, 2018. 8
Krueger, and Ilya Sutskever. Learning transferable visual
[82] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
models from natural language supervision. arXiv preprint
Ruslan Salakhutdinov, and Quoc V Le. XLNet: General-
arXiv:2103.00020, 2021. 8
ized autoregressive pretraining for language understanding.
[66] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
arXiv preprint arXiv:1906.08237, 2019. 1
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[83] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Incorpo-
Peter J. Liu. Exploring the limits of transfer learning with a
rating copying mechanism in image captioning for learning
unified text-to-text transformer. JMLR, 21(140):1–67, 2020.
novel objects. In CVPR, 2017. 8
1
[84] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
[67] Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan maier. From image descriptions to visual denotations: New
Lee. Overcoming language priors in visual question answer- similarity metrics for semantic inference over event descrip-
ing with adversarial regularization. In NeurIPS, 2018. 14 tions. TACL, 2:67–78, 2014. 2
[68] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [85] Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. Multimodal
Faster R-CNN: Towards real-time object detection with re- transformer with multi-view visual representation for image
gion proposal networks. In NIPS, 2015. 5 captioning. arXiv, 1905.07841, 2019. 5
[69] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [86] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J.
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Liu. PEGASUS: Pre-training with extracted gap-sentences
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and for abstractive summarization. In ICML, 2020. 14
Li Fei-Fei. ImageNet large scale visual recognition chal- [87] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez,
lenge. IJCV, 115(3):211–252, 2015. 1, 5 and Kai-Wei Chang. Men also like shopping: Reducing
[70] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu gender bias amplification using corpus-level constraints. In
Soricut. Conceptual Captions: A cleaned, hypernymed, im- EMNLP, 2017. 12
age alt-text dataset for automatic image captioning. In ACL, [88] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja-
2018. 1, 2, 3, 5, 8, 12 son J. Corso, and Jianfeng Gao. Unified vision-language pre-
[71] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan training for image captioning and VQA. In AAAI, 2020. 1,
Liu. MASS: Masked sequence to sequence pre-training for 2, 4, 8, 12
language generation. In ICML, 2019. 14 [89] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan.
[72] Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Capturing long-tail distributions of object subcategories. In
Bendersky, and Marc Najork. WIT: Wikipedia-based image CVPR, 2014. 8

11
[90] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, both.)
Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Align-
• CC3M [70] An instance of text is the caption associ-
ing books and movies: Towards story-like visual explana-
ated with each image url of the training split.
tions by watching movies and reading books. In ICCV, 2015.
1 • CC12M (ours) An instance of text is the caption asso-
ciated with each image url. It has been used and is cur-
rently the most popular V+L pre-training dataset [55,
A. Broader Impact
3, 21, 74, 88, 56, 50].
Our publicly-available V+L pre-training resource • COCO Captions [20] An instance of text comes from
CC12M has the potential to positively impact multiple the caption associated with each image of the 2017
vision-and-language tasks. One main aspect that we have training split (five captions per image). This dataset
identified is a much higher degree of coverage of long-tail is designed for the task of image captioning, and has
visual concepts than previous resources, including CC3M. been used for caption-based image retrieval as well. It
As a result, we expect the models (pre-)trained on our data has been used for V+L pre-training [77, 49, 21, 50].
to be more robust in the wild than before. • Visual Genome [43] An instance of text comes from
In addition, our work could benefit the design of new the caption of each region in images of the training
setups for the downstream tasks that shift away from in- split. This dataset aims to connect vision and lan-
domain (e.g., COCO/Visual Genome) to out-of-domain/in- guage through scene graphs and is used for multiple
the-wild (e.g., OID), similar to nocaps that our work fo- tasks that include but not limited to dense image cap-
cuses heavily on. The setups could also avoid the use of tioning, visual relationship detection and scene graph
in-domain data during pre-training that in some cases re- parsing, image retrieval and generation, and visual
sulting in transfer learning between (almost) identical sets question answering. It has been used for V+L pre-
of images, e.g., COCO, Visual Genome (VG), VQA2, VG training [77, 21].
QA, Visual7W, GQA, GuessWhat, and RefCOCO*. • SBU Captions [61] An instance of text is the caption
At the same time, datasets curated from the Web could associated with each image url of the “preferred” ver-
come with risks such as unsuitable content (adult content, sion of the dataset. This dataset is designed for the
profanity) and unintended privacy leakage [60, 17, 18]. We task of image captioning. It has been used for V+L
take the steps in Sect. 2.2 of the main text to mitigate both pre-training [77, 21, 48, 50].
of these risks by applying the necessary image and text fil- • VQA2 [27] An instance of text is the question and
tering steps and replacing each person name (celebrities’ the answers in each image-question-answers triplet of
included) with the special <PERSON> token. the train2014 + val2train2014 splits. This dataset is
Less specific to the Web data are the unwanted dataset designed for the task of visual question answering
biases [13, 73, 80] that are prone to amplification by ma- (VQA) [8]. It has been used for V+L pre-training [77,
chine learning models [11, 87]. Our preliminary analysis in 50].
Sect. 2.3 of the main text shed light on the degree to which • RefCOCOg [59] An instance of text is the referring ex-
our data exhibits some aspects of these inherent biases, and pression in each region in images of the training split.
we suspect that the better coverage of the tail in fact makes This dataset is designed for the task of referring ex-
this issue less severe. Nevertheless, the users of this data pression comprehension [39].
and the systems trained on it shall be aware of such risks • NLVR2 [75] An instance of text comes from the cap-
and other ones that might arise. tion associated with each pair of images of the training
split. This dataset is used for the task called multi-
B. Additional analyses of CC12M modal verification in [56], but designed for the general
task of visual reasoning.
B.1. Out-of-domain (OOD) visual concepts on an
Table 9 summarizes the number of instances whose texts
expanded list of datasets
contain OOD visual concepts for all selected datasets. We
We use the 394 nocaps’ out-of-domain classes as a proxy use both the absolute frequency and the normalized one
for OOD visual concepts and analyze popular vision-and- (per 1M text instances). Essentially, these numbers indi-
language datasets, in addition to CC3M and CC12M that cate the degree of OOD coverage. We find that CC12M has
we focus in the main text. These datasets span a wide many more OOD instances than all other datasets by a large
range of use cases, both in terms of tasks (image-to-text margin (6.7x median and 5.8x mean vs. the second best
generation, image-and-text matching, visual question an- CC3M). Moreover, CC12M still prevails even after normal-
swering (VQA), referring expression comprehension, and ization to account for its size. In other words, CC12M cov-
multimodal verification), and in terms of the stage during ers these OOD classes better in both absolute and relative
which they are used (pre-training, fine-tuning/evaluation, or senses.

12
Dataset Freq Freq (per 1M) B.2. The impact of the dataset size
median mean median mean
CC3M 462 2325.7 139.2 700.8 We experiment with pre-training on randomly subsam-
CC12M 3110 13455.8 250.3 1083.1 pled CC12M, 25% (3.1M) and 50% (6.2M) and evaluate
COCO Captions 37 248.6 62.3 417.1 the pre-trained models on novel object captioning on no-
Visual Genome 133 1114.47 40.7 341.4 caps and zero-shot IR on Flickr30K. Fig. 6 shows the larger,
SBU Captions 121 798.6 121.0 798.6 the better trend, with 25% of CC12M gives rise to similar
VQA2 37 242.0 63.8 417.2 performance as CC3M.
RefCOCOg 1 21.2 8.8 186.4
NLVR2 4 79.9 11.6 245.5 C. Qualitive Results for Image Retrieval
Table 9: Statistics of the (normalized) frequency of nocaps’
out-of-domain visual concepts in the texts of popular vision-and- Fig. 7 provides qualitative image retrieval results on
language datasets. the Flickr30K dataset, top-3 images retrieved by the from-
scratch model trained on Flickr30K, as well as by two mod-
els pre-trained on CC3M and CC12M and then fine-tuned
on Flickr30K. We report three cases in which CC12M pre-
training helps correct the rankings from the other two mod-
els, which we suspect due to the model getting more famil-
iar with the rare words, highlighted in blue.

D. Pre-Training: Data and Method Variants


D.1. Vision-to-Language Pre-Training on LocNar
Open Images
Table 10 considers pre-training on LocNar Open Images
for the nocapsbenchmark. We observe inferior performance
to both CC3M and CC12M. We attribute this to the long
narratives in LocNar having drastically different styles from
Figure 5: Comparison of nocaps’ out-of-domain coverage degree those from COCO Captions and nocaps. Furthermore, the
among captioning (solid) and 3 other tasks’ (dashed) datasets (see data collection protocol in nocaps does not involve prim-
text for details). ing the annotator to mention object names present to the
user, resulting in more generatic terms (instrument vs. gui-
tar). This again highlights the natural fine-grainedness in-
herent in noisy Web data, especially in the case of no-
hypernymized data source (CC12M).

D.2. Pre-Training Strategies


In the main text, we focus on the image captioning (ic)
and the visual-linguistic matching (vlm) learning objec-
tives both during pre-training and fine-tuning stages. Our
motivation here is to keep the setup for evaluating pre-
Figure 6: Performance with sub-sampled CC12M (25% & 50%)
on novel object captioning (left, CIDEr’s on nocaps val) and zero-
training data as “clean” as possible. However, other pre-
shot IR (right, recall@1 on Flickr30K test). training strategies exist in the literature and we describe and
test the effectiveness of them in this section.

Fig. 5 provides a more complete picture of the normal- D.2.1 Masked Vision-to-Language Generation
ized frequency of OOD classes in these datasets, at differ-
ent thresholds. It shows the number of OOD classes (y- Given the training image-text pairs, the ic objective pre-
axis) with at least K per 1M captions (x-axis). Evidently, dicts the text from the image. The following objectives
other datasets experience sharper drops as K increases than predict (all or part of) the text from the image and (all or
CC12M (black solid curve). We also find that caption- part of) the text. In order to encode both the image and
ing datasets (solid curves) generally provide better cover- the text, we concatenate the sequence of image feature vec-
age than non-captioning datasets: VQA2, RefCOCOg, and tors and the sequence of text token feature vectors, and use
NLVR2 (dashed curves). the Transformer encoder to encode them [49, 21, 74]. This

13
a blond child holding a sword and dressed in a black a man sitting on a chair with a beer in his hands a man and a woman hold the front arm of a large tiger that
robe is standing next to a stroller roasting something to eat on a wooden stick is laying on the ground among various medical devices

No pre-training

CC3M

CC12M

Figure 7: Qualitative results for the image retrieval task on Flickr30K given the query text (very top) when the model is not pre-trained
(top), pre-trained on CC3M (middle), and pre-trained on CC12M (bottom).
nocaps val
Pre-training in-domain near-domain out-of-domain overall
data CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE

LocNar Open Images 76.0 11.6 65.9 10.9 48.9 9.3 73.3 17.4 23.5 50.7 63.9 10.7
CC3M 81.8 11.6 73.7 11.1 65.3 10.1 74.6 19.1 24.1 51.5 73.2 11.0
CC12M 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8

Table 10: Comparison between pre-training data. LocNar Open Images’s images are from the same visual domain as nocaps.
All approaches use the ic pre-training objective.

vanilla fusion is effective, shown to consistently outperform higher on out-of-domain CIDEr.


the co-attentional transformer layer [55, 56], in which the In addition, the trend suggests that it is critical that the
“query” comes from the other modality than that of “key” text masking rate is high enough such that the models be-
and “value” (see Sect. 2 and Fig. 2 in [55] for details). come less and less reliant on text — that is, when mlm
Masked Language Modeling (mlm). We mask a percent- and mass become more similar to the ic task. Note
age of the input text tokens at random, and predict the target that widely-used configurations in the VLP literature on
text sequence using the decoder. Following BERT [25], we vision-and-language understanding are the ones with low
use a mixed strategy for masking: for each selected token, text masking rates (0.2 in most cases), which consistently
we replace it with the mask token [MASK] 80% of the time, underperform in our generation setup.
replace it with a random token 10% of the time, and leave it We attribute this result to the models’ (over)reliance on
as is 10% of the time. text during pre-training, which hurts the quality of its image
representations. Supporting evidence for this phenomenon
Masked Sequence to Sequence Modeling (mass). We ap- is found in the recent work of [16], which observe that
ply the mixed masking strategy as in mlm to the input text image+text pre-trained models exhibit a preference for at-
tokens, but require that the mask is applied to consecutive tending text rather than images during inference (in image
tokens (i.e., a contiguous segment). The task is to sequen- and text understanding task). Another supporting evidence
tially predict the masked segment using the decoder. This is the issue of strong language priors (well-known in the
approach is inspired by MASS [71] and PEGASUS [86]. VQA community), which led to interest in adversarial test
Results. Table 11 compares ic, mlm, and mass pre- sets and other methods to overcome strong language bi-
training objectives. Our main observation is that ic clearly ases [1, 67, 22, 14]. The same pheonmenon has been re-
outperforms masked vision-to-language pre-training when ported for multi-modal machine translation, where models
the masking rate is low. Overall, ic is competitive to mlm trained on image+text tend to ignore the image and primar-
and mass, slightly below mlm[.8] in overall CIDEr, but ily use the text input [15]. Based on these results, the design

14
nocaps val
Pre-training in-domain near-domain out-of-domain overall
objective CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE

ic 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
mlm[.1] 76.4 11.5 68.4 10.8 57.6 9.6 73.0 18.1 23.5 50.6 67.4 10.6
mlm[.2] 79.8 11.3 76.3 10.9 76.2 10.2 76.2 20.5 24.1 52.4 76.8 10.8
mlm[.4] 86.5 12.3 82.7 11.5 86.3 11.3 78.0 22.7 25.2 53.7 84.0 11.6
mlm[.8] 89.3 12.5 87.5 11.9 91.1 11.3 78.7 23.8 25.9 54.4 88.5 11.9
mass[.1] 86.0 12.1 74.8 11.1 71.7 10.1 75.8 20.5 24.6 52.5 75.8 11.0
mass[.2] 84.9 12.0 78.1 11.2 78.6 10.5 76.0 20.8 24.7 52.7 79.2 11.2
mass[.4] 85.7 11.7 83.7 11.5 88.5 10.9 77.3 22.8 25.1 53.6 85.0 11.4
mass[.8] 88.8 12.2 85.1 11.7 87.8 10.6 78.1 23.7 25.5 54.2 86.2 11.5
Table 11: Comparison between the ic pre-training and masked V+L pre-training. We consider two masking schemes (mlm
and mass) and four masking rates (.1, .2, .4, .8) and report their effects on the nocaps val set.

nocaps val
Pre-training in-domain near-domain out-of-domain overall
objectives CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE

ic 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
ic+vlm 88.6 12.3 85.8 11.9 90.0 11.4 78.0 23.1 25.7 54.4 87.1 11.9
ic+moc 91.1 12.4 88.4 12.1 93.6 11.4 78.8 24.6 26.2 55.2 89.9 12.0

Table 12: Effect of visual linguistic matching (vlm) and masked object classication (moc) when combined with the ic
objective on the nocaps val set.

of V+L pre-training objectives that are capable of outper- E. Implementation Details


forming the image-only ic objective (i.e., overcoming the
language through modeling) is an interesting venue for fu- E.1. Data Preprocessing and Feature Embedding
ture work. • Text tokenizer: preprocesed with COCO tokenizer
Another observation is that mass significantly works https : / / github . com / tylin / coco - caption.
better than mlm for lower masking rates. When masking We then create a vocabulary of subtokens out of these.
rates are high, the two objectives become more similar. This • Text input embedding (during pre-training only):
suggests the importance of bridging the gap between pre- subtoken lookup embeddings of size E = 512 are
training and fine-tuning (producing consecutive tokens). randomly initialized, followed by Linear(512)-ReLU-
Dropout(0.3)-Linear(512).
D.2.2 Image Captioning with Visual-Linguistic Match- • Image’s geometric features: two pairs of coordinates
ing or Masked Object Classification (top left and bottom right) and the relative area, rep-
resented by relative numbers between 0 and 1. Each
We explore adding auxiliary losses to the main ic objec-
of these 5 numbers is linearly projected into an em-
tive. First, we define a pre-training task that does not require
bedding of size E = 512. We concatenate the result to
text.
get an embedding of size E x 5 = 2560, followed by
Masked object classification (moc). We mask one of the Linear(512)-ReLU-Dropout(0.3)-Linear(512).
visual regions (selected at random), and predict the cluster • Image’s semantic features: each feature vector (a
ID of that region [55, 77, 21]. We use a total of 8192 clus- global image feature vector or one of the 16 box’s
ters, obtained via K-means over the training data. image feature vector, followed by Linear(512)-ReLU-
Then, we either add the vlm loss (multipled by 0.1) or Dropout(0.3)-Linear(512).
the moc loss (multipled by 0.1) to the main ic loss. • Image’s combined geometric and semantic features:
Results. Table 12 reports the effect of multi-task pre- we first apply LayerNorm [9] to each of the ge-
training on the nocaps val set. We observe a slight improve- ometric or the semantic features. We then add
ment when adding moc but a slight drop when adding vlm. the two and apply Linear(512)-ReLU-Dropout(0.3)-
This again shows that ic is a good pre-training task to start Linear(512)-LayerNorm.
with. We leave developing advanced auxiliary losses on top • Image’s tag features: same as text input embedding.
of it and multi-task pre-training strategies for future work. For the ic objective, we have a bag of 1 + 16 visual feature

15
vectors and up to 16 tag feature vectors, each of size 512. github.com/nocaps-org/updown-baseline/blob/
For the vlm objective, where text has to be encoded, we master/updown/utils/evalai.py. For in-depth dis-
also have a sequence of text (sub)token feature vectors of cussions of these metrics see [40].
size 512. Participating in the default formulation of the nocaps
challenge requires that one (i) does not use val and test
E.2. Model Open Images’s ground-truth object detection annotations,
The ic-based task uses a transformer encoder-decoder and (ii) does not use image-caption data collected via ad-
model. The vlm-based uses two transformer encoders, one ditional annotation protocols. We satisfy both requirements
for texts and the other for images. as we train our object detector on Visual Genome, and both
• Transformer image encoder: number of layers L = 6. CC3M and CC12M are automatically harvested from the
• Transformer image encoder: vocab embedding size E web (alt-text) and belong to the category of noisy web data,
= 512. therefore satisfying the second requirement. On the other
• Transformer image encoder: hidden embedding size H hand, models that leverage the Open Images Localized Nar-
= 1024. ratives dataset (LocNar) [64] for pre-training belong to the
• Transformer image encoder: feedforward/filter size F nocaps (XD) leaderboard rather than the default one.
= H x 4 = 4096, following [25]. Some of our results on the CC3M benchmark are taken
• Transformer image encoder: number of attention from the leaderboard, which is located at https : / /
heads A = H / 64 = 8, following [25]. ai . google . com / research / ConceptualCaptions /
• Transformer text encoder (for vlm only): L, E, H, F, leaderboard?active_tab=leaderboard.
A are the same as Transformer image encoder.
E.5. Hyperparameter search
• Transformer decoder: L, E, H, F, A are the same as
Transformer image encoder. For pre-training experiments, we do not conduct hyper-
• Transformer decoder: beam search width = 5. parameter tuning besides an initial stage of exploration as
• Transformer decoder: beam search alpha = 0.6. we believe small changes would not considerably affect the
• Transformer decoder: maximum output length = 36 for downstream performance. For instance, we fix an initial
all datasets except for LocNar which is set to 180. learning rate to 0.000032 and observe it works consistently
well (on the validation set) across scenarios.
E.3. Training For fine-tuning experiments, we focus on tuning one hy-
• Infrastructure: Google Cloud 32-core TPUs. perparamter: the initial learning rate. In the case of nocaps,
• Batch size per core: 128 (for a total of 4096) we also lightly tune the maximum number of training steps
• Optimizer: Adam [41] with default hyperparameters as we observe the model overfitting on COCO Captions. In
(except for the initial learning rate; see below). all cases, we make sure to allocate similar resources to any
• Learning rate — Initial: See Hyperparameter search two settings that we make a comparison between, such as
below. pre-training data sources of CC3M and CC12M.
• Learning rate — Warm-up epochs: 20 for all pre- For generation, the ranges for the initial learning rate are
training and fine-tuning experiments. {3.2e-9, 3.2e-8, 3.2e-7} and the ranges for the maximum
• Learning rate — Decay rate: 0.95 for all pre-training number of training steps are {5K, 10K}. For matching, the
and fine-tuning experiments. ranges for the initial learning rate are {3.2e-8, 3.2e-7, 3.2e-
• Learning rate — Decay epochs: 25 for all pre-training 6} while the maximum number of training steps is fixed to
and fine-tuning experiments. 10K.
• Data augmentation: a set of input visual regions are
permuted during training.
• Maximum number of steps: 2M for vision-to-
language generation pre-training on both CC12M
and CC3M (and CC3M+CC12M). For vision-and-
language matching, 1M for CC3M instead. See Hyper-
parameter search below for fine-tuning experiments.

E.4. Evaluation
For nocaps evaluation, we submit inference results to
the leaderboard https://evalai.cloudcv.org/web/
challenges/challenge-page/464/overview. Code
for all evaluation metrics can be found at https : / /

16

You might also like