Conceptual 12M
Conceptual 12M
Abstract
1
In contrast, V+L datasets are limited in two ways. (a) A public larger-scale V+L pre-training dataset that cov-
First, the effective sizes of popular V+L datasets are low. ers a much wider range of concepts than existing ones.
The number of images in these datasets range from fewer (b) Evaluation on downstream vision-to-language genera-
than a few hundred thousands [84, 20, 44, 28] to sev- tion and vision-and-language matching with an empha-
eral millions [70], with lower text quality as the scale in- sis on long-tail recognition that consistently shows the
creases. Second, many of the small-sized datasets share the superiority of this dataset over CC3M.
same, limited visual domain; COCO-Captions [20], Visual (c) State-of-the-art results on the nocaps (novel object cap-
Genome [44], and VQA2 [27] are (mostly) based on sev- tioning) and Conceptual Captions benchmarks.
eral hundreds thousand of COCO images [52]. The lack
in scale and diversity of visual concepts (with respect to 2. Vision-and-Language Pre-Training Data
vision/language-only counterparts) makes it hard for V+L
We first review the data collection pipeline for the Con-
models to perform adequately in the wild.
ceptual Captions 3M (CC3M) outlined in Sect. 3 of [70],
One major reason for these gaps is the difficulty in which we followed closely. We then describe a series of
collecting such datasets. Unlike in image classification, relaxation and simplification to the pipeline that results in
“text” in V+L datasets is longer and less likely to be CC12M, a much larger set of image-text pairs. Finally, we
agreed upon, making the annotation process more costly perform an analysis of CC12M in comparison with CC3M
and time-consuming. One approach to remedy this is to and other existing V+L datasets.
make use of large amounts of the alt-texts accompanying
images on the Web. For instance, Sharma et al. intro- 2.1. Conceptual Captions 3M: Pipeline for extract-
duced Conceptual Captions (CC3M) [70], a dataset of 3.3M ing and cleaning Image Alt-Text from the Web
himage, captioni pairs that result from a filtering and post-
The Conceptual Captions dataset consists of about
processing pipeline of those alt-texts. Despite being au-
3.3M Web images and their corresponding cleaned, hyper-
tomatically collected, CC3M is shown to be effective in
nymized Alt-texts [70]. This approach leverages a promis-
both image captioning in the wild [70, 19] and V+L pre-
ing source of (weak) supervision for learning correspon-
training [55, 49, 21, 77, 3, 74, 88, 48, 56]. In other words, it
dance between visual and linguistic concepts: once the
provides a promising start for large-scale V+L annotations.
pipeline is established, the data collection requires no ad-
In this paper, we explore pushing the limits of V+L data ditional human intervention. It consists of the following 4
using this approach. Our key insight is that specific down- steps: (i) image-based filtering based on size, aspect ratio,
stream V+L tasks (e.g., VQA, image captioning) can be encoding format and offensive content, (ii) text-based fil-
overly restrictive if the goal is to collect large-scale V+L an- tering based on language, captialization, token frequency,
notations. For instance, CC3M was collected to favor high- pre-defined unwanted phrases, as well as part-of-speech
precision texts that are fit for the downstream task of image (POS), sentiment/polarity, and adult content detection (us-
captioning. Yet, we have witnessed this dataset being in- ing Google Cloud Natural Language APIs), (iii) image-
creasingly adopted for V+L pre-training [55, 21, 3, 74, 88, text–based filtering based on the number of image tags (as
48, 56], arguably beyond its original purpose. predicted by Google Cloud Vision APIs) that overlap with
We hypothesize that the V+L field could benefit from the existing text, (iv) text transformations, most notably hy-
such an insight, and therefore we introduce Conceptual pernymization of named entities, including proper names
12M (CC12M), a high(er)-recall V+L dataset for the pur- of persons, organizations and locations (e.g., both “Harri-
pose of V+L pre-training. By relaxing multiple image and son Ford” and “Calista Flockhart” are replaced by “actor”),
text filters used in CC3M, we obtain a less precise but 4x deletion of time-related spans, and digit replacement (using
larger V+L set of himage, texti pairs. We perform an anal- # as a digit abstraction).
ysis of this dataset and show that it covers a wider range of The large scale nature and the high degree of textual and
visual concepts. visual diversity make this dataset particularly suited to V+L
We test our hypothesis by benchmarking the effective- pre-training [55, 21, 74, 88, 48, 56].
ness of CC12M as a pre-training data source on several V+L
tasks, in comparison to CC3M. We explore two main pre- 2.2. CC12M: Relaxing filters for higher recall
training strategies (and more in the Supplementary mate- Conceptual Captions has been created to work out-of-
rial): one for vision-to-language generation and the other the-box for training image captioning models, and thus it
for vision-and-language matching. Our experiments indi- involves substantial image, text, and image-text filtering and
cate that scaling up pre-training V+L has a dramatic pos- processing to obtain clean, high-precision captions. As a re-
itive effect on image captioning, novel object captioning, sult, this approach comes at the cost of low recall (many po-
and (zero-shot) image retrieval. tentially useful himage, Alt-texti pairs are discarded). How-
In summary, our main contributions are: ever, this trade-off may not be optimal if the dataset is to
2
Dataset # examples token/type caption length
CC3M train 3,318,333 804.8 10.3 ± 4.5
CC12M 12,423,374 370.0 20.2 ± 16.3
Table 1: Basic statistics of CC12M vs. CC3M
be used primarily for V+L pre-training. Motivated by this,
we follow a similar procedure as the one described in [70]
but relax some of its filters, and construct the dataset called
Conceptual 12M (CC12M), as detailed below.
Filtering. As described above, the construction of CC3M
used three main filtering types [70]: image-based, text-
based, and image-text–based. To arrive at CC12M, we keep
the image-text filtering intact, and relax the unimodal fil-
ters only. First, for image-based filtering, we set the maxi-
mum ratio of larger to smaller dimension to 2.5 instead of
2. We still keep only JPEG images with size greater than
400 pixels, and still exclude images that trigger pornogra-
Figure 2: Word clouds of top 100 tokens in CC3M (the top cloud)
phy detectors. Second, in text-based filtering, we allow text and in CC12M (the bottom cloud).
between 3 and 256 words in the alt-text. We still discard
candidates with no noun or no determiner, but permit ones indicating a longer-tail distribution and a higher diversity
without prepositions. We discard the heuristics regarding degree of the concepts captured. Lastly, the average caption
high unique-word ratio covering various POS tags and word length of CC12M is much longer. This is overall achieved
capitalization. We set the maximum fraction of word repe- by our relaxation of the filters, especially the text one.
tition allowed to 0.2. Given a larger pool of text due to the Quality. We compute a rough estimate of precision on
above relaxations, the threshold for counting a word type as 100 examples by asking two annotators to rate how well
rare is increased from 5 to 20. the given alt-text fits the image on a 1–5 scale: 1 (no fit),
Text transformation. The main motivation for CC3M to 2 (barely fit), 3 (somewhat), 4 (good fit, but disfluent lan-
perform text transformation is that a majority of candidate guage), 5 (perfect). We define precision as the fraction of
captions contain ultrafine-grained entities such as proper captions with a score 4 or above. We see a drop in preci-
names (people, venues, locations, etc.), making it extremely sion, 76.6% vs. 90.3% as reported for CC3M (Table 2 in
difficult to learn as part of the image captioning task. In [70]). This analysis points to the precision/recall tradeoff in
contrast, we are not restricted by the end task of image cap- transitioning from CC3M to CC12M. Fig. 1 illustrates such
tion generation. Our intuition is that relatively more dif- a tradeoff: the “jellyfish” example would have been filtered
ficult pre-training data would lead to better transferability. out from CC3M (due to a high percentage of nouns and a
We thus do not perform hypernimization or digit substitu- lack of proprositions), but it is included in CC12M.
tion as in [70]. The only exception to the “keep alt-texts as Visual concept distribution. We use the caption text to-
raw as possible” rule is performing person-name substitu- kens to represent the visual concepts. The long tail of visual
tions, which we identify as necessary to protect the privacy concepts that emerge in CC12M spans many categories, and
of the individuals in these images. For this step, we use the can be attributed to (1) a dramatic increase in scale, and (2)
Google Cloud Natural Language APIs to detect all named the absence of fine-grained entity hypernymization. We list
entities of type Person, and substitute them by a special to- some of them here to illustrate this point, in the format of
ken hPERSONi. Around 25% of all the alt-texts in CC12M “hwordi hfrequency in CC3Mi → − hfrequency in CC12Mi”:
are transformed in this fashion. luffy 0 →
− 152, mangosteen 0 → − 212, zanzibar 0 → − 1138,
sumo 1 → − 661, pokemon 1 → − 8615, chevrolet 1 → − 12181,
2.3. Characteristics of CC12M mehndi 3 → − 9218, pooh 4 → − 7286, cyberpunk 5 → − 5247,
We provide an analysis of CC12M along multiple dimen- keto 6 →
− 6046, hound 9 → − 3392, quiche 50 → − 1109, durian
sions, focusing on comparing it to the most relevant CC3M. 61 →− 552, jellyfish 456 →− 2901.
Additional analyses are in the supplementary material. We also visualize the head of the distribution in Fig. 2.
Basic statistics. As seen in Table 1, CC12M consists of We observe that “person” becomes much more frequent due
12.4M image-text pairs2 , about 4x larger than CC3M. It has to person substitution with the token “hPERSONi”. More-
a much lower token (word count) to type (vocab size) ratio, over, there are fewer “actor”, “artist”, “(football) player”, as
a result of removing hypernymization.
2 Extracted as of May 2020. Finally, we inspect tokens that are unseen in CC3M.
3
Image Captioning (IC) We focus on the two most fundamental V+L tasks:
vision-to-language generation and vision-and-language
[GO] Token 1 ... Token N [EOS]
matching. In both cases, our emphasis is on (i) the simplest
Multi-Layer Transformer Multi-Layer Transformer
Image Encoder Text Decoder setting in which the learning objectives during pre-training
Glob Reg 1 ... Reg 16 Tag 1 ... Tag 16 and downstream tasks match, and (ii) long-tail recognition
and out-of-distribution generalization, as we believe this is
Visual-Linguistic Matching (VLM) where pre-training has the most impact. Fig. 3 and Table 2
Match or not? summarize our experimental setup, in terms of the down-
Pooling & Fully-connected Layers
stream tasks and the fine-tuning and evaluation datasets.
Multi-Layer Transformer Multi-Layer Transformer
Image Encoder Text Encoder
3.1. Vision-to-Language Generation
Glob Reg 1 ... Reg 16 Tag 1 ... Tag 16 [CLS] Token 1 Token 2 ... Token N
3.1.1 Pre-Training Tasks
Figure 3: Main Pre-Training Tasks: image captioning (vision-
We use image captioning (ic) as the pre-training task. The
to-language generation) and visual-linguistic matching (vision-
and-language understanding). task is to predict the target caption given image features.
Downstream Downstream datasets To train the model parameters, we use the standard cross
task Train Eval entropy loss given the groundtruth caption.
Novel object captioning COCO Captions nocaps Note that there exist vision-to-language generation pre-
Novel object captioning LocNar COCO LocNar OID training strategies that are different from ours. For instance,
Image captioning CC3M Zhou et al. [88] adapt BERT [25] to generate text. As
Zero-shot IR None Flickr30K masked language modeling is used for pre-training, there
IR Flickr30K
is no decoder and, at inference time, text is generated us-
IR LocNar Flickr30K
ing the encoder network one token at a time, appending
Table 2: Generation (top) and matching (bottom) tasks and the mask token to the image and the text generated so far.
datasets considered in this paper. IR = Image Retrieval. Thus, this approach is inefficient as the number of passes
We observe that these tokens may occur very frequently over the input image is linear in the desired caption length.
in CC12M if they are fine-grained instances such as loca- It is also unclear how to incorporate advanced decoding
tions (“france,” “africa,” “dc,” “toronto”) or digits (“2019”, schemes such as beam search, top-k sampling, or nucleus
“10”, “2018”, “2020”). This is due to the removal of hyper- sampling (see, e.g., [31]) with such an approach. Finally,
nymization and the dropping of time-related span deletion. our experiments (see Supplementary material) suggest that
the ic pre-training task is superior to its masked variants
Biases. We study the context in which several sen-
and justify using the simple ic learning objective.
sitive terms related to gender, age, race, ethnicity ap-
pear such as “black”/“white”/“asian”/“african”/“indian”,
“man”/”woman”, “young”/“old”, etc. We observe no large 3.1.2 Downstream Tasks
biases in the distribution of these terms, either in terms of
co-occurrence between sensitive term pairs or with other to- Our downstream tasks are selected to measure progress
kens. Furthermore, we check the distribution of web do- toward solving image captioning in the wild. They also
mains and, similar to visual concepts, we find this to be stand to benefit from visual grounding, especially since pre-
diverse and long-tail: >100K with >40K contributing >10 training, by definition, is expected to cover a wider range of
samples. We take our preliminary study as a positive in- (long-tail) visual concepts than fine-tuning datasets.
dication of no severe biases stemming from particular do- nocaps [2] is a recent object-captioning-at-scale bench-
mains or communities. Finally, we provide a Broader Im- mark consisting of 4,500 validation and 10,600 test images
pact statement in the supplementary material. with 10 hidden reference captions. Unlike in the standard
image captioning setting, nocaps’s distributions of images
3. Evaluating Vision-and-Language Pre- during training (COCO Captions) and evaluation (Open Im-
Training Data ages) are different: the Open Images dataset [45, 42] cov-
ers one order of magnitude more objects (600 classes) than
The previous section describes CC3M and our CC12M. COCO [52] (80 classes). This discrepancy defines the chal-
In this section, we evaluate both datasets on their ability to lenge: solutions must be able to learn to describe novel
benefit V+L downstream tasks, measuring the impact from concepts from sources external to the COCO training set,
visual grounding produced under the two settings. For the such as text corpora, knowledge bases, or object detection
sake of comparison, we do not include the images that ap- datasets. In the Supplementary material, we provide de-
pear in CC3M in CC12M in our experiments. tails on the nocapsleaderboard. In addition, besides CC3M
4
and CC12M, we also explore using the Open Images Local- the LocNar Flickr30K portion (train split of 30,546 images,
ized Narratives dataset (LocNar) [64], as an alternative “in- and test split of 1000 images) for training and evaluation.
domain” (from a visual standpoint) pre-training data source. Evaluation metrics. To measure the performance on image
Localized Narratives (LocNar) [64] is a collection of retrieval, we consider the standard metrics Recall@1 (R1),
datasets with images that are paired with captions obtained Recall@5 (R5), and Recall@10 (R10).
by converting speech to text via ASR and manual post-
processing it3 . Inspired by the setting in nocaps, we use 3.3. Implementation Details
the COCO[52] portion (train split of around 130K images)
for training/fine-tuning, and Open Images [45] portion of Representing images and texts. We use Graph-RISE [37,
evaluation (val split of around 40K images). Note that the 38] to featurize the entire image. We train a Faster-
LocNar captions are much longer than standard captioning RCNN [68] on Visual Genome [43], with a ResNet101 [29]
datasets (41.8 words/caption), setting it apart from nocaps. backbone trained on JFT [30] and fine-tuned on Ima-
Conceptual Captions 3M [70] is our main reference for geNet [69]. We select top-16 box proposals and featurize
V+L pre-training data source. At the same time, the image each of them with Graph-RISE, similar to [19]. Inspired by
captioning task on this dataset itself is a valuable bench- [50], we obtain up to 16 image tags from the Google Cloud
mark for vision-to-language generation in the wild. Thus, Vision APIs, and treat them as text inputs to our model.
we adopt it as a downstream task for CC12M. This means These global, regional, and tag features end up being repre-
that, in the case of CC3M, from-scratch and pre-training sented as a bag of 1+16+16 vectors, serving as bottom-up
settings collapse. features [7] for our model.
Evaluation metrics. To measure the performance on im- Model and Learning. For ic-based pre-training and
age caption generation, we consider the standard met- downstream tasks, we follow the state-of-the-art architec-
rics BLEU-1,4 [62], ROUGE-L [51], METEOR [10], ture that heavily rely on self-attention [78] or similar mech-
CIDEr-D [79], and SPICE [4]. anisms [70, 85, 19, 33, 23]. We implement a Transformer-
based encoder-decoder model, using [19] as a starting point.
3.2. Vision-and-Language Matching In addition, we encode each feature vector with a deeper
embedding layer and apply layer normalization [9]. Follow-
3.2.1 Pre-training Tasks
ing [55], we encode the corners and the area of bounding
In visual-linguistic matching (vlm), the task takes as input boxes and apply layer normalization when combining geo-
both image and text features and predicts whether the input metric and regional semantic features. These modifications
image and text are matched. To train the model’s parame- lead to an improved CIDEr score of 100.9 on the CC3M dev
ters, we use a contrastive softmax loss, for which the origi- benchmark (Table 7), vs. 93.7 as reported by [19]. We de-
nal image-text pairs are used as positive examples, while all scribe additional details in the supplementary material, in-
other image-text pairs in the mini-batch are used as negative cluding infrastructure description, runtime, model size, hy-
examples [55, 77]. perparameter ranges and tuning methods, and the configu-
ration of the best-performing model.
3.2.2 Downstream Tasks For the vlm-based pre-training and downstream tasks,
we reuse the architecture above but discard the decoder. We
The task of caption-based image retrieval (IR) is to iden- use mean pooling to obtain a fixed-length vector for each
tify a relevant image from a pool given a caption describing modality, and compute the product of the transformed (last-
its content. The Flickr30K dataset [63] consists of 31,000 layer Transformer encoder representation) image and the
images from Flickr, each associated with five captions. Fol- transformed text before applying softmax.
lowing existing work [47, 55], we use 1,000 images for val-
idation, 1,000 images for testing, and use the rest of image- 4. Experimental Results
text pairs for model training.
We further consider zero-shot caption-based image re- 4.1. Vision-to-Language Generation
trieval [55] on the Flickr30K dataset. The term “zero-shot” Table 3 shows our results on nocaps. We report in
refers to the setting in which we discard training data and Row 1 the performance of our baseline model without pre-
apply pre-trained models “as-is”, i.e., without fine-tuning training. Rows 2-3 show the performance of off-the-shelf
on the target task. captioning systems trained on CC3M and CC12M, respec-
Finally, we further evaluate our retrieval system on the tively. This indicates the “raw” power (zero-shot setting)
Localized Narratives dataset [64] (see Sect. 3.1.2). We use of the pre-trained network in generating captions out of
3 This dataset also contains mouse traces synchronized with the text, but the box. We note that, without fine-tuning on COCO Cap-
we do not use the traces here. tions, the model underperforms our baseline numbers on all
5
Pretraining Train or nocaps val
data finetune on in-domain near-domain out-of-domain overall
coco cap? CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE
None X 72.8 11.1 57.1 10.2 34.1 8.3 69.8 14.5 21.9 47.9 54.7 10.0
CC3M 29.2 7.4 27.5 6.9 37.3 7.4 36.0 2.8 12.6 29.1 29.7 7.1
CC12M 20.7 6.9 24.1 6.9 41.6 8.0 31.8 2.9 12.1 26.8 27.1 7.2
CC3M X 81.8 11.6 73.7 11.1 65.3 10.1 74.6 19.1 24.1 51.5 73.2 11.0
CC12M X 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
CC3M+CC12M X 92.6 12.5 88.3 12.1 94.5 11.9 79.2 24.4 26.1 55.1 90.2 12.1
Table 3: Automatic metric scores on the nocaps val set: performance of from-scratch (Row 1), pre-trained (Rows 2-3), and fine-tuned
(Rows 4-5) models. CC12M outperforms CC3M by a large margin after fine-tuning (Row 4 vs. 5). Together, they achieve a new best,
surpassing 90 CIDEr points on nocaps val. Bold indicates best-to-date, underline indicates second-best.
a group of people riding on top of a skateboard . a man in a military uniform holding a microphone .
a group of people that are dressed in costumes . Before fine-tuning After fine-tuning Before fine-tuning After fine-tuning
Before fine-tuning After fine-tuning members of the a group of people riding on file photo , performs a man in a military
file : martial artist a group of men playing a CC3M CC3M during the opening
CC3M public take a tour . the back of a motorcycle . uniform holds a sword .
dies at age . game of sport . ceremony .
segway tour at the a group of people riding a
< person > at the a couple of sumo CC12M < person > playing a man in a kilt playing
CC12M white house segway in a park CC12M the bagpipes
grand sumo wrestlers in a stadium . the bagpipes .
Figure 4: Qualitative results on nocaps. Each example comes with a caption predicted by the model that is trained on COCO Captions
without pre-training (very top, right under the image), as well as captions predicted by models pre-trained on CC3M (middle) and CC12M
(bottom), where the left/right column indicates if the model is fine-tuned on COCO Captions.
metrics, which is indicative of the need for the model to column). This result indicates that, although the annota-
learn the COCO captioning style, to which the existing au- tion protocol for nocaps uses the priming of annotators to
tomatic metrics are quite sensitive. In addition, we observe mention one or more of displayed fine-grained ground-truth
a slightly better performance by CC3M except for BLUE4 object classes (e.g., “red panda”) present in the image [2],
and SPICE. This illustrates the benefit of data processing the large-scale and natural fine-grainedness of CC12M suc-
and bias toward high-precision captions present in CC3M. ceeds in correctly learning to generate captions containing
With a fine-tuned model, the benefit of transfer learning such concepts, in spite of being textually out-of-domain.
using pre-training on this task is clear (Row 1 vs. Rows Following [2], we also report results of our best model
4,5,6), with CC12M outperforming CC3M by +14.2 CIDEr on the COCO Captions val2017 split, see Table 5, with 5K
points and another +2.8 with CC3M+CC12M. Fig. 4 illus- and 10K fine-tuning steps. We note that, since we do not
trates this effect; scaling up pre-training data benefits learn- rely on techniques such as constrained beam search (CBS)
ing multimodal correspondences from a much larger pool [5, 2] that constrain the model outputs, we do not suffer
of concepts, potentially making the model less susceptible from the large performance trade-offs seen with the pre-
to hallucinations (e.g., guessing “microphone” as it has not vious solutions (degradation on in-domain performance as
seen “bagpipes” in the training set), and also more informa- out-of-domain performance increases, see each model vs.
tive (e.g. choosing “sumo wrestlers” over “men”/“people”). “reference”). Our result on out-of-domain data, as we vary
Table 4 compares our best model (ic pre-trained on the number of fine-tuning steps (last two rows), suggests
CC3M+CC12M) to existing state-of-the-art results on no- that over–fine-tuning on COCO Captions may incur a cost
caps, and show that ours achieves state-of-the-art perfor- in terms of poor generalization.
mance on CIDEr, outperforming a concurrent work [32] A second set of results is reported in Table 6. We observe
that uses a different pre-training approach directly on the that, even when the task requires the generation of much
Open Images dataset, which nocaps is based on. Impor- longer captions for LocNar, CC12M achieves superior per-
tantly, we observe that the gain in the overall score can formance (as measured by CIDEr) compared to CC3M as
be largely attributed to the out-of-domain performance (3rd pretraining data. However, the gain is smaller compared to
6
nocaps val
Method in-domain near-domain out-of-domain overall
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
UpDown [2] 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS [2] 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo + CBS [2] 79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
OscarL [50] 79.9 12.4 68.2 11.8 45.1 9.4 65.2 11.4
OscarL + CBS [50] 78.8 12.2 78.9 12.1 77.4 10.5 78.6 11.8
OscarL + SCST + CBS [50] 85.4 11.9 84.0 11.7 80.3 10.0 83.4 11.4
VIVO [32] 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS [32] 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO + SCST + CBS [32] 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
pretrain ic on CC12M 88.3 12.3 86.0 11.8 91.3 11.2 87.4 11.8
pretrain ic on CC3M+CC12M 92.6 12.5 88.3 12.1 94.5 11.9 90.2 12.1
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2
nocaps test
UpDown [2] 74.3 11.5 56.9 10.3 30.1 8.1 54.3 10.1
UpDown + ELMo + CBS [2] 76.0 11.8 74.2 11.5 66.7 9.7 73.1 11.2
VIVO + SCST + CBS [32] 89.0 12.9 87.8 12.6 80.1 11.1 86.6 12.4
pretrain ic on CC12M 82.9 11.9 85.7 12.0 85.3 11.3 85.3 11.8
pretrain ic on CC3M+CC12M 87.2 12.3 87.4 12.1 87.2 11.4 87.3 12.0
Human 80.6 15.0 84.6 14.7 91.6 14.2 85.3 14.7
Table 4: Comparison between our best model (in italics, pre-trained on CC12M with ic and fine-tuned on COCO Captions) and existing
models, on the nocaps val (top) and test (bottom) splits. Bold indicates best-to-date, underline indicates second-best.
COCO nocaps CC3M CC3M
Method val2017 val Method dev test
CIDEr CIDEr CIDEr CIDEr
7
Pretraining Finetuning Flickr30K been extended to visual region inputs, while the next sen-
data data test tence prediction is analogous to vlm. Based directly upon
R1 R5 R10 BERT, V+L pre-training research has largely been focused
None Flickr30K 43.7 74.8 84.1
on V+L understanding [55, 49, 21, 77, 3, 74, 48, 56], with
CC3M None 35.4 65.2 76.2
classification or regression tasks that do not involve gener-
CC12M None 42.5 73.1 83.4
CC3M+CC12M None 47.1 76.4 83.4 ation. One exception is UnifiedVL [88], which pre-trains a
CC3M Flickr30K 52.3 81.7 88.4 unified architecture for both image captioning (generation)
CC12M Flickr30K 58.5 86.6 92.1 and VQA (understanding). Our work focuses on simpler
CC3M+CC12M Flickr30K 61.5 87.5 92.8 objectives and consider one at a time. This allows for a
Pretraining Finetuning LocNar Flickr30K “clean” study of the effect of pre-training data sources. At
data data test the same time, we also pre-train vision-to-language genera-
R1 R5 R10 tion and encoder-decoder jointly as opposed to an encoder-
None LocNar Flickr30K 54.5 85.0 91.0 only setup. Our work also shows that ic is a strong ob-
CC3M LocNar Flickr30K 61.1 88.2 93.7 jective for vision-to-language generation with respect to the
CC12M LocNar Flickr30K 70.2 92.1 95.6 widely-used masking-based objectives. Consistent with our
CC3M+CC12M LocNar Flickr30K 71.0 93.0 97.0 results, ic is successfully adopted for learning visual rep-
Table 8: Image retrieval on Flickr30K and LocNar Flickr30K resentations for lower-level vision tasks [24].
8
References [15] Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and
Loı̈c Barrault. Probing the need for visual context in mul-
[1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- timodal machine translation. In NAACL, 2019. 14
dha Kembhavi. Don’t just assume; look and answer: Over- [16] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen,
coming priors for visual question answering. In CVPR, 2018. and Jingjing Liu. Behind the scene: Revealing the secrets of
14 pre-trained vision-and-language models. In ECCV, 2020. 14
[2] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, [17] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos,
Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- and Dawn Song. The secret sharer: Evaluating and testing
fan Lee, and Peter Anderson. nocaps: novel object caption- unintended memorization in neural networks. In {USENIX}
ing at scale. In ICCV, 2019. 4, 6, 7 Security, 2019. 12
[3] Chris Alberti, Jeffrey Ling, Michael Collins, and David Re- [18] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
itter. Fusion of detected objects in text for visual question Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
answering. In EMNLP-IJCNLP, 2019. 1, 2, 8, 12 Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and
[4] Peter Anderson, Basura Fernando, Mark Johnson, and Colin Raffel. Extracting training data from large language
Stephen Gould. SPICE: semantic propositional image cap- models. arXiv preprint arXiv:2012.07805, 2020. 12
tion evaluation. In ECCV, 2016. 5 [19] Soravit Changpinyo, Bo Pang, Piyush Sharma, and Radu
[5] Peter Anderson, Basura Fernando, Mark Johnson, and Soricut. Decoupled box proposal and featurization with
Stephen Gould. Guided open vocabulary image captioning ultrafine-grained semantic labels improve image captioning
with constrained beam search. In EMNLP, 2017. 6, 8 and visual question answering. In EMNLP-IJCNLP, 2019.
[6] Peter Anderson, Stephen Gould, and Mark Johnson. 2, 5, 7, 8
Partially-supervised image captioning. In NeurIPS, 2018. 8 [20] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-
tam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick.
[7] Peter Anderson, Xiaodong He, Chris Buehler, Damien
Microsoft COCO Captions: Data collection and evaluation
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
server. arXiv preprint arXiv:1504.00325, 2015. 1, 2, 8, 12
Bottom-up and top-down attention for image captioning and
[21] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
visual question answering. In CVPR, 2018. 5
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
[8] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret UNITER: Learning UNiversal Image-TExt Representations.
Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi In ECCV, 2020. 1, 2, 8, 12, 13, 15
Parikh. VQA: Visual question answering. In ICCV, 2015.
[22] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer.
12
Don’t take the easy way out: Ensemble based methods for
[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- avoiding known dataset biases. In EMNLP-IJCNLP, 2019.
ton. Layer normalization. arXiv preprint arXiv:1607.06450, 14
2016. 5, 15 [23] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and
[10] Satanjeev Banerjee and Alon Lavie. METEOR: An auto- Rita Cucchiara. Meshed-Memory Transformer for Image
matic metric for MT evaluation with improved correlation Captioning. In CVPR, 2020. 5
with human judgments. In ACL Workshops, 2005. 5 [24] Karan Desai and Justin Johnson. VirTex: Learning visual
[11] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh representations from textual annotations. In CVPR, 2021. 8
Saligrama, and Adam Kalai. Man is to computer program- [25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
mer as woman is to homemaker? debiasing word embed- Toutanova. BERT: Pre-training of deep bidirectional trans-
dings. In NeurIPS, 2016. 12 formers for language understanding. In NAACL, 2019. 1, 4,
[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie 8, 14, 16
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- [26] Sebastian Goodman, Zhenzhong Lan, and Radu Soricut.
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Multi-stage pretraining for abstractive summarization. arXiv
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, preprint arXiv:1909.10599, 2019. 1
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. [27] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, tra, and Devi Parikh. Making the V in VQA matter: El-
Mark Chen, Eric Sigler, Mateusz Litwin, Scottand Gray, evating the role of image understanding in visual question
Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- answering. In CVPR, 2017. 2, 8, 12
Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. [28] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhat-
Language models are few-shot learners. arXiv preprint tacharya. Captioning images taken by people who are blind.
arXiv:2005.14165, 2020. 1 In ECCV, 2020. 2
[13] Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Darrell, and Anna Rohrbach. Women also snowboard: Over- Deep residual learning for image recognition. In CVPR,
coming bias in captioning models. In ECCV, 2018. 12 2016. 5
[14] Remi Cadene, Corentin Dancette, Hedi Ben-younes, [30] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the
Matthieu Cord, and Devi Parikh. RUBi: Reducing unimodal knowledge in a neural network. In Proceedings of NeurIPS
biases in visual question answering. In NeurIPS, 2019. 14 workshop, 2015. 5
9
[31] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. [44] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
The curious case of neural text degeneration. In ICLR, 2020. Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
4 tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
[32] Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Li Fei-Fei. Visual Genome: Connecting language and vision
Jianfeng Gao, and Zicheng Liu. VIVO: Surpassing human using crowdsourced dense image annotations. International
performance in novel object captioning with visual vocabu- Journal of Computer Vision, 123(1):32–73, 2017. 2, 8
lary pre-training. arXiv preprint arXiv:2009.13682, 2020. 6, [45] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R.
7 Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Ste-
[33] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. fan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari.
Attention on attention for image captioning. In CVPR, 2019. The open images dataset V4: unified image classification,
5 object detection, and visual relationship detection at scale.
CoRR, abs/1811.00982, 2018. 4, 5
[34] Drew A Hudson and Christopher D Manning. Gqa: a
new dataset for compositional question answering over real- [46] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
world images. arXiv preprint arXiv:1902.09506, 2019. 8 Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
bert for self-supervised learning of language representations.
[35] Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao
In ICLR, 2020. 1
Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui
[47] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-
Xu, Weihao Zheng, et al. WenLan: Bridging vision and
aodong He. Stacked cross attention for image-text matching.
language by large-scale multi-modal pre-training. arXiv
In ECCV, 2018. 5
preprint arXiv:2103.06561, 2021. 8
[48] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming
[36] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Zhou. Unicoder-VL: A universal encoder for vision and lan-
Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom
guage by cross-modal pre-training. In AAAI, 2020. 1, 2, 8,
Duerig. Scaling up visual and vision-language representa-
12
tion learning with noisy text supervision. arXiv preprint
[49] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh,
arXiv:2102.05918, 2021. 8
and Kai-Wei Chang. VisualBERT: A simple and perfor-
[37] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Alek- mant baseline for vision and language. arXiv preprint
sei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, An- arXiv:1908.03557, 2019. 1, 2, 8, 12, 13
drew Tomkins, and Sujith Ravi. Graph-RISE: Graph-
[50] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan
regularized image semantic embedding. arXiv preprint
Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong,
arXiv:1902.10814, 2019. 5
Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-
[38] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei semantics aligned pre-training for vision-language tasks. In
Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew ECCV, 2020. 5, 7, 8, 12
Tomkins, and Sujith Ravi. Ultra fine-grained image semantic [51] Chin-Yew Lin. ROUGE: A package for automatic evaluation
embedding. In WSDM, 2020. 5 of summaries. In Text Summarization Branches Out, 2004. 5
[39] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and [52] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D.
Tamara Berg. ReferItGame: Referring to objects in pho- Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva
tographs of natural scenes. In EMNLP, 2014. 12 Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft
[40] Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and COCO: common objects in context. In ECCV, 2014. 2, 4, 5,
Erkut Erdem. Re-evaluating automatic metrics for image 8
captioning. In EACL, 2017. 16 [53] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
[41] Diederik P. Kingma and Jimmy Ba. Adam: A method for dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
stochastic optimization. In ICLR, 2015. 16 Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro-
[42] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami bustly optimized BERT pretraining approach. arXiv preprint
Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- arXiv:1907.11692, 2019. 1
jlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi [54] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang,
Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Boqing Gong, and Stella X. Yu. Large-scale long-tailed
Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun recognition in an open world. In CVPR, 2019. 8
Feng, Dhyanesh Narayanan, and Kevin Murphy. Open- [55] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL-
Images: A public dataset for large-scale multi-label and BERT: Pretraining task-agnostic visiolinguistic representa-
multi-class image classification. Dataset available from tions for vision-and-language tasks. In NeurIPS, 2019. 1, 2,
https://g.co/dataset/openimages, 2017. 4 5, 8, 12, 14, 15
[43] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, [56] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- Parikh, and Stefan Lee. 12-in-1: Multi-task vision and lan-
tidis, Li-Jia Li, David A. Shamma, Michael Bernstein, and guage representation learning. In CVPR, 2020. 1, 2, 8, 12,
Li Fei-Fei. Visual Genome: Connecting language and vi- 14
sion using crowdsourced dense image annotations. IJCV, [57] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
123(1):32–73, 2017. 1, 5, 12 Neural baby talk. In CVPR, 2018. 8
10
[58] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, text dataset for multimodal multilingual machine learning.
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, arXiv preprint arXiv:2103.01913, 2021. 8
and Laurens van der Maaten. Exploring the limits of weakly [73] Pierre Stock and Moustapha Cisse. Convnets and imagenet
supervised pretraining. In ECCV, 2018. 1 beyond accuracy: Understanding mistakes and uncovering
[59] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana biases. In ECCV, 2018. 12
Camburu, Alan L. Yuille, and Kevin Murphy. Generation [74] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
and comprehension of unambiguous object descriptions. In Wei, and Jifeng Dai. VL-BERT: Pre-training of generic
CVPR, 2016. 12 visual-linguistic representations. In ICLR, 2020. 1, 2, 8,
[60] Milad Nasr, Reza Shokri, and Amir Houmansadr. Compre- 12, 13
hensive privacy analysis of deep learning: Passive and active [75] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun
white-box inference attacks against centralized and federated Bai, and Yoav Artzi. A corpus for reasoning about natural
learning. In IEEE SP, 2019. 12 language grounded in photographs. In ACL, 2018. 12
[61] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. [76] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
Im2Text: Describing images using 1 million captioned pho- nav Gupta. Revisiting unreasonable effectiveness of data in
tographs. In NIPS, 2011. 1, 8, 12 deep learning era. In ICCV, 2017. 1
[62] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing [77] Hao Tan and Mohit Bansal. LXMERT: Learning cross-
Zhu. Bleu: A method for automatic evaluation of machine modality encoder representations from transformers. In
translation. In ACL, 2002. 5 EMNLP-IJCNLP, 2019. 1, 2, 5, 8, 12, 15
[63] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, [78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
nik. Flickr30k entities: Collecting region-to-phrase corre- Polosukhin. Attention is all you need. In NIPS, 2017. 1, 5
spondences for richer image-to-sentence models. In ICCV, [79] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi
2015. 5 Parikh. CIDEr: Consensus-based image description evalu-
ation. In CVPR, 2015. 5
[64] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu
[80] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang,
Soricut, and Vittorio Ferrari. Connecting vision and lan-
and Vicente Ordonez. Balanced datasets are not enough: Es-
guage with localized narratives. In ECCV, 2020. 5, 16
timating and mitigating gender bias in deep image represen-
[65] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tations. In ICCV, 2019. 12
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[81] Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. Decoupled
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
novel object captioner. In MM, 2018. 8
Krueger, and Ilya Sutskever. Learning transferable visual
[82] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
models from natural language supervision. arXiv preprint
Ruslan Salakhutdinov, and Quoc V Le. XLNet: General-
arXiv:2103.00020, 2021. 8
ized autoregressive pretraining for language understanding.
[66] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
arXiv preprint arXiv:1906.08237, 2019. 1
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[83] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Incorpo-
Peter J. Liu. Exploring the limits of transfer learning with a
rating copying mechanism in image captioning for learning
unified text-to-text transformer. JMLR, 21(140):1–67, 2020.
novel objects. In CVPR, 2017. 8
1
[84] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
[67] Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan maier. From image descriptions to visual denotations: New
Lee. Overcoming language priors in visual question answer- similarity metrics for semantic inference over event descrip-
ing with adversarial regularization. In NeurIPS, 2018. 14 tions. TACL, 2:67–78, 2014. 2
[68] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [85] Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. Multimodal
Faster R-CNN: Towards real-time object detection with re- transformer with multi-view visual representation for image
gion proposal networks. In NIPS, 2015. 5 captioning. arXiv, 1905.07841, 2019. 5
[69] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [86] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J.
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Liu. PEGASUS: Pre-training with extracted gap-sentences
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and for abstractive summarization. In ICML, 2020. 14
Li Fei-Fei. ImageNet large scale visual recognition chal- [87] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez,
lenge. IJCV, 115(3):211–252, 2015. 1, 5 and Kai-Wei Chang. Men also like shopping: Reducing
[70] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu gender bias amplification using corpus-level constraints. In
Soricut. Conceptual Captions: A cleaned, hypernymed, im- EMNLP, 2017. 12
age alt-text dataset for automatic image captioning. In ACL, [88] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja-
2018. 1, 2, 3, 5, 8, 12 son J. Corso, and Jianfeng Gao. Unified vision-language pre-
[71] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan training for image captioning and VQA. In AAAI, 2020. 1,
Liu. MASS: Masked sequence to sequence pre-training for 2, 4, 8, 12
language generation. In ICML, 2019. 14 [89] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan.
[72] Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Capturing long-tail distributions of object subcategories. In
Bendersky, and Marc Najork. WIT: Wikipedia-based image CVPR, 2014. 8
11
[90] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, both.)
Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Align-
• CC3M [70] An instance of text is the caption associ-
ing books and movies: Towards story-like visual explana-
ated with each image url of the training split.
tions by watching movies and reading books. In ICCV, 2015.
1 • CC12M (ours) An instance of text is the caption asso-
ciated with each image url. It has been used and is cur-
rently the most popular V+L pre-training dataset [55,
A. Broader Impact
3, 21, 74, 88, 56, 50].
Our publicly-available V+L pre-training resource • COCO Captions [20] An instance of text comes from
CC12M has the potential to positively impact multiple the caption associated with each image of the 2017
vision-and-language tasks. One main aspect that we have training split (five captions per image). This dataset
identified is a much higher degree of coverage of long-tail is designed for the task of image captioning, and has
visual concepts than previous resources, including CC3M. been used for caption-based image retrieval as well. It
As a result, we expect the models (pre-)trained on our data has been used for V+L pre-training [77, 49, 21, 50].
to be more robust in the wild than before. • Visual Genome [43] An instance of text comes from
In addition, our work could benefit the design of new the caption of each region in images of the training
setups for the downstream tasks that shift away from in- split. This dataset aims to connect vision and lan-
domain (e.g., COCO/Visual Genome) to out-of-domain/in- guage through scene graphs and is used for multiple
the-wild (e.g., OID), similar to nocaps that our work fo- tasks that include but not limited to dense image cap-
cuses heavily on. The setups could also avoid the use of tioning, visual relationship detection and scene graph
in-domain data during pre-training that in some cases re- parsing, image retrieval and generation, and visual
sulting in transfer learning between (almost) identical sets question answering. It has been used for V+L pre-
of images, e.g., COCO, Visual Genome (VG), VQA2, VG training [77, 21].
QA, Visual7W, GQA, GuessWhat, and RefCOCO*. • SBU Captions [61] An instance of text is the caption
At the same time, datasets curated from the Web could associated with each image url of the “preferred” ver-
come with risks such as unsuitable content (adult content, sion of the dataset. This dataset is designed for the
profanity) and unintended privacy leakage [60, 17, 18]. We task of image captioning. It has been used for V+L
take the steps in Sect. 2.2 of the main text to mitigate both pre-training [77, 21, 48, 50].
of these risks by applying the necessary image and text fil- • VQA2 [27] An instance of text is the question and
tering steps and replacing each person name (celebrities’ the answers in each image-question-answers triplet of
included) with the special <PERSON> token. the train2014 + val2train2014 splits. This dataset is
Less specific to the Web data are the unwanted dataset designed for the task of visual question answering
biases [13, 73, 80] that are prone to amplification by ma- (VQA) [8]. It has been used for V+L pre-training [77,
chine learning models [11, 87]. Our preliminary analysis in 50].
Sect. 2.3 of the main text shed light on the degree to which • RefCOCOg [59] An instance of text is the referring ex-
our data exhibits some aspects of these inherent biases, and pression in each region in images of the training split.
we suspect that the better coverage of the tail in fact makes This dataset is designed for the task of referring ex-
this issue less severe. Nevertheless, the users of this data pression comprehension [39].
and the systems trained on it shall be aware of such risks • NLVR2 [75] An instance of text comes from the cap-
and other ones that might arise. tion associated with each pair of images of the training
split. This dataset is used for the task called multi-
B. Additional analyses of CC12M modal verification in [56], but designed for the general
task of visual reasoning.
B.1. Out-of-domain (OOD) visual concepts on an
Table 9 summarizes the number of instances whose texts
expanded list of datasets
contain OOD visual concepts for all selected datasets. We
We use the 394 nocaps’ out-of-domain classes as a proxy use both the absolute frequency and the normalized one
for OOD visual concepts and analyze popular vision-and- (per 1M text instances). Essentially, these numbers indi-
language datasets, in addition to CC3M and CC12M that cate the degree of OOD coverage. We find that CC12M has
we focus in the main text. These datasets span a wide many more OOD instances than all other datasets by a large
range of use cases, both in terms of tasks (image-to-text margin (6.7x median and 5.8x mean vs. the second best
generation, image-and-text matching, visual question an- CC3M). Moreover, CC12M still prevails even after normal-
swering (VQA), referring expression comprehension, and ization to account for its size. In other words, CC12M cov-
multimodal verification), and in terms of the stage during ers these OOD classes better in both absolute and relative
which they are used (pre-training, fine-tuning/evaluation, or senses.
12
Dataset Freq Freq (per 1M) B.2. The impact of the dataset size
median mean median mean
CC3M 462 2325.7 139.2 700.8 We experiment with pre-training on randomly subsam-
CC12M 3110 13455.8 250.3 1083.1 pled CC12M, 25% (3.1M) and 50% (6.2M) and evaluate
COCO Captions 37 248.6 62.3 417.1 the pre-trained models on novel object captioning on no-
Visual Genome 133 1114.47 40.7 341.4 caps and zero-shot IR on Flickr30K. Fig. 6 shows the larger,
SBU Captions 121 798.6 121.0 798.6 the better trend, with 25% of CC12M gives rise to similar
VQA2 37 242.0 63.8 417.2 performance as CC3M.
RefCOCOg 1 21.2 8.8 186.4
NLVR2 4 79.9 11.6 245.5 C. Qualitive Results for Image Retrieval
Table 9: Statistics of the (normalized) frequency of nocaps’
out-of-domain visual concepts in the texts of popular vision-and- Fig. 7 provides qualitative image retrieval results on
language datasets. the Flickr30K dataset, top-3 images retrieved by the from-
scratch model trained on Flickr30K, as well as by two mod-
els pre-trained on CC3M and CC12M and then fine-tuned
on Flickr30K. We report three cases in which CC12M pre-
training helps correct the rankings from the other two mod-
els, which we suspect due to the model getting more famil-
iar with the rare words, highlighted in blue.
Fig. 5 provides a more complete picture of the normal- D.2.1 Masked Vision-to-Language Generation
ized frequency of OOD classes in these datasets, at differ-
ent thresholds. It shows the number of OOD classes (y- Given the training image-text pairs, the ic objective pre-
axis) with at least K per 1M captions (x-axis). Evidently, dicts the text from the image. The following objectives
other datasets experience sharper drops as K increases than predict (all or part of) the text from the image and (all or
CC12M (black solid curve). We also find that caption- part of) the text. In order to encode both the image and
ing datasets (solid curves) generally provide better cover- the text, we concatenate the sequence of image feature vec-
age than non-captioning datasets: VQA2, RefCOCOg, and tors and the sequence of text token feature vectors, and use
NLVR2 (dashed curves). the Transformer encoder to encode them [49, 21, 74]. This
13
a blond child holding a sword and dressed in a black a man sitting on a chair with a beer in his hands a man and a woman hold the front arm of a large tiger that
robe is standing next to a stroller roasting something to eat on a wooden stick is laying on the ground among various medical devices
No pre-training
CC3M
CC12M
Figure 7: Qualitative results for the image retrieval task on Flickr30K given the query text (very top) when the model is not pre-trained
(top), pre-trained on CC3M (middle), and pre-trained on CC12M (bottom).
nocaps val
Pre-training in-domain near-domain out-of-domain overall
data CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE
LocNar Open Images 76.0 11.6 65.9 10.9 48.9 9.3 73.3 17.4 23.5 50.7 63.9 10.7
CC3M 81.8 11.6 73.7 11.1 65.3 10.1 74.6 19.1 24.1 51.5 73.2 11.0
CC12M 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
Table 10: Comparison between pre-training data. LocNar Open Images’s images are from the same visual domain as nocaps.
All approaches use the ic pre-training objective.
14
nocaps val
Pre-training in-domain near-domain out-of-domain overall
objective CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE
ic 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
mlm[.1] 76.4 11.5 68.4 10.8 57.6 9.6 73.0 18.1 23.5 50.6 67.4 10.6
mlm[.2] 79.8 11.3 76.3 10.9 76.2 10.2 76.2 20.5 24.1 52.4 76.8 10.8
mlm[.4] 86.5 12.3 82.7 11.5 86.3 11.3 78.0 22.7 25.2 53.7 84.0 11.6
mlm[.8] 89.3 12.5 87.5 11.9 91.1 11.3 78.7 23.8 25.9 54.4 88.5 11.9
mass[.1] 86.0 12.1 74.8 11.1 71.7 10.1 75.8 20.5 24.6 52.5 75.8 11.0
mass[.2] 84.9 12.0 78.1 11.2 78.6 10.5 76.0 20.8 24.7 52.7 79.2 11.2
mass[.4] 85.7 11.7 83.7 11.5 88.5 10.9 77.3 22.8 25.1 53.6 85.0 11.4
mass[.8] 88.8 12.2 85.1 11.7 87.8 10.6 78.1 23.7 25.5 54.2 86.2 11.5
Table 11: Comparison between the ic pre-training and masked V+L pre-training. We consider two masking schemes (mlm
and mass) and four masking rates (.1, .2, .4, .8) and report their effects on the nocaps val set.
nocaps val
Pre-training in-domain near-domain out-of-domain overall
objectives CIDEr SPICE CIDEr SPICE CIDEr SPICE BLEU1 BLEU4 METEOR ROUGE CIDEr SPICE
ic 88.3 12.3 86.0 11.8 91.3 11.2 78.5 23.4 25.9 54.5 87.4 11.8
ic+vlm 88.6 12.3 85.8 11.9 90.0 11.4 78.0 23.1 25.7 54.4 87.1 11.9
ic+moc 91.1 12.4 88.4 12.1 93.6 11.4 78.8 24.6 26.2 55.2 89.9 12.0
Table 12: Effect of visual linguistic matching (vlm) and masked object classication (moc) when combined with the ic
objective on the nocaps val set.
15
vectors and up to 16 tag feature vectors, each of size 512. github.com/nocaps-org/updown-baseline/blob/
For the vlm objective, where text has to be encoded, we master/updown/utils/evalai.py. For in-depth dis-
also have a sequence of text (sub)token feature vectors of cussions of these metrics see [40].
size 512. Participating in the default formulation of the nocaps
challenge requires that one (i) does not use val and test
E.2. Model Open Images’s ground-truth object detection annotations,
The ic-based task uses a transformer encoder-decoder and (ii) does not use image-caption data collected via ad-
model. The vlm-based uses two transformer encoders, one ditional annotation protocols. We satisfy both requirements
for texts and the other for images. as we train our object detector on Visual Genome, and both
• Transformer image encoder: number of layers L = 6. CC3M and CC12M are automatically harvested from the
• Transformer image encoder: vocab embedding size E web (alt-text) and belong to the category of noisy web data,
= 512. therefore satisfying the second requirement. On the other
• Transformer image encoder: hidden embedding size H hand, models that leverage the Open Images Localized Nar-
= 1024. ratives dataset (LocNar) [64] for pre-training belong to the
• Transformer image encoder: feedforward/filter size F nocaps (XD) leaderboard rather than the default one.
= H x 4 = 4096, following [25]. Some of our results on the CC3M benchmark are taken
• Transformer image encoder: number of attention from the leaderboard, which is located at https : / /
heads A = H / 64 = 8, following [25]. ai . google . com / research / ConceptualCaptions /
• Transformer text encoder (for vlm only): L, E, H, F, leaderboard?active_tab=leaderboard.
A are the same as Transformer image encoder.
E.5. Hyperparameter search
• Transformer decoder: L, E, H, F, A are the same as
Transformer image encoder. For pre-training experiments, we do not conduct hyper-
• Transformer decoder: beam search width = 5. parameter tuning besides an initial stage of exploration as
• Transformer decoder: beam search alpha = 0.6. we believe small changes would not considerably affect the
• Transformer decoder: maximum output length = 36 for downstream performance. For instance, we fix an initial
all datasets except for LocNar which is set to 180. learning rate to 0.000032 and observe it works consistently
well (on the validation set) across scenarios.
E.3. Training For fine-tuning experiments, we focus on tuning one hy-
• Infrastructure: Google Cloud 32-core TPUs. perparamter: the initial learning rate. In the case of nocaps,
• Batch size per core: 128 (for a total of 4096) we also lightly tune the maximum number of training steps
• Optimizer: Adam [41] with default hyperparameters as we observe the model overfitting on COCO Captions. In
(except for the initial learning rate; see below). all cases, we make sure to allocate similar resources to any
• Learning rate — Initial: See Hyperparameter search two settings that we make a comparison between, such as
below. pre-training data sources of CC3M and CC12M.
• Learning rate — Warm-up epochs: 20 for all pre- For generation, the ranges for the initial learning rate are
training and fine-tuning experiments. {3.2e-9, 3.2e-8, 3.2e-7} and the ranges for the maximum
• Learning rate — Decay rate: 0.95 for all pre-training number of training steps are {5K, 10K}. For matching, the
and fine-tuning experiments. ranges for the initial learning rate are {3.2e-8, 3.2e-7, 3.2e-
• Learning rate — Decay epochs: 25 for all pre-training 6} while the maximum number of training steps is fixed to
and fine-tuning experiments. 10K.
• Data augmentation: a set of input visual regions are
permuted during training.
• Maximum number of steps: 2M for vision-to-
language generation pre-training on both CC12M
and CC3M (and CC3M+CC12M). For vision-and-
language matching, 1M for CC3M instead. See Hyper-
parameter search below for fine-tuning experiments.
E.4. Evaluation
For nocaps evaluation, we submit inference results to
the leaderboard https://evalai.cloudcv.org/web/
challenges/challenge-page/464/overview. Code
for all evaluation metrics can be found at https : / /
16