0% found this document useful (0 votes)
52 views10 pages

TGIF: A New Dataset and Benchmark On Animated GIF Description

Uploaded by

Saket Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views10 pages

TGIF: A New Dataset and Benchmark On Animated GIF Description

Uploaded by

Saket Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2016 IEEE Conference on Computer Vision and Pattern Recognition

TGIF: A New Dataset and Benchmark on Animated GIF Description

Yuncheng Li Yale Song Liangliang Cao Joel Tetreault


University of Rochester Yahoo Research Yahoo Research Yahoo Research
Larry Goldberg Alejandro Jaimes Jiebo Luo
Yahoo Research AiCure University of Rochester

Abstract 

 




 
With the recent popularity of animated GIFs on social 

media, there is need for ways to index them with rich meta-

         

data. To advance research on animated GIF understanding,
we collected a new dataset, Tumblr GIF (TGIF), with 100K Figure 1: Our TGIF dataset contains 100K animated GIFs and
animated GIFs from Tumblr and 120K natural language de- 120K natural language descriptions. (a) Online users create GIFs
scriptions obtained via crowdsourcing. The motivation for that convey short and cohesive visual stories, providing us with
this work is to develop a testbed for image sequence de- well-segmented video data. (b) We crawl and filter high quality
scription systems, where the task is to generate natural lan- animated GIFs, and (c) crowdsource natural language descriptions
guage descriptions for animated GIFs or video clips. To en- ensuring strong visual/textual association.
sure a high quality dataset, we developed a series of novel
quality controls to validate free-form text input from crowd-
language descriptions. There are two major challenges to
workers. We show that there is unambiguous association
this work: (1) We need a large scale dataset that captures a
between visual content and natural language descriptions
wide variety of interests from online users who produce an-
in our dataset, making it an ideal benchmark for the visual
imated GIFs; (2) We need automatic validation methods
content captioning task. We perform extensive statistical
that ensure high quality data collection at scale, in order
analyses to compare our dataset to existing image and video
to deal with noisy user-generated content and annotations.
description datasets. Next, we provide baseline results on
While it is difficult to address these two challenges at once,
the animated GIF description task, using three representa-
there has been great progress in recent years in collecting
tive techniques: nearest neighbor, statistical machine trans-
large scale datasets in computer vision [33, 20, 34, 31]. Our
lation, and recurrent neural networks. Finally, we show
work contributes to this line of research by collecting a new
that models fine-tuned from our animated GIF description
large scale dataset for animated GIF description, and by pre-
dataset can be helpful for automatic movie description.
senting automatic validation methods that ensure high qual-
ity visual content and crowdsourced annotations.
1. Introduction Our dataset, Tumblr GIF (TGIF), contains 100K ani-
mated GIFs collected from Tumblr, and 120K natural lan-
Animated GIFs have quickly risen in popularity over the
guage sentences annotated via crowdsourcing. We de-
last few years as they add color to online and mobile com-
veloped extensive quality control and automatic validation
munication. Different from other forms of media, GIFs are
methods for collecting our dataset, ensuring strong and un-
unique in that they are spontaneous (very short in duration),
ambiguous association between GIF and sentence. In ad-
have a visual storytelling nature (no audio involved), and
dition, we carefully evaluate popular approaches for video
are primarily generated and shared by online users [3]. De-
description and report several findings that suggest future
spite its rising popularity and unique visual characteristics,
research directions. It is our goal that our dataset and base-
there is a surprising dearth of scholarly work on animated
line results will serve as useful resources for future video
GIFs in the computer vision community.
description and animated GIF research.1
In an attempt to better understand and organize the grow-
Our work is in part motivated by the recent work on im-
ing number of animated GIFs on social media, we con-
structed an animated GIF description dataset which consists 1 We use video description, image sequence description, and animated

of user-generated animated GIFs and crowdsourced natural GIF description interchangeably, as they all contain sequences of images.

1063-6919/16 $31.00 © 2016 IEEE 4641


DOI 10.1109/CVPR.2016.502
age and video description [40, 9, 14, 21, 38]. Describing Techniques. Image and video description has been tack-
animated GIFs, or image sequences in general, is different led by using established algorithms [42, 10, 25, 17, 32]. Or-
from the image captioning task (e.g., MS-COCO [20]) be- donez et al. [25] generate an image caption by finding k
cause of motion information involved between frames. Re- nearest neighbor images from 1 million captioned images
cent movie description datasets, such as M-VAD [34] and and summarizing the retrieved captions into one sentence.
MPII-MD [31], made the first attempt towards this direction Rohrbach et al. [32] formulate video description as a trans-
by leveraging professionally annotated descriptive video lation problem and propose a method that combines seman-
service (DVS) captions from commercial movies. However, tic role labeling and statistical machine translation.
as we show later in this paper, such datasets contain cer- Recent advances in recurrent neural networks has led to
tain characteristics not ideal for image sequence description end-to-end image and video description techniques [40, 9,
(i.e., poorly segmented video clips, descriptions with con- 14, 21, 39, 38, 43, 26, 41]. Venugopalan et al. [39] rep-
textual information not available within a provided clip). resent video by mean-pooling image features from frames,
We make the following contributions in this paper: while Li et al. [43] apply the soft-attention mechanism to
1. We collect a dataset for animated GIF description. We represent each frame of a video, which is then output to
solved many challenges involved in data collection, includ- an LSTM decoder [12] to generate a natural language de-
ing GIF filtering, language validation and quality control. scription. More recently, Venugopalan et al. [38] use an
2. We compare our dataset to other image and video de- LSTM to encode image sequence dynamics, formulating
scription datasets, and find that animated GIFs are tempo- the problem as sequence-to-sequence prediction. In this
rally well segmented and contain cohesive visual stories. paper, we evaluate three representative techniques (near-
3. We provide baseline results on our dataset using several est neighbor [25], statistical machine translation [32], and
existing video description techniques. Moreover, we show LSTMs [38]) and provide benchmark results on our dataset.
that models trained on our dataset can improve performance
on the task of automatic movie description. 2.1. Comparison with LSMDC
4. We make our code and dataset publicly available at
In essence, movie description and animated GIF descrip-
https://github.com/raingo/TGIF-Release
tion tasks both involve translating image sequence to natu-
ral language, so the LSMDC dataset may seem similar to
2. Related Work
the dataset proposed in this paper. However, there are two
There is growing interest in automatic image and video major differences. First, our set of animated GIFs was cre-
description [20, 34, 31, 44]. We review existing datasets ated by online users while the LSMDC was generated from
and some of the most successful techniques in this domain. commercial movies. Second, our natural language genera-
Datasets. For image captioning, the SBU dataset [25] tions were crowdsourced whereas the LSMDC descriptions
contains over 1 million captioned images crawled from the were carried out by descriptive video services (DVS). This
web, while the MS-COCO dataset [20] contains 120K im- led to the following differences between the two datasets2 :
ages and descriptions annotated via crowdsourcing. The Language complexity. Movie descriptions are made
VQA [1] and the Visual Madlibs [44] datasets are released by trained professionals, with an emphasis on describing
for image captioning and visual question answering. key visual elements. To better serve the target audience
In the video domain, the YouTube2Text dataset [5, 11] of people with visual impairment, the annotators use ex-
contains 2K video clips and 120K sentences. Although pressive phrases. However, this level of complexity in lan-
originally introduced for the paraphrasing task [5], this guage makes the task very challenging. In our dataset, our
dataset is also suitable for video description [11]. The workers are encouraged to describe major visual content di-
TACoS dataset [30] contains 9K cooking video clips and rectly, and not to use overly descriptive language. As an ex-
12K descriptions, while the YouCook dataset [8] contains ample to illustrate the language complexity difference, the
80 cooking videos and 748 descriptions. More recently, the LSMDC dataset described a video clip as “amazed some-
M-VAD [34] and MPII-MD [31] datasets use the descrip- one starts to play the rondo again.”, while for the same clip,
tive video service (DVS) from commercial movies, which a crowd worker described as “a man plays piano as a woman
is originally developed to help people with visual impair- stands and two dogs play.”
ment understand non-narrative movie scenes. Since the Visual/textual association. Movie descriptions often
two datasets have similar characteristics, the Large Scale contain contextual information not available within a single
Movie Description Challenge (LSMDC) makes use of both movie clip; they sometimes require having access to other
datasets [34, 31]. Our work contributes to the video domain parts of a movie that provide contextual information. Our
with 1) animated GIFs, which are well-segmented video descriptions do not have such issue because each animated
clips with cohesive stories, and 2) natural language descrip-
tions with strong visual/textual associations. 2 Side-by-side comparison examples: https://goo.gl/ZGYIYh

4642
GIF is presented to workers without any surrounding con-
text. Our analysis confirmed this, showing that 20.7% of
sentences in LSMDC contain at least two pronouns, while
in our TGIF dataset this number is 7%.
Scene segmentation. In the LSMDC dataset, video
clips are segmented by means of speech alignment, aligning
speech recognition results to movie transcripts [31]. This
process is error-prone and the errors are particularly harm-
ful to image sequence modeling because a few irrelevant
frames either at the beginning or the end of a sequence can
significantly alter the sequence representation. In contrast,
our GIFs are by nature well segmented because they are
carefully curated by online users to create high quality vi-
sual content. Our user study confirmed this; we observe that
15% of the LSMDC movie clips v.s. 5% of animated GIFs
is rated as not well segmented.

3. Animated GIF Description Dataset


3.1. Data Collection
We extract a year’s worth of GIF posts from Tumblr us-
ing the public API3 , and clean up the data with four filters: Figure 2: The instructions shown to the crowdworkers.
(1) Cartoon. We filter out cartoon content by matching
popular animation keywords to user tags. (2) Static. We
discard GIFs that show little to no motion (basically static New Zealand, UK and USA in an effort to collect fluent de-
images). To detect static GIFs, we manually annotated 7K scriptions from native English speakers. Figure 2 shows the
GIFs as either static or dynamic, and trained a Random For- instructions given to the workers. Each task showed 5 ani-
est classifier based on C3D features [36]. The 5-fold cross mated GIFs and asked the worker to describe each with one
validation accuracy for this classifier is 89.4%. (3) Text. We sentence. To promote language style diversity, each worker
filter out GIFs that contain text, e.g., memes, by detecting could rate no more than 800 images (0.7% of our corpus).
text regions using the Extremal Regions detector [23] and We paid 0.02 USD per sentence; the entire crowdsourcing
discarding a GIF if the regions cover more than 2% of the cost less than 4K USD. We provide details of our annotation
image area. (4) Dedup. We compute 64bit DCT image hash task in the supplementary material.
using pHash [45] and apply multiple index hashing [24] to Syntactic validation. Since the workers provide free-
perform k nearest neighbor search (k = 100) in the Ham- form text, we automatically validate the sentences be-
ming space. A GIF is considered a duplicate if there are fore submission. We do the following checks: The sen-
more than 10 overlapping frames with other GIFs. On a tence (1) contains at least 8, but no more than 25 words
held-out dataset, the false alarm rate is around 2%. (white space separated); (2) contains only ASCII charac-
Finally, we manually validate the resulting GIFs to see ters; (3) does not contain profanity (checked by keyword
whether there is any cartoon, static, and textual content. matching); (4) should be typed, not copy/pasted (checked
Each GIF is reviewed by at least two annotators. After these by disabling copy/paste on the task page); (5) should con-
steps, we obtain a corpus of 100K clean animated GIFs. tain a main verb (checked by using standard POS tag-
ging [35]); (6) contains no named entities, such as a name
3.2. Data Annotation of an actor/actress, movie, country (checked by the Named
Entity Recognition results from DBpedia spotlight [6]); and
We annotated animated GIFs with natural language de-
(7) is grammatical and free of typographical errors (checked
scriptions using the crowdsourcing service CrowdFlower.
by the LanguageTool4 ).
We carefully designed our annotation task with various
This validation pipeline ensures sentences are syntacti-
quality control mechanisms to ensure the sentences are both
cally good. But it does not ensure their semantic correct-
syntactically and semantically of high quality.
ness, i.e., there is no guarantee that a sentence accurately
A total of 931 workers participated in our annotation
describes the corresponding GIF. We therefore designed a
task. We allowed workers only from Australia, Canada,
semantic validation pipeline, described next.
3 https://www.tumblr.com/docs/en/api/v2
4 https://languagetool.org/

4643
Semantic validation. Ideally, we would like to validate TGIF M-VAD MPII-MD LSMDC COCO
the semantic correctness of every submitted sentence (as we (a) 125,781 46,523 68,375 108,470 616,738
do for syntactic validation). But doing so is impractical. (b) 11,806 15,977 18,895 22,898 54,224
We turn to the “blacklisting” approach, where we identify (c) 112.8 31.0 34.7 46.8 118.9
workers who underperform and block them accordingly. (d) 10 6 6 6 9
We annotated a small number of GIFs and used them to (e) 2.54 5.45 4.65 5.21 -
measure the performance of workers. We collected a valida- Table 1: Descriptive statistics: (a) total number of sentences, (b)
tion dataset with 100 GIFs and annotated each with 10 sen- vocabulary size, (c) average term frequency, (d) median number of
tences using CrowdFlower. We carefully hand-picked the words in a sentence, and (e) average number of shots.
GIFs whose visual story is clear and unambiguous. After
collecting the sentences, we manually reviewed and edited Noun man, woman, girl, hand, hair, head, cat, boy, person
them to make sure they meet our standard. Verb be, look, wear, walk, dance, talk, smile, hold, sit
Using the validation dataset, we measured the semantic Adj. young, black, other, white, long, red, blond, dark
relatedness of sentence to GIF using METEOR [18], a met-
Table 2: Top frequent nouns/verbs/adjectives
ric commonly used within the NLP community to measure
machine translation quality. We compare a user-provided
sentence to 10 reference sentences using the metric, accept 
 "

a sentence if the METEOR score is above a threshold (em-  
 

#$ 
%&
'%()
pirically set at 20%). This will filter out junk sentences, e.g.,  ! 



“this is a funny GIF taken in a nice day,” but retain sentences 

with similar semantic meaning as the reference sentences.  

We used the dataset in both the qualification and the main 
tasks. In the qualification task, we provided 5 GIFs from the 

validation dataset and approved a worker if they success- 
fully described at least four tests. In the main task, we ran-    
domly mixed one validation question with four main ques- 
















 

 







 
tions; a worker is blacklisted if the overall approval rate on
validation questions falls below 80%. Because validation
questions are indistinguishable from normal task questions, Figure 3: The main plot shows the distribution of language model
scores averaged by the number of words in each dataset. The box
workers have to continue to maintain a high level of accu-
plot shows the distribution of sentence lengths.
racy in order to remain eligible for the task.
As we run the CrowdFlower task, we regularly reviewed
failed sentences and, in the case of a false alarm, we manu-
datasets are 6.13 and 3.02 seconds long, respectively.
ally added the failed sentence to the reference sentence pool
and removed the worker from the blacklist. Rashtchian et Table 1 shows descriptive statistics of our dataset and ex-
al. [28] and Chen et al. [5] used a similar prescreening strat- isting datasets, and Table 2 shows the most frequent nouns,
egy to approve crowdworkers; our strategy to validate sen- verbs and adjectives. Our dataset has more sentences with
tences during the main task is unique to our work. a smaller vocabulary size. Notably, our dataset has an
average term frequency that is 3 to 4 times higher than
4. Dataset Analysis other datasets. A higher average term frequency means less
polymorphism, thus increasing the chances of learning vi-
We compare TGIF to four existing image and video de- sual/textual associations, an ideal property for image and
scription datasets: MS-COCO [20], M-VAD [34], MPII- video description.
MD [31], and LSMDC [34, 31]. Language generality-specificity. Our dataset is anno-
Descriptive statistics. We divide 100K animated GIFs tated by crowdworkers, while the movie datasets are anno-
into 90K training and 10K test splits. We collect 1 sentence tated by trained professionals. As a result, the language in
and 3 sentences per GIF for the training and test data, re- our dataset tends to be more general than the movie datasets.
spectively. Therefore, there are about 120K sentences in To show this, we measure how sentences in each dataset
our dataset. By comparison, the MS-COCO dataset [20] conform to ”common language” using an n-gram language
has 5 sentences and 40 sentences for each training and test model (LM) trained on the Google 1B word corpus [4]. We
sample, respectively. The movie datasets have 1 profes- average the LM score by the number of words in each sen-
sionally created sentence for each training and test sample. tence to avoid the tendency of a longer sentence producing
On average, an animated GIF in our dataset is 3.10 seconds a lower score. Figure 3 shows that our dataset has higher
long, a video clip in the M-VAD [34] and the MPII-MD [31] average LM scores even with longer sentences.

4644
Category motion contact body percp. comm. of the questions. We randomly selected 100 samples from
turn sit wear look talk each dataset, converted movie clips to animated GIFs, and
move stand smile show wave mixed them in a random order to make them indistinguish-
Examples walk kiss laugh stare speak able. We recruited 10 people from various backgrounds,
dance put blow watch point and used majority voting to pool the answers from raters.
shake open dress see nod Table 4 shows two advantages of ours over LSMDC:
TGIF 30% 17% 11% 8% 7% (1) the animated GIFs are carefully segmented to convey
LSMDC 31% 21% 3% 12% 4% a cohesive and self-contained visual story; and (2) the sen-
COCO 19% 35% 3% 7% 2% tences are well associated with the main visual story.
Table 3: Top verb categories with most common verbs for
each category, and the distribution of verb occurrences on three 5. Benchmark Evaluation
datasets. Bold faced numbers are discussed in the text. “percp.”
We report results on our dataset using three popular tech-
means “perception”, and “comm.” means “communication.”
niques used in video description: nearest neighbor, statisti-
TGIF LSMDC
cal machine translation, and LSTM.
Q1: Video contains a cohesive, self- 5.1. Evaluation Metrics
100.0% 92.0%
contained visual story without any
±0.2% ±1.7% We report performance on four metrics often used
frame irrelevant to the main story.
Q2: Sentence accurately describes the in machine translation: BLEU [27], METEOR [18],
95.0% 78.0%
main visual story of the video without ROUGE [19] and CIDEr [37]. BLEU, ROUGE and CIDEr
±1.4% ±2.6%
missing information. use only exact n-gram matches, while METEOR uses syn-
Q3: Sentence describes visual content 94.0% 88.0% onyms and paraphrases in addition to exact n-gram matches.
available only within the video. ±1.5% ±2.0% BLEU is precision-based, while ROUGE is recall-based.
Table 4: Polling results comparing TGIF and LSMDC datasets.
CIDEr optimizes a set of weights on the TF-IDF match
score using human judgments. METEOR uses an F1 score
to combine different matching scores. For all four metrics,
Verb characteristics. Identifying verbs (actions) is per- a larger score means better performance.
haps one of the most challenging problems in image and
5.2. Baseline Methods
video description. In order to understand what types of
verbs are used for describing each dataset, we link verbs The TGIF dataset is randomly split into 80K, 10K and
in each sentence to WordNet using the semantic parser 10K for training, validation and testing, respectively. The
from [31]. Table 3 shows the distribution of top verb cat- automatic animated GIF description methods learn from the
egories in each dataset (verb categories refer to the highest- training set, and are evaluated on the testing set.
level nodes in the WordNet hierarchy).
Not surprisingly, the MS-COCO dataset contains more 5.2.1 Nearest Neighbor (NN)
static verbs (contact) compared to the video description
We find a nearest neighbor in the training set based on its
datasets, which have more dynamic verbs (motion). This
visual representation, and use its sentence as the prediction
suggests that video contains more temporally dynamic con-
result. Each animated GIF is represented using the off-the-
tent than static images. Most importantly, our dataset
shelf Hybrid CNN [46] and C3D [36] models; the former
has more “picturable” verbs related to human interactions
encodes static objects and scenes, while the latter encodes
(body), and fewer abstract verbs (perception) compared to
dynamic actions and events. From each animated GIF, we
the LSMDC dataset. Because picturable verbs are arguably
sample one random frame for the Hybrid CNN features and
more visually identifiable than abstract verbs (e.g., walk vs.
16 random sequential frames for the C3D features. We then
think), this result suggests that our dataset may provide an
concatenate the two feature representations and determine
ideal testbed for video description.
the most similar instance based on the Euclidean distance.
Quality of segmentation and description. To make
qualitative comparisons between the TGIF and LSMDC
5.2.2 Statistical Machine Translation (SMT)
datasets, we conducted a user study designed to evaluate
the quality of segmentation and language descriptions (see Similar to the two-step process of Rohrbach et al. [31], we
Table 4). The first question evaluates how well a video is automatically label an animated GIF with a set of seman-
segmented, while the other two evaluate the quality of text tic roles using a visual classifier and translate them into
descriptions (how well a sentence describes the correspond- a sentence using SMT. We first obtain the semantic roles
ing video). In the questionnaire we provided detailed exam- of words in our training examples by applying a semantic
ples for each question to facilitate complete understanding parser [31, 7]. We then train a visual classifier using the

4645
same input features as in the NN baseline and the semantic   
roles as the target variable. We use the multi-label classifi-
cation model of Read et al. [29] as our visual classifier.
We compare two different databases to represent seman-
tic roles: WordNet [22] and FrameNet [2], which we re-
fer to as SMT-WordNet and SMT-FrameNet, respectively.
For SMT-WordNet, we use the same semantic parser of
Rohrbach et al. [31] to map the words into WordNet entries Figure 4: Illustration of three cropping schemes. S2VT crops
patches from random locations across all frames in a sequence.
(semantic roles), while for SMT-FrameNet we use a frame
The Tempo also crops patches from random locations, but from a
semantic parser from Das et al. [7]. We use the phrase based randomly cropped subsequence. The Cubic crops patches from a
model from Koehn et al. [15] to learn the SMT model. random location shared across a randomly cropped subsequence.

5.2.3 Long Short-Term Memory (LSTM)


We evaluate an LSTM approach using the same setup of average their CNN features. Spatial cropping is shown to be
S2VT [38]. We also evaluate a number of its variants in crucial to achieve a translation invariance for image recog-
order to analyze effects of different components. nition [33]. To achieve similar invariance effect along the
Basic setup. We sample frames at 10 FPS and encode temporal axis, we introduce Tempo, where a subsequence is
each using a CNN [16]. We then encode the whole sequence randomly cropped from the original sequence and used as
using an LSTM. After the encoding stage, a decoder LSTM input for the sequence encoder (instead of the original full
is initialized with a “BOS” (Beginning of Sentence) token sequence); the spatial cropping is also applied to this base-
and the hidden states/memory cells from the last encoder line. S2VT crops patches from different spatial locations
LSTM unit. The decoder LSTM generates a description across frames. However, this introduces a spatial incon-
word-by-word using the previous hidden states and words, sistency into the LSTM encoder because the cropped lo-
until a “EOS” (End of Sentence) token is generated. The cation changes over the temporal axis. This may make it
model weights – CNN and encoder/decoder LSTM – are difficult to learn the right spatial-temporal dynamic to cap-
learned by minimizing the softmax loss L: ture the motion information. Therefore, we introduce Cubic
cropping, which adds a spatial consistency constraint to the
N T
1  Tempo version (see Figure 4).
L=− p(yt = Sti |ht−1 , St−1
i
), (1) Variants on CNN weight optimization. We evalu-
N i=1 t=1
ate three variants on how the CNN weights are initialized
and updated. S2VT sets the weights by pretraining it on
where Sti is the tth word of the ith sentence in the training
ImageNet-1K class categories [33] and fixing them through-
data, ht is the hidden state and yt is the predicted word at
out. The Rand model randomly initializes the CNN weights
timestamp t. The word probability p(yt = w) is computed
and fixes them throughout. To keep the CNN weights fixed,
as the softmax of the decoder LSTM output.
we limit the gradients of the loss function backpropagate
At the test phase, the decoder has no ground-truth word
only to the encoder LSTM. Finetune takes the pretrained
from which to infer the next word. There are many infer-
parameters and finetunes them by backpropagating the gra-
ence algorithms for this situation, including greedy, sam-
dients all the way down to the CNN part.
pling, and beam search. We empirically found that the sim-
ple greedy algorithm performs the best. Thus, we use the
5.3. Results and Discussion
most likely word at each time step to predict the next word
(along with the hidden states). Table 5 summarizes the results. We can see that NN per-
We implemented the system using Caffe [13] on three forms significantly worse than all other methods across all
K80 GPU cards, with the batch size fixed to 16, the learn- metrics. The NN copies sentences from the training set;
ing rate decreasing from 0.1 to 1e-4 gradually, and for 16 our result suggests the importance of explicitly modeling
epochs (800K iterations) over the training data. The opti- sequence structure in GIFs and sentences for TGIF dataset.
mization converges at around 600K iterations. SMT baselines. Our results show that SMT-FrameNet
Variants on cropping scheme. We evaluate five variants outperforms SMT-WordNet across the board. Does it mean
of cropping schemes for data augmentation. S2VT uses a the former should be preferred over the latter? To answer
well-adopted spatial cropping [16] for all frames indepen- this, we dissect the two-step process of the SMT baseline by
dently. To verify the importance of sequence modeling, we analyzing visual classification (image to semantic role) and
test Single cropping, where we take a single random frame machine translation (semantic role to sentence) separately.
from the entire sequence. No-SP crops 10 patches (2 mir- The mean F1 score of visual classification on the test set is
rors of center, bottom/top-left/right) from each frame and only 0.22% for WordNet; for FrameNet it is 2.09%. We also

4646
Methods BLEU-{1,2,3,4} METEOR ROUGE L CIDEr
Nearest Neighbor 25.3 7.6 2.2 0.7 7.0 21.0 1.5
WordNet 27.8 13.6 6.3 3.0 9.6 26.0 8.9
SMT
FrameNet 34.3 18.1 9.2 4.6 14.1 28.3 10.3
S2VT 51.1 31.3 19.1 11.2 16.1 38.7 27.6
Single 47.0 27.1 15.7 9.0 15.4 36.9 23.8
No-SP 51.4 32.1 20.1 11.8 16.1 39.1 28.3
LSTM Crop
Tempo 49.4 30.4 18.6 10.6 16.1 38.4 26.7
Cubic 50.9 31.5 19.3 11.1 16.2 38.7 27.6
Rand 49.7 27.2 14.5 5.2 13.6 36.6 7.6
CNN

Finetune 52.1 33.0 20.9 12.7 16.7 39.8 31.6


Table 5: Benchmark results on three baseline methods and their variants on five different evaluation metrics.

observe poor grammar performance with both variants, as is 20% 40% 60% 80% 100%
shown in Figure 5. We believe poor performance of visual S2VT 15.0 15.5 15.7 16.1 16.1
classifiers has contributed to the poor grammar in gener-
Table 6: METEOR scores improve as we use more training data,
ated sentences. This is because it makes the distribution of
but plateau after 80% of the training set.
the input to the SMT system inconsistent with the training
data. Although nowhere close to the current state-of-the-
art image classification performance [33], the difference in three variants of different CNN weight initialization and up-
mean F1 scores in part explains the better performance of date schemes (S2VT, Rand, Finetune), Finetune performs
SMT-FrameNet, i.e., the second step (machine translation) the best. This suggests the importance of having a task-
receives more accurate classification results as input. We dependent representation in the LSTM baseline.
note, however, that there are 6,609 concepts from WordNet Qualitative analysis. Figure 5 shows sentences gener-
that overlaps with our dataset, while for FrameNet there are ated using the three baselines and their METEOR scores.
only 696 concepts. So the performance difference could The NN appears to capture some parts of visual compo-
merely reflect the difficulty of learning a visual classifier nents (e.g., (c) “drops” and (d) “white” in Fig. 5), but almost
for WordNet with about 10 times more label categories. always fails to generate a relevant sentence. On the other
We find a more conclusive answer by analyzing the ma- hand, the SMT-FrameNet appears to capture more detailed
chine translation step alone: We bypass the visual classifi- semantic roles (e.g., (a) “ball player” and (b) “pool of wa-
cation step by using ground-truth semantic roles as input to ter”), but most sentences contain syntactic errors. Finally,
machine translation. We observe an opposite result: a ME- the LSTM-Finetune generates quite relevant and grammati-
TEOR score of 21.9% for SMT-FrameNet and 29.3% for cal sentences, but at times fail to capture detailed semantics
SMT-WordNet. This suggests: (1) having a more expres- (e.g., (c) “running through” and (f) “a group of people”).
sive and larger semantic role vocabulary helps improve per- We provide more examples in the supplementary material.
formance; and (2) there is huge potential for improvement Do we need more training data? Table 6 shows the
on SMT-WordNet, perhaps more so than SMT-FrameNet, METEOR score of S2VT on various portions of the train-
by improving visual classification of WordNet categories. ing dataset (but on the same test set). Not surprisingly, the
LSTM baselines. The LSTM methods significantly out- performance increases as we use more training data. We
perform the NN and the SMT baselines even with the simple see, on the other hand, that the performance plateaus after
CNN features – NN and SMT baselines use Hybrid CNN 80%. We believe this shows our TGIF dataset is already at
and C3D features. This conforms to recent findings that its capacity to challenge current state-of-the-art models.
end-to-end sequence learning using deep neural nets outper- Importance of multiple references. Table 7 shows the
forms traditional hand-crafted pipelines [43, 38]. By com- METEOR score of three baselines according to different
paring results of different LSTM variants we make three numbers of reference sentences in our test set. We see a
major observations: (1) The fact that Single performs worse clear pattern of increasing performance as we use more ref-
than all other LSTM variants (except for Rand) suggests the erences in evaluation. We believe this reflects the fact that
importance of modeling input sequence structure; (2) The there is no clear cut single sentence answer to image and
four variants on different cropping schemes (S2VT, No- video description, and that it suggests using more references
SP, Tempo, Cubic) perform similarly to each other, sug- will increase the reliability of evaluation results. We believe
gesting spatial and temporal shift-invariance of the LSTM the score will eventually converge with more references; we
approaches to the input image sequence; (3) Among the plan to investigate this in the future.

4647
Figure 5: Example animated GIFs and generated sentences from nearest neighbor (N), SMT-FrameNet (S), and LSTM-Finetune (L). The
GT refers to one of the 3 ground truth sentences provided by crowdworkers. The numbers in parentheses show the METEOR score (%) of
each generated sentence. More examples can be found here: https://goo.gl/xcYjjE

# of references One Two Three 6. Conclusions


NN 5.0 6.2 7.0
SMT-FrameNet 10.5 12.8 14.1 We presented the Tumblr GIF (TGIF) dataset and
LSTM-Finetune 12.1 15.0 16.7 showed how we solved multiple obstacles involved in
crowdsourcing natural language descriptions, using auto-
Table 7: METEOR scores improve with more reference sentences. matic content filtering for collecting animated GIFs, as well
as novel syntactic and semantic validation techniques to en-
sure high quality descriptions from free-form text input. We
M-VAD MPII-MD LSMDC also provided extensive benchmark results using three pop-
TGIF 3.53 3.92 3.96 ular video description techniques, and showed promising
Movie 4.99 5.35 5.82 results on improving movie description using our dataset.
TGIF-to-Movie 5.17 5.42 5.77 We believe TGIF shows much promise as a research tool
for video description and beyond. An animated GIF is sim-
Table 8: METEOR scores from cross-dataset experiments. ply a limited series of still frames, often without narrative or
need for context, and always without audio. So focusing on
this constrained content is a more readily accessible bridge
to advance research on video understanding than a leap to
5.4. Cross-Dataset Adaptation: GIF to Movies long-form videos, where the content is complex with con-
textual information that is currently far from decipherable
automatically. Once the content of animated GIFs is more
Finally, we evaluate whether an LSTM trained to de-
readily recognizable, the step to video understanding will be
scribe animated GIFs can be applied to the movie descrip-
more achievable, through adding audio cues, context, story-
tion task. We test three settings (see Table 8). TGIF repre-
telling archetypes and other building blocks.
sents the basic S2VT model trained on the TGIF dataset,
while Movie is the S2VT model trained on each movie
Acknowledgements
dataset (M-VAD, MPII-MD, and LSMDC) respectively.
Finally, TGIF-to-Movie represents the S2VT model pre- This work was supported in part by Yahoo Research,
trained on the TGIF and fine-tuned on each of the movie Flickr, and New York State through the Goergen Institute
datasets, respectively. We see that the TGIF-to-Movie im- for Data Science at the University of Rochester. We thank
proves performance on the M-VAD and MPII-MD datasets, Gerry Pesavento, Huy Nguyen and others from Flickr for
and performs comparably to the LSMDC dataset. their support in collecting descriptions via crowdsourcing.

4648
References [15] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen,
[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Moran, R. Zens, et al. Moses: Open source toolkit
C. L. Zitnick, and D. Parikh. VQA: Visual question for statistical machine translation. In ACL, 2007. 6
answering. In ICCV, 2015. 2
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
[2] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berke- agenet classification with deep convolutional neural
ley framenet project. In ACL, 1998. 6 networks. In NIPS, 2012. 6
[3] S. Bakhshi, D. Shamma, L. Kennedy, Y. Song, [17] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li,
P. de Juan, and J. J. Kaye. Fast, cheap, and good: Why Y. Choi, A. C. Berg, and T. Berg. Babytalk: Un-
animated GIFs engage us. In CHI, 2016. 1 derstanding and generating simple image descriptions.
[4] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, PAMI, 35(12), 2013. 2
P. Koehn, and T. Robinson. One billion word bench- [18] M. D. A. Lavie. Meteor universal: Language specific
mark for measuring progress in statistical language translation evaluation for any target language. ACL,
modeling. In INTERSPEECH, 2014. 4 2014. 4, 5
[5] D. L. Chen and W. B. Dolan. Collecting highly par- [19] C.-Y. Lin. Rouge: A package for automatic evaluation
allel data for paraphrase evaluation. In ACL, 2011. 2, of summaries. In ACL, 2004. 5
4
[20] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona,
[6] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft
Improving efficiency and accuracy in multilingual en- COCO: common objects in context. In ECCV, 2014.
tity extraction. In I-Semantics, 2013. 3 1, 2, 4
[7] D. Das, D. Chen, A. F. Martins, N. Schneider, and [21] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep
N. A. Smith. Frame-semantic parsing. Computational captioning with multimodal recurrent neural networks
Linguistics, 40(1), 2014. 5, 6 (m-rnn). arXiv preprint arXiv:1412.6632, 2014. 2
[8] P. Das, C. Xu, R. Doell, and J. Corso. A thousand [22] G. A. Miller. Wordnet: a lexical database for english.
frames in just a few words: Lingual description of Communications of the ACM, 38(11), 1995. 6
videos through latent topics and sparse object stitch-
[23] L. Neumann and J. Matas. Real-time scene text local-
ing. In CVPR, 2013. 2
ization and recognition. In CVPR, 2012. 3
[9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava,
[24] M. Norouzi, A. Punjani, and D. J. Fleet. Fast exact
L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C.
search in hamming space with multi-index hashing.
Platt, et al. From captions to visual concepts and back.
PAMI, 36(6), 2014. 3
In CVPR, 2015. 2
[25] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text:
[10] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,
Describing images using 1 million captioned pho-
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Ev-
tographs. In NIPS, 2011. 2
ery picture tells a story: Generating sentences from
images. In ECCV. 2010. 2 [26] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly
modeling embedding and translation to bridge video
[11] S. Guadarrama, N. Krishnamoorthy, G. Malkar-
and language. CoRR, abs/1505.01861, 2015. 2
nenkar, S. Venugopalan, R. Mooney, T. Darrell, and
K. Saenko. Youtube2text: Recognizing and describ- [27] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu:
ing arbitrary activities using semantic hierarchies and a method for automatic evaluation of machine transla-
zero-shot recognition. In ICCV, 2013. 2 tion. In ACL, 2002. 5
[12] S. Hochreiter and J. Schmidhuber. Long short-term [28] C. Rashtchian, P. Young, M. Hodosh, and J. Hocken-
memory. Neural computation, 9(8), 1997. 2 maier. Collecting image annotations using amazon’s
mechanical turk. In NAACL HLT, 2010. 4
[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
R. Girshick, S. Guadarrama, and T. Darrell. Caffe: [29] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Clas-
Convolutional architecture for fast feature embedding. sifier chains for multi-label classification. Machine
In MM, 2014. 6 learning, 85(3), 2011. 6
[14] A. Karpathy and L. Fei-Fei. Deep visual-semantic [30] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater,
alignments for generating image descriptions. In B. Schiele, and M. Pinkal. Grounding action descrip-
CVPR, 2015. 2 tions in videos. TACL, 2013. 2

4649
[31] A. Rohrbach, M. Rohrbach, N. Tandon, and [39] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
B. Schiele. A dataset for movie description. In CVPR, R. J. Mooney, and K. Saenko. Translating videos
2015. 1, 2, 3, 4, 5, 6 to natural language using deep recurrent neural net-
[32] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, works. In NAACL HLT, 2015. 2
and B. Schiele. Translating video content to natural
[40] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
language descriptions. In ICCV, 2013. 2
and tell: A neural image caption generator. In CVPR,
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, 2015. 2
S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet [41] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville,
Large Scale Visual Recognition Challenge. IJCV, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show,
2015. 1, 6, 7 attend and tell: Neural image caption generation with
[34] A. Torabi, P. Chris, L. Hugo, and C. Aaron. Using visual attention. In ICML, 2015. 2
descriptive video services to create a large data source [42] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu.
for video annotation research. arXiv preprint, 2015. 1, I2t: Image parsing to text description. IEEE, 1998. 2
2, 4
[35] K. Toutanova and C. D. Manning. Enriching the [43] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal,
knowledge sources used in a maximum entropy part- H. Larochelle, and A. Courville. Describing videos
of-speech tagger. In ACL, 2010. 3 by exploiting temporal structure. In ICCV, 2015. 2, 7
[36] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and [44] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual
M. Paluri. Learning spatiotemporal features with 3d madlibs: Fill in the blank description generation and
convolutional networks. In ICCV, 2015. 3, 5 question answering. In ICCV, 2015. 2
[37] R. Vedantam, C. Lawrence Zitnick, and D. Parikh.
[45] C. Zauner. Implementation and benchmarking of per-
Cider: Consensus-based image description evaluation.
ceptual image hash functions. 2010. 3
In CVPR, 2015. 5
[38] S. Venugopalan, M. Rohrbach, J. Donahue, [46] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and
R. Mooney, T. Darrell, and K. Saenko. Sequence to A. Oliva. Learning deep features for scene recogni-
sequence - video to text. In ICCV, 2015. 2, 6, 7 tion using places database. In NIPS, 2014. 5

4650

You might also like