Recent self-supervised approaches have used large-scale imagetext datasets to learn powerful repr... more Recent self-supervised approaches have used large-scale imagetext datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is a one-to-one correspondence between images and their (short) captions. However, many tasks require reasoning about multiple images paired with a long text narrative, such as photos in a news article. In this work, we explore a novel setting where the goal is to learn a self-supervised visual-language representation from longer text paired with a set of photos, which we call visual summaries. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset called NewsStories containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives paired with multiple images, and introduce an intuitive baseline that outperforms these methods, e.g., by 10% on on zero-shot image-set retrieval in the GoodNews dataset 1 .
We introduce the task of spatially localizing narrated interactions in videos. Key to our approac... more We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with selfsupervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-atone and pointing hand accuracies.
2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021
The goal of weakly-supervised video moment retrieval is to localize the video segment most releva... more The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-byword interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 5-20% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate t... more Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training. Instead, a model must learn how to identify the correct segment (i.e. moment) when only being provided with video-sentence pairs. Thus, an inherent challenge is automatically inferring the latent correspondence between visual and language representations. To facilitate this alignment, we propose our Weakly-supervised Moment Alignment Network (wMAN) which exploits a multi-level co-attention mechanism to learn richer multimodal representations. The aforementioned mechanism is comprised of a Frame-By-Word interaction module as well as a novel Word-Conditioned Visual Graph (WCVG). Our approach also incorporates a novel application of positional encodings, commonly used in Transformers, to learn visual-semantic representations that contain contextual information of their relative...
Recent advancements in computer vision have provided us with increased capabilities in the field ... more Recent advancements in computer vision have provided us with increased capabilities in the field of sport analytics. Action recognition and quality assessment are some of the areas that have benefitted from computer vision techniques. Reliable methods have been developed to provide coaches and players with information in making decisions during gameplay and improving the techniques of players. In basketball, we have systems that display a heat map of every single player on the court as well as the player's best position in a lineup. Computer vision is also being applied to football games to track the top speed of players and their positional changes. In this paper, we address the sport of baseball where we seek to apply computer vision techniques to measure the speed and trajectory of the bat via automated tracking and optical flow algorithms.U of I Onlyundergraduate senior thesis not recommended for open acces
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL ... more Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Many real-world tasks require models to compare images along multiple similarity conditions (e.g.... more Many real-world tasks require models to compare images along multiple similarity conditions (e.g. similarity in color, category or shape). Existing methods often reason about these complex similarity relationships by learning condition-aware embeddings. While such embeddings aid models in learning different notions of similarity, they also limit their capability to generalize to unseen categories since they require explicit labels at test time. To address this deficiency, we propose an approach that jointly learns representations for the different similarity conditions and their contributions as a latent variable without explicit supervision. Comprehensive experiments 1 across three datasets, Polyvore-Outfits, Maryland-Polyvore and UT-Zappos50k, demonstrate the effectiveness of our approach: our model outperforms the state-of-the-art methods, even those that are strongly supervised with pre-defined similarity conditions, on fill-in-the-blank, outfit compatibility prediction and triplet prediction tasks. Finally, we show that our model learns different visually-relevant semantic sub-spaces that allow it to generalize well to unseen categories.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Large-scale dissemination of disinformation online intended to mislead or deceive the general pop... more Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visualsemantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation. Parliament was scheduled to reconvene on Oct 9, but Mr. Johnson said he planned to extend its break. nytimes.com What's Next for Britons after Brexit?
Recent self-supervised approaches have used large-scale imagetext datasets to learn powerful repr... more Recent self-supervised approaches have used large-scale imagetext datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is a one-to-one correspondence between images and their (short) captions. However, many tasks require reasoning about multiple images paired with a long text narrative, such as photos in a news article. In this work, we explore a novel setting where the goal is to learn a self-supervised visual-language representation from longer text paired with a set of photos, which we call visual summaries. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset called NewsStories containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives paired with multiple images, and introduce an intuitive baseline that outperforms these methods, e.g., by 10% on on zero-shot image-set retrieval in the GoodNews dataset 1 .
We introduce the task of spatially localizing narrated interactions in videos. Key to our approac... more We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with selfsupervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-atone and pointing hand accuracies.
2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021
The goal of weakly-supervised video moment retrieval is to localize the video segment most releva... more The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-byword interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 5-20% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate t... more Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training. Instead, a model must learn how to identify the correct segment (i.e. moment) when only being provided with video-sentence pairs. Thus, an inherent challenge is automatically inferring the latent correspondence between visual and language representations. To facilitate this alignment, we propose our Weakly-supervised Moment Alignment Network (wMAN) which exploits a multi-level co-attention mechanism to learn richer multimodal representations. The aforementioned mechanism is comprised of a Frame-By-Word interaction module as well as a novel Word-Conditioned Visual Graph (WCVG). Our approach also incorporates a novel application of positional encodings, commonly used in Transformers, to learn visual-semantic representations that contain contextual information of their relative...
Recent advancements in computer vision have provided us with increased capabilities in the field ... more Recent advancements in computer vision have provided us with increased capabilities in the field of sport analytics. Action recognition and quality assessment are some of the areas that have benefitted from computer vision techniques. Reliable methods have been developed to provide coaches and players with information in making decisions during gameplay and improving the techniques of players. In basketball, we have systems that display a heat map of every single player on the court as well as the player's best position in a lineup. Computer vision is also being applied to football games to track the top speed of players and their positional changes. In this paper, we address the sport of baseball where we seek to apply computer vision techniques to measure the speed and trajectory of the bat via automated tracking and optical flow algorithms.U of I Onlyundergraduate senior thesis not recommended for open acces
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL ... more Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Many real-world tasks require models to compare images along multiple similarity conditions (e.g.... more Many real-world tasks require models to compare images along multiple similarity conditions (e.g. similarity in color, category or shape). Existing methods often reason about these complex similarity relationships by learning condition-aware embeddings. While such embeddings aid models in learning different notions of similarity, they also limit their capability to generalize to unseen categories since they require explicit labels at test time. To address this deficiency, we propose an approach that jointly learns representations for the different similarity conditions and their contributions as a latent variable without explicit supervision. Comprehensive experiments 1 across three datasets, Polyvore-Outfits, Maryland-Polyvore and UT-Zappos50k, demonstrate the effectiveness of our approach: our model outperforms the state-of-the-art methods, even those that are strongly supervised with pre-defined similarity conditions, on fill-in-the-blank, outfit compatibility prediction and triplet prediction tasks. Finally, we show that our model learns different visually-relevant semantic sub-spaces that allow it to generalize well to unseen categories.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Large-scale dissemination of disinformation online intended to mislead or deceive the general pop... more Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visualsemantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation. Parliament was scheduled to reconvene on Oct 9, but Mr. Johnson said he planned to extend its break. nytimes.com What's Next for Britons after Brexit?
Uploads
Papers by Reuben Tan