In this paper, we consider the problem of learning a visual representation from the raw spatiotem... more In this paper, we consider the problem of learning a visual representation from the raw spatiotemporal signals in videos for use in action recognition. Our representation is learned without supervision from semantic labels. We formulate it as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful unsupervised representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. Our method can also be combined with supervised representations to provide an addi...
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
(b) Results (a) Overview of our method Translation Rotation Scale Instance BBoxes Union (pair) BB... more (b) Results (a) Overview of our method Translation Rotation Scale Instance BBoxes Union (pair) BBox Object Encoder Relative Encoder Final Prediction Image with detections Per object Per pair f Figure 1: (a) Approach Overview: We study the problem of layout estimation in 3D by reasoning about relationships between objects. Given an image and object detection boxes, we first predict the 3D pose (translation, rotation, scale) of each object and the relative pose between each pair of objects. We combine these predictions and ensure consistent relationships between objects to predict the final 3D pose of each object. (b) Output: An example result of our method that takes as input the 2D image and generates the 3D layout.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Pre-training convolutional neural networks with weaklysupervised and self-supervised strategies i... more Pre-training convolutional neural networks with weaklysupervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks. However, due to the lack of strong discriminative signals, these learned representations may overfit to the pre-training objective (e.g., hashtag prediction) and not generalize well to downstream tasks. In this work, we present a simple strategy -ClusterFit (CF) to improve the robustness of the visual representations learned during pre-training. Given a dataset, we (a) cluster its features extracted from a pretrained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels. We empirically show that clustering helps reduce the pre-training task-specific information from the extracted features thereby minimizing overfitting to the same. Our approach is extensible to different pretraining frameworks -weak-and self-supervised, modalities -images and videos, and pre-training tasks -object and action classification. Through extensive transfer learning experiments on 11 different target datasets of varied vocabularies and granularities, we show that CF significantly improves the representation quality compared to the state-ofthe-art large-scale (millions / billions) weakly-supervised image and video models and self-supervised image models.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
Compositionality and contextuality are key building blocks of intelligence. They allow us to comp... more Compositionality and contextuality are key building blocks of intelligence. They allow us to compose known concepts to generate new and complex ones. However, traditional learning methods do not model both these properties and require copious amounts of labeled data to learn new concepts. A large fraction of existing techniques, e.g. using late fusion, compose concepts but fail to model contextuality. For example, red in red wine is different from red in red tomatoes. In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts. Our method builds upon the intuition that classifiers lie in a smooth space where compositional transforms can be modeled. We show how it can generalize to unseen combinations of concepts. Our results on composing attributes, objects as well as composing subject, predicate, and objects demonstrate its strong generalization performance compared to baselines. Finally, we present detailed analysis of our method and highlight its properties.
In this paper, we present an approach for learning a visual representation from the raw spatiotem... more In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
Multi-task learning in Convolutional Networks has displayed remarkable success in the field of re... more Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multiple supervisory tasks. However, existing multi-task approaches rely on enumerating multiple network architectures specific to the tasks at hand, that do not generalize. In this paper, we propose a principled approach to learn shared representations in ConvNets using multitask learning. Specifically, we propose a new sharing unit: "cross-stitch" unit. These units combine the activations from multiple networks and can be trained end-to-end. A network with cross-stitch units can learn an optimal combination of shared and task-specific representations. Our proposed method generalizes across multiple tasks and shows dramatically improved performance over baseline methods for categories with few training examples.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016
There has been an explosion of work in the vision & language community during the past few years ... more There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016
We introduce the first dataset for sequential vision-to-language, and explore how this data may b... more We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND 1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
Current computer vision systems rely primarily on fixed models learned in a supervised fashion, i... more Current computer vision systems rely primarily on fixed models learned in a supervised fashion, i.e., with extensive manually labelled data. This is appropriate in scenarios in which the information about all the possible visual queries can be anticipated in advance, but it does not scale to scenarios in which new objects need to be added during the operation of the system, as in dynamic interaction with UGVs. For example, the user might have found a new type of object of interest, e.g., a particular vehicle, which needs to be added to the system right away. The supervised approach is not practical to acquire extensive data and to annotate it. In this paper, we describe techniques for rapidly updating or creating models using sparsely labelled data. The techniques address scenarios in which only a few annotated training samples are available and need to be used to generate models suitable for recognition. These approaches are crucial for on-the-fly insertion of models by users and on-line learning.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
We present a semi-supervised approach that localizes multiple unknown object instances in long vi... more We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exhaustive labeling of each object instance in any single frame, or any explicit annotation of negative data. Working in such a generic setting allow us to tackle multiple object instances in video, many of which are static. In contrast, existing approaches either do not consider multiple object instances per video, or rely heavily on the motion of the objects present. The experiments demonstrate the effectiveness of our approach by evaluating the automatically labeled data on a variety of metrics like quality, coverage (recall), diversity, and relevance to training an object detector.
IEEE Winter Conference on Applications of Computer Vision, 2014
We consider the problem of discovering discriminative exemplars suitable for object detection. Du... more We consider the problem of discovering discriminative exemplars suitable for object detection. Due to the diversity in appearance in real world objects, an object detector must capture variations in scale, viewpoint, illumination etc. The current approaches do this by using mixtures of models, where each mixture is designed to capture one (or a few) axis of variation. Current methods usually rely on heuristics to capture these variations; however, it is unclear which axes of variation exist and are relevant to a particular task. Another issue is the requirement of a large set of training images to capture such variations. Current methods do not scale to large training sets either because of training time complexity or test time complexity . In this work, we explore the idea of compactly capturing taskappropriate variation from the data itself. We propose a two stage data-driven process, which selects and learns a compact set of exemplar models for object detection. These selected models have an inherent ranking, which can be used for anytime/budgeted detection scenarios. Another benefit of our approach (beyond the computational speedup) is that the selected set of exemplar models performs better than the entire set.
Parallel computing using accelerators has gained widespread research attention in the past few ye... more Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has significantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational resources, it is important to also include CPUs in the computation. We call this the hybrid computing model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore such a model is quite natural. We reevaluate the claim of a recent paper by Lee et al. (ISCA 2010). We argue that the right question arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions. On both these platforms, we also show that our solutions are 90% resource efficient on average. Our work therefore suggests that hybrid computing can offer tremendous advantages at not only research-scale platforms but also the more realistic scale systems with significant performance gains and resource efficiency to the large scale user community.
In this paper, we consider the problem of learning a visual representation from the raw spatiotem... more In this paper, we consider the problem of learning a visual representation from the raw spatiotemporal signals in videos for use in action recognition. Our representation is learned without supervision from semantic labels. We formulate it as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful unsupervised representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. Our method can also be combined with supervised representations to provide an addi...
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
(b) Results (a) Overview of our method Translation Rotation Scale Instance BBoxes Union (pair) BB... more (b) Results (a) Overview of our method Translation Rotation Scale Instance BBoxes Union (pair) BBox Object Encoder Relative Encoder Final Prediction Image with detections Per object Per pair f Figure 1: (a) Approach Overview: We study the problem of layout estimation in 3D by reasoning about relationships between objects. Given an image and object detection boxes, we first predict the 3D pose (translation, rotation, scale) of each object and the relative pose between each pair of objects. We combine these predictions and ensure consistent relationships between objects to predict the final 3D pose of each object. (b) Output: An example result of our method that takes as input the 2D image and generates the 3D layout.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Pre-training convolutional neural networks with weaklysupervised and self-supervised strategies i... more Pre-training convolutional neural networks with weaklysupervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks. However, due to the lack of strong discriminative signals, these learned representations may overfit to the pre-training objective (e.g., hashtag prediction) and not generalize well to downstream tasks. In this work, we present a simple strategy -ClusterFit (CF) to improve the robustness of the visual representations learned during pre-training. Given a dataset, we (a) cluster its features extracted from a pretrained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels. We empirically show that clustering helps reduce the pre-training task-specific information from the extracted features thereby minimizing overfitting to the same. Our approach is extensible to different pretraining frameworks -weak-and self-supervised, modalities -images and videos, and pre-training tasks -object and action classification. Through extensive transfer learning experiments on 11 different target datasets of varied vocabularies and granularities, we show that CF significantly improves the representation quality compared to the state-ofthe-art large-scale (millions / billions) weakly-supervised image and video models and self-supervised image models.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
Compositionality and contextuality are key building blocks of intelligence. They allow us to comp... more Compositionality and contextuality are key building blocks of intelligence. They allow us to compose known concepts to generate new and complex ones. However, traditional learning methods do not model both these properties and require copious amounts of labeled data to learn new concepts. A large fraction of existing techniques, e.g. using late fusion, compose concepts but fail to model contextuality. For example, red in red wine is different from red in red tomatoes. In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts. Our method builds upon the intuition that classifiers lie in a smooth space where compositional transforms can be modeled. We show how it can generalize to unseen combinations of concepts. Our results on composing attributes, objects as well as composing subject, predicate, and objects demonstrate its strong generalization performance compared to baselines. Finally, we present detailed analysis of our method and highlight its properties.
In this paper, we present an approach for learning a visual representation from the raw spatiotem... more In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
Multi-task learning in Convolutional Networks has displayed remarkable success in the field of re... more Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multiple supervisory tasks. However, existing multi-task approaches rely on enumerating multiple network architectures specific to the tasks at hand, that do not generalize. In this paper, we propose a principled approach to learn shared representations in ConvNets using multitask learning. Specifically, we propose a new sharing unit: "cross-stitch" unit. These units combine the activations from multiple networks and can be trained end-to-end. A network with cross-stitch units can learn an optimal combination of shared and task-specific representations. Our proposed method generalizes across multiple tasks and shows dramatically improved performance over baseline methods for categories with few training examples.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016
There has been an explosion of work in the vision & language community during the past few years ... more There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016
We introduce the first dataset for sequential vision-to-language, and explore how this data may b... more We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND 1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
Current computer vision systems rely primarily on fixed models learned in a supervised fashion, i... more Current computer vision systems rely primarily on fixed models learned in a supervised fashion, i.e., with extensive manually labelled data. This is appropriate in scenarios in which the information about all the possible visual queries can be anticipated in advance, but it does not scale to scenarios in which new objects need to be added during the operation of the system, as in dynamic interaction with UGVs. For example, the user might have found a new type of object of interest, e.g., a particular vehicle, which needs to be added to the system right away. The supervised approach is not practical to acquire extensive data and to annotate it. In this paper, we describe techniques for rapidly updating or creating models using sparsely labelled data. The techniques address scenarios in which only a few annotated training samples are available and need to be used to generate models suitable for recognition. These approaches are crucial for on-the-fly insertion of models by users and on-line learning.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
We present a semi-supervised approach that localizes multiple unknown object instances in long vi... more We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exhaustive labeling of each object instance in any single frame, or any explicit annotation of negative data. Working in such a generic setting allow us to tackle multiple object instances in video, many of which are static. In contrast, existing approaches either do not consider multiple object instances per video, or rely heavily on the motion of the objects present. The experiments demonstrate the effectiveness of our approach by evaluating the automatically labeled data on a variety of metrics like quality, coverage (recall), diversity, and relevance to training an object detector.
IEEE Winter Conference on Applications of Computer Vision, 2014
We consider the problem of discovering discriminative exemplars suitable for object detection. Du... more We consider the problem of discovering discriminative exemplars suitable for object detection. Due to the diversity in appearance in real world objects, an object detector must capture variations in scale, viewpoint, illumination etc. The current approaches do this by using mixtures of models, where each mixture is designed to capture one (or a few) axis of variation. Current methods usually rely on heuristics to capture these variations; however, it is unclear which axes of variation exist and are relevant to a particular task. Another issue is the requirement of a large set of training images to capture such variations. Current methods do not scale to large training sets either because of training time complexity or test time complexity . In this work, we explore the idea of compactly capturing taskappropriate variation from the data itself. We propose a two stage data-driven process, which selects and learns a compact set of exemplar models for object detection. These selected models have an inherent ranking, which can be used for anytime/budgeted detection scenarios. Another benefit of our approach (beyond the computational speedup) is that the selected set of exemplar models performs better than the entire set.
Parallel computing using accelerators has gained widespread research attention in the past few ye... more Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has significantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational resources, it is important to also include CPUs in the computation. We call this the hybrid computing model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore such a model is quite natural. We reevaluate the claim of a recent paper by Lee et al. (ISCA 2010). We argue that the right question arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions. On both these platforms, we also show that our solutions are 90% resource efficient on average. Our work therefore suggests that hybrid computing can offer tremendous advantages at not only research-scale platforms but also the more realistic scale systems with significant performance gains and resource efficiency to the large scale user community.
Uploads
Papers by ishan misra