Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021
…
14 pages
1 file
The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA that emphasizes the specific challenges of acoustic inputs, e.g. variable duration scenes. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The usage of time and frequency 1D convolutions to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. NAAQA achieves 91.6% of accuracy on the AQA task with about 7 times fewer parameters than the previously explored VQA model. We provide a detailed analysis of the results for the different question types. The effectiveness of coordinate maps in this acoustic context was also studied and we show that time coordinate maps augm...
Turkish Journal of Computer and Mathematics Education (TURCOMAT)
The ability of a computer system to be able to understand surroundings and elements and to think like a human being to process the information has always been the major point of focus in the field of Computer Science. One of the ways to achieve this artificial intelligence is Visual Question Answering. Visual Question Answering (VQA) is a trained system which can answer the questions associated to a given image in Natural Language. VQA is a generalized system which can be used in any image-based scenario with adequate training on the relevant data. This is achieved with the help of Neural Networks, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this study, we have compared different approaches of VQA, out of which we are exploring CNN based model. With the continued progress in the field of Computer Vision and Question answering system, Visual Question Answering is becoming the essential system which can handle multiple scenarios with their re...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention, and show its effectiveness over conventional VQA techniques through empirical evaluations.
International Journal of Computer Vision, 2019
Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.
2018
Visual Question Answering (VQA) requires AI models to comprehend data in two domains, vision and text. Current state-of-the-art models use learned attention mechanisms to extract relevant information from the input domains to answer a certain question. Thus, robust attention mechanisms are essential for powerful VQA models. In this paper, we propose a recurrent attention mechanism and show its benefits compared to the traditional convolutional approach. We perform two ablation studies to evaluate recurrent attention. First, we introduce a baseline VQA model with visual attention and test the performance difference between convolutional and recurrent attention on the VQA 2.0 dataset. Secondly, we design an architecture for VQA which utilizes dual (textual and visual) Recurrent Attention Units (RAUs). Using this model, we show the effect of all possible combinations of recurrent and convolutional dual attention. Our single model outperforms the first place winner on the VQA 2016 challenge and to the best of our knowledge, it is the second best performing single model on the VQA 1.0 dataset. Furthermore, our model noticeably improves upon the winner of the VQA 2017 challenge. Moreover, we experiment replacing attention mechanisms in state-of-the-art models with our RAUs and show increased performance.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017
In this paper, we present a simple, yet effective, attention and memory mechanism that is reminiscent of Memory Networks and we demonstrate it in question-answering scenarios. Our mechanism is based on four simple premises: a) memories can be formed from word sequences by using convolutional networks; b) distance measurements can be taken at a neuronal level; c) a recursive softmax function can be used for attention; d) extensive weight sharing can help profoundly. We achieve state-of-the-art results in the bAbI tasks, outperforming both Memory Networks and the Differentiable Neural Computer, both in terms of accuracy and stability (i.e. variance) of results.
Computer Vision and Image Understanding
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021
Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between train and test folds. Overall, it is hard to determine when state-of-theart systems exploit these weaknesses rather than really infer the answers, because they are opaque and their 'reasoning' process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct (semantic) interpretation of the input query. In response, we identify a key structural idiom in OKVQA, viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specifically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by consulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQA S3 , a subset of OKVQA annotated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally transparent OKVQA system, S3, that explicitly addresses our challenge data set, and outperforms recent competitive baselines. We make our code and data available at https://s3vqa.github.io/ * Both authors contributed equally to this research.
For this assignment, I have implemented a Visual Question Answering model that has achieved an accuracy of 51.97% on the validation set, which is higher than the accuracy (45%) required to receive full credit. To extract features from the questions, I adapt a pre-trained FastText model and apply a 2-layer bidirectional LSTM. To combine the features , I experiment with element-wise multiplication and a convolutional layer. Furthermore, I perform ablation studies by only feeding in the question vector and image features respectively , and using only averaged word embed-dings. I find that using averaged word embed-dings does not decrease the accuracy substantially , and also find that removing the question vector caused a larger decrease in accuracy.
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, 2019
Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset (Alamri et al., 2019a), our proposed models in single-turn and multiturn question answering achieve state-of-theart performance on several automatic natural language generation evaluation metrics.
2019
We introduced the task of acoustic question answering (AQA) in https://arxiv.org/abs/1811.10561.This dataset aim to promote research in the acoustic reasoning area.It comprise Acoustic Scenes and multiple questions/answers for each of them.Each question is accompanied by a functional program which describe the reasoning steps needed in order to answer it. The dataset is constitued is separated in 3 sets :Training35 000 acoustic scenes1 400 000 questions/answersValidation7 500 acoustic scenes300 000 questions/answersTest7 500 acoustic scenes300 000 questions/answers The generation code is available at https://github.com/IGLU-CHISTERA/CLEAR-dataset-generationThe dataset can be easily regenerated with a different amount of scene/questions/answers.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Computer Vision – ECCV 2018, 2018
arXiv (Cornell University), 2021
Lecture Notes in Computer Science, 2018
Global journal of computer science and technology, 2018
RANLP, 2023
arXiv (Cornell University), 2020
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
Proceedings of the 30th ACM International Conference on Information & Knowledge Management
arXiv (Cornell University), 2023
Machine Learning, Optimization, and Data Science, 2021
IEEE Transactions on Image Processing, 2019
Proceedings of the Workshop on Human-Computer Question Answering, 2016