Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2022, Mathematical Statistician and Engineering Applications
Video captioning refers to the process of predicting a semantically consistent textual description from a given video clip. Even though a significant amount of research work is present for video captioning in English, for Bengali the field of video captioning is nearly unexplored. Therefore, this research aims at generating Bengali captions that plausibly describe the gist of a specific short video. To accomplish this, Long Short-Term Memory (LSTM) based a sequence-to-sequence model is used that takes the video frame features as input and generates an analogous textual description. In this study, Microsoft Research Video Description Corpus (MSVD) dataset is used which is an English dataset. Therefore, a deep learning-based translator and manual labor are used to convert English captions into appropriate Bengali ones. Finally, the model's performance is evaluated using popular evaluation metrics-BLEU and TER. The proposed approach achieves BLEU and TER scores of 0.38 and 0.76 respectively, establishing a new benchmark for the Bengali video captioning tasks.
arXiv (Cornell University), 2023
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctness.
Procedia Computer Science
Automatic image caption generation aims to produce an accurate description of an image in natural language automatically. However, Bangla, the fifth most widely spoken language in the world, is lagging considerably in the research and development of such domain. Besides, while there are many established data sets to related to image annotation in English, no such resource exists for Bangla yet. Hence, this paper outlines the development of "Chittron", an automatic image captioning system in Bangla. Moreover, to address the data set availability issue, a collection of 16, 000 Bangladeshi contextual images has been accumulated and manually annotated in Bangla. This data set is then used to train a model which integrates a pre-trained VGG16 image embedding model with stacked LSTM layers. The model is trained to predict the caption when the input is an image, one word at a time. The results show that the model has successfully been able to learn a working language model and to generate captions of images quite accurately in many cases. The results are evaluated mainly qualitatively. However, BLEU scores are also reported. It is expected that a better result can be obtained with a bigger and more varied data set.
International Journal of Advanced Computer Science and Applications, 2020
Visually impaired individuals face many difficulties in their daily lives. In this study, a video captioning system has been developed for visually impaired individuals to analyze the events through real-time images and express them in meaningful sentences. It is aimed to better understand the problems experienced by visually impaired individuals in their daily lives. For this reason, the opinions and suggestions of the disabled individuals within the Altınokta Blind Association (Turkish organization of blind people) have been collected to produce more realistic solutions to their problems. In this study, MSVD which consists of 1970 YouTube clips has been used as training dataset. First, all clips have been muted so that the sounds of the clips have not been used in the sentence extraction process. The CNN and LSTM architectures have been used to create sentence and experimental results have been compared using BLEU 4, ROUGE-L and CIDEr and METEOR.
ArXiv, 2021
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
Neural Networks and Deep Learning have seen an upsurge of research in the past decade due to the improved results. Generates text from the given image is a crucial task that requires the combination of both sectors which are computer vision and natural language processing in order to understand an image and represent it using a natural language. However existing works have all been done on a particular lingual domain and on the same set of data. This leads to the systems being developed to perform poorly on images that belong to specific locales' geographical context. TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context and use its knowledge to represent what it understands in Bengali. Hence, we have trained a model on our previously developed and published dataset named BanglaLekhaImageCaptions. This dataset contains 9,154 images along with two annotations for each image. In order to access performance, the prop...
Journal of Intelligent & Fuzzy Systems, 2019
Understanding the context with generation of textual description from an input image is an active and challenging research topic in computer vision and natural language processing. However, in the case of Bengali language, the problem is still unexplored. In this paper, we address a standard approach for Bengali image caption generation though subsampling the machine translated dataset. Later, we use several pre-processing techniques with the state-of-the-art CNN-LSTM architecturebased models. The experiment is conducted on standard Flickr-8K dataset, along with several modifications applied to adapt with the Bengali language. The training caption subsampled dataset is computed for both Bengali and English languages for further experiments with 16 distinct models developed in the entire training process. The trained models for both languages are analyzed with respect to several caption evaluation metrics. Further, we establish a baseline performance in Bengali image captioning defining the limitation of current word embedding approaches compared to internal local embedding.
International Journal of Advanced Computer Science and Applications
Automatic caption generation from images has become an active research topic in the field of Computer Vision (CV) and Natural Language Processing (NLP). Machine generated image caption plays a vital role for the visually impaired people by converting the caption to speech to have a better understanding of their surrounding. Though significant amount of research has been conducted for automatic caption generation in other languages, far too little effort has been devoted to Bangla image caption generation. In this paper, we propose an encoder-decoder based model which takes an image as input and generates the corresponding Bangla caption as output. The encoder network consists of a pretrained image feature extractor called ResNet-50, while the decoder network consists of Bidirectional LSTMs for caption generation. The model has been trained and evaluated using a Bangla image captioning dataset named BanglaLekhaIm-ageCaptions. The proposed model achieved a training accuracy of 91% and BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores of 0.81, 0.67, 0.57, and 0.51 respectively. Moreover, a comparative study for different pretrained feature extractors such as VGG-16 and Xception is presented. Finally, the proposed model has been deployed on an embedded device for analysing the inference time and power consumption.
2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020
There is not more research on the linguistic characteristics of the Bengali language. Bengali is spoken by about 193 million people globally, and it is one of the top ten spoken languages worldwide. In this paper, a CNN and Bidirectional GRU architecture is proposed for producing a natural language caption from an image in the Bengali language. Bangladeshi people may use this study to grasp one another better and crack language barriers and increase their cultural understanding. This study would immensely help several blind people in their daily lives. The encoder-decoder approach was used in this paper for captioning. We used a pre-trained Deep CNN named InceptionV3 image encoder to interpret, identify, and annotate the dataset’s images and used a Bidirectional GRU architecture as the decoder to produce captions. In order to deliver the finest and subtle Bengali captions from our model, argmax search and beam search are included. We proposed a new dataset named BNATURE that contain...
International Journal of Electrical and Computer Engineering (IJECE), 2022
With the development of today's society, demand for applications using digital cameras jumps over year by year. However, analyzing large amounts of video data causes one of the most challenging issues. In addition to storing the data captured by the camera, intelligent systems are required to quickly analyze the data to correct important situations. In this paper, we use deep learning techniques to build automatic models that describe movements on video. To solve the problem, we use three deep learning models: sequence-to-sequence model based on recurrent neural network, sequenceto-sequence model with attention and transformer model. We evaluate the effectiveness of the approaches based on the results of three models. To train these models, we use Microsoft research video description corpus (MSVD) dataset including 1970 videos and 85,550 captions translated into Vietnamese. In order to ensure the description of the content in Vietnamese, we also combine it with the natural language processing (NLP) model for Vietnamese.
International Journal of Advanced Computer Science and Applications (IJACSA), 2023
Indeed, Image Captioning has become a crucial aspect of contemporary artificial intelligence because it has tackled two crucial parts of the AI field: Computer Vision and Natural Language Processing. Currently, Bangla stands as the seventh most widely spoken language globally. Due to this, image captioning has gained recognition for its significant research accomplishments. Many established datasets are found in English but no standard datasets in Bangla. For our research, we have used the BAN-Cap dataset which contains 8091 images with 40455 sentences. Many effective encoder-decoder and Visual Attention approaches are used for image captioning where CNN is utilized for the encoder and RNN is used for the decoder. However, we suggested a transformer-based image captioning model in this study with different pre-train image feature extraction models like Resnet50, InceptionV3, and VGG16 using the BAN-Cap dataset and find out its effective efficiency and accuracy based on many performances measured methods like BLEU, METEOR, ROUGE, CIDEr and also find out the drawbacks of others model.
THE 2ND UNIVERSITAS LAMPUNG INTERNATIONAL CONFERENCE ON SCIENCE, TECHNOLOGY, AND ENVIRONMENT (ULICoSTE) 2021
Video Captioning is automatic text description generation process of a given video. It is a type of sequence to sequence translation that should consider the spatial and temporal features of input video data. Recurrent Neural Networks which consist of encoder-decoder architecture are generally used for this type of problem. The model which uses Recurrent Neural Networks suffer from high computational cost and the output generation depends upon the final hidden state vector of the encoder which cannot represent the entire video effectively. Incorporating attention mechanism in the video captioning process can improve the efficiency of caption generation. Nowadays focus has been shifted from RNN based networks to Transformer for video captioning problem. This paper introduces transformer based network architecture over LSTM based models for captioning video. This architecture is generally used in language translation models. Transformer network contains multi-head self-attention and the encoder-decoder attention to improve the performance of text description generation. The model shows BLEU score of 42.4 on MSR-VTT and 53.2 on MSVD datasets which is better than state-of-art models.
SN Computer Science
Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.
2015
In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning systems with RNN based language models and extends this framework to videos utilizing both static image features and video-specific features. In addition, we study the usefulness of visual content classifiers as a source of additional information for caption generation. With experimental results we show that utilizing keyframe based features, dense trajectory video features and content classifier outputs together gives better performance than any one of them individually.
International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022
Researchers in the fields of computer vision and natural language processing have been concentrating their efforts in recent years on automatically developing natural language descriptions for videos. Although video comprehension has a variety of applications, such as video retrieval and indexing, video captioning is a difficult topic to master due to the complex and diverse nature of video content. Understanding the relationship between video content and natural language sentences, on the other hand, is still a work in progress, and several approaches for improved video analysis are being developed. Because of their superior performance and high-speed computing capabilities, deep learning approaches have shifted their focus to video processing. This research aims at the end-to-end structure of a deep learning based encoder-decoder network for creating natural language descriptions for video sequences. The use of a CNN-RNN model paired with beam search to generate captions for the MSVD dataset is explored in this study. We have compared the results with beam search and greedy search approach. The generated captions from this model is generally grammatically incorrect. Our paper focuses on improving those grammatical errors using encoder-decoder model. Grammatical errors include spelling mistakes, incorrect use of articles, prepositions, pronouns, nouns, etc or even poor sentence construction. Using beam search for k=3, the captions generated by our algorithm get a BLEU score of 0.72. After passing the generated captions through a grammar error correction mechanism, the results improve to a BLEU score of 0.76. The results increased by 5.55% after grammar correction. The blue score reduces as the value of k decreases, but the time it takes to generate captions decreases as well.
2021
Due to the fact of increase in amount of video data each day, the need for auto generation of captioning them clearly is inevitable. Video captioning makes the video more accessible in numerous ways. It allows the deaf and hard of hearing individuals to watch videos, helps people to focus on and remember the information more easily, and lets people watch it in sound- sensitive environments. Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. The events are temporally localized in the video with independent start and end times. At the same time, some events that might also occur concurrently and overlap in time. Classifying the events into present, past and future as well as separating them based on their start and end times will help in identifying the order of events. Hence the proposed work develops a captioning system that clearly explains each visual feature that is present in the image conceptually. The Blended-LSTM (Bl- LSTM) model with the help of Xception based Convolution Neural Network (CNN) with Fusion Visual Captioning (FVC) system achieves it with the BLEU score of 75.9 %.
2021
Recent advancements in deep learning have created many opportunities to solve real-world problems that remained unsolved for more than a decade. Automatic caption generation is a major research field, and the research community has done a lot of work on it in most common languages like English. Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India, and yet no work has been done for Urdu language caption generation. Our research aims to fill this gap by developing an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language. We have prepared a dataset in the Urdu language by translating a subset of the "Flickr8k" dataset containing 700 'man' images. We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language. We improve on the previous state-of-the-art by using better CNN architectures a...
International Journal of Next-Generation Computing, 2021
Video captioning is the process of creating a natural language sentence that summarises the video's contents automatically. Modeling the video's effective temporal composition and effectively integrating that information into a plain language description are both required. It has a variety of applications, including assisting the visually impaired, video subtitling, and video surveillance, among others. Due to the advancement of deep learning in computer vision and natural language processing, there has been a surge in study in this area in recent years. Video captioning is the result of combining these two worlds of computer vision and natural language processing. In this study, we examine and analyse various strategies for addressing this issue, as well as benchmark datasets in terms of domains, repository size, and number of classes; and identify the benefits and drawbacks of various evaluation metrics such as BLEU, METEOR, CIDEr, SPICE, and ROUGE.
2021
Automatic Image Captioning is the never-ending effort of creating syntactically and validating the accuracy of textual descriptions of an image in natural language with context. The encoder-decoder structure used throughout existing Bengali Image Captioning (BIC) research utilized abstract image feature vectors as the encoder’s input. We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all existing Bengali Image Captioning work and sets a new benchmark by scoring 0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337 on METEOR.
ArXiv, 2020
In this work, we have introduced Gaussian Smoothen Semantic Features (GSSF) for Better Semantic Selection for Indian regional language-based image captioning and introduced a procedure where we used the existing translation and English crowd-sourced sentences for training. We have shown that this architecture is a promising alternative source, where there is a crunch in resources. Our main contribution of this work is the development of deep learning architectures for the Bengali language (is the fifth widely spoken language in the world) with a completely different grammar and language attributes. We have shown that these are working well for complex applications like language generation from image contexts and can diversify the representation through introducing constraints, more extensive features, and unique feature spaces. We also established that we could achieve absolute precision and diversity when we use smoothened semantic tensor with the traditional LSTM and feature decom...
IOS Press eBooks, 2022
Image Captioning has gained tremendous spotlight in recent years. However, the captioning models generate captions in English language. In this paper, we present an image caption generator for our regional language that is Hindi using Resnet50 and LSTM with attention module. An experimental study is shown highlighting the effect of attention-based learning on generated Hindi captions. Flickr8k dataset in Hindi language is used to validate the performance of proposed work in terms of BLEU score.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.