0% found this document useful (0 votes)
25 views7 pages

Image Caption Generation Methodologies

The paper discusses methodologies for image caption generation, a key area in computer vision and artificial intelligence that aims to create human-like descriptions of images. It reviews various techniques such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) networks, highlighting their roles in accurately identifying and constructing image captions. The authors emphasize the importance of recent advancements in deep learning and the availability of large datasets for improving the effectiveness of image captioning systems.

Uploaded by

Black Beast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Image Caption Generation Methodologies

The paper discusses methodologies for image caption generation, a key area in computer vision and artificial intelligence that aims to create human-like descriptions of images. It reviews various techniques such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) networks, highlighting their roles in accurately identifying and constructing image captions. The authors emphasize the importance of recent advancements in deep learning and the availability of large datasets for improving the effectiveness of image captioning systems.

Uploaded by

Black Beast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/351840108

Image Caption Generation Methodologies

Conference Paper · April 2021

CITATIONS READS

3 2,777

1 author:

Rishikesh Gawde
New York University
2 PUBLICATIONS 5 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rishikesh Gawde on 25 May 2021.

The user has requested enhancement of the downloaded file.


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

Image Caption Generation Methodologies


Omkar Shinde1, Rishikesh Gawde2, Anurag Paradkar3
1-3Student, Department of Information Technology, Vidyalankar Institute of Technology, Mumbai, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Scene understanding has always been an image with a fast glance. Although great development has
important task in computer vision, and image captioning is been made in computer vision, tasks such as recognizing an
one of the major areas of Artificial intelligence research since object, action classification, image classification, attribute
it aims to mimic the human ability to compress an enormous classification and scene recognition are possible but it is a
amount of visual information in a few sentences. Image relatively new task to let a computer describe an image that
caption generation aims to generate a sentence description for is forwarded to it in the form of a human-like sentence.
an image. The task aims to provide short but detailed caption
of the image and requires the use of techniques from computer 2. LITERATURE REVIEW
vision and natural language processing. Recent developments
in deep learning and the availability of image caption datasets One of the influential papers by Andrej Karpathy et al. in
such as Flickr and COCO have enabled significant research in image captioning divides the task into two steps: mapping
the area. In this paper, we propose methodologies used such as sentence snippets to visual regions in the image and then
multilayer Convolutional Neural Network (CNN), Recurrent using these correspondences to generate new descriptions
Neural Network (RNN) and a Long Short Term Memory (Karpathy and Fei-Fei 2015). The authors use a Region
(LSTM) to accurately identify and construct meaningful Convolutional Neural Network (RCNN) to represent images
caption for a given image. as a set of h dimensional vectors each representing an object
in the image, detected based on 200 ImageNet classes. The
Key Words: Image Captioning, Computer Vision, authors represent sentences with the help of a Bidirectional
Convolutional Neural Network, Recurrent Neural Recurrent Neural Network (BRNN) in the same h
Network, Long Short term memory dimensional space. Each sentence is a set of h dimensional
vectors, representing snippets or words. The use of the
1. INTRODUCTION BRNN enriches this representation as it learns knowledge
about the context of each word in a sentence. The authors
Problem Statement: find that with such a representation, the final representation
of words aligns strongly with the representation of visual
Most of the people spend about hours deciding about what to regions related to the same concept. They define an
write as a caption. A picture is incomplete without a good alignment score on this representation of words and visual
caption to go with it. The problem introduces a captioning regions and align various words to the same region
task, which requires a computer vision system to both generating text snippets, with the help of a Markov Random
localize and describe salient regions in images in natural Field. With the help of these correspondences between
language. The image captioning task generalizes object image regions and text snippets, the authors train another
detection when the descriptions consist of a single word. model that generates text descriptions for new unseen
Given a set of images and prior knowledge about the content images (Karpathy and Fei-Fei 2015).
find the correct semantic label for the entire image(s).
The authors train an RNN that takes text snippets and visual
Artificial Intelligence (AI) is now at the heart of innovation regions as inputs and tries to predict the next word in the
economy and thus the base for this project is also the same. text based on the words it has seen so far. The image region
In the recent past a field of AI namely Deep Learning has information is passed to the network as the initial hidden
turned a lot of heads due to its impressive results in terms of state at the initial time step, and the network learns to
accuracy when compared to the already existing Machine predict the log probability of the next most likely word using
learning algorithms. The task of being able to generate a a softmax classifier. The authors use unique START and END
meaningful sentence from an image is a difficult task but can tokens that represent the beginning and end of the sentence,
have great impact, for instance helping the visually impaired which allows the network to make variable length
to have a better understanding of images. With the great predictions. The RNN has 512 nodes in the hidden layer
advancement in computing power and with the availability (Karpathy and Fei-Fei 2015).
of huge datasets, building models that can generate captions
for an image has become possible. The network for learning correspondences between visual
regions and text words was trained using stochastic gradient
On the other hand, humans are able to easily describe the descent in batches of 100 image-sentence pairs. The authors
environments they are in. Given a picture, it’s natural for a used dropouts on every layer except the recurrent layers and
person to explain an immense amount of details about this clipped the element-wise gradients at 5 to prevent gradient

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3961
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

explosion. The RNN to generate descriptions for unseen 3. METHODOLOGIES


images was trained using RMSprop which dynamically
adjusts the learning rate (Karpathy and Fei-Fei 2015). 3.1 Model Overview:
Kelvin Xu et.al (Xu et al. 2015) use the concept of attention to
better describe images. The authors propose models that The model proposed takes an image I as input and is trained
focus on which area of the image, and what objects in the to maximize the probability of p(S|I) where S is the sequence
image are being given attention and evaluate these models of words generated from the model and each word St Is
on different image captioning datasets. The idea be- hind the generated from a dictionary built from the training dataset.
approach is that much like the human visual system, some The input image I is fed into a deep vision Convolutional
parts of the image may be ignored for the task of image Neural Network (CNN) which helps in detecting the objects
description, and only the salient foreground features are present in the image. The image encodings are passed on to
considered. The authors use a CNN to learn important the Language Generating Recurrent Neural Network (RNN)
features of the image and an LSTM (Long short-term which helps in generating a meaningful sentence for the
memory network) to generate description text based on a image as shown in the fig. 13. An analogy to the model can be
context vector. given with a language translation RNN model where we try
to maximize the p (T|S) where T is the translation to the
Jyoti Aneja et al. in (Aneja, Deshpande, and Schwing 2017) sentence S. However, in our model the encoder RNN which
use a convolutional approach to generate description text helps in transforming an input sentence to a fixed length
instead of a simple RNN, and show that their model works at vector is replaced by a CNN encoder. Recent research has
par with RNN and LSTM based approaches. shown that the CNN can easily transform an input image to a
vector.
Andrew Shin et al. (Shin, Ushiku, and Harada 2016) use a
second neural network, finely tuned on text-based sentiment
analysis to generate image descriptions which capture the
sentiments in the image. The authors use multi-label
learning to learn sentiments associated with each of the im-
ages, and then use these sentiments, along with the input
from the CNN itself as inputs to an LSTM to generate
sentences which include the sentiment. The LSTM is
restricted so that each description contains at least one term
from the sentiment vocabulary.
Alexander Mathews et al. (Mathews, Xie, and He 2016)
emphasize how only a few image descriptions in most
datasets contain words describing sentiments, and most
Fig-1: Flow of Model
descriptions are factual. The authors propose a model that
consists of two CNN + RNN models each with a specific task.
For the task of image classification, we use a pretrained
While one model learns to describe factual content in the
image, the other learns to describe the sentimental model VGG16. The details of the models are discussed in the
associated, thus providing a framework that learns to following section. A Long Short-Term Memory (LSTM)
generate sentiment based descriptions even with lesser network follows the pretrained VGG16. The LSTM network is
image sentiment data. used for language generation. LSTM differs from traditional
Quanzeng You et.al in (You, Jin, and Luo 2018) propose Neural Networks as a current token is dependent on the
approaches to inject sentiment into the descriptions previous tokens for a sentence to be meaningful and LSTM
generated by image captioning methods. networks take this factor into account.
Tsung Yi Lin et.al in (Lin et al. 2014) describes the Microsoft
Common Objects in Context dataset that is widely used for
benchmarking image captioning models.

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3962
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

This makes them applicable to tasks such as unsegmented,


connected handwriting recognition or speech recognition.

The term “recurrent neural network” is used indiscriminately


to refer to two broad classes of networks with a similar
general structure, where one is finite impulse and the other is
infinite impulse. Both classes of networks exhibit temporal
dynamic behavior. A finite impulse recurrent network is a
directed acyclic graph that can be unrolled and replaced with
a strictly feed forward neural network, while an infinite
Fig- 2: Training the model using VGG16 image features impulse recurrent network is a directed cyclic graph that
cannot be unrolled.
The model that was can before the project consists of two
different input streams, one for the image features, and the Both finite impulse and infinite impulse recurrent networks
other for the preprocessed input captions. The image features can have additional stored states, and the storage can be
are passed through a fully connected (dense) layer to get a under direct control by the neural network. The storage can
representation in a different dimension. The input captions also be replaced by another network or graph, if that
are passed through an embedding layer. These two input incorporates time delays or has feedback loops. Such
streams are then merges and passed as inputs to an LSTM controlled states are referred to as gated state or gated
layer. The image is passed as the initial state to the LSTM memory, and are part of long short-term memory networks
while the caption embedding’s passed as the input to the (LSTMs) and gated recurrent units. This is also called
LSTM. The architecture is shown in figure Feedback Neural Network (FNN).

3.2 Recurrent Neural Network: RNN have a “memory” which remembers all information
about what has been calculated. It uses the same parameters
Recurrent Neural Network is a generalization of feed forward for each input as it performs the same task on all the inputs
neural network that has an internal memory. RNN is or hidden layers to produce the output. This reduces the
recurrent in nature as it performs the same function for every complexity of parameters, unlike other neural networks.
input of data while the output of the current input depends
on the past one computation. After producing the output, it is The input to the hidden layer in an RNN (with a single hidden
copied and sent back into the recurrent network. For making layer) is the input vector, along with the output of the hidden
a decision, it considers the current input and the output that layer for the previous time step. The RNNs are trained to
it has learned from the previous input. learn to predict the next word given the current example.
This is done for multiple times in each iteration, which
represents the length of the sequence that the RNN can learn
and predict later. The Network is trained using back
propagation through time, which adjusts the weights
between the hidden layer for a given time step and the next
time step. Once trained for various iterations, the RNN can
learn to model the sequence (contributors 2018c).

There are problems with RNNs, when learning long


Fig-3: RNN sequences, in situations such as the language translation of
large documents, where it may be necessary to remember
A recurrent neural network (RNN) is a class of artificial only the context over a small time period. For this purpose,
neural networks where connections between nodes form a Long Short Term Memory (LSTM) networks are used, where
directed graph along a temporal sequence. This allows it to each cell has three gates - input, forget and output and can
exhibit temporal dynamic behavior. Derived from feed learn when to forget the previous context, along with other
forward neural networks, RNNs can use their internal state parameters. The LSTM is trained such that each LSTM cell
(memory) to process variable length sequences of inputs. updates its weights at each time step and the all the weights
are updated after each iteration. This helps the network learn

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3963
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

long sequences and decide which parts of the sequences are learns to optimize the filters or convolution kernels that in
related with some context (Trask 2015) (contributors traditional algorithms are hand-engineered. This
2018b). independence from prior knowledge and human intervention
in feature extraction is a major advantage. Convolutional
3.3 Convolutional Neural Network: neural networks are composed of multiple layers of artificial
neurons. Artificial neurons, a rough imitation of their
In deep learning, a convolutional neural network (CNN, or
biological counterparts, are mathematical functions that
ConvNet) is a class of deep neural networks, most commonly
calculate the weighted sum of multiple inputs and outputs an
applied to analyzing visual imagery. They are also known as
activation value. The behavior of each neuron is defined by its
shift invariant or space invariant artificial neural networks
weights. When fed with the pixel values, the artificial neurons
(SIANN), based on the shared-weight architecture of the
of a CNN pick out various visual features.
convolution kernels that scan the hidden layers and
translation invariance characteristics. They have applications When you input an image into a ConvNet, each of its layers
in image and video recognition, recommender systems, image generates several activation maps. Activation maps highlight
classification, Image segmentation, medical image analysis, the relevant features of the image. Each of the neurons takes
natural language processing, brain-computer interfaces, and a patch of pixels as input, multiplies their color values by its
financial time series. weights, sums them up, and runs them through the activation
function. The first (or bottom) layer of the CNN usually
detects basic features such as horizontal, vertical, and
diagonal edges. The output of the first layer is fed as input of
the next layer, which extracts more complex features, such as
corners and combinations of edges. As you move deeper into
the convolutional neural network, the layers start detecting
higher-level features such as objects, faces, and more.

Fig-4: CNN 3.4 Long Short-Term Memory:

CNNs are regularized versions of multilayer perceptrons. Long Short-Term Memory (LSTM) networks are a type of
Multilayer perceptrons usually mean fully connected recurrent neural network capable of learning order
networks, that is, each neuron in one layer is connected to all dependence in sequence prediction problems. This is a
neurons in the next layer. The "fully-connectedness" of these behavior required in complex problem domains like machine
networks makes them prone to overfitting data. Typical ways translation, speech recognition, and more.
of regularization include varying the weights as the loss
function gets minimized while randomly trimming
connectivity. CNNs take a different approach towards
regularization: they take advantage of the hierarchical
pattern in data and assemble patterns of increasing
complexity using smaller and simpler patterns embossed in
the filters. Therefore, on the scale of connectedness and
complexity, CNNs are on the lower extreme. Convolutional
networks were inspired by biological processes in that the
connectivity pattern between neurons resembles the
organization of the animal visual cortex. Individual cortical
neurons respond to stimuli only in a restricted region of the
visual field known as the receptive field. The receptive fields
of different neurons partially overlap such that they cover the
entire visual field.
Fig-5: LSTM
CNNs use relatively little pre-processing compared to other
image classification algorithms. This means that the network LSTMs are a complex area of deep learning. It can be hard to
get your hands around what LSTMs are, and how terms like

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3964
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

bidirectional and sequence-to-sequence relate to the field. In


this post, you will get insight into LSTMs using the words of
research scientists that developed the methods and applied
them to new and important problems. There are few that are
better at clearly and precisely articulating both the promise
of LSTMs and how they work than the experts that developed
them. We will explore key questions in the field of LSTMs
using quotes from the experts, and if you’re interested, you
will be able to dive into the original papers from which the Fig-6: VGG16
quotes were taken. Unlike standard feed forward neural
networks, LSTM has feedback connections. It can not only VGG16 is a convolutional neural network model proposed by
process single data points (such as images), but also entire K. Simonyan and A. Zisserman from the University of Oxford
sequences of data (such as speech or video). For example, in the paper “Very Deep Convolutional Networks for Large-
LSTM is applicable to tasks such as unsegmented, connected Scale Image Recognition”. The model achieves 92.7% top-5
handwriting recognition, speech recognition and anomaly test accuracy in ImageNet, which is a dataset of over 14
detection in network traffic or IDSs (intrusion detection million images belonging to 1000 classes. It was one of the
systems). famous model submitted to ILSVRC-2014. It makes the
improvement over AlexNet by replacing large kernel-sized
A common LSTM unit is composed of a cell, an input gate, an filters (11 and 5 in the first and second convolutional layer,
output gate and a forget gate. The cell remembers values over respectively) with multiple 3×3 kernel-sized filters one after
arbitrary time intervals and the three gates regulate the flow another. VGG16 was trained for weeks and was using NVIDIA
of information into and out of the cell. LSTM networks are Titan Black GPU’s.
well-suited to classifying, processing and making predictions
based on time series data, since there can be lags of unknown VGG-16 is a convolutional neural network that is 16 layers
duration between important events in a time series. LSTMs deep. You can load a pretrained version of the network
were developed to deal with the vanishing gradient problem trained on more than a million images from the ImageNet
that can be encountered when training traditional RNNs. database [1]. The pretrained network can classify images into
Relative insensitivity to gap length is an advantage of LSTM 1000 object categories, such as keyboard, mouse, pencil, and
over RNNs, hidden Markov models and other sequence many animals. As a result, the network has learned rich
learning methods in numerous applications. feature representations for a wide range of images. The
network has an image input size of 224-by-224. This network
3.5 VGG16: is characterized by its simplicity, using only 3×3
convolutional layers stacked on top of each other in
It is considered to be one of the excellent vision model increasing depth. Reducing volume size is handled by max
architecture till date. Most unique thing about VGG16 is that pooling. Two fully-connected layers, each with 4,096 nodes
instead of having a large number of hyper parameters they are then followed by a softmax classifier.
focused on having convolution layers of 3x3 filters with a
stride 1 and always used same padding and maxpool layer of 4. CONCLUSIONS
2x2 filter of stride 2. It follows this arrangement of
convolution and max pool layers consistently throughout the In this paper we presented the deep learning techniques
whole architecture. In the end it has 2 FC (fully connected used for image captioning problem. We have presented
methodologies such as Convolutional Neural Network,
layers) followed by a softmax for output. The 16 in VGG16
Convolutional Neural Network, VGG16, Long short-term
refers to it has 16 layers that have weights. This network is a memory models.
pretty large network and it has about 138 million (approx.)
parameters. The image caption generator has the capabilities to generate
captions for the images, provided during the Training
purpose & also for the new images as well. The model takes
an image as an input and by analyzing the image it detects
objects present in an image and can create a suitable caption
for it.

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3965
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072

ACKNOWLEDGEMENT [16] Simonyan, K., and Zisser- man, A. 2014. Very deep
convolutional networks for large- scale image recognition.
We thank Professor Yash Shah for giving us the opportunity arXiv preprint arXiv:1409.1556.
to work on the project and for providing the necessary [17] Dr. Vinayak D. Shinde, Mahiman P. Dave, Anuj M.
resources. Singh, Amit C. Dubey; Image Caption Generator using Big
Data and Machine Learning.
REFERENCES [18] Trask, A. 2015. Anyone can learn to code an lstm-
rnn in python (part 1: Rnn). iamtrask github.io blog.
[1] Aarthi, S., and Chitrakala, S. 2017. Scene [19] Vinyals, O.; Toshev, A.; Bengio, S.;and Erhan, D. 2015.
understandinga survey. In Computer, Com- munication and Show and tell: A neural image caption gen- erator. In
Signal Processing (ICCCSP), 2017 Interna- tional Conference Computer Vision and Pattern Recognition (CVPR), 2015 IEEE
on, 1–4. IEEE. Conference on, 3156–3164. IEEE.
[2] Amazon Web Services. 2018. Amazon ec2 p2 [20] Wikipedia contributors. 2018a. Bleu — Wikipedia,
instances. the free encyclopedia. [Online; accessed 30-April-2018].
[3] Aneja, J.; Desh- pande, A.; and Schwing, A. 2017. [21] Wikipedia contributors. 2018b. Cross entropy —
Convolutional image captioning. arXiv preprint Wikipedia, the free encyclopedia. [Online; accessed 30-April-
arXiv:1711.09151. 2018].
[4] contributors, W. 2018a. Convolutional neural [22] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.;
network — wikipedia, the free encyclopedia. [On- line; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show,
accessed 30-March-2018]. attend and tell: Neural image caption generation with visual
[5] contributors, W. 2018b. Long short- term memory attention. In International Conference on Machine Learning,
— wikipedia, the free encyclopedia. [Online; accessed 30- 2048–2057.
March-2018].
[6] contributors, W. 2018c. Recurrent neu- ral network
— wikipedia, the free encyclopedia. [Online; accessed 30-
March-2018].
[7] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep
residual learning for image recognition. In Proceed- ings of
the IEEE conference on computer vision and pattern
recognition, 770–778.
[8] Jason Brownlee. 2017. A gentle in- troduction to
calculating the bleu score for text in python.
[9] Karpathy, A., and Fei-Fei, L. 2015. Deep visual-
semantic alignments for generating image descriptions. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, 3128–3137.
[10] Karpathy, A. 2015. The unreasonable ef- fectiveness
of recurrent neural networks. Andrej Karpathy’s Blog.
[11] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.;
Ramanan, D.; Dolla´r, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference on
computer vision, 740–755. Springer.
[12] Mathews, A. P.; Xie, L.; and He, X. 2016. Senticap:
Generating image descriptions with sentiments. In AAAI,
3574–3580.
[13] Rashtchian, C.; Young, P.; Hodosh, M.; and
Hockenmaier, J. 2010. Collecting image annotations using
amazon’s mechanical turk. In Proceedings of the NAACL HLT
2010 Workshop on Creating Speech and Language Data with
Amazon’s Mechanical Turk, 139–147. Association for
Computational Linguistics.
[14] Rosebrock, A. 2017. Imagenet: Vggnet, resnet,
inception, and xception with keras. pyimagesearch website.
[15] Shin, A.; Ushiku, Y.; and Harada, T. 2016. Image
captioning with sentiment terms via weakly-supervised
sentiment dataset. In BMVC.

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3966

View publication stats

You might also like