Digital Signal Processing Laboratory ( EEE -316)
Image Captioning Using
CNN & LSTM
Uday Kamal Hasib Amin Rajib Al-Sabah Abhishek Shushil
Id : 1406041 Id : 1406045 Id : 1406035 Id : 1406034
Bangladesh University of Engineering and Technology (BUET)
Presentation Outline
● Problem Statement
● Basic building blocks for the
network
- CNN
- Transfer Learning
- RNN
- LSTM
● How do we wire them together?
● Code
● Other places this can be
implemented
● Interaction & Questions
Problem Overview
Problem Overview
Overall Model:
Building Blocks for the Network:
CNN
Building Blocks for the Network:
CNN
Convolution layer is a feature detector that automagically learns to filter out the not needed
information from an input by using convolution kernel.
Pooling layers compute the max or average value of a particular feature over a region of
the input data (downsizing of input images). Also helps to detect objects in some unusual
places and reduces memory size.
Building Blocks for the Network:
Transfer Learning
Building Blocks for the Network:
Inception V3
Building Blocks for the Network:
RNN
● As humans we understand context
● Every single time we don’t reset our understanding
● Thoughts have persistence
● Traditional NNs like CNNs don’t have persistence
● speech recognition, language modeling, translation
requires this persistence
RNNs are general computers which can learn algorithms to map
input sequences to output sequences (flexible-sized vectors).
The output vector’s contents are influenced by the entire
history of inputs.
Building Blocks for the Network:
RNN
Building Blocks for the Network:
LSTM
The LSTM units give the network memory cells with read, write and reset
operations. During training, the network can learn when it should remember data
and when it should throw it away.
Building Blocks for the Network:
LSTM
Ct is the cell state, which flows through the
entire chain...
Building Blocks for the Network:
LSTM
Forget Gate:
Concatenate
Building Blocks for the Network:
LSTM
Input Gate Layer
New contribution to cell state
Classic neuron
Building Blocks for the Network:
LSTM
Update Cell State (memory):
Building Blocks for the Network:
LSTM
Output Gate Layer
Output to next layer
Building Blocks for the Network:
Word Embedding
Embeddings are used to turn textual data (words, sentences, paragraphs) into
high- dimensional vector representations and group them together with
semantically similar data in a vectorspace. Thereby, computer can detect
similarities mathematically.
Final Model:
Training Data:
Flickr8k Dataset:
Dataset contains 8000 different images with 5 different human
labelled captions.:
The image is given 5 different captions:
1) A boy runs as others play on a home-made slip and
slide.
2) Children in swimming clothes in a field.
3) Little kids are playing outside with a water hose and
are sliding down a water slide.
4) Several children are playing outside with a wet tarp on
the ground.
5) Several children playing on a homemade water slide.
Training History:
Model’s Performance on Test Data:
Model’s Performance on Real Data:
Three people are on a boat in Three people pose for a One man is sitting at a table
the water picture together in front of a restaurant
A soccer player prepares to A group of kids play in the A boy hits the ball at a
kick the ball water baseball game .
Application:
● Visual to Text systems for blind people
● Search Engines for searching medical records based on
content based caption
● Auto Tagging different imaging data
● Auto Video tagging and summary generation