Key-shots Based Video Summarization By
Applying Self-Attention Mechanism
PROJECT SYNOPSIS
BACHELOR OF ENGINEERING
Computer Engineering
SUBMITTED BY
Girish Mulmule - 41081
Gaurav Gavhane - 41079
Shivani Kale - 41080
Bhagyashree Vichare - 41082
Yogesh Kadam - 41083
Under the guidance of
Prof. Pradeep Patil
Department of Computer Engineering
P. E. S. Modern College of Engineering,
Pune.
2020-2021
Contents
1 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4 Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
7 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
8 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
9 Brief Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
10 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
11 Probable Date of Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
List of Figures
1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
List of Tables
1 Title
Key-shots Based Video Summarization By Applying Self-Attention Mechanism
2 Domain
Machine Learning
3 Keywords
Bi-LSTM (Bi-Directional Long Short Term Memory),
TVSum (Title Based Video Summarization),
AVS (Attentive Video Summarization),
RNN (Recurrent Neural Networks),
OVP (Open Video Project),
SDLC (Software Delivery Life Cycle),
GAN (Generative Adversarial Network).
4 Team
Group Id: 20
Team Members:
1. Girish Mulmule - 41081
2. Gaurav Gavhane - 41079
3. Shivani Kale - 41080
4. Bhagyashree Vichare - 41082
5. Yogesh Kadam - 41083
5 Literature Review
Based on Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li ,”Video summarization with
attention-based encoder-decoder networks”, Tianjin University, Xi’an Institute of Optics and
Precision Mechanics, CoRR abs/1708.09545, 2018. [1] Video is inundating the Internet social
platform. There are more than 300 hours video upload per minute to YouTube. It is awfully time
consuming to browse these videos. Therefore it become increasingly important to efficiently
browse, manage, and retrieve these videos. An ideal video summarization is that can provide
users the maximum information of the target video with the shortest time. It is also useful for
many other practical applications, such as video indexing, video retrieval, and event detection.
Its main goal is to produce a compact yet comprehensive summary to enable an efficient
browsing experience. It is a novel video summarization framework named Attentive
encoder-decoder networks for Video Summarization (AVS), in which the encoder uses a
Bidirectional Long Short-Term Memory (BiLSTM) to encode the contextual information among
the input video frames. As for the decoder, two attention-based LSTM networks are explored by
using additive and multiplicative objective functions, respectively. The results demonstrate the
1
superiority of the proposed AVS-based approaches against the state-of-the-art approaches with
remarkable improvements.
6 Objective
1. Digital Videos , nowadays are becoming more common in various fields like education ,
entertainment, etc. due to increased computational power and electronic storage capacity.
2. With an increasing size of videos collection , a technology is needed to effectively and
efficiently brows through the video without losing important contents of the videos.
3. Without compromising on these points when we make a summary of that long videos is
what video summarization promises.
7 Problem Statement
Summarizing videos to generate a short summary of the content of a larger videos by selecting
and present only the most informative or essential highlights for potential users by implement
key-shots based video summarization using attention based mechanism
8 Scope
1. To Create a tool product which generates the non-redundant coherent summary in order to
improve the efficiency of the multimedia documents and reduce redundancy.
2. For the purpose of achieving better accuracy than extractive methods and achieve better
efficiency.
2
9 Brief Description
Figure 1: Architecture Diagram
1. After the video is inserted the first is selected key-frames, where the summarization result
is a subset of isolated frames. The second is interval-based key-shots, where the summary
is a set of short intervals along the time constraints. Instead of binary information being
selected or not selected, certain data sets provide frame-level importance scores computed
from human annotations Those scores represent the likelihoods of the frames being
selected as a part of summary. Our models make use of all types of annotations binary
key-frame labels, binary subshot labels, or frame-level importance as learning signals The
selected key-frames are then separated from the unwanted footage and then the passed
through the encoder as shown in figure below.
2. The encoder then uses LSTM that is Long Short Term Memory. LSTMs are a special kind
of recurrent neural network that are adept at modelling long-range dependencies. At the
core of the LSTMs are memory cells c which encode, at every time step, the knowledge of
the inputs that have been observed up to that step. The model is composed of two LSTM
(long short-term memory) layers: one layer models video sequences in the forward
direction and the other the backward direction.
3. In a common encoder-decoder framework, an encoder con- verts the input sequence X =
x1, x2, ..., xT into a representation vector v= v1,v2, · · · ,vT. he architecture of an encoder
depends on the input in a specific application. For instance, in the application of image
caption, Convolutional Neural Network (CNN) is a good choice. In the case of machine
translation, it is natural to use a RNN as the encoder, since its input is a variablelength
sequence of symbols. When applied to video summarization, LSTM is the most suitable
algorithm since the contextual information around a specific frame is necessary for
generating a video summary. As human relies on high-level semantic understanding of the
video contents, usually after viewing the whole sequence can she/he decide which frame
3
or shot should be selected into the summary. For example, considering summarizing a
basketball game video, only a key ball that affects the game process should be selected
into the summary. However, there are many goals in a basketball game, thus it is
necessary to combine the scene before and after the goal to determine whether a goal is a
key ball. But as the software is not as capable as human brain and depends on various
coherence factors to make a selection
10 Technical Details
Platform
1. Ubantu-18.04
Software Specification
1. CUDA: 9.0.176
2. CUDNN : 7.1.2
3. Python: 3.5.2
4. PyTorch:0.4.1
5. NumPy:1.16.1
6. JSON:2.0.9
4
11 Probable Date of Completion
12 References
[1] Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li ,“Video summarization with
attention-based encoder-decoder networks”, Tianjin University, Xi’an Institute of Optics and
Precision Mechanics, CoRR abs/1708.09545, 2018.
[2] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman, “Video summarization with long
short-term memory”, in Proc. Eur. Conference Computer Vision pp 766–782, 2016.
[3] Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic, video summarization with
adversarial LSTM networks”, in Proceedings of IEEE conference on Computer Vision and
Pattern Recognition, 2017, pp. 1–10.
[4] Michael Gygli,Helmut Grabner and Luc Van Gool, summarization by learning submodular
mixtures of objectives”, in Proceedings of IEEE Conference Computer Vision. Pattern
Recognition pp. 3090–3098,2015.
[5] Ejaz, Naveed, Irfan Mehmood, and Sung Wook Baik, “Efficient visual attention based
framework for extracting key frames from videos”, [5]Signal Processing Image
Communication, vol. 28, no. 1, pp. 34–44, 2013.
Zhang, Ke, Wei-Lun Chao, Fei Sha, and Kristen Grauman, “Summary transfer: Exemplar-based
subset selection for video summarization”, In Proceedings of the IEEE conference on computer
vision and pattern recognition pp 1059-1067, 2016.