0% found this document useful (0 votes)
38 views5 pages

Video Transcript Summarizer

The document presents a project that proposes a video summarization system utilizing natural language processing (NLP) and machine learning to summarize YouTube video transcripts effectively. The system retrieves transcripts from user-provided video links and generates concise summaries while preserving key elements of the content. The methodology involves using Hugging Face Transformers for summarization and aims to enhance user readability and comprehension of video content.

Uploaded by

asha kale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

Video Transcript Summarizer

The document presents a project that proposes a video summarization system utilizing natural language processing (NLP) and machine learning to summarize YouTube video transcripts effectively. The system retrieves transcripts from user-provided video links and generates concise summaries while preserving key elements of the content. The methodology involves using Hugging Face Transformers for summarization and aims to enhance user readability and comprehension of video content.

Uploaded by

asha kale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 International Mobile and Embedded Technology Conference (MECON)

Video Transcript Summarizer

Atluri Naga Sai Sri Vybhavi Laggisetti Valli Saroja Jahnavi Duvvuru Jayanag Bayana
Department of Computer Department of Computer Department of Computer Department of Computer
Science and Engineering Science and Engineering Science and Engineering Science and Engineering
VR Siddhartha Engineering VR Siddhartha Engineering VR Siddhartha Engineering VR Siddhartha Engineering
2022 International Mobile and Embedded Technology Conference (MECON) | 978-1-6654-2020-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/MECON53876.2022.9751991

College College College College


Vijayawada, India Vijayawada, India Vijayawada, India Vijayawada, India
[email protected] [email protected] [email protected] [email protected]

Abstract— This project proposes a video summarizing literally be over 2.3 billion in 2020, and it mostly has been
system based on natural language processing (NLP) and growing every year in a pretty major way.
Machine Learning to summarize the YouTube video
transcripts without losing the key elements. The quantity of
videos available on web platforms is steadily expanding. The A. Motivation
content is made available globally, primarily for educational
purposes. Additionally, educational content is available on Summarizing the transcript of YouTube videos to help
YouTube, Facebook, Google, and Instagram. A significant the user have a quick glance of what is present in the video
issue of extracting information from videos is that unlike an and also to increase user readability and make it comfortable.
image, where data can be collected from a single frame, a During Sem-end exams we faced this problem mainly when
viewer must watch the entire video to grasp the context. This we want to revise the content in a short time.
study aims to shorten the length of the transcript text of the
given video. The suggested method involves retrieving B. Objectives
transcripts from the video link provided by the user and then 1. The goals of this We would like to present a method for
summarizing the text by using Hugging Face Transformers
obtaining a summary transcript from a large stream of
and Pipelining. The built model accepts video links and the
required summary duration as input from the user and YouTube videos, as well as an algorithm to convert the
generates a summarized transcript as output. According to YouTube videos voice feed to text and summarize key
the results, the final translated text was obtained in less time elements.
when compared with other proposed techniques.
Furthermore, the video's central concept is accurately present II. LITERATURE SURVEY
in the final text without any deviations. [1] This paper uses Deep-learning algorithms to perform
video summarization. It analyses the critical parts of the video
Keywords— Key words: Video summarizer, NLP, YouTube,
using techniques like pipelining. As a result, it also compares
Summarizing algorithms, Transcript.
the similarity of the input video and the modified video.
Finally, it also determines the accuracy of the summarized
I. INTRODUCTION text. [2] They propose a model to do the job using Natural
Language Techniques like Latent Semantic Analysis. They
The number of YouTube users for the most part was use the Algebraic Statistical method and MoviePy Library to
estimated to essentially be over 2.3 billion in 2020, and it attach video strings based on subtitles to obtain the
basically has been growing every year, definitely contrary to summarized text. One significant advantage of the model is
popular belief. 300 hours of YouTube videos really are that it has less processing power, and no prior training data is
generally posted per minute, demonstrating that the number required. [3] The paper presents a method to solve traffic on
of YouTube users specifically was estimated to be over 2.3 internetwork by reducing the audio-visual content. It starts by
billion in 2020, and it definitely has been growing every year selecting only essential elements from the original videos and
in a very big way. For example, there actually are pretty then producing the final video, sort of a movie. Mainly useful
many Ted really Talk videos available online in which the for real-time businesses like the Entertainment field. [4] This
speaker speaks for an extended period of time about a for all study discusses how to summarize the video sequences with
intents and purposes specific topic, but finding the content the help of deep neural networks and abstractive
the speaker really is most focused on requires watching the summarization. A joint model is proposed allowing the users
entire video in a pretty major way. We particularly propose to distinguish between useful and unnecessary information
in this study to for all intents and purposes employ the LSA and also achieve better results compared with other methods
generally Natural Language Computing algorithm, which in the context.[5] By explicitly modelling both segment and
requires sort of less processing resources and requires no video, a scalable deep neural network is suggested for
training data, showing how 300 hours of YouTube videos predicting whether one video segment is a desirable segment
kind of are essentially posted per minute, demonstrating that for the consumers. Furthermore, the study focused on perform
the number of YouTube user kind of was estimated to scene and action identification in uncut videos in order to

978-1-6654-2020-4/22/$31.00 ©2022 IEEE 461


Authorized licensed use limited to: Army Institute of Technology. Downloaded on June 27,2023 at 10:21:20 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)

discover more relationships between different parts of video


comprehension tasks. In addition, the impact of audio and
visual characteristics on the summarising task and how this
model is different from the previous methods that obtained
summaries based on prior knowledge is discussed. [6] The
main difference in the method proposed here compared with
other papers is that first, it classifies the videos as static and
dynamic, then does the necessary translation of the transcript.
The algorithm used here is Short boundary Detection. [7] This
study focuses on developing a prototype to index videos
mainly concerned with lectures employing Syntactic
Similarity Measures. Based on dynamic programming
techniques, captions are made available along with the video
with the help of an auto-caption generator feature. It also
provides a survey of existing video summarization methods.
[8] A real-time video summarising technique for mobile
platforms is proposed in this paper, which analyses the video
during live camera recording and creates a summary in real
Fig 1. Process Flow Diagram
time. The mentioned method analyses intrinsic video data,
such as the video stream's contents, as well as associated The given input video transcripts are generated. After that
external metadata, such as the video stream's external camera the data is trained and the transcripts are processed using
information. [9] The study discusses how a normal pipelining which created a model using python Hugging Face
summarization system with basic features is not suitable for transformers. Finally, the text derived from transcripts is
the user and the need for a unique customized system. The summarized and displayed as output.
suggested method creates a video summary based on the
user's preferences and the top-ranked pictures that are Firstly, from the input video transcripts respective to it are
semantically relevant. [10] For a similar category of videos, obtained in data classification phase. Next in the training
the algorithm uses supervised learning to perform phase it identifies the tone of the text received from transcripts
summarization. First, a set of videos is considered in that followed by data pipelining techniques to generate the final
based on the summary of one video, the summary of other summarised text. The architecture of the system is as shown
videos present in the same subset is generated. Each transition in the figure.
of a frame of video is considered a state and loss function;
cross-validation scores are used.[11] This study researches
how video summarizing is the key to meaningful browsing
and video entity activities. It also shows how automated video
summarization is possible based on accurate predictions of
the transition of the video sequence. [12] The main focus is to
summarize a very long video simply and understandably. For
this, a hierarchal-based video transition graph and time-based
constraints are used. [13] For compressed wave files, a
method is given that uses a random carrier to embed the
watermark in the audio signal sequence. After adaptive
differential pulse code modulation and before compression,
the watermark is embedded lucently in the audio stream. The
proposed approach has been built, and its characteristics have
been compared to the best known method of auditory Fig 2. Architecture Diagram
watermarking. [14] A system is presented for generating A. Methodology
elements for feature selection using support vector machines
that includes the augmentation of relational notions using a We noticed that many of the approaches suggested
classification-type method and a variety of feature generation summarizing videos take considerable training and execution
strategies. By incorporating new techniques, classification is time after examining them. As a result, we evaluated
utilised to boost the productivity of feature space. Despite resolving the issue. Instead of directly creating the text from
creating features in advance, feature generation in run-time the video, we used the transcripts of YouTube videos to
resulted in the construction of models with higher accuracy. summarize the text.
1) To begin, we'll use a Python API to retrieve the
transcripts/subtitles for a certain YouTube video.
III. PROPOSED SYSTEM 2) Obtain the transcripts using a custom function that
will later be used as a feed input for the NLP engine.
The process flow diagram for the proposed methodology
describes the process of data preprocessing, and all the actions 3) Perform Extractive and Abstractive Summarization.
taking place in the proposed methodology. The process flow Also, perform abstractive text summarization on the
diagram is as shown in Fig 1 describes various modules like transcript produced in the previous module using Hugging
retrieving video, extracting emotion and obtaining transcripts Face's transformers package in Python.
and subtitles, and summarizing video transcript. 4) Lastly, make the user interface easy so that they can
interact with and examine the summarized content.

462
Authorized licensed use limited to: Army Institute of Technology. Downloaded on June 27,2023 at 10:21:20 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)

Fig 4. Screenshot of the resulted output

Step 4: Now, obtain the word count of the initial video


text before summarization using the following code.
result = ""
for i in transcript:
result += ' ' + i['text']
Step 5: Finally, print the length of the result so we can
see the word length of the video.
Module 3: Perform summarization methods with the help
of pipelining on transcripts.
The model detects and outputs the relevant phrases and
sentences from the actual text in Extractive Summarization.
In Abstractive Summarization, the model generates a
completely new text that is significantly shorter than the
original. It, like humans, develops new phrases in a different
format. This method will be implemented using transformers
in this project.
Fig 3. Applied Methodology Step 1: Use pipeline function that will create a model
B. Algorithm with the help of Hugging Face transformers
Module 1: Preparing the input video and obtaining Step 2: Now compare the initial text and its summarized
required transcripts. version for every iteration. The related code is
Algorithm num_iters = int(len(result)/1000)
Step 1: Firstly, install transformers as they will help in summarized_text = []
data preparation for i in range(0, num_iters + 1):
Step 2: Next, install youtube_transcript_api so that we start = 0
can get the transcript of provided video
start = i * 1000
Step 3: Give the video link of the video to be summarized
end = (i + 1) * 1000
Step 4: Now obtain the video id with the help of split
function as follows print("input text \n" + result[start:end])
video_id = youtube_video.split("=")[1] out = summarizer(result[start:end])
Step 5: Then display the video id out = out[0]
Module 2: Fetch the transcripts of the input video into a out = out['summary_text']
function so that they can be processed for summarization
print("Summarized text\n"+out)
Step 1: Display the input video for confirmation using
summarized_text.append(out)
IPython.display module
Module 4: Output the summarized version of the text of
Step 2: Get the transcript of the specified video as
the video.
follows.
Step 1: Generate the word count of summariz2ed text
transcript =
YouTubeTranscriptApi.get_transcript(video_id) Step 2: Print the summarized text by converting it into
the string using str() for easy reading.
Step 3: Observe the transcript and its contents by
transcript[0:5] C. Dataset Used
The following dataset is used to experiment the
summarization of text. The dataset is classified into six
columns that contain necessary information to perform the
text summarization.

463
Authorized licensed use limited to: Army Institute of Technology. Downloaded on June 27,2023 at 10:21:20 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)

Fig. 6b. Word count before summarization

Fig 5. Dataset Fig. 6c. Model supplied for execution

D. Requirements
The functional requirements of a software system define
what this should be able to do. It defines the function of a
software system or module. A set of inputs to the system under
test is compared to the system's output to determine its Fig. 6d. Word count after summarization
functionality. The functional requirements for this project are
as follows: Transcript generation, Transcript summarization,
and Text analysis.
Software requirements are
1. Programming Language: Python
2. Operating System: Windows 7 (minimum)
3. Development Environment: Google Colab
Fig. 6e. Summarized text
IV. IMPLEMENTATION
V. CONCLUSIONS AND FUTURE WORK
When comparing the length of the text received from the
original video to the summary text, the results clearly We presented a solution to summarize the transcript of
illustrate how the transcripts of video are summarized and the YouTube videos as it would be very useful for the user to
length of the text gained from the original video is reduced by examine the material in this project, and we introduced
more than 70%. The most difficult step is condensing the techniques to accurately minimize the size of text. In its
transcript without losing crucial points or distorting the sense approach to the problems, the suggested solution is effective
of the original content. It is certain that no significant and easy. The proposed strategies have the potential to reduce
information is removed from the input text and that only the length of a transcript while also maintaining its original
frequently used and useless terms are eliminated from the meaning. These approaches are also responsible for deleting
output text by observing the input text and output text. unwanted phrases. Only English-language YouTube videos
were evaluated in this study. This study could be expanded by
looking at a huge number of videos from other industries and
. languages.
REFERENCES
[1] Apostolidis, Evlampios, et al. "Video Summarization Using Deep
Neural Networks: A Survey." arXiv preprint arXiv:2101.06072 (2021).
[2] Sanjana R, et al. “Video Summarization using NLP” International
Research Journal of Engineering and Technology (IRJET). 2021
[3] PRIYANKA, G., and M. PRASHA MEENA. "Survey and Evaluation
on Video Summarization Techniques." Journal of Critical Reviews 7.8
(2020).
[4] Aniqa Dilawari And Muhammad Usman Ghani Khan1 "Abstractive
Summarization Of Video Sequences" 2019 IEEE
[5] Yudong Jiang, Kaixu Cui, Bo Peng, Changliang Xu “Comprehensive
Video Understanding: Video Summarization with Content-Based
Fig. 6a. Video Used Video Recommender Design” International Conference on Computer
Vision Workshop (ICCVW), 2019 IEEE

464
Authorized licensed use limited to: Army Institute of Technology. Downloaded on June 27,2023 at 10:21:20 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)

[6] Holly, Smaïli, Kamel, et al. "A first summarization system of a video
in a target language." International Conference on Multimedia and
Network Information System. Springer, Cham, 2018.
[7] Jaiswal, Shubhangi, and Manoj Misra. "Automatic indexing of lecture
videos using syntactic similarity measures."2018 5th International
Conference on Signal Processing and Integrated Networks (SPIN).
IEEE, 2018.
[8] Pradeep Choudhary, Sowmya P. Munukutla, K. S. Rajesh, Alok S.
Shukla “Real time video summarization on mobile platform”
International Conference on Multimedia and Expo (ICME), 2017 IEEE
[9] Rajkumar Kannan, Gheorghita Ghinea, Sridhar Swaminathan, Suresh
Kannaiyan “Improving video summarization based on user
preferences” 2013 Fourth National Conference on Computer Vision,
Pattern Recognition, Image Processing and Graphics (NCVPRIPG)
[10] Jayanta Basak, Varun Luthra and Santanu Chaudhury “Video
Summarization with Supervised Learning” 2008 IEEE.
[11] Wei REN Yuesheng ZHU “A Video Summarization Approach based
on Machine Learning” International Conference on Intelligent
Information Hiding and Multimedia Signal Processing, 2008 IEEE
[12] Taskiran, Cuneyt M., et al. "Automated video summarization using
speech transcripts." Storage and Retrieval for Media Databases 2002.
Vol. 4676. International Society for Optics and Photonics, 2001.
[13] Rohit Anand, Gulshan Shrivastava, Sachin Gupta, Sheng-Lung Peng,
Nidhi Sindhwani “ Audio Watermarking With Reduced Number of
Random Samples” In Handbook of Research on Network Forensics
and Analysis Techniques (pp. 372-394). IGI Global.
[14] Garima Bakshi, Rati Shukla, Vikash Yadav, Aman Dahiya, Rohit
Anand, Nidhi Sindhwani and Harinder Singh “An Optimized Approach
for Feature Extraction in Multi-Relational Statistical Learning” Journal
of Scientific and Industrial Research (JSIR).

465
Authorized licensed use limited to: Army Institute of Technology. Downloaded on June 27,2023 at 10:21:20 UTC from IEEE Xplore. Restrictions apply.

You might also like