AI based Presentation Creator With Customized
Audio Content Delivery
1st Muvazima Mansoor 2nd Srikanth Chandar 3rd Ramamoorthy Srinath
ECE ECE CSE
PES University PES University PES University
Bengaluru, India Bengaluru, India Bengaluru, India
[email protected] [email protected] [email protected]arXiv:2106.14213v1 [cs.LG] 27 Jun 2021
Abstract—In this paper we propose an architecture to a solve a know use the idea in the context of automating the process
novel problem statement that has stemmed more so in the recent of virtual content delivery. Multiple speech synthesis startups
times with an increase in demand for virtual content delivery like Lyrebird and Sonantic have obtained sizable grants and
due to the COVID-19 pandemic. All educational institutions,
work-places, research centres, etc. are trying to bridge the gap investments. Lyrebird aims to offer their voice cloning API
of communication during these socially distanced times with so that third parties can make use of the audio mimicry
the use of online content delivery. The trend now is to create technology for their own needs. Whereas, sonantic aims to use
presentations, and then subsequently deliver the same using their voice cloning feature in video games. The objective of
various virtual meeting platforms. The time being spent in such our project is to implement the voice cloning feature to read
creation of presentations and delivering is what we try to reduce
and eliminate through this paper which aims to use Machine out a presentation created by summarizing a research paper
Learning (ML) algorithms and Natural Language Processing using a custom voice.
(NLP) modules to automate the process of creating a slides based
presentation from a document, and then use State-of-the-art voice II. S TRUCTURE
cloning models to deliver the content in the desired author’s
voice. We consider a structured document such as a research
The project involves 4 sub-problems:
paper to be the content that has to be presented. The research • Identification of sup-topics from the paper and converting
paper is first summarized using BERT summarization techniques these topics to hierarchical bullet points which can go on
and condensed into bullet points that go into the slides. Tacotron each slide of a presentation.
inspired architecture with Encoder, Synthesizer, and a Genarative
• Content Generation from these points in each slide.
Adversarial Network (GAN) based Vocoder, is used to convey the
contents of the slides in the author’s voice (or any customized • Voice recognition- mimic the style and tone of a chosen
voice). The world is facing a pandemic and the people had to voice..
make significant changes in their lifestyles to adapt to it. Almost • Present the above content by using a customized text -
all learning has now been shifted to online mode, and working to - content delivery mechanism.
professionals are now working from the comfort of their homes.
Due to the current situation, teachers and professionals have
shifted to presentations to help them in imparting information. III. PAST W ORK
In this paper, we aim to reduce the considerable amount of time A. Transfer Learning from Speaker Verification to Multi-
that is taken in creating a presentation by automating this process
and subsequently delivering this presentation in a customized
speaker Text-To-Speech Synthesis
voice, using a content delivery mechanism that can clone any Ye Jia, et al. describe a text-to-speech synthesis using
voice using a short audio clip. multiple neural networks that can generate speech audio in
Index Terms—Voice Cloning, Generative Adversarial Net- multiple voices, including those that are not seen during
works, Summarization, Natural Langauge Processing, Machine
Learning, Tacotron, Transformers. training. [1]
The system consists of three independently trained compo-
nents:
I. I NTRODUCTION
Encoder: It is used to generate a fixed-dimensional em-
A system/product aiming to work on the exact same appli- bedding vector from a few seconds of reference speech. It
cation is non-existent at the moment. There have been subsets is trained on a speaker verification task using the LibreSpeech
to the application we propose that has garnered significant dataset that consists of an untranscribed noisy speech from
attention over the last few years. The ideas behind techniques hundreds of speakers.
and the algorithms used in summarization models is not en- Synthesizer: based on the speaker embedding derived from
tirely new. However, the use of such algorithms in the context the encoder, the synthesizer generates a Mel spectrogram from
of summarizing a research paper and subsequently generating the text. The synthesizer model is based on Tacotron 2. [2]
a slides based presentation is unheard of. The concept of Vocoder: it converts the Mel spectrogram generated by the
voice cloning is one that is gaining traction, but none that we synthesizer into waveform samples in the time domain. The
tive TTS model that generates speech directly from alphabets.
The model can be trained from scratch using text and audio
pairs with random initialization.
Advantages of the Tacotron model:
The tacotron model reduces the need for arduous feature
engineering, which may involve fragile design choices and
heuristics.
The model allows for rich conditioning on various features
like speaker, language, or sentiment. This is due to the fact that
conditioning can take place at the initial stages of the model
rather than only on specific components. Likewise, adjustment
to the new data could also be easier.
A single model is more likely to be robust compared to a
multi-stage model where every stage’s errors can multiply.
The advantages listed above suggest that an end-to-end
model could facilitate training on large amounts of noisy data
that is found everywhere. Fig.2 shows the model architecture.
Text to speech is a large-scale inverse issue: a highly com-
pressed text is decompressed into audio. The same text can
equate to various intonations or speaking styles, which is an
arduous learning task for the model. For a given input, the
model must endure large disparities at the signal level. [10]
Fig. 1. Structure Flowchart
SV2TTS model uses a WaveNet-based vocoder that is auto-
regressive. They demonstrate that the model is capable of
transferring the speaker variability knowledge that is learned
by the encoder to the multispeaker text to speech task and it
is able to generate natural speech from speakers that are not
seen during training.
The authors mention that a large and diverse speaker dataset
that is used for training the encoder is very important in
order to obtain the best performance. Finally, they show that
speaker embeddings that were randomly sampled can be used
to generate the voice of new speakers not seen during training
which indicates that the trained model has learned speaker
representation of good quality. Fig. 2. Tacotron Model Architecture.
The SV2TTS model is capable of generating realistic speech
from speakers unseen in the training set, implying that the C. High Fidelity Speech Synthesis With Adversarial Networks
model has learned to utilize a realistic depiction of the space
of speaker variation. [6] In recent years, Generative adversarial networks have seen
The SV2TTS model does not attain human-level natural- rapid development and have led to extraordinary developments
ness, despite the use of a WaveNet vocoder. This is due to in the generative modeling of images. Their implementation
the added difficulty of generating speech for different types in the audio domain has received little attention. Autoregres-
of speakers given very fewer data per speaker, and the use of sive models like WaveNet are the most widely used in the
low-quality datasets. [7] generative modeling of audio signals. To discuss this paucity,
Jeff Donahue et al. introduce Text-to-Speech using Generative
B. Tacotron: Towards End-to-End Speech Synthesis Adversarial Network or GAN-TTS. [3]
A text-to-speech model usually contains stages like an audio Generative Adversarial Networks form a subgroup of gen-
synthesis module, an acoustic model, and a text analysis erative models that involves the adversarial training of two
model. Constructing these models usually requires extensive networks:
domain knowledge and may contain fragile design choices. • Generator: it tries to generate samples that resemble the
Yuxuan Wang, et al. present Tacotron, an end-to-end genera- reference distribution.
• Discriminator: it imparts a useful gradient signal to the • Extractive: It uses only the content from the given text
generator by differentiating between generated samples like the raw phrases and sentences to provide a summa-
and the real samples. rization of the text.
Residual blocks are used in the model. The Convolutional Derek Miller uses BERT (Bidirectional Encoder Represen-
layers have equal output and input channels. tations from Transformers) for the process of summarization.
GANs are capable of producing high-precision speech that The unsupervised model, BERT is built on top of the Trans-
sounds as natural as the ones produced by the state-of-the-art former architecture. It performs better than all the existing
models. GANs are extremely parallelizable due to an efficient NLP models for a broad range of functions.
feed-forward generator. [5] Whereas, autoregressive models In other summarization models, it was not possible to obtain
like WaveNet are not parallelizable. Fig 3. shows the Model dynamic summary sizes. The BERT model produces sentence
Architecture. [8] embeddings. These sentence embeddings can be clustered with
a size of K which permits dynamic summary sizes.
BERT combines context with the most important sentences
and therefore performs much better than methods like Tex-
tRank in terms of quality. [15]
IV. M ETHODOLOGY AND S YSTEM D ESIGN - T EXT TO
SPEECH WITH VOICE CLONING
Fig. 4. Architecture.
A. Speaker Encoder
The speaker encoder block is used to derive the embedding
of the user’s voice from a short audio clip. Our project uses
the speaker Encoder model from SV2TTS [1]. In SV2TTS,
Fig. 3. GAN vocoder Architecture.
the speaker encoder network is trained on a speaker verifi-
cation task using the LibreSpeech dataset that consists of an
D. Leveraging BERT for Extractive Text Summarization on untranscribed noisy speech from hundreds of speakers. This
Lectures allows the model to produce an embedding vector of fixed
This paper describes a python-based RESTful service that dimensions from only a few seconds of reference speech.
uses Bidirectional Encoder Representations from Transform- Other multi-speaker speech synthesis is done by incorporating
ers model for text embeddings. For summary selection, The hundreds of hours of the target speaker’s voice in the training
project also utilizes K Means clustering to determine the dataset. This method requires many hours of transcribed data
sentences that are nearest to the centroid. Apart from summa- and does not allow the fitting of new voices without retraining
rization, the project provides features for the management of the computationally heavy model.
lectures and summaries and supports collaboration by storing
content on the cloud. [4] B. Synthesizer
There are two different types of automatic text summariza- Based on the speaker embedding derived from the encoder,
tion: the synthesizer generates a Mel spectrogram from text. Our
• Abstractive: Abstractive summarization resembles human project uses the synthesizer model from Tacotron [2]. Tacotron
summarization closely by using a vocabulary beyond the is an end-to-end generative TTS model that generates speech
text provided. It abstracts the important points present directly from alphabets. The model can be trained from scratch
in the text and it is usually smaller in size. Though using text and audio pairs with random initialization. This
this approach is very useful, it is arduous to produce it approach does not use complex linguistic and acoustic features
automatically. It requires multiple GPUs and takes many as input compared to other models like Deep Voice [3] and
days to train. VoiceLoop [4].
C. Neural Vocoder synthesizer,Encoder, and Vocoder, parsed papers, generated
A vocoder converts the Mel spectrogram generated by the PPTs and audio clips are stored in Google Drive. These models
synthesizer into waveform samples in the time domain. Our are called by Google Colab as per the implementation, a web
project uses GAN(Generative Adversarial Networks) [5] as the application is hosted on the local host using streamlit and
vocoder. ngrok is used to securely connect Colab to the local host.
GANs are capable of producing high-precision speech that [9]
sounds as natural as the ones produced by the state-of-the-art
models. GANs are extremely parallelizable due to an efficient
feed-forward generator. Whereas, autoregressive models like
WaveNet are not parallelizable.
V. S UMMARIZATION OF A R ESEARCH PAPER AND
CREATING A PRESENTATION
Our project uses BERT (Bidirectional Encoder Representa-
tions from Transformers) [6] summarization for the summary
creation process. The unsupervised model, BERT is built on
top of the Transformer architecture. It performs better than
all the existing NLP models for a broad range of functions, Fig. 5. Web Application Architecture.
including summarization. In other summarization models, it is
not possible to obtain dynamic summary sizes. The BERT VII. R ESULTS
model produces sentence embeddings. These sentence em- The architecture proposed in this paper is one that combines
beddings can be clustered with a size of K which permits different elements from methods to come up with a voice
dynamic summary sizes. BERT combines context with the cloning approach. The encoder is from the SV2TTS model, the
most important sentences and therefore performs much better sythesizer has been inspired from Tacotron, and the Vocoder
than methods like TextRank in terms of quality. is the GAN based model. The architecture when combined
Our approach is to apply BERT summarization to every produced voice cloning results that match the reference audio
section of the research paper which would go on every new to a good degree. [13] and [14] To determine the accuracy
page of the presentation. Each page of the presentation will of the summary generated by the BERT summarizer, we
also contain a hyperlink that would direct the user to the computed rouge scores of the summary generated by using the
section of the research paper that has been summarized. highlights provided by the author as reference summary. We
To validate our results, we compared the rouge-l scores of also computed rouge scores of different summarization models
the summary generated by the BERT summarizer by using and compared their performance with BERT summarizer. The
author generated highlights as the reference summary. different summarization models used are:
• Abstractive summarization using TensorFlow and Keras
VI. I MPLEMENTATION
based Neural Network. It performed poorly and gave an
This section provides an overview about how the various f-score of 0.115.
machine learning models (Summarizer and the Voice Cloning
- all 3 segments- have been tied together to make a web based f p r
application). Given that this is not very research centric, this rouge -1 0.17532 0.14545 0.12188
section will remain brief. The purpose of including this section rouge -2 0.03075 0.06458 0.02072
is to give an idea to the reader on the implementation aspect rouge -l 0.11545 0.29515 0.10673
of this research, and also how the timing constraints that one TABLE I: Rouge scores of Abstractive summarization
may have assumed to be a problem in the case of complex
neural networks such as cloning (GAN based), is actually • Extractive summarization using TextRank algorithm
handled. [11] and [12] The architecture used is that of a which is an extractive and unsupervised text summariza-
python implementation which combines Streamlit and Ngrok, tion technique. It gave an f-score of 0.1538.
using Google Colab as a codebase to utilize its computational
power to run the ML models. Streamlit is a platform/library f p r
that enables data scripts such as the python modules used rouge -1 0.19631 0.14545 0.30188
in this paper to be converted to web aplications with ease. rouge -2 0.07453 0.05504 0.11538
Ngrok enables such a web application by providing a secure rouge -l 0.15384 0.11594 0.22857
URL to the localhost server through any NAT or firewall. TABLE II: Rouge scores of Extractive summarization using
This also solves the issue in terms of security as it provides TextRank
a secure authenticator key to connect to the local host (web
application). The architecture as shown in Fig. 5 is connected • Extractive summarization using SVM - we scored each
to Google Drive as the database. The pretrained ML models- sentence based on the words that overlapped between
the main body of the paper and the author generated rest of the mentors from the Computer Science department for
summaries. The top “n” sentences with the highest scores providing us this opportunity and for their continuous support.
were picked as summary sentences. To do this, we used We extend our thanks to PES University, who provided us a
a Support Vector Machine Regressor to predict sentence platform that helped us to team up and pursue this project. We
scores based on document vectors generated by Gensim’s also thank Dr. Anuradha M, for her support.
doc2vec function. It gave an f-score of 0.16.
R EFERENCES
f p r [1] Jia, Ye, et al. ”Transfer learning from speaker verification to multispeaker
rouge -1 0.20930 0.23076 0.19148 text-to-speech synthesis.” Advances in neural information processing
systems. 2018.
rouge -2 0.05882 0.06976 0.05084 [2] Wang, Yuxuan, et al. ”Tacotron: Towards end-to-end speech synthesis.”
rouge -l 0.16004 0.17948 0.14893 arXiv preprint arXiv:1703.10135 (2017).
[3] Bińkowski, Mikołaj, et al. ”High fidelity speech synthesis with adver-
TABLE III: Rouge scores of Extractive summarization using sarial networks.” arXiv preprint arXiv:1909.11646 (2019).
SVM [4] Miller, Derek. ”Leveraging BERT for extractive text summarization on
lectures.” arXiv preprint arXiv:1906.04165 (2019).
[5] Yang, Geng, et al. ”Multi-band MelGAN: Faster Waveform Genera-
• Extractive summarization using BERT which is an unsu- tion for High-Quality Text-to-Speech.” arXiv preprint arXiv:2005.05106
pervised model built on top of transformer architecture. (2020).
[6] Kim, Jaehyeon, et al. ”Glow-TTS: A Generative Flow for Text-to-Speech
It performed much better than the other summarization via Monotonic Alignment Search.” arXiv preprint arXiv:2005.11129
models and gave an f-score of 0.45. (2020).
[7] Tu, Tao, et al. ”Semi-supervised Learning for Multi-speaker Text-to-
speech Synthesis Using Discrete Speech Representation.” arXiv preprint
f p r arXiv:2005.08024 (2020).
rouge -1 0.442307 0.45098 0.43396 [8] Kumar, Kundan, et al. ”Melgan: Generative adversarial networks for
rouge -2 0.15686 0.16 0.15384 conditional waveform synthesis.” Advances in Neural Information Pro-
cessing Systems. 2019.
rouge -l 0.45714 0.45714 0.45714 [9] Luong, Hieu-Thi, et al. ”Training Multi-Speaker Neural Text-to-Speech
Systems Using Speaker-Imbalanced Speech Corpora.” arXiv preprint
TABLE IV: Rouge scores of Extractive summarization using arXiv:1904.00771 (2019).
BERT [10] Deng, Yan, Lei He, and Frank Soong. ”Modeling multi-speaker latent
space to improve neural TTS: Quick enrolling new speaker and enhanc-
To compute the accuracy of the PPT generated by our ing premium voice.” arXiv preprint arXiv:1812.05253 (2018).
[11] Chen, Yutian, et al. ”Sample efficient adaptive text-to-speech.” arXiv
application, we compared it with a PPT created by humans preprint arXiv:1809.10460 (2018).
and treated it as a gold standard. It gave a rouge-1 score of [12] Nachmani, Eliya, et al. ”Fitting new speakers based on a short untran-
0.49 and rouge-l score of 0.35 scribed sample.” arXiv preprint arXiv:1802.06984 (2018).
[13] Taigman, Yaniv, et al. ”Voiceloop: Voice fitting and synthesis via a
phonological loop.” arXiv preprint arXiv:1707.06588 (2017).
f p r [14] Arik, Sercan O., et al. ”Deep voice: Real-time neural text-to-speech.”
rouge -1 0.49519 0.44557 0.55725 arXiv preprint arXiv:1702.07825 (2017).
[15] Collins, Ed, Isabelle Augenstein, and Sebastian Riedel. ”A supervised
rouge -2 0.17657 0.15885 0.19872 approach to extractive summarisation of scientific papers.” arXiv preprint
rouge -l 0.35419 0.32051 0.39577 arXiv:1706.03946 (2017).
TABLE V: Rouge scores of PPT generated by BERT with
human generated PPT
The voice cloning results can be found here. As is evident
from the voices, the cloning module has successfully copied
the fundamental properties of the voices like pitch, depth,
timbre, frequency, etc. It however does not account for style
specific features like accent, voice style, specific pronunci-
ations etc. The 3 voices considered for the sake of results
are the author’s, and Amitabh Bachchan (a famour celebrity
in Bollywood). The cloned voices can be contrasted with the
original voices which may be available to notice the similarity.
ACKNOWLEDGMENT
This work was done as part of the Undergraduate Final Year
Capstone Project in PES University, Computer Science Depart-
ment. We acknowledge all the support from PES University,
the Computer Science Department, and the Electronics and
Communication Department towards this work. For this, we
would like to thank Dr. Shylaja S S of PES University and the