AI Based Presentation Creator With Customized Audio Content Delivery

Uploaded by

gmmm96169

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

AI Based Presentation Creator With Customized Audio Content Delivery

Uploaded by

gmmm96169

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI based Presentation Creator With Customized

Audio Content Delivery

1st Muvazima Mansoor 2nd Srikanth Chandar 3rd Ramamoorthy Srinath
ECE ECE CSE
PES University PES University PES University
Bengaluru, India Bengaluru, India Bengaluru, India
[email protected] [email protected] [email protected]
arXiv:2106.14213v1 [cs.LG] 27 Jun 2021

Abstract—In this paper we propose an architecture to a solve a know use the idea in the context of automating the process
novel problem statement that has stemmed more so in the recent of virtual content delivery. Multiple speech synthesis startups
times with an increase in demand for virtual content delivery like Lyrebird and Sonantic have obtained sizable grants and
due to the COVID-19 pandemic. All educational institutions,
work-places, research centres, etc. are trying to bridge the gap investments. Lyrebird aims to offer their voice cloning API
of communication during these socially distanced times with so that third parties can make use of the audio mimicry
the use of online content delivery. The trend now is to create technology for their own needs. Whereas, sonantic aims to use
presentations, and then subsequently deliver the same using their voice cloning feature in video games. The objective of
various virtual meeting platforms. The time being spent in such our project is to implement the voice cloning feature to read
creation of presentations and delivering is what we try to reduce
and eliminate through this paper which aims to use Machine out a presentation created by summarizing a research paper
Learning (ML) algorithms and Natural Language Processing using a custom voice.
(NLP) modules to automate the process of creating a slides based
presentation from a document, and then use State-of-the-art voice II. S TRUCTURE
cloning models to deliver the content in the desired author’s
voice. We consider a structured document such as a research
The project involves 4 sub-problems:
paper to be the content that has to be presented. The research • Identification of sup-topics from the paper and converting
paper is first summarized using BERT summarization techniques these topics to hierarchical bullet points which can go on
and condensed into bullet points that go into the slides. Tacotron each slide of a presentation.
inspired architecture with Encoder, Synthesizer, and a Genarative
• Content Generation from these points in each slide.
Adversarial Network (GAN) based Vocoder, is used to convey the
contents of the slides in the author’s voice (or any customized • Voice recognition- mimic the style and tone of a chosen
voice). The world is facing a pandemic and the people had to voice..
make significant changes in their lifestyles to adapt to it. Almost • Present the above content by using a customized text -
all learning has now been shifted to online mode, and working to - content delivery mechanism.
professionals are now working from the comfort of their homes.
Due to the current situation, teachers and professionals have
shifted to presentations to help them in imparting information. III. PAST W ORK
In this paper, we aim to reduce the considerable amount of time A. Transfer Learning from Speaker Verification to Multi-
that is taken in creating a presentation by automating this process
and subsequently delivering this presentation in a customized
speaker Text-To-Speech Synthesis
voice, using a content delivery mechanism that can clone any Ye Jia, et al. describe a text-to-speech synthesis using
voice using a short audio clip. multiple neural networks that can generate speech audio in
Index Terms—Voice Cloning, Generative Adversarial Net- multiple voices, including those that are not seen during
works, Summarization, Natural Langauge Processing, Machine
Learning, Tacotron, Transformers. training. [1]
The system consists of three independently trained compo-
nents:
I. I NTRODUCTION
Encoder: It is used to generate a fixed-dimensional em-
A system/product aiming to work on the exact same appli- bedding vector from a few seconds of reference speech. It
cation is non-existent at the moment. There have been subsets is trained on a speaker verification task using the LibreSpeech
to the application we propose that has garnered significant dataset that consists of an untranscribed noisy speech from
attention over the last few years. The ideas behind techniques hundreds of speakers.
and the algorithms used in summarization models is not en- Synthesizer: based on the speaker embedding derived from
tirely new. However, the use of such algorithms in the context the encoder, the synthesizer generates a Mel spectrogram from
of summarizing a research paper and subsequently generating the text. The synthesizer model is based on Tacotron 2. [2]
a slides based presentation is unheard of. The concept of Vocoder: it converts the Mel spectrogram generated by the
voice cloning is one that is gaining traction, but none that we synthesizer into waveform samples in the time domain. The
tive TTS model that generates speech directly from alphabets.
The model can be trained from scratch using text and audio
pairs with random initialization.
Advantages of the Tacotron model:
The tacotron model reduces the need for arduous feature
engineering, which may involve fragile design choices and
heuristics.
The model allows for rich conditioning on various features
like speaker, language, or sentiment. This is due to the fact that
conditioning can take place at the initial stages of the model
rather than only on specific components. Likewise, adjustment
to the new data could also be easier.
A single model is more likely to be robust compared to a
multi-stage model where every stage’s errors can multiply.
The advantages listed above suggest that an end-to-end
model could facilitate training on large amounts of noisy data
that is found everywhere. Fig.2 shows the model architecture.
Text to speech is a large-scale inverse issue: a highly com-
pressed text is decompressed into audio. The same text can
equate to various intonations or speaking styles, which is an
arduous learning task for the model. For a given input, the
model must endure large disparities at the signal level. [10]

Fig. 1. Structure Flowchart

SV2TTS model uses a WaveNet-based vocoder that is auto-

regressive. They demonstrate that the model is capable of
transferring the speaker variability knowledge that is learned
by the encoder to the multispeaker text to speech task and it
is able to generate natural speech from speakers that are not
seen during training.
The authors mention that a large and diverse speaker dataset
that is used for training the encoder is very important in
order to obtain the best performance. Finally, they show that
speaker embeddings that were randomly sampled can be used
to generate the voice of new speakers not seen during training
which indicates that the trained model has learned speaker
representation of good quality. Fig. 2. Tacotron Model Architecture.
The SV2TTS model is capable of generating realistic speech
from speakers unseen in the training set, implying that the C. High Fidelity Speech Synthesis With Adversarial Networks
model has learned to utilize a realistic depiction of the space
of speaker variation. [6] In recent years, Generative adversarial networks have seen
The SV2TTS model does not attain human-level natural- rapid development and have led to extraordinary developments
ness, despite the use of a WaveNet vocoder. This is due to in the generative modeling of images. Their implementation
the added difficulty of generating speech for different types in the audio domain has received little attention. Autoregres-
of speakers given very fewer data per speaker, and the use of sive models like WaveNet are the most widely used in the
low-quality datasets. [7] generative modeling of audio signals. To discuss this paucity,
Jeff Donahue et al. introduce Text-to-Speech using Generative
B. Tacotron: Towards End-to-End Speech Synthesis Adversarial Network or GAN-TTS. [3]
A text-to-speech model usually contains stages like an audio Generative Adversarial Networks form a subgroup of gen-
synthesis module, an acoustic model, and a text analysis erative models that involves the adversarial training of two
model. Constructing these models usually requires extensive networks:
domain knowledge and may contain fragile design choices. • Generator: it tries to generate samples that resemble the
Yuxuan Wang, et al. present Tacotron, an end-to-end genera- reference distribution.
• Discriminator: it imparts a useful gradient signal to the • Extractive: It uses only the content from the given text
generator by differentiating between generated samples like the raw phrases and sentences to provide a summa-
and the real samples. rization of the text.
Residual blocks are used in the model. The Convolutional Derek Miller uses BERT (Bidirectional Encoder Represen-
layers have equal output and input channels. tations from Transformers) for the process of summarization.
GANs are capable of producing high-precision speech that The unsupervised model, BERT is built on top of the Trans-
sounds as natural as the ones produced by the state-of-the-art former architecture. It performs better than all the existing
models. GANs are extremely parallelizable due to an efficient NLP models for a broad range of functions.
feed-forward generator. [5] Whereas, autoregressive models In other summarization models, it was not possible to obtain
like WaveNet are not parallelizable. Fig 3. shows the Model dynamic summary sizes. The BERT model produces sentence
Architecture. [8] embeddings. These sentence embeddings can be clustered with
a size of K which permits dynamic summary sizes.
BERT combines context with the most important sentences
and therefore performs much better than methods like Tex-
tRank in terms of quality. [15]

IV. M ETHODOLOGY AND S YSTEM D ESIGN - T EXT TO

SPEECH WITH VOICE CLONING

Fig. 4. Architecture.

A. Speaker Encoder
The speaker encoder block is used to derive the embedding
of the user’s voice from a short audio clip. Our project uses
the speaker Encoder model from SV2TTS [1]. In SV2TTS,
Fig. 3. GAN vocoder Architecture.
the speaker encoder network is trained on a speaker verifi-
cation task using the LibreSpeech dataset that consists of an
D. Leveraging BERT for Extractive Text Summarization on untranscribed noisy speech from hundreds of speakers. This
Lectures allows the model to produce an embedding vector of fixed
This paper describes a python-based RESTful service that dimensions from only a few seconds of reference speech.
uses Bidirectional Encoder Representations from Transform- Other multi-speaker speech synthesis is done by incorporating
ers model for text embeddings. For summary selection, The hundreds of hours of the target speaker’s voice in the training
project also utilizes K Means clustering to determine the dataset. This method requires many hours of transcribed data
sentences that are nearest to the centroid. Apart from summa- and does not allow the fitting of new voices without retraining
rization, the project provides features for the management of the computationally heavy model.
lectures and summaries and supports collaboration by storing
content on the cloud. [4] B. Synthesizer
There are two different types of automatic text summariza- Based on the speaker embedding derived from the encoder,
tion: the synthesizer generates a Mel spectrogram from text. Our
• Abstractive: Abstractive summarization resembles human project uses the synthesizer model from Tacotron [2]. Tacotron
summarization closely by using a vocabulary beyond the is an end-to-end generative TTS model that generates speech
text provided. It abstracts the important points present directly from alphabets. The model can be trained from scratch
in the text and it is usually smaller in size. Though using text and audio pairs with random initialization. This
this approach is very useful, it is arduous to produce it approach does not use complex linguistic and acoustic features
automatically. It requires multiple GPUs and takes many as input compared to other models like Deep Voice [3] and
days to train. VoiceLoop [4].
C. Neural Vocoder synthesizer,Encoder, and Vocoder, parsed papers, generated
A vocoder converts the Mel spectrogram generated by the PPTs and audio clips are stored in Google Drive. These models
synthesizer into waveform samples in the time domain. Our are called by Google Colab as per the implementation, a web
project uses GAN(Generative Adversarial Networks) [5] as the application is hosted on the local host using streamlit and
vocoder. ngrok is used to securely connect Colab to the local host.
GANs are capable of producing high-precision speech that [9]
sounds as natural as the ones produced by the state-of-the-art
models. GANs are extremely parallelizable due to an efficient
feed-forward generator. Whereas, autoregressive models like
WaveNet are not parallelizable.
V. S UMMARIZATION OF A R ESEARCH PAPER AND
CREATING A PRESENTATION
Our project uses BERT (Bidirectional Encoder Representa-
tions from Transformers) [6] summarization for the summary
creation process. The unsupervised model, BERT is built on
top of the Transformer architecture. It performs better than
all the existing NLP models for a broad range of functions, Fig. 5. Web Application Architecture.
including summarization. In other summarization models, it is
not possible to obtain dynamic summary sizes. The BERT VII. R ESULTS
model produces sentence embeddings. These sentence em- The architecture proposed in this paper is one that combines
beddings can be clustered with a size of K which permits different elements from methods to come up with a voice
dynamic summary sizes. BERT combines context with the cloning approach. The encoder is from the SV2TTS model, the
most important sentences and therefore performs much better sythesizer has been inspired from Tacotron, and the Vocoder
than methods like TextRank in terms of quality. is the GAN based model. The architecture when combined
Our approach is to apply BERT summarization to every produced voice cloning results that match the reference audio
section of the research paper which would go on every new to a good degree. [13] and [14] To determine the accuracy
page of the presentation. Each page of the presentation will of the summary generated by the BERT summarizer, we
also contain a hyperlink that would direct the user to the computed rouge scores of the summary generated by using the
section of the research paper that has been summarized. highlights provided by the author as reference summary. We
To validate our results, we compared the rouge-l scores of also computed rouge scores of different summarization models
the summary generated by the BERT summarizer by using and compared their performance with BERT summarizer. The
author generated highlights as the reference summary. different summarization models used are:
• Abstractive summarization using TensorFlow and Keras
VI. I MPLEMENTATION
based Neural Network. It performed poorly and gave an
This section provides an overview about how the various f-score of 0.115.
machine learning models (Summarizer and the Voice Cloning
- all 3 segments- have been tied together to make a web based f p r
application). Given that this is not very research centric, this rouge -1 0.17532 0.14545 0.12188
section will remain brief. The purpose of including this section rouge -2 0.03075 0.06458 0.02072
is to give an idea to the reader on the implementation aspect rouge -l 0.11545 0.29515 0.10673
of this research, and also how the timing constraints that one TABLE I: Rouge scores of Abstractive summarization
may have assumed to be a problem in the case of complex
neural networks such as cloning (GAN based), is actually • Extractive summarization using TextRank algorithm
handled. [11] and [12] The architecture used is that of a which is an extractive and unsupervised text summariza-
python implementation which combines Streamlit and Ngrok, tion technique. It gave an f-score of 0.1538.
using Google Colab as a codebase to utilize its computational
power to run the ML models. Streamlit is a platform/library f p r
that enables data scripts such as the python modules used rouge -1 0.19631 0.14545 0.30188
in this paper to be converted to web aplications with ease. rouge -2 0.07453 0.05504 0.11538
Ngrok enables such a web application by providing a secure rouge -l 0.15384 0.11594 0.22857
URL to the localhost server through any NAT or firewall. TABLE II: Rouge scores of Extractive summarization using
This also solves the issue in terms of security as it provides TextRank
a secure authenticator key to connect to the local host (web
application). The architecture as shown in Fig. 5 is connected • Extractive summarization using SVM - we scored each
to Google Drive as the database. The pretrained ML models- sentence based on the words that overlapped between
the main body of the paper and the author generated rest of the mentors from the Computer Science department for
summaries. The top “n” sentences with the highest scores providing us this opportunity and for their continuous support.
were picked as summary sentences. To do this, we used We extend our thanks to PES University, who provided us a
a Support Vector Machine Regressor to predict sentence platform that helped us to team up and pursue this project. We
scores based on document vectors generated by Gensim’s also thank Dr. Anuradha M, for her support.
doc2vec function. It gave an f-score of 0.16.
R EFERENCES
f p r [1] Jia, Ye, et al. ”Transfer learning from speaker verification to multispeaker
rouge -1 0.20930 0.23076 0.19148 text-to-speech synthesis.” Advances in neural information processing
systems. 2018.
rouge -2 0.05882 0.06976 0.05084 [2] Wang, Yuxuan, et al. ”Tacotron: Towards end-to-end speech synthesis.”
rouge -l 0.16004 0.17948 0.14893 arXiv preprint arXiv:1703.10135 (2017).
[3] Bińkowski, Mikołaj, et al. ”High fidelity speech synthesis with adver-
TABLE III: Rouge scores of Extractive summarization using sarial networks.” arXiv preprint arXiv:1909.11646 (2019).
SVM [4] Miller, Derek. ”Leveraging BERT for extractive text summarization on
lectures.” arXiv preprint arXiv:1906.04165 (2019).
[5] Yang, Geng, et al. ”Multi-band MelGAN: Faster Waveform Genera-
• Extractive summarization using BERT which is an unsu- tion for High-Quality Text-to-Speech.” arXiv preprint arXiv:2005.05106
pervised model built on top of transformer architecture. (2020).
[6] Kim, Jaehyeon, et al. ”Glow-TTS: A Generative Flow for Text-to-Speech
It performed much better than the other summarization via Monotonic Alignment Search.” arXiv preprint arXiv:2005.11129
models and gave an f-score of 0.45. (2020).
[7] Tu, Tao, et al. ”Semi-supervised Learning for Multi-speaker Text-to-
speech Synthesis Using Discrete Speech Representation.” arXiv preprint
f p r arXiv:2005.08024 (2020).
rouge -1 0.442307 0.45098 0.43396 [8] Kumar, Kundan, et al. ”Melgan: Generative adversarial networks for
rouge -2 0.15686 0.16 0.15384 conditional waveform synthesis.” Advances in Neural Information Pro-
cessing Systems. 2019.
rouge -l 0.45714 0.45714 0.45714 [9] Luong, Hieu-Thi, et al. ”Training Multi-Speaker Neural Text-to-Speech
Systems Using Speaker-Imbalanced Speech Corpora.” arXiv preprint
TABLE IV: Rouge scores of Extractive summarization using arXiv:1904.00771 (2019).
BERT [10] Deng, Yan, Lei He, and Frank Soong. ”Modeling multi-speaker latent
space to improve neural TTS: Quick enrolling new speaker and enhanc-
To compute the accuracy of the PPT generated by our ing premium voice.” arXiv preprint arXiv:1812.05253 (2018).
[11] Chen, Yutian, et al. ”Sample efficient adaptive text-to-speech.” arXiv
application, we compared it with a PPT created by humans preprint arXiv:1809.10460 (2018).
and treated it as a gold standard. It gave a rouge-1 score of [12] Nachmani, Eliya, et al. ”Fitting new speakers based on a short untran-
0.49 and rouge-l score of 0.35 scribed sample.” arXiv preprint arXiv:1802.06984 (2018).
[13] Taigman, Yaniv, et al. ”Voiceloop: Voice fitting and synthesis via a
phonological loop.” arXiv preprint arXiv:1707.06588 (2017).
f p r [14] Arik, Sercan O., et al. ”Deep voice: Real-time neural text-to-speech.”
rouge -1 0.49519 0.44557 0.55725 arXiv preprint arXiv:1702.07825 (2017).
[15] Collins, Ed, Isabelle Augenstein, and Sebastian Riedel. ”A supervised
rouge -2 0.17657 0.15885 0.19872 approach to extractive summarisation of scientific papers.” arXiv preprint
rouge -l 0.35419 0.32051 0.39577 arXiv:1706.03946 (2017).
TABLE V: Rouge scores of PPT generated by BERT with
human generated PPT

The voice cloning results can be found here. As is evident

from the voices, the cloning module has successfully copied
the fundamental properties of the voices like pitch, depth,
timbre, frequency, etc. It however does not account for style
specific features like accent, voice style, specific pronunci-
ations etc. The 3 voices considered for the sake of results
are the author’s, and Amitabh Bachchan (a famour celebrity
in Bollywood). The cloned voices can be contrasted with the
original voices which may be available to notice the similarity.

ACKNOWLEDGMENT
This work was done as part of the Undergraduate Final Year
Capstone Project in PES University, Computer Science Depart-
ment. We acknowledge all the support from PES University,
the Computer Science Department, and the Electronics and
Communication Department towards this work. For this, we
would like to thank Dr. Shylaja S S of PES University and the

Thesis
No ratings yet
Thesis
37 pages
Suoni
No ratings yet
Suoni
38 pages
Deepfake Voice Synthesis Framework
No ratings yet
Deepfake Voice Synthesis Framework
24 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Mid Defence Clone
No ratings yet
Mid Defence Clone
45 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
AI Based Voice Cloning System: From Text To Speech
No ratings yet
AI Based Voice Cloning System: From Text To Speech
9 pages
1 Base
No ratings yet
1 Base
5 pages
Video To Audio Generation Through Text
No ratings yet
Video To Audio Generation Through Text
30 pages
VALL-E: Zero-Shot TTS with Language Models
No ratings yet
VALL-E: Zero-Shot TTS with Language Models
16 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Papers
No ratings yet
Papers
9 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Flowtron
No ratings yet
Flowtron
10 pages
Text-to-Audio Conversion with OpenVoice
No ratings yet
Text-to-Audio Conversion with OpenVoice
48 pages
Text To Audio (Team 05)
No ratings yet
Text To Audio (Team 05)
30 pages
Advancements in Voice Cloning - A Machine Learning Approach To Synthetic Speech Generation
No ratings yet
Advancements in Voice Cloning - A Machine Learning Approach To Synthetic Speech Generation
6 pages
Informatics 08 00084
No ratings yet
Informatics 08 00084
15 pages
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
No ratings yet
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
18 pages
6044-Article Text-9269-1-10-20200513
No ratings yet
6044-Article Text-9269-1-10-20200513
8 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
Voice Cloning
No ratings yet
Voice Cloning
15 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
The Future of
No ratings yet
The Future of
25 pages
3 Gan
No ratings yet
3 Gan
12 pages
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
No ratings yet
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
20 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
Neural Voice Cloning With A Few Samples
No ratings yet
Neural Voice Cloning With A Few Samples
18 pages
Speech To Image Conversion: Shaik Karishma, Siddu Devi Naga Susmitha, Nanditha Katari, G. Sirisha
No ratings yet
Speech To Image Conversion: Shaik Karishma, Siddu Devi Naga Susmitha, Nanditha Katari, G. Sirisha
5 pages
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
No ratings yet
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
21 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
TTS Tech Review for Researchers
No ratings yet
TTS Tech Review for Researchers
4 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
No ratings yet
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
8 pages
Speech Synthesis
No ratings yet
Speech Synthesis
4 pages
AudioPaLM: Unified Speech and Text Model
No ratings yet
AudioPaLM: Unified Speech and Text Model
27 pages
An Emotion Speech Synthesis Method Based On VITS - 2023
No ratings yet
An Emotion Speech Synthesis Method Based On VITS - 2023
12 pages
Neural Voice Cloning with Few Samples
No ratings yet
Neural Voice Cloning with Few Samples
17 pages
Real-Time Voice Conversion Model
No ratings yet
Real-Time Voice Conversion Model
8 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Neural Voice Cloning Insights
No ratings yet
Neural Voice Cloning Insights
11 pages
Pheme
No ratings yet
Pheme
15 pages
DRL for Emotional Text-to-Speech
No ratings yet
DRL for Emotional Text-to-Speech
5 pages
Lit Rev Dis 1
No ratings yet
Lit Rev Dis 1
24 pages
Huang 22
No ratings yet
Huang 22
17 pages
Voice Cloning
No ratings yet
Voice Cloning
4 pages
Prosody Transfer Presentation Updated
No ratings yet
Prosody Transfer Presentation Updated
8 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
1707 06519
No ratings yet
1707 06519
8 pages
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
No ratings yet
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
9 pages
Hino 300 Series 714 Specifications
No ratings yet
Hino 300 Series 714 Specifications
1 page
MSDS Ok.1
No ratings yet
MSDS Ok.1
136 pages
Income Tax Calculator for Individuals
0% (1)
Income Tax Calculator for Individuals
3 pages
Final Version
No ratings yet
Final Version
83 pages
Poultry: A Global Overview
No ratings yet
Poultry: A Global Overview
77 pages
MFI Business Plan Format Guide
No ratings yet
MFI Business Plan Format Guide
16 pages
Handbook of Cognitivebehavior Group Therapy With Children and Adolescents 2nd Edition Ray W Christner Download
No ratings yet
Handbook of Cognitivebehavior Group Therapy With Children and Adolescents 2nd Edition Ray W Christner Download
78 pages
Slope Stability Assessment Services
No ratings yet
Slope Stability Assessment Services
1 page
ULP Guidebook December 2022
No ratings yet
ULP Guidebook December 2022
22 pages
Grade 1 Lesson Plan: Letter Ck
No ratings yet
Grade 1 Lesson Plan: Letter Ck
6 pages
Advanced Accounting Principles Guide
100% (1)
Advanced Accounting Principles Guide
3 pages
Home Office Income Computation
No ratings yet
Home Office Income Computation
2 pages
English Formal vs. Informal Styles
No ratings yet
English Formal vs. Informal Styles
2 pages
01 - Full Score - II V I
No ratings yet
01 - Full Score - II V I
5 pages
Management Integrative Managerial Issues
No ratings yet
Management Integrative Managerial Issues
19 pages
GNKHHV
No ratings yet
GNKHHV
3 pages
Statement of Account 5645373 13377-Inactive
No ratings yet
Statement of Account 5645373 13377-Inactive
2 pages
Summ Test Arts 10 Q2 W1
No ratings yet
Summ Test Arts 10 Q2 W1
4 pages
Assignment-3 Case: Oliver Optics Company: Submitted By
No ratings yet
Assignment-3 Case: Oliver Optics Company: Submitted By
3 pages
Eviction Process Guide for Landlords
No ratings yet
Eviction Process Guide for Landlords
1 page
Zero Dark Thirty (2012)
No ratings yet
Zero Dark Thirty (2012)
123 pages
Qualitative Data Collection Guide
No ratings yet
Qualitative Data Collection Guide
4 pages
Wind Gear Oil Wind Turbine Gears 3 Stage Planetary Flender - AG - ASWI9000UK - 01
No ratings yet
Wind Gear Oil Wind Turbine Gears 3 Stage Planetary Flender - AG - ASWI9000UK - 01
2 pages
Chapter 6:: Skin Glands: Sebaceous, Eccrine, and Apocrine Glands
No ratings yet
Chapter 6:: Skin Glands: Sebaceous, Eccrine, and Apocrine Glands
32 pages
Ds Avl Aw2400xtr
No ratings yet
Ds Avl Aw2400xtr
2 pages
Army Battle Command System Pocket Guide
100% (1)
Army Battle Command System Pocket Guide
124 pages
CBSE - X Biology Phase - 2 Session - II (Set - A)
No ratings yet
CBSE - X Biology Phase - 2 Session - II (Set - A)
3 pages
PMS Final User Manual 2017 Ec
No ratings yet
PMS Final User Manual 2017 Ec
88 pages
Connection - 2
No ratings yet
Connection - 2
12 pages
Bank Statement: Dr. Shyju N Ravindran
No ratings yet
Bank Statement: Dr. Shyju N Ravindran
4 pages

AI Based Presentation Creator With Customized Audio Content Delivery

Uploaded by

AI Based Presentation Creator With Customized Audio Content Delivery

Uploaded by

AI based Presentation Creator With Customized

Audio Content Delivery

Fig. 1. Structure Flowchart

SV2TTS model uses a WaveNet-based vocoder that is auto-

IV. M ETHODOLOGY AND S YSTEM D ESIGN - T EXT TO

The voice cloning results can be found here. As is evident

You might also like