0% found this document useful (0 votes)
160 views15 pages

Urdu Text To Speech API

This document describes a final year project report for developing an Urdu Text-to-Speech API. The project aims to create a fast, natural-sounding, and inclusive Urdu TTS system by collecting a large dataset of diverse Urdu speech and using advanced deep learning models. The report outlines the goals of speed, naturalness, and inclusiveness. It also describes the system requirements, literature review, analysis and design process, remaining work, and timeline. The team aims to address limitations of existing Urdu TTS systems and improve the experience for users.

Uploaded by

ramzan1243259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views15 pages

Urdu Text To Speech API

This document describes a final year project report for developing an Urdu Text-to-Speech API. The project aims to create a fast, natural-sounding, and inclusive Urdu TTS system by collecting a large dataset of diverse Urdu speech and using advanced deep learning models. The report outlines the goals of speed, naturalness, and inclusiveness. It also describes the system requirements, literature review, analysis and design process, remaining work, and timeline. The team aims to address limitations of existing Urdu TTS systems and improve the experience for users.

Uploaded by

ramzan1243259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Urdu Text to Speech

Department of Computer Science

Muhammad Ramzan NIM-BSCS2020-17


Syeda Marjan Fatima NIM-BSCS2020-36

Session 2020-2024

Final year project report submitted in partial fulfillment of requirement for degree of Bachelors of
Science in Computer Science

Namal University,
30-KM, Talagang Road, Mianwali, Pakistan.
[Link]
DECLARATION

The project report titled “Urdu Text to Speech API” is submitted in partial fulfillment of the degree
of Bachelors of Science in Computer Science, to the Department of Computer Science at Namal
University, Mianwali, Pakistan.
It is declared that this is an original work done by the team members listed below, under the guidance
of our supervisor “Sir. Shahzad Arif”. No part of this project and its report is plagiarized from
anywhere, and any help taken from previous work is cited properly.
No part of the work reported here is submitted in fulfillment of requirement for any other degree/
qualification in any institute of learning.

Team Members University ID Signatures

Muhammad Ramzan NIM-BSCS2020-17 ____________________

Syeda Marjan Fatima NIM-BSCS2020-36 ____________________

Supervisor

Sir. Shahzad Arif

Signatures with date

__________________________

__________________________
Table of Contents
Abstract................................................................................................................................................ 5
1 Introduction................................................................................................................................. 6
1.1 Speed:..................................................................................................................................... 6
1.2 Naturalness:............................................................................................................................ 6
1.3 Being inclusive:...................................................................................................................... 6
2 System Requirements.................................................................................................................. 7
2.1 Functional Requirement:........................................................................................................ 7
2.1.1 Text Input:...................................................................................................................... 7
2.1.2 Voice Synthesis:............................................................................................................. 7
2.1.3 Data Collection:.............................................................................................................. 7
2.1.4 Audio Output:................................................................................................................. 7
2.2 Non-functional Requirements................................................................................................ 7
2.2.1 Performance:................................................................................................................... 7
2.2.2 Reliability:...................................................................................................................... 7
2.2.3 Usability:........................................................................................................................ 7
2.2.4 User Interface requirements:........................................................................................... 7
3 Literature Review........................................................................................................................ 8
4 Analysis and Design..................................................................................................................... 9
4.1 Software Process Model:........................................................................................................ 9
4.2 Conceptual Diagram:.............................................................................................................. 9
4.3 Data Flow Diagram (DFD):.................................................................................................. 10
4.4 Use Case Diagram:............................................................................................................... 10
4.5 Algorithm:............................................................................................................................ 11
4.6 Implementation Constraints:................................................................................................ 11
4.6.1 Data collection and processing:....................................................................................11
4.6.2 Computational resources:.............................................................................................. 12
4.6.3 Latency optimization:................................................................................................... 12
4.6.4 Solution:........................................................................................................................ 12
5 Remaining Work........................................................................................................................ 13
5.1 Data Processing and Preparation:......................................................................................... 13
5.2 Model Training and Optimization:....................................................................................... 13
5.3 System Integration and Development:.................................................................................13
5.4 Testing and Evaluation:........................................................................................................ 13
5.5 Deployment and Maintenance:............................................................................................. 14
5.6 Timeline and Milestones:..................................................................................................... 14
6 References.................................................................................................................................. 15
Abstract

The main goal is to create a Text-to-Speech (TTS) model that not only perfectly replicates the
most basic parts of human speech but also sounds like the best voice processors on the market at
the moment. Brain-like computers called WaveNet and Tacotron will help make this happen.
The goal is to make voice output that sounds like it came from both a person and a machine. A
lot of what we do is get information that is just right for us. We want to make our vocal
collection big and full of different sounds because the ones that are already out there are broken
and skewed. This helpful method is used to make sure the model hears a lot of other voices,
accents, languages, and strange ways of speaking. Our Text-to-Speech (TTS) system needs to
hear real speech in order to become more realistic and flexible in ways that have never been seen
before. We will use well-known deep learning tools like TensorFlow and PyTorch when we
write our code. The NLTK library helps us change and improve our text sources so that they
work best with real speech. We want to change how TTS technology is used totally. To do this,
we will slowly make changes after doing a lot of research and getting personal information.
1 Introduction
Imagine reading an interesting Urdu story, but instead of reading the words on a page, you hear
the story come to life through the rich, lively tones of a native speaker. This is the promise of the
Urdu Text-to-Speech (TTS) API we're making: a revolutionary tool that quickly turns written
Urdu into spoken language that sounds natural.
But the Urdu TTS options that are already out there are often slow, making users wait for the
magic to happen. This not only breaks up the flow of the experience, but it also makes it harder
for Urdu audio material to reach more people. Our project's goal is to get around this problem by
creating a very fast Urdu TTS API that is driven by high-quality voice data that we collect
ourselves.
This is what makes our project unique:
1.1 Speed:
Our primary concern is efficiency. Our carefully chosen collection of different Urdu sounds,
along with algorithmic improvements, will greatly cut down on response times, allowing for
almost instant speech generation.
1.2 Naturalness:
Our dataset includes the unique rhythms and intonations of real Urdu speakers. This makes sure
that the sound that is produced sounds human, interesting, and expressive.
1.3 Being inclusive:
We are dedicated to showing how rich and varied the Urdu language is. Our carefully chosen
collection includes a range of accents and speaking styles, which means that more people can use
the API.
This report will go into great detail about the problems with slow Urdu TTS and show you how
we plan to fix them. We think this API can give teachers, writers, content makers, and anyone
else who wants to hear Urdu come to life in real time new tools.
2 System Requirements
2.1 Functional Requirement:
2.1.1 Text Input:
You can type Urdu text in a number of different forms, such as plain text. Understand and
use various UTF-8 and Unicode encodings for Urdu text.
2.1.2 Voice Synthesis:
Quickly turn processed text into voice that sounds like a natural person voice.
2.1.3 Data Collection:
Accept voice clips from people who want to contribute in our data collection.
2.1.4 Audio Output:
Produce high-quality audio files in popular formats such as MP3 and WAV.

2.2 Non-functional Requirements


2.2.1 Performance:
Ensure minimal delay in converting text to speech. Efficiently manage high-volume
requests.
2.2.2 Reliability:
Achieve a seamless audio experience across all devices and platforms. Keep your system
running efficiently with maximum uptime and availability.
2.2.3 Usability:
Enhance user experience and facilitate seamless API integration by providing
comprehensive documentation and easy-to-use tools for voice customization and
selection.
2.2.4 User Interface requirements:
We currently do not require an interface to interact with the API. However, if we were to
develop a web app, we could integrate features such as voice selection panels,
customization options, audio controls, and more.
3 Literature Review
The field of text-to-speech (TTS) technology has undergone a significant transformation in
recent years, largely due to the emergence of new model structures and cutting-edge research.
Among the most noteworthy developments in this field is Google's Tacotron model, which
marked a major step forward in the conversion of text input into voice by utilizing raw
spectrograms. Another significant breakthrough was made by DeepMind with its WaveNet
system, which employed a novel method called expanded convolutions to create raw audio
waves.
TTS technology has advanced considerably due to the development of large datasets and basic
models. A variety of spoken word clips have been used to train TTS algorithms, with the Mozilla
Common Speech collection being a particularly valuable resource for researchers due to its
extensive data in multiple languages.
While many TTS platforms are available, [Link] has become a popular choice among business
people because of its user-friendliness and efficacy in converting text to speech. Synthesia, on
the other hand, is a unique platform that combines images with computer-generated speech.
Despite the progress made in this field, there is still much to explore in the realm of synthetic
speech synthesis. Our goal is to build on the foundational models and datasets, such as Tacotron
and LJ Speech, to further develop and enhance TTS technology.
Table 1 Research Work

Tool Text to speech Voice Cloning Words Limit API


conversion time

Synthesia 5 sec Specific (4 voices) Yes No

[Link] 2 sec Specific (2 voices) Yes No

Our model 1 sec For each voice Yes Yes


4 Analysis and Design

4.1 Software Process Model:


Our project will employ an Agile development methodology, which is a flexible and iterative
approach to software development. This approach emphasizes the importance of incremental
progress and continuous feedback from stakeholders. By breaking down the project into smaller,
manageable tasks, we can expedite the prototyping process and ensure that we are meeting the
evolving requirements of our users. Additionally, this methodology allows us to effectively
manage risks and respond to any issues that may arise throughout the development process.
Version Control: By utilizing a Git repository for version control, we can guarantee seamless
teamwork and precise documentation of changes made.
4.2 Conceptual Diagram:

figure 1 Conceptual diagram


4.3 Data Flow Diagram (DFD):

figure 2 DFD

4.4 Use Case Diagram:


 Text-to-speech conversion for audiobooks and e-learning materials.
 Voiceovers for video and multimedia content.
 Accessibility tools for visually impaired users.
 Real-time speech generation for interactive applications.
figure 3 Use Case Diagram

4.5 Algorithm:
Tacotron is a powerful TTS technology that has been developed by Google. It uses a neural
network architecture to convert text into speech that sounds natural and authentic. The system is
designed to capture the nuances of human speech, including cadence and intonation, which has
been pivotal in advancing the quality of TTS technology.
WaveNet, a speech synthesis system developed by DeepMind, is renowned for producing
human-like speech from text. It works by using a deep neural network to generate raw audio
waveforms. This technology is widely considered as a benchmark in speech synthesis due to its
ability to produce high-quality, natural-sounding speech that is difficult to distinguish from
human speech.
4.6 Implementation Constraints:
4.6.1 Data collection and processing:
Creating a voice dataset that is both diverse and of good quality necessitates substantial
resources
4.6.2 Computational resources:
To train and execute intricate speech synthesis models, it may be necessary to have
access to high-performance computational resources.
4.6.3 Latency optimization:
It involves the meticulous optimization of algorithms and hardware infrastructure to
achieve a balance between high-quality voice and quick response times.
4.6.4 Solution:
 Our primary focus has been on optimizing the process of gathering a wide range
of voice samples in an effective manner.
 We will investigate several cloud computing alternatives and enhance model
architectures to achieve faster processing using the resources that are accessible.
 We will perform thorough investigation and benchmarking to determine the
most favorable balance between speech quality and response time.
5 Remaining Work
Although we have made tremendous advancements in creating our self-curated Urdu voice
dataset, we are currently in the process of developing a highly efficient Urdu TTS API.
Presented below is a detailed plan outlining the tasks that need to be completed in the future:
5.1 Data Processing and Preparation:
 Provide phonetic and prosodic information for the collected voice recordings.
 Implement normalization algorithms to guarantee uniformity between recordings.
 Create training, validation, and test sets for the voice synthesis model.

5.2 Model Training and Optimization:


 Assess and contrast several speech synthesis techniques using the prepared dataset.
 Train the selected model using the training set, adjusting hyper parameters to achieve the
best possible performance and naturalness.
 Assess the accuracy of the model by utilizing the validation set and resolve any
discovered problems.

5.3 System Integration and Development:


 Create the API interface to facilitate user interaction and entry of text.
 Embed the taught speech synthesis model into the API framework.
 Develop pre-processing and post-processing modules to achieve the best possible audio
output.
 Create and implement strategies to optimize performance in order to achieve minimal
response times.

5.4 Testing and Evaluation:


 Perform thorough testing on the system utilizing both the test set and real-world user
scenarios.
 Assess the excellence and authenticity of the produced speech using both subjective and
objective criteria.
 Examine the duration of responses and pinpoint possible points of congestion for
subsequent enhancement.
5.5 Deployment and Maintenance:
 Select an appropriate hosting platform for the API, taking into consideration performance
and scalability needs.

Table 2 Remaining work

Module Name Completion Status


Research Work Completed (100%)
Data Collection Completed (100 %)
Design and Architecture of API In progress (50 %)
Data Preprocessing In progress (20%)
Model Building and Training Remaining Work
Fine Tune Model Remaining Work
Testing and API integration Remaining Work
Deployment Remaining Work

5.6 Timeline and Milestones:


6 References
[1] Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," arXiv preprint
arXiv:1703.10135, 2017.

[2] A. van den Oord et al., "WaveNet: A Generative Model for Raw Audio," arXiv preprint
arXiv:1609.03499, 2016.

[3] I. J. L. J. Speech, "LJ Speech: An English Multi-speaker Corpus for TTS," 2017. [Online].
Available: [Link]

[4] R. Ardila et al., "Common Voice: Building a multilingual, open source voice dataset," arXiv
preprint arXiv:1912.06670, 2019.

[5] [Link], "Veed Text-to-Speech," 2021. [Online]. Available: [Link]


to-speech.

[6] Synthesia, "AI Video Generation Platform," 2021. [Online]. Available:


[Link]

You might also like