0% found this document useful (0 votes)

160 views15 pages

Urdu Text To Speech API

This document describes a final year project report for developing an Urdu Text-to-Speech API. The project aims to create a fast, natural-sounding, and inclusive Urdu TTS system by collecting a large dataset of diverse Urdu speech and using advanced deep learning models. The report outlines the goals of speed, naturalness, and inclusiveness. It also describes the system requirements, literature review, analysis and design process, remaining work, and timeline. The team aims to address limitations of existing Urdu TTS systems and improve the experience for users.

Uploaded by

ramzan1243259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views15 pages

Urdu Text To Speech API

Uploaded by

ramzan1243259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Urdu Text to Speech

Department of Computer Science

Muhammad Ramzan NIM-BSCS2020-17

Syeda Marjan Fatima NIM-BSCS2020-36

Session 2020-2024

Final year project report submitted in partial fulfillment of requirement for degree of Bachelors of
Science in Computer Science

Namal University,
30-KM, Talagang Road, Mianwali, Pakistan.
[Link]
DECLARATION

The project report titled “Urdu Text to Speech API” is submitted in partial fulfillment of the degree
of Bachelors of Science in Computer Science, to the Department of Computer Science at Namal
University, Mianwali, Pakistan.
It is declared that this is an original work done by the team members listed below, under the guidance
of our supervisor “Sir. Shahzad Arif”. No part of this project and its report is plagiarized from
anywhere, and any help taken from previous work is cited properly.
No part of the work reported here is submitted in fulfillment of requirement for any other degree/
qualification in any institute of learning.

Team Members University ID Signatures

Muhammad Ramzan NIM-BSCS2020-17 ____________________

Syeda Marjan Fatima NIM-BSCS2020-36 ____________________

Supervisor

Sir. Shahzad Arif

Signatures with date

__________________________

__________________________
Table of Contents
Abstract................................................................................................................................................ 5
1 Introduction................................................................................................................................. 6
1.1 Speed:..................................................................................................................................... 6
1.2 Naturalness:............................................................................................................................ 6
1.3 Being inclusive:...................................................................................................................... 6
2 System Requirements.................................................................................................................. 7
2.1 Functional Requirement:........................................................................................................ 7
2.1.1 Text Input:...................................................................................................................... 7
2.1.2 Voice Synthesis:............................................................................................................. 7
2.1.3 Data Collection:.............................................................................................................. 7
2.1.4 Audio Output:................................................................................................................. 7
2.2 Non-functional Requirements................................................................................................ 7
2.2.1 Performance:................................................................................................................... 7
2.2.2 Reliability:...................................................................................................................... 7
2.2.3 Usability:........................................................................................................................ 7
2.2.4 User Interface requirements:........................................................................................... 7
3 Literature Review........................................................................................................................ 8
4 Analysis and Design..................................................................................................................... 9
4.1 Software Process Model:........................................................................................................ 9
4.2 Conceptual Diagram:.............................................................................................................. 9
4.3 Data Flow Diagram (DFD):.................................................................................................. 10
4.4 Use Case Diagram:............................................................................................................... 10
4.5 Algorithm:............................................................................................................................ 11
4.6 Implementation Constraints:................................................................................................ 11
4.6.1 Data collection and processing:....................................................................................11
4.6.2 Computational resources:.............................................................................................. 12
4.6.3 Latency optimization:................................................................................................... 12
4.6.4 Solution:........................................................................................................................ 12
5 Remaining Work........................................................................................................................ 13
5.1 Data Processing and Preparation:......................................................................................... 13
5.2 Model Training and Optimization:....................................................................................... 13
5.3 System Integration and Development:.................................................................................13
5.4 Testing and Evaluation:........................................................................................................ 13
5.5 Deployment and Maintenance:............................................................................................. 14
5.6 Timeline and Milestones:..................................................................................................... 14
6 References.................................................................................................................................. 15
Abstract

The main goal is to create a Text-to-Speech (TTS) model that not only perfectly replicates the
most basic parts of human speech but also sounds like the best voice processors on the market at
the moment. Brain-like computers called WaveNet and Tacotron will help make this happen.
The goal is to make voice output that sounds like it came from both a person and a machine. A
lot of what we do is get information that is just right for us. We want to make our vocal
collection big and full of different sounds because the ones that are already out there are broken
and skewed. This helpful method is used to make sure the model hears a lot of other voices,
accents, languages, and strange ways of speaking. Our Text-to-Speech (TTS) system needs to
hear real speech in order to become more realistic and flexible in ways that have never been seen
before. We will use well-known deep learning tools like TensorFlow and PyTorch when we
write our code. The NLTK library helps us change and improve our text sources so that they
work best with real speech. We want to change how TTS technology is used totally. To do this,
we will slowly make changes after doing a lot of research and getting personal information.
1 Introduction
Imagine reading an interesting Urdu story, but instead of reading the words on a page, you hear
the story come to life through the rich, lively tones of a native speaker. This is the promise of the
Urdu Text-to-Speech (TTS) API we're making: a revolutionary tool that quickly turns written
Urdu into spoken language that sounds natural.
But the Urdu TTS options that are already out there are often slow, making users wait for the
magic to happen. This not only breaks up the flow of the experience, but it also makes it harder
for Urdu audio material to reach more people. Our project's goal is to get around this problem by
creating a very fast Urdu TTS API that is driven by high-quality voice data that we collect
ourselves.
This is what makes our project unique:
1.1 Speed:
Our primary concern is efficiency. Our carefully chosen collection of different Urdu sounds,
along with algorithmic improvements, will greatly cut down on response times, allowing for
almost instant speech generation.
1.2 Naturalness:
Our dataset includes the unique rhythms and intonations of real Urdu speakers. This makes sure
that the sound that is produced sounds human, interesting, and expressive.
1.3 Being inclusive:
We are dedicated to showing how rich and varied the Urdu language is. Our carefully chosen
collection includes a range of accents and speaking styles, which means that more people can use
the API.
This report will go into great detail about the problems with slow Urdu TTS and show you how
we plan to fix them. We think this API can give teachers, writers, content makers, and anyone
else who wants to hear Urdu come to life in real time new tools.
2 System Requirements
2.1 Functional Requirement:
2.1.1 Text Input:
You can type Urdu text in a number of different forms, such as plain text. Understand and
use various UTF-8 and Unicode encodings for Urdu text.
2.1.2 Voice Synthesis:
Quickly turn processed text into voice that sounds like a natural person voice.
2.1.3 Data Collection:
Accept voice clips from people who want to contribute in our data collection.
2.1.4 Audio Output:
Produce high-quality audio files in popular formats such as MP3 and WAV.

2.2 Non-functional Requirements

2.2.1 Performance:
Ensure minimal delay in converting text to speech. Efficiently manage high-volume
requests.
2.2.2 Reliability:
Achieve a seamless audio experience across all devices and platforms. Keep your system
running efficiently with maximum uptime and availability.
2.2.3 Usability:
Enhance user experience and facilitate seamless API integration by providing
comprehensive documentation and easy-to-use tools for voice customization and
selection.
2.2.4 User Interface requirements:
We currently do not require an interface to interact with the API. However, if we were to
develop a web app, we could integrate features such as voice selection panels,
customization options, audio controls, and more.
3 Literature Review
The field of text-to-speech (TTS) technology has undergone a significant transformation in
recent years, largely due to the emergence of new model structures and cutting-edge research.
Among the most noteworthy developments in this field is Google's Tacotron model, which
marked a major step forward in the conversion of text input into voice by utilizing raw
spectrograms. Another significant breakthrough was made by DeepMind with its WaveNet
system, which employed a novel method called expanded convolutions to create raw audio
waves.
TTS technology has advanced considerably due to the development of large datasets and basic
models. A variety of spoken word clips have been used to train TTS algorithms, with the Mozilla
Common Speech collection being a particularly valuable resource for researchers due to its
extensive data in multiple languages.
While many TTS platforms are available, [Link] has become a popular choice among business
people because of its user-friendliness and efficacy in converting text to speech. Synthesia, on
the other hand, is a unique platform that combines images with computer-generated speech.
Despite the progress made in this field, there is still much to explore in the realm of synthetic
speech synthesis. Our goal is to build on the foundational models and datasets, such as Tacotron
and LJ Speech, to further develop and enhance TTS technology.
Table 1 Research Work

Tool Text to speech Voice Cloning Words Limit API

conversion time

Synthesia 5 sec Specific (4 voices) Yes No

[Link] 2 sec Specific (2 voices) Yes No

Our model 1 sec For each voice Yes Yes

4 Analysis and Design

4.1 Software Process Model:

Our project will employ an Agile development methodology, which is a flexible and iterative
approach to software development. This approach emphasizes the importance of incremental
progress and continuous feedback from stakeholders. By breaking down the project into smaller,
manageable tasks, we can expedite the prototyping process and ensure that we are meeting the
evolving requirements of our users. Additionally, this methodology allows us to effectively
manage risks and respond to any issues that may arise throughout the development process.
Version Control: By utilizing a Git repository for version control, we can guarantee seamless
teamwork and precise documentation of changes made.
4.2 Conceptual Diagram:

figure 1 Conceptual diagram

4.3 Data Flow Diagram (DFD):

figure 2 DFD

4.4 Use Case Diagram:

 Text-to-speech conversion for audiobooks and e-learning materials.
 Voiceovers for video and multimedia content.
 Accessibility tools for visually impaired users.
 Real-time speech generation for interactive applications.
figure 3 Use Case Diagram

4.5 Algorithm:
Tacotron is a powerful TTS technology that has been developed by Google. It uses a neural
network architecture to convert text into speech that sounds natural and authentic. The system is
designed to capture the nuances of human speech, including cadence and intonation, which has
been pivotal in advancing the quality of TTS technology.
WaveNet, a speech synthesis system developed by DeepMind, is renowned for producing
human-like speech from text. It works by using a deep neural network to generate raw audio
waveforms. This technology is widely considered as a benchmark in speech synthesis due to its
ability to produce high-quality, natural-sounding speech that is difficult to distinguish from
human speech.
4.6 Implementation Constraints:
4.6.1 Data collection and processing:
Creating a voice dataset that is both diverse and of good quality necessitates substantial
resources
4.6.2 Computational resources:
To train and execute intricate speech synthesis models, it may be necessary to have
access to high-performance computational resources.
4.6.3 Latency optimization:
It involves the meticulous optimization of algorithms and hardware infrastructure to
achieve a balance between high-quality voice and quick response times.
4.6.4 Solution:
 Our primary focus has been on optimizing the process of gathering a wide range
of voice samples in an effective manner.
 We will investigate several cloud computing alternatives and enhance model
architectures to achieve faster processing using the resources that are accessible.
 We will perform thorough investigation and benchmarking to determine the
most favorable balance between speech quality and response time.
5 Remaining Work
Although we have made tremendous advancements in creating our self-curated Urdu voice
dataset, we are currently in the process of developing a highly efficient Urdu TTS API.
Presented below is a detailed plan outlining the tasks that need to be completed in the future:
5.1 Data Processing and Preparation:
 Provide phonetic and prosodic information for the collected voice recordings.
 Implement normalization algorithms to guarantee uniformity between recordings.
 Create training, validation, and test sets for the voice synthesis model.

5.2 Model Training and Optimization:

 Assess and contrast several speech synthesis techniques using the prepared dataset.
 Train the selected model using the training set, adjusting hyper parameters to achieve the
best possible performance and naturalness.
 Assess the accuracy of the model by utilizing the validation set and resolve any
discovered problems.

5.3 System Integration and Development:

 Create the API interface to facilitate user interaction and entry of text.
 Embed the taught speech synthesis model into the API framework.
 Develop pre-processing and post-processing modules to achieve the best possible audio
output.
 Create and implement strategies to optimize performance in order to achieve minimal
response times.

5.4 Testing and Evaluation:

 Perform thorough testing on the system utilizing both the test set and real-world user
scenarios.
 Assess the excellence and authenticity of the produced speech using both subjective and
objective criteria.
 Examine the duration of responses and pinpoint possible points of congestion for
subsequent enhancement.
5.5 Deployment and Maintenance:
 Select an appropriate hosting platform for the API, taking into consideration performance
and scalability needs.

Table 2 Remaining work

Module Name Completion Status

Research Work Completed (100%)
Data Collection Completed (100 %)
Design and Architecture of API In progress (50 %)
Data Preprocessing In progress (20%)
Model Building and Training Remaining Work
Fine Tune Model Remaining Work
Testing and API integration Remaining Work
Deployment Remaining Work

5.6 Timeline and Milestones:

6 References
[1] Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," arXiv preprint
arXiv:1703.10135, 2017.

[2] A. van den Oord et al., "WaveNet: A Generative Model for Raw Audio," arXiv preprint
arXiv:1609.03499, 2016.

[3] I. J. L. J. Speech, "LJ Speech: An English Multi-speaker Corpus for TTS," 2017. [Online].
Available: [Link]

[4] R. Ardila et al., "Common Voice: Building a multilingual, open source voice dataset," arXiv
preprint arXiv:1912.06670, 2019.

[5] [Link], "Veed Text-to-Speech," 2021. [Online]. Available: [Link]

to-speech.

[6] Synthesia, "AI Video Generation Platform," 2021. [Online]. Available:

[Link]

Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
4 pages
Presentation Format
No ratings yet
Presentation Format
17 pages
Urdu Speech Processing with Transformers
No ratings yet
Urdu Speech Processing with Transformers
6 pages
NLP Fyp Ii
No ratings yet
NLP Fyp Ii
57 pages
Text To Speech Speech To Text Using Translations (Mini Project)
No ratings yet
Text To Speech Speech To Text Using Translations (Mini Project)
46 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
Kavita Goswami G1 2318974
No ratings yet
Kavita Goswami G1 2318974
10 pages
Text-to-Speech Project Report
No ratings yet
Text-to-Speech Project Report
26 pages
1.modern Text Tool
No ratings yet
1.modern Text Tool
8 pages
Topic ApprovalBEA13
No ratings yet
Topic ApprovalBEA13
6 pages
Format of Mini - Project Report
No ratings yet
Format of Mini - Project Report
32 pages
Design and Implementation of Text To Speech Audio System
No ratings yet
Design and Implementation of Text To Speech Audio System
5 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
Anurag Synop
No ratings yet
Anurag Synop
9 pages
7sem Projectreport
No ratings yet
7sem Projectreport
33 pages
Real Time Voice Translator
No ratings yet
Real Time Voice Translator
28 pages
Text To Speech Converter 25,26,27
No ratings yet
Text To Speech Converter 25,26,27
10 pages
Unit 3 NMU
No ratings yet
Unit 3 NMU
4 pages
Caption Generator
No ratings yet
Caption Generator
18 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
Lit
No ratings yet
Lit
6 pages
Article 5
No ratings yet
Article 5
7 pages
Project 1 - Final Report 8th Sem (VERIFIED) 2025
No ratings yet
Project 1 - Final Report 8th Sem (VERIFIED) 2025
55 pages
Speech to Text Conversion Project
No ratings yet
Speech to Text Conversion Project
2 pages
Mad Lab Report
0% (2)
Mad Lab Report
27 pages
Text-to-Speech App with pyttsx3
No ratings yet
Text-to-Speech App with pyttsx3
15 pages
Speech & Text Recognition Report
No ratings yet
Speech & Text Recognition Report
74 pages
Mini Project Report 3.00000000
No ratings yet
Mini Project Report 3.00000000
21 pages
Reasechpaperon LLM
No ratings yet
Reasechpaperon LLM
25 pages
AI Voice Assistants with LLMs
No ratings yet
AI Voice Assistants with LLMs
25 pages
Shanu Merged
No ratings yet
Shanu Merged
46 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
12 pages
Synopsis Project Phase 1
No ratings yet
Synopsis Project Phase 1
5 pages
Augustin Document
No ratings yet
Augustin Document
43 pages
ccs369 Ts A Syllabus
No ratings yet
ccs369 Ts A Syllabus
3 pages
Minor Project Sem 2
No ratings yet
Minor Project Sem 2
35 pages
AI Assistant PBL Project
No ratings yet
AI Assistant PBL Project
13 pages
Text Recognition and Speech Synthesis
50% (2)
Text Recognition and Speech Synthesis
26 pages
Comparison of Urdu Text To Speech Synthesis Using Unit Selection and HMM Based Techniques PDF
No ratings yet
Comparison of Urdu Text To Speech Synthesis Using Unit Selection and HMM Based Techniques PDF
5 pages
Sujal Kumar Sinha - IOT - MATLAB Mini
No ratings yet
Sujal Kumar Sinha - IOT - MATLAB Mini
13 pages
Format Edit
No ratings yet
Format Edit
10 pages
"Text To Speech Converter": A Project Report On
No ratings yet
"Text To Speech Converter": A Project Report On
9 pages
Mini Project
No ratings yet
Mini Project
19 pages
Speech Recognition Report
No ratings yet
Speech Recognition Report
46 pages
224s 22 Lec1
No ratings yet
224s 22 Lec1
31 pages
Real-Time Text to Speech Converter Project
No ratings yet
Real-Time Text to Speech Converter Project
30 pages
Applied Natural Language Processing: Projects
No ratings yet
Applied Natural Language Processing: Projects
26 pages
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
No ratings yet
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
4 pages
PRJ Final
No ratings yet
PRJ Final
33 pages
Ccs369-Text and Speech Analysis
No ratings yet
Ccs369-Text and Speech Analysis
3 pages
Balaa Punda
No ratings yet
Balaa Punda
25 pages
Text To Speech Synthesis 93
No ratings yet
Text To Speech Synthesis 93
15 pages
TTShindi
No ratings yet
TTShindi
83 pages
Speech & Language Tech for Students
No ratings yet
Speech & Language Tech for Students
28 pages
Project Title Approval Form
No ratings yet
Project Title Approval Form
2 pages
Speech To Text Conversion
No ratings yet
Speech To Text Conversion
34 pages
Project Report
No ratings yet
Project Report
17 pages
Wa0002.
No ratings yet
Wa0002.
10 pages
CAPO Test Methodology for Concrete Strength
No ratings yet
CAPO Test Methodology for Concrete Strength
6 pages
Health Amp Happiness Private Limited,: Grand Total
No ratings yet
Health Amp Happiness Private Limited,: Grand Total
1 page
September 2023 Electricity Bill Details
No ratings yet
September 2023 Electricity Bill Details
4 pages
Ground Sense Operational Amplifiers: Datasheet
No ratings yet
Ground Sense Operational Amplifiers: Datasheet
55 pages
ts06h-ts06h Lam 3995
No ratings yet
ts06h-ts06h Lam 3995
10 pages
EHYHBH-AV32, EHYHBX-AV3 - Operation Manual - 4PEN349588-1F - English
No ratings yet
EHYHBH-AV32, EHYHBX-AV3 - Operation Manual - 4PEN349588-1F - English
16 pages
Serialism Teaching Guide for Educators
100% (1)
Serialism Teaching Guide for Educators
12 pages
Chapter 3 - The External Assessment
No ratings yet
Chapter 3 - The External Assessment
37 pages
Dexos1™ Gen 2 Brands - GM Dexos® Licensing Program
No ratings yet
Dexos1™ Gen 2 Brands - GM Dexos® Licensing Program
23 pages
CHAPTR 10 Summary
No ratings yet
CHAPTR 10 Summary
4 pages
Yan 2008
No ratings yet
Yan 2008
5 pages
CameraTracker Release Notes
No ratings yet
CameraTracker Release Notes
11 pages
88 - Fiche Technique MagMill Magotteaux HRV2 - 2 PDF
No ratings yet
88 - Fiche Technique MagMill Magotteaux HRV2 - 2 PDF
2 pages
Explain UNION and UNION ALL SQL Clause With Example
No ratings yet
Explain UNION and UNION ALL SQL Clause With Example
8 pages
Industrial Crane Solutions by Bonfiglioli
No ratings yet
Industrial Crane Solutions by Bonfiglioli
20 pages
Chapter 5: The Second Law of Thermodynamics: in A Cyclic Process
100% (1)
Chapter 5: The Second Law of Thermodynamics: in A Cyclic Process
42 pages
Ex HM 2300
No ratings yet
Ex HM 2300
1 page
Unit 1 - Variations of Psychological Attributes (Question Bank)
No ratings yet
Unit 1 - Variations of Psychological Attributes (Question Bank)
3 pages
Daikin 60 HZ FCU - EDB (Technical Manual)
No ratings yet
Daikin 60 HZ FCU - EDB (Technical Manual)
100 pages
Agrippa
No ratings yet
Agrippa
4 pages
Sop RWTH Production Systems
No ratings yet
Sop RWTH Production Systems
2 pages
GATE EService ID Application Process
No ratings yet
GATE EService ID Application Process
2 pages
Differences Between International & Domestic HRM: Topic
100% (1)
Differences Between International & Domestic HRM: Topic
11 pages
Cenko
No ratings yet
Cenko
12 pages
Silangan Elementary School Grade 1 Class Schedules
No ratings yet
Silangan Elementary School Grade 1 Class Schedules
19 pages
Robin Hood Final
50% (2)
Robin Hood Final
7 pages
Phone Book
No ratings yet
Phone Book
48 pages
Catalogo Reductores SEW
83% (6)
Catalogo Reductores SEW
558 pages
Information and Communications Security 18th International Conference ICICS 2016 Singapore Singapore November 29 December 2 2016 Proceedings 1st Edition Kwok-Yan Lam Instant Download
100% (3)
Information and Communications Security 18th International Conference ICICS 2016 Singapore Singapore November 29 December 2 2016 Proceedings 1st Edition Kwok-Yan Lam Instant Download
54 pages
Unsighted
No ratings yet
Unsighted
9 pages

Urdu Text To Speech API

Uploaded by

Urdu Text To Speech API

Uploaded by

Urdu Text to Speech

Department of Computer Science

Muhammad Ramzan NIM-BSCS2020-17

Team Members University ID Signatures

Muhammad Ramzan NIM-BSCS2020-17 ____________________

Syeda Marjan Fatima NIM-BSCS2020-36 ____________________

Sir. Shahzad Arif

Signatures with date

2.2 Non-functional Requirements

Tool Text to speech Voice Cloning Words Limit API

Synthesia 5 sec Specific (4 voices) Yes No

[Link] 2 sec Specific (2 voices) Yes No

Our model 1 sec For each voice Yes Yes

4.1 Software Process Model:

figure 1 Conceptual diagram

4.4 Use Case Diagram:

5.2 Model Training and Optimization:

5.3 System Integration and Development:

5.4 Testing and Evaluation:

Table 2 Remaining work

Module Name Completion Status

5.6 Timeline and Milestones:

[5] [Link], "Veed Text-to-Speech," 2021. [Online]. Available: [Link]

[6] Synthesia, "AI Video Generation Platform," 2021. [Online]. Available:

You might also like