Unit 4 (Text-To-speech Synthesis)
Unit 4 (Text-To-speech Synthesis)
Text normalization
Text normalization is the process of transforming text into a standard format to facilitate
easier processing and analysis, especially in natural language processing (NLP) tasks. It
involves several steps that help to reduce the variability in text data.
These steps are crucial for cleaning and preparing text for tasks like machine learning, sentiment
analysis, or text mining. The extent of normalization depends on the specific use case and the
type of analysis being performed.
PROGRAM
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Sample text
text = "Text normalization, isn't it useful? It’s one of the most crucial steps in NLP!"
# Lowercasing
text = text.lower()
# Removing punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenization
tokens = nltk.word_tokenize(text)
# Removing stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
Output
Original Text: text normalization isnt it useful its one of the most crucial steps in nlp
Tokens: ['text', 'normalization', 'useful', 'one', 'crucial', 'steps', 'nlp']
Stemmed Tokens: ['text', 'normal', 'use', 'one', 'crucial', 'step', 'nlp']
Lemmatized Tokens: ['text', 'normalization', 'useful', 'one', 'crucial', 'step', 'nlp']
Letter-to-sound
1. Rule-Based Systems:
o Early approaches relied on manually created rules that map graphemes (letters or
combinations of letters) to phonemes.
o Example Rule: In the word "cat," the letter "c" is mapped to /k/, "a" to /æ/, and "t"
to /t/.
2. Dictionary-Based Approaches:
o A dictionary contains predefined mappings of words to their phonemic
transcriptions. This works well for common words but struggles with new or rare
words.
o Example: A pronunciation dictionary might map "light" to /laɪt/.
3. Statistical and Machine Learning Approaches:
o These use data-driven methods, like decision trees or HMMs, trained on large
corpora of words and their corresponding phonemes. The system learns patterns
and probabilities to predict phonemes for unseen words.
4. Neural Network Models:
o More recently, neural networks, such as sequence-to-sequence models (e.g.,
RNNs or Transformers), have been used for L2S conversion. These models can
handle irregularities better by learning complex patterns from large datasets.
o Example: A neural network could learn that "ough" maps to /oʊ/ in "though," /uː/
in "through," and /ʌf/ in "rough."
Program
import nltk
nltk.download('cmudict')
def get_phonemes(word):
word = word.lower()
if word in cmudict:
return cmudict[word]
else:
return "Phonemes not found in dictionary."
# Example
word = "through"
phonemes = get_phonemes(word)
print(f"Word: {word}")
print(f"Phonemes: {phonemes}")
OUTPUT
Word: through
Phonemes: [['TH', 'R', 'UW1']]
Where TH, R, and UW1 are the phonemes that correspond to the sounds in "through" based on the
CMU Pronouncing Dictionary.
For more complex letter-to-sound conversion tasks, especially for unseen words or non-standard
spelling, deep learning models can be used, such as:
Applications:
• Text-to-Speech (TTS): Converts written text to speech by first converting letters into
phonemes.
• Automatic Speech Recognition (ASR): Uses phoneme models for recognizing speech
and mapping spoken words to text.
• Language Learning Tools: Helps learners by generating phonetic transcriptions of
words.
Prosody, Evaluation
Prosody refers to the rhythm, intonation, and stress patterns in speech that convey meaning,
emotion, and structure. It's an essential aspect of natural language and spoken communication,
affecting how messages are perceived beyond the basic phonetic sounds. Prosody encompasses
several elements:
1. Intonation: The rise and fall of pitch in speech, which can indicate questions, statements,
emotions, or emphasis.
o Example: A rising pitch at the end of a sentence typically signals a question in
English.
Prosody plays a critical role in speech synthesis (text-to-speech) and speech recognition systems
to make the output sound natural. It helps in disambiguating sentence meanings, expressing
emotions, and maintaining listener engagement.
In text-to-speech systems, prosody helps ensure that synthesized speech sounds natural. Without
proper prosody, speech can sound robotic or unnatural. Modern TTS systems like Google’s
WaveNet or Tacotron apply machine learning models to capture and generate natural prosody.
However, generating accurate prosody is still a challenging task, especially for complex,
expressive speech.
The evaluation of prosody in TTS and other speech systems is important to ensure that the
generated speech is natural and easy to understand. Evaluation typically involves several
methods, both objective and subjective.
1. Objective Evaluation
Objective evaluation methods use mathematical models to assess aspects like timing, pitch, and
stress patterns in the speech. They usually compare the synthesized prosody with a reference
(ground truth).
• Mel Cepstral Distortion (MCD): A metric for measuring the spectral distortion between
synthesized and reference speech.
• Pitch and Duration Correlation: Measures how well the pitch and duration of
synthesized speech match the target prosody.
• Perplexity: This is used in language models to evaluate how predictable the prosodic
patterns are.
While objective methods are fast and quantitative, they may not always reflect how natural the
speech sounds to human listeners.
2. Subjective Evaluation
Subjective evaluation is crucial for assessing the naturalness of prosody since prosodic features
like intonation and rhythm are perceptual. These methods involve human listeners rating the
quality of the generated speech.
• Mean Opinion Score (MOS): A common subjective test where listeners rate the
naturalness of the speech on a scale (e.g., from 1 to 5).
• ABX Testing: Involves presenting listeners with two speech samples (A and B) and
asking them to identify which one is more natural or preferable.
• Pairwise Comparison: Listeners are presented with pairs of synthesized speech samples
and asked which one sounds more natural or fluid in terms of prosody.
• Transcription Task: Listeners are asked to transcribe the generated speech, and the
accuracy of transcription can indicate prosodic clarity.
1. Cultural and Linguistic Differences: Different languages and dialects have unique
prosodic patterns, making it difficult to apply a one-size-fits-all evaluation method.
2. Emotion and Expressiveness: Evaluating prosody becomes even more challenging in
systems designed to generate emotional or expressive speech, as it involves more subtle
variations in pitch, rhythm, and stress.
3. Unreliability of Subjective Tests: Human listeners can have inconsistent perceptions,
and large-scale subjective tests are costly and time-consuming.
• Neural Prosody Models: Use neural networks to model prosodic patterns, such as using
sequence-to-sequence models or attention mechanisms to capture prosodic features.
• Transfer Learning: Pre-trained models on large expressive speech datasets can help
generate more natural prosody for new tasks.
• Fine-tuning on Target Domain: Adapting a TTS model to the prosodic characteristics
of specific domains (e.g., newsreading, dialogue systems) ensures more natural delivery.
SIGNAL PROCESSING
Signal processing is the analysis, manipulation, and interpretation of signals to extract useful
information, enhance their quality, or convert them into a desired format. Signals can be anything
that conveys information, such as sound, images, sensor readings, or data streams, and they can
be represented in various forms like analog (continuous) or digital (discrete).
In the context of speech processing, image processing, and audio processing, signal processing
plays a fundamental role. Below is a broad overview of the different types, applications, and key
techniques involved in signal processing.
Types of Signals:
1. Analog Signals: Continuous signals, like sound waves or light, that vary over time and
take any value in a given range.
o Example: Human speech captured by a microphone.
2. Digital Signals: Discrete-time signals, often derived from the sampling of analog signals,
represented as sequences of numbers (binary).
1. Sampling: The process of converting an analog signal into a digital one by taking discrete
samples at regular intervals.
o Nyquist-Shannon Sampling Theorem: To avoid losing information, the sampling
rate should be at least twice the highest frequency present in the signal.
2. Quantization: The process of converting the continuous amplitude of a sampled signal
into a finite set of values, which is necessary for digital representation.
3. Fourier Transform (FT): A mathematical tool that converts a time-domain signal (e.g.,
sound wave) into its frequency-domain representation, showing how different frequencies
contribute to the overall signal.
o Fast Fourier Transform (FFT): An efficient algorithm for computing the Fourier
transform of a signal.
4. Filtering: The process of removing or emphasizing certain parts of a signal, such as
eliminating noise or isolating specific frequency components.
o Low-Pass Filter: Allows low-frequency components to pass while blocking high-
frequency noise.
o High-Pass Filter: Removes low-frequency noise or trends, allowing high-
frequency components to pass.
o Band-Pass Filter: Allows frequencies within a certain range to pass while
attenuating those outside the range.
5. Convolution: A mathematical operation that combines two signals to produce a third
signal. In audio processing, convolution is used for effects like reverb or echo.
o Convolution Theorem: In the frequency domain, convolution of two signals is
equivalent to multiplying their Fourier transforms.
6. Windowing: Applying a window function to a signal to reduce spectral leakage when
performing Fourier transforms, often used in short-time Fourier transform (STFT)
analysis.
o Example: The Hamming window.
7. Time-Frequency Analysis: Techniques like the Wavelet Transform and Short-Time
Fourier Transform (STFT) allow for analyzing signals that vary over time, capturing
both time and frequency information simultaneously.
4. Communication Systems:
• Modulation: The process of encoding information onto a carrier wave for transmission
(e.g., AM, FM, QAM).
• Error Correction: Detecting and correcting errors in transmitted data using techniques
like Hamming codes or Reed-Solomon codes.
• Equalization: Mitigating the effects of signal distortion during transmission.
• Target Detection: Signal processing is used to detect and identify objects or targets from
reflected signals.
• Doppler Effect: Calculating speed or movement by analyzing the frequency shift in
signals.
1. NumPy: Provides basic operations for array manipulation and Fourier transforms.
2. SciPy: Contains functions for filtering, FFT, and signal analysis.
o Example: scipy.signal for filtering, convolution, etc.
3. Librosa: A popular library for audio and music analysis.
4. PyWavelets: For wavelet transforms in Python.
5. OpenCV: Used for image and video processing tasks.
Program
import numpy as np
import matplotlib.pyplot as plt
from scipy.fftpack import fft
plt.subplot(2, 1, 2)
plt.plot(frequencies[:N // 2], np.abs(fft_signal[:N // 2]))
plt.title("Frequency-domain Spectrum")
plt.xlabel("Frequency [Hz]")
plt.tight_layout()
plt.show()
This code generates a composite signal with two sine waves and computes its Fast Fourier Transform
(FFT) to display the frequency components.
• The top plot shows the time-domain signal, which is a composite of two sine waves
with frequencies of 50 Hz and 120 Hz.
• The bottom plot shows the frequency-domain spectrum after applying the Fast Fourier
Transform (FFT). You can clearly see two peaks at 50 Hz and 120 Hz, corresponding to
the two frequencies in the original signal.
Concatenative and parametric approaches are two traditional methods used in text-to-speech
(TTS) systems to synthesize human-like speech. Both methods aim to generate intelligible and
natural-sounding speech from text, but they differ fundamentally in how they achieve this.
Key Features:
• Database of Recorded Speech: The system uses a large corpus of speech recordings,
broken down into smaller units.
• Units of Concatenation: The units could be:
o Phonemes: Smallest sound units (e.g., /k/, /æ/).
o Diphones: Pairs of phonemes that capture the transition between two sounds.
o Syllables or Words: Larger units that provide more context but require larger
databases.
• Unit Selection: The system selects the best-matching speech segments for the given text,
considering factors like phonetic context, prosody, and naturalness.
• Smoothing: When the segments are concatenated, techniques like pitch adjustment and
smoothing at the junctions of segments are used to reduce discontinuities.
Pros:
• Naturalness: Since it uses real recorded speech, concatenative synthesis can produce very
natural-sounding speech, especially if the system has a large, well-designed database.
• Efficiency: Once the database is built, synthesis can be relatively fast because the system
simply pieces together pre-recorded speech.
Cons:
• Limited Flexibility: The system can only generate speech that is covered by the available
recorded units. Unusual words or new phonetic sequences may not be synthesized well.
• Database Size: A large database is required to cover a wide variety of phonetic contexts,
which makes the system storage-intensive.
• Junction Artifacts: If not handled carefully, the junctions between concatenated units can
sound unnatural due to pitch, timing, or volume mismatches.
Example:
The Festival speech synthesis system is a popular open-source concatenative TTS engine.
Parametric TTS systems generate speech by modeling the speech production process. Instead of
concatenating pre-recorded speech, parametric approaches synthesize speech by using statistical
models to control parameters like pitch, duration, and formants (vocal tract resonances) to generate
audio waveforms from scratch.
Key Features:
• Statistical Models: Parametric synthesis often relies on models such as Hidden Markov
Models (HMMs) to predict the speech parameters (e.g., pitch, spectral features) from input
text. More recent systems use Deep Neural Networks (DNNs).
• Feature Extraction: Speech is represented as a set of parameters like pitch, spectral
envelope, and vocal tract characteristics, which are estimated during synthesis.
• Vocoder: A vocoder (voice coder) is used to convert the predicted parameters into an
actual speech waveform. Common vocoders include STRAIGHT and WORLD.
Pros:
• Flexibility: Parametric synthesis can generate speech for any text input, even for words or
phonetic combinations that are not present in the training data.
• Small Footprint: The system does not require storing large databases of speech
recordings, making it more compact and suitable for resource-constrained environments
like mobile devices.
• Consistent Quality: The system can maintain consistency across different voices and
speaking styles by adjusting parameters.
Cons:
Example:
The HTS (HMM-based Speech Synthesis System) is a parametric speech synthesis engine that
uses Hidden Markov Models for speech generation.
Hybrid Approaches
Many modern systems combine aspects of both concatenative and parametric methods to leverage
their strengths. For instance, unit selection systems (a type of concatenative synthesis) can
incorporate parametric prosody models to ensure more natural intonation and rhythm.
Neural TTS systems, such as WaveNet and Tacotron, represent the latest evolution in speech
synthesis, blending the benefits of parametric and concatenative approaches but addressing many
of their limitations.
• WaveNet (Google): A deep neural network-based model that directly generates raw audio
waveforms. It significantly improves the naturalness of speech by learning complex
temporal dependencies.
• Tacotron (Google): A sequence-to-sequence neural network that converts text directly
into spectrograms, which are then synthesized into speech waveforms.
These neural models have largely superseded traditional concatenative and parametric systems
due to their ability to produce highly natural and flexible speech without the limitations of large
speech databases or oversimplified models.
WaveNet is a generative model for producing raw audio waveforms using deep learning.
Introduced by Google DeepMind in 2016, it was a groundbreaking advancement in TTS and audio
generation.
• Natural Sounding Speech: WaveNet produces highly natural and realistic speech by
directly modeling the waveform at the raw level, capturing subtle nuances like breathiness,
pitch variations, and prosody.
• Flexibility: It can be trained to generate different types of voices and even non-speech
sounds like music, making it highly versatile.
WaveNet Architecture:
• Dilated Causal Convolutions: These convolutions are causal (i.e., they only depend on
the previous samples, preserving the temporal structure) and dilated (i.e., they skip certain
inputs to expand the receptive field).
• Probabilistic Modeling: WaveNet predicts the probability distribution of the next audio
sample, conditioned on previous samples, and samples from this distribution to generate
speech.
Advantages of WaveNet:
Challenges of WaveNet:
To address the speed issues, later versions like Parallel WaveNet were developed, which sped up
generation by using a non-autoregressive mechanism during inference.
2. Tacotron (Google)
Tacotron is another deep learning-based TTS model from Google that tackles text-to-speech
conversion in a fundamentally different way from WaveNet.
Tacotron 2:
Advantages of Tacotron:
Challenges:
3. FastSpeech (Microsoft)
FastSpeech is an attempt to solve the speed and latency issues inherent in autoregressive models
like Tacotron and WaveNet by using a non-autoregressive architecture.
FastSpeech 2:
Advantages of FastSpeech:
• Speed: FastSpeech is much faster than autoregressive models, making it suitable for real-
time applications.
Challenges:
• Trade-off in Naturalness: Early versions of FastSpeech were slightly less natural than
Tacotron 2, though FastSpeech 2 has significantly closed this gap.
VITS is a recent model that uses variational inference and combines the advantages of both
autoregressive and non-autoregressive models to generate high-quality speech efficiently.
• End-to-End: VITS directly converts text into speech without requiring an intermediate
mel-spectrogram representation.
• Variational Autoencoders (VAEs): VITS employs VAEs to model the variability in
speech, including prosody and speaking style, making it capable of producing expressive
speech.
• High-Quality and Fast Synthesis: VITS delivers high-quality speech synthesis
comparable to autoregressive models like WaveNet but at a much faster speed due to its
non-autoregressive nature.
Advantages of VITS:
• High-Quality: Combines the high naturalness of autoregressive models with the efficiency
of non-autoregressive approaches.
• Direct Text-to-Speech: No need for an intermediate mel-spectrogram, simplifying the
architecture.
Challenges:
• Complexity: VITS involves a more complex training process due to its use of variational
inference and VAEs, which require careful tuning