Speech Processing
Speech Processing
Abstract
Speech is one of the most complex signals an engineer has to handle. It is thus not surprising
that its automatic processing has only recently found a wide market. In this paper we analyze
the latest developments in speech coding, synthesis and recognition, and show why they were
necessary for commercial maturity. Synthesis based on automatic unit selection, robust
recognition systems, and mixed excitation coders are among the topics discussed here.
Introduction
Speech, which is one of the most complex signals an engineer has to handle (although we
would need another article to support this claim), is also the easiest way of communication
between humans. This is not a paradox : as opposed to telecommunication signals, speech was
not invented by engineers. It was there much before them. If engineers had been given the
task of designing speech, they sure would not have made it the way it is (chances are we
would speak “sinusoids”, possibly with the help of attached bio-electronic devices, but this,
again, is another paper…). Telecommunication signals are designed in such a way that
loading/unloading them with information can easily be done by a bunch of electronic
components bundled into a simple device1. Nothing to be compared with the way a human
brain sends information to another human brain.
As opposed to what most students expect when first confronted to speech as a signal, speech
processing is not a sub-area of signal processing. As Allen noted [ALE85]: “These speech
systems provide excellent examples for the study of complex systems, since they raise
fundamental issues in system partitioning, choice of descriptive units, representational
techniques, levels of abstraction, formalisms for knowledge representation, the expression of
interacting constraints, techniques of modularity and hierarchy, techniques for characterizing
the degree of belief in evidence, subjective techniques for the measurement of stimulus
quality, naturalness and preference, the automatic determination of equivalence classes,
adaptive model parameterization, tradeoffs between declarative and procedural
representations, system architectures, and the exploitation of contemporary technology to
produce real-time performance with acceptable cost.” It is therefore not surprising that, from
the first thoughts on digital speech processing in the late 60s, it took so much time for speech
technologies (speech coding, speech synthesis, speech recognition) to come to maturity.
“Acceptance of a new technology by the mass market is almost always a function of utility,
usability, and choice. This is particularly true when using a technology to supply information
1
When you think of it, the simple fact that neat signal processing algorithms works so well on telecom signals is
another proof of their intrinsic simplicity .
where the former mechanism has been a human.” [LEV93]. In the case of speech, utility has
been demonstrated by thousands of years of practice. Usability, however, has only recently
(but how efficiently!) been shown in electronic devices. The best and most well known
example is our GSM, with its LP model-based speech coder, its simple speech recognition,
and its voice mail (if not its read emails).
In this review paper, we try to give a quick view of state-of-the art techniques in speech
synthesis (section 1), speech recognition (section 2), and speech coding (section 3).
N L P m od u le
D S P m od u le
S peech
F0
_ d o g _
_d do og g_
Prosody
Diphone
Database Modification
Smooth joints
4
x 10
1
0.5
0
-0.5
-1
0 1000 2000 3000 4000 5000 6000 7000 8000
Word-level concatenation is impractical because of the large amount of units that would have
to be recorded. Also, the lack of coarticulation at word boundaries results in unnaturally
connected speech. Syllables and phonemes seem to be linguistically appealing units. However
there are over 10000 syllables in English and while there are only 40 phonemes, their simple
concatenation produces unnatural speech because it does not account for coarticulation. Units
that are currently used in concatenative systems mostly include diphones, and sometimes
triphones or half-syllables. A minimum inventory of about 1000 diphones is required to
synthesize unrestricted English text (about 3 minutes of speech, i.e., 5 Mbytes of speech data
at 16 Khz/16 bits). Some diphone-based synthesizers also include multi-phone units of
varying length to better represent highly coarticulated speech (such as in /r/ or /l/ contexts). In
the half-syllable approach, highly coarticulated syllable-internal consonant clusters are treated
as units. However, coarticulation across syllables in not treated very well.
For the concatenation, prosodic modification, and compression of acoustic units, speech
models are usually used. Speech models provide a parametric form for acoustic units. The
task of the speech model is to analyze the inventory of the acoustic units and then compress
the inventory (using speech coding techniques), while maintaining a high quality of the
synthesized speech. During synthesis, the model has to have the ability to perform the
following tasks in real-time: concatenate an adequate sequence of parameters, adjust the
parameters of the model so as to match the prosody of the concatenated segments to the
prosody imposed by the language processing module, and finally smooth out concatenation
points in order to produce the least possible audible discontinuities. Therefore, it is important
to use speech models that allow easy and high-quality (without introducing artifacts)
modification of the fundamental frequency, segmental duration and spectral information
(magnitude and phase spectrum).
There has been a considerable amount of research effort directed at the problem of speech
representation for TTS for these last ten years. The advent of linear prediction (LP) has had its
impact in speech coding as well as in speech synthesis. However, the buzziness inherent in LP
degrades perceived voice quality. Other synthesis techniques based on pitch synchronous
waveform processing have been proposed such as the Time-Domain Pitch-Synchronous-
Overlap-Add (TD-PSOLA) method [MOU90]. TD-PSOLA is currently one of the most
popular concatenation methods. Although TD-PSOLA provides good quality speech
synthesis, it has limitations which are related to its non-parametric structure: spectral
mismatch at segmental boundaries and tonal quality when prosodic modifications are applied
on the concatenated acoustic units. An alternative method is the MultiBand Resynthesis
Overlap Add (MBROLA; see the MBROLA project homepage :
http://tcts.fpms.ac.be.be/synthesis/mbrola/) method which tries to overcome the TD-PSOLA
concatenation problems by using a specially edited inventory, obtained by resynthesizing the
voiced parts of the original inventory with constant harmonic phases and constant pitch
[DUT97, Chapter 10]. Both TD-PSOLA and MBROLA have very low computational cost.
Sinusoidal approaches (e.g., [MAC96]) and hybrid harmonic/stochastic representations
[STY98] have also been proposed for speech synthesis. These models are intrinsically more
powerful than TD-PSOLA and MBROLA for compression, modification and smoothing.
However, they are also about ten times more computationally intensive than TD-PSOLA or
MBROLA. For a formal comparison between different speech representations for text-to-
speech, see [DUT94] and [SYR98].
F0
_ d o g _
_d do og g_
VERY Prosody
LARGE
CORPUS Modification
Smooth joints
4
x 10
1
0.5
0
-0.5
-1
0 1000 2000 3000 4000 5000 6000 7000 8000
Target j
target cost
tc(j,i)
Vocabulary Size
The size of the available vocabulary is another key point in speech recognition applications. It
is clear that the larger the vocabulary is the more opportunities the system will have to make
some errors. A good speech recognition system will therefore make it possible to adapt its
vocabulary to the task it is currently assigned to (i.e., possibly enable a dynamic adaptation of
its vocabulary). Usually we classify the difficulties level according to table 1 with a score
from 1 to 10, where 1 is the simplest system (speaker-dependent, able to recognize isolated
words in a small vocabulary (10 words)) and 10 correspond to the most difficult task
(speaker-independent continuous speech over a large vocabulary (say, 10,000 words)). State-
of-the-art speech recognition systems with acceptable error rates are somewhere in between
these two extremes.
The commonly obtained error rates on speaker independent isolated word databases are
around 1% for 100 words vocabulary, 3% for 600 words and 10 % for 8000 words [DER98].
For a speaker independent continuous speech recognition database, the error rates are around
15 % with a trigram language model and for a 65000 words vocabulary [YOU97].
Feature
Extraction
Probability
Estimation
Decoding
Language
Models
Recognized Sentences
Note that the first block, which consists of the acoustic environment plus the transduction
equipment (microphone, preamplifier, filtering, A/D converter) can have a strong effect on the
generated speech representations. For instance, additive noise, room reverberation,
microphone position and type of microphone can all be associated with this part of the
process. The second block, the feature extraction subsystem, is intended to deal with these
problems, as well as deriving acoustic representations that are both good at separating classes
of speech sounds and effective at suppressing irrelevant sources of variation.
The next two blocks in Figure 4 illustrate the core acoustic pattern matching operations of
speech recognition. In nearly all ASR systems, a representation of speech, such as a spectral
or cepstral representation, is computed over successive intervals, e.g., 100 times per second.
These representations or speech frames are then compared to the spectra or cepstra of frames
that were used for training, using some measure of similarity or distance. Each of these
comparisons can be viewed as a local match. The global match is a search for the best
sequence of words (in the sense of the best match to the data), and is determined by
integrating many local matches. The local match does not typically produce a single hard
choice of the closest speech class, but rather a group of distances or probabilities
corresponding to possible sounds. These are then used as part of a global search or decoding
to find an approximation to the closest (or most probable) sequence of speech classes, or
ideally to the most likely sequence of words. Another key function of this global decoding
block is to compensate for temporal distortions that occur in normal speech. For instance,
vowels are typically shortened in rapid speech, while some consonants may remain nearly the
same length.
The recognition process is based on statistical models (Hidden Markov Models, HMMs)
[RAB89,RAB93] that are now widely used in speech recognition. A hidden Markov model
(HMM) is typically defined (and represented) as a stochastic finite state automaton (SFSA)
which is assumed to be built up from a finite set of possible states, each of those states being
associated with a specific probability distribution (or probability density function, in the case
of likelihoods).
Ideally, there should be a HMM for every possible utterance. However, this is clearly
infeasible. A sentence is thus modeled as a sequence of words. Some recognizers operate at
the word level, but if we are dealing with any substantial vocabulary (say over 100 words or
so) it is usually necessary to further reduce the number of parameters (and, consequently, the
required amount of training material). To avoid the need of a new training phase each time a
new word is added to the lexicon, word models are often composed of concatenated sub-word
units. Any word can be split into acoustic units. Although there are good linguistic arguments
for choosing units such as syllables or demi-syllables, the unit most commonly used are
speech sounds (phones) that are acoustic realizations of linguistic units called phonemes.
Phonemes are speech sound categories that are meant to differentiate between different words
in a language. One or more HMM states are commonly used to model a segment of speech
corresponding to a phone. Word models consist of concatenations of phone or phoneme
models (constrained by pronunciations from a lexicon), and sentence models consist of
concatenations of word models (constrained by a grammar).
Model accuracy:
ANN estimation of probabilities does not require detailed assumptions about the form of the
statistical distribution to be modeled, resulting in more accurate acoustic models.
Contextual Information :
For the ANN estimator, multiple inputs can be used from a range of speech frames, and the
network will learn something about the correlation between the acoustic inputs. This is in
contrast with more conventional approaches, which assume that successive acoustic vectors
are uncorrelated (while this is clearly wrong).
Discrimination:
ANNs can easily accommodate discriminant training, that is : at training time, speech frames
which characterize a given acoustic unit will be used to train the corresponding HMM to
recognize these frames, and to train the other HMMs to reject them. Of course, as currently
done in standard HMM/ANN hybrid discrimination is only local (at the frame level). It
remains that this discriminant training option is clearly closer to how we humans recognize
speech.
SNR -5 DB 0 DB 10 DB 15 DB Clean
WER 90.2 % 72.2 % 20.0 % 8.0 % 1.0 %
Table 2 : Word Error Rate on the Aurora 2 database (Continuous digits in noisy environment for
different signal to Noise ratio).
In the case of short-term (frame-based) frequency analysis, even when only a single frequency
component is corrupted (e.g., by a selective additive noise), the whole feature vector provided
by the feature extraction phase in Fig. 4 is generally corrupted, and typically the performance
of the recognizer is severely impaired.
The multi-band speech recognition system [DUP00] is one possible way that is explored by
many researchers. Current automatic speech recognition systems treat any incoming signal as
one entity. There are, however, several reasons why we might want to view the speech signal
as a multi-stream input in which each stream contains specific information and is therefore
processed (up to some time range) more or less independently of the others. Multi-band
speech recognition is an approach in which the input speech is divided into disjoint frequency
bands and treated as separate sources of information. They can then be merged into an
automatic speech recognition (ASR) system to determine the most likely spoken words.
Hybrid HMM/ANN systems provide a good framework for such problems, where
discrimination and the possibility of using temporal context are important features.
Language Models
Other research tends to ameliorate language models which are also a key point in the speech
recognition systems. The language model is the recognition system component which
incorporates the syntactic constraints of the language. Most of the state-of-the-art large
vocabulary speech recognition systems make use of statistical language models, which are
easily integrated with the other system components. Most probabilistic language models are
based on the empirical paradigm that a good estimation of the probability of a linguistic event
can be obtained by observing this event on a large enough text corpus. The most commonly
used models are n-grams, where the probability of a sentence is estimated from the
conditional probabilities of each word or word class given the n-1 preceding words or word
classes. Such models are particularly interesting since they are both robust and efficient, but
they are limited to modeling only the local linguistic structure. Bigram and trigram language
models are widely used in speech recognition systems (dictation systems).
One important issues for speech recognition is how to create language models for spontaneous
speech. When recognizing spontaneous speech in dialogs, it is necessary to deal with
extraneous words, out-of-vocabulary words, ungrammatical sentences, disfluency, partial
words, hesitations and repetitions. Those kind of variation can degrade the recognition
performance. For example the results obtained on the SwitchBoard database (telephone
conversations) show a recognition accuracy for the baseline systems of only 50 % [COH94].
Better language models are presently a major issue and could be obtained by looking beyond
N-Grams. This could be achieved by identifying useful linguistic information and integrating
more information. Better pronunciation modeling will probably enlarge the population that
can get acceptable results on a speech recognition system and therefore strengthen the
acceptability of the system.
ACELP
G.723
1995 ADPCM
excellent ACELP
LD-CELP G.727 1984
HVXC G.729
G.728 1991
MPEG4 1996
1999
S
p IMBE
Inmarsat
e good
1991
RPE-LTP
e AMBE GSM-FR
1988
c
Inmarsat
New MELP 1995 CELP
h
( DOD )
FS-1016
2000 MELP 1989
DOD
1996
Q fair
u LPC10E
a DOD
1975
l
i
t poor
y
bad
1 2 4 8 16 32
bit rate [kbps]
Fig. 5. Speech quality vs. Bit rate for major speech coding techniques
Figure 5 shows, for commercially used standards, the speech quality that is currently
achievable at various bit rates, from 1.2 Kbps to 32 Kbps, and for narrowband telephone (300-
3400 Hz) speech only. For the sake of clarity, we do not include speech coders using higher
bandwidth dedicated to multimedia applications. Information on such coders can be found in
[KLE95]. Notice also that a speech coder cannot be judged on its speech quality alone.
Robustness against harsh environments, end-to-end delay, immunity to bit errors as well as to
frame loss are equally important. Complexity is always a decisive factor. These important
parameters, however, are not considered in this short paper. We invite the interested reader to
refer to [KLE95, COX96, SPA94, ATA91] for more details. Finally audio coding techniques
(MPEG,AC3,…) using perceptual properties of the human ear are not considered here either.
See [WWWMPEG] for information of such coders. However, we cannot ignore the new
emerging MPEG-4 Natural Audio Coding technique. This is a complete toolbox able to
perform all kinds of coding from low bit rate speech coding to high quality audio coding or
music synthesis [WWWMPE][WWWMPA]. It should be the successor of the popular MPEG-
2 Layer-3 (the real name of mp3) format. It integrates a Harmonic Vector Excitation (HVXC)
coding for narrowband speech and multi-rate CELP (number of parameters in the model may
vary) for higher bit rates (see below).
In the next sections, we examine the latest telephone speech coders, from the lowest bit rate to
the highest.
3.1 Vocoders
The Linear predictive coder is the ancestor of most of the coders presented in this paper. It is
based on the source-filter speech model. This approach models the vocal tract as a slowly
varying linear filter. Its parameters can be dynamically determined using Linear Predictive
Analysis [KLE95]. The filter is excited by either glottal pulses (modeled as a periodic signal
for voiced excitation) or turbulence (modeled as white noise for unvoiced excitation). This
source-filter model is functionally similar to the human speech production mechanism. The
source excitation represents the stream of air blown through the vocal tract, and the linear
filter models the vocal tract itself . Direct application of this model has led to the U.S
department of defense FS1015 2.4 Kbps LPC-10E (1975)[WWWDOD]. This standard coder
uses a simple method for pitch detection and voiced/unvoiced detection. The intelligibility of
this coder is correct, but it produces coded speech with a somewhat buzzy, mechanical
quality. To improve speech quality, more advanced excitation models were required.
MBE
The Multi-Band Excitation (MBE) coder [GRIF87] is a frequency-domain coder mixing both
voiced and unvoiced excitation in the same frame. The MBE model divides the spectrum into
sub-bands at multiples of the pitch frequency. It allows separate Voiced/Unvoiced decision
for each frequency band in the same frame. A specific implementation of the MBE vocoders
was chosen as the INMARSAT-M standard for land mobile satellite service [WON91]. This
version, denoted as IMBE, uses 4.15 kbps to code speech and 6.4 kbps as rough bit rate (Error
Correcting Code included). Improved quantization of the amplitude envelope has led to
Advanced MBE (AMBE), at a rate of 3.6 kbps (rough 4.8 kbps). The AMBE model was
selected as the INMARSAT mini-M standard [DIM95] and also for the declining IRIDIUM
satellite communication system. Other implementations of MBE models and improvements of
the amplitude quantization scheme have been successfully used in mobile satellite
applications [WER95]. The introduction of vector quantization schemes for these parameters
has made it possible to design a coder producing intelligible speech at a low bit rate of 800
bps [WER95].
MELP
To avoid the hard voiced/unvoiced decision, the Mixed Excitation Linear Prediction (MELP)
coder models the excitation as a combination of periodic and noise-like components, with
their relative “voicing strength” instead of a hard decision. Several fixed bands are considered
across the spectrum. The MELP approach better models frame with mixed voicing, as in the
voiced fricative /z/ for example. The short term spectrum is modeled with the usual LPC
analysis. In the mid-1990s a MELP coder has been selected to replace the LPC-10E US
Department of Defense federal standard. This implementation performs as well as a CELP
algorithm working at 4.8 kbps. Due to this standardization, MELP has been the focus of much
experimentation towards reducing the bit rate to 1.7 Kbps[MCR98] or 1.2 kbps[WAN00], and
improving quality [STA99].
HVXC
Harmonic Vector Excitation coders (HVXC) [WWWMPEG] encode short-term information
using Vector Quantization of LPC parameters. The residual of the linear prediction analysis is
vector quantized for voiced frames and employs a CELP scheme (see below) for unvoiced
frames. The predilection bit rate of HVXC vocoders is from 2kbps to 4 kbps.
Waveform Interpolation
To be complete, we have to mention the promising 4.0 kbps WI coder [GOT99], which is
preferred by several listeners over MPEG-4 HVXC coder and also slightly over the G723
ACELP. It is based on the fact that the slow variation rate of pitch period waveforms in
voiced speech allows downsampling. However, computational complexity is higher than with
other coders.
ITU-T G-728
The ITU G.728 standard [WWWITU] (1992) is a 16 kbps algorithm for coding telephone-
bandwidth speech for universal applications using low-delay code-excited linear prediction.
The G.728 coding algorithm is based on a standard LPAS CELP coding technique. However,
several modifications are incorporated to meet the needs of low-delay high-quality speech
coding. G.728 uses short excitation vectors (5 samples, or 0.625ms) and backward-adaptive
linear predictors. The algorithmic delay of the resulting coder is 0.625 ms, resulting in an
achievable end-to-end delay of less than 2 ms.
RPE-LTP GSM
An RPE coder has been standardized by the GSM group of ETSI. It is now used as GSM Full
rate mode for all European GSMs. The bit rate is 13 kbps for speech information and rough
information including channel error codes is 22.8 kbps. This standard is now superseded by
the Enhanced full Rate GSM which is an ACELP coder [WWWETSI].
ITU-T G729
This coder is a LPAS coder using algebraic CELP codebook structure (ACELP), where the
prediction filter and the gains are explicitly transmitted, along with the pitch period estimate.
Its originality is the algebraic codebook structure improving both index transmission and best
fitting procedure. This coder is used for transmission of telephone bandwidth speech at 8 kb/s.
Its main application is Simultaneous Voice and Data (SVD) modems. [WWWITU]
ITU-T G-723
The ITU-T G-723 standard [WWWITU] is used for video conferencing and Voice over IP
applications (part of H.323/H.324 ITU recommendations). It is an ACELP using a dual rate
(5.3 kbps/6.3 kbps) switchable implementation.
Other important CELP standards exist, like VSELP (IS-54) for American cellular mobile
phones [WWWTIA], IS-641 for the new generation of American digital cellular [WWWTIA],
and Adaptive Multi Rate for European cellular phones (AMR or GSM 06.90)[WWWETSI].
3.4 Waveform coders
At the highest bit rates, the reference coder is the UIT G711 64 kbps pulse code modulation
(PCM) standard [WWWITU]. This “coder” is also the simplest. It only involves the sampling
and quantization of speech, in such a way as to maintain good intelligibility and quality. It is
still used in most telecommunication standards (e.g. ISDN, ATM,…) where bandwidth is less
critical than simplicity and universality (both voice and data on same network). Both the
G726 adaptive differential PCM (ADPCM) and the embedded G727 ADPCM maintain the
same level of speech quality at, basically, 32 kbps. Extensions to a lower 16 kbps version of
these standards exist, but they are outperformed by the above mentioned next generation of
hybrid coders.
Conclusion
Speech technology has experienced a major paradigm shift in the last decade : from “speech
science”, it became “speech (and language) engineering”. This is not a merry chance. The
increasing availability of large databases, the existence of organizations responsible for
collecting and redistributing speech and text data (LDC, in the US, and ELRA in Europe), and
the growing need for algorithms that work in real applications (while the problems they have
to handle are very intricate), requires people to act as engineers more than as experts.
Currently emerging products and technologies are certainly less “human-like” than what we
expected (in the sense that speech coding, synthesis, and recognition technologies still make
little use of syntax, semantics, and pragmatics, known to be major tasks when humans process
speech), but they tend to work in real time, with today’s machines.
References
[ALL85] ALLEN, J., (1985), "A Perspective on Man-Machine Communication by Speech",
Proceedings of the IEEE, vol. 73, n°11, pp. 1541-1550.
[ATA82] B.S. ATAL J.R. Remde “ A new model of LPC excitation for producing natural sounding
speech at low bit rates”, ICASSP 1982
[ATA91] B.S. ATAL “Speech Coding” http://cslu.cse.ogi.edu/HLTsurvey/ch10node4.html
(Written in 1991)
[BOI00] R. Boite, H. Bourlard, T. Dutoit, J. Hancq, H. Leich, 2000. Traitement de la parole.
Presses polytechniques et universitaires romandes, Lausanne, Suisse, ISBN 2-88074-
388-5, 488pp.
[BOU94] Bourlard H. And Morgan N., “Connectionist Speech Recognition – A Hybrid Approach”,
Kluwer Academic Publishers, 1994.
[COH94] Cohen J., Gish H., Flanagan J, „SwitchBoard – The second year”, Technical report, CAIP
Workshop in Speech Recognition : Frontiers in speech processing II, july 1994.
[COX96] R.V. Cox, P.K. Kroon “Low bit rate speech coders for Multimedia communication”
IEEE Communication Magazine, Dec 1996, Vol 34 N° 12
[DEMOLIN] Lincom demonstrations of vocoders (including MELP) in harsh conditions
http://www.lincom-asg.com/ssadto/index.html
[DEMOASU] Audio examples from ASU http://www.eas.asu.edu/~speech/table.html (Stop in 1998)
[DER98] Deroo O. “Modèle dépendant du contexte et fusion de données appliqués à la
reconnaissance de la parole par modèle hybride HMM/MLP ». PhD Thesis, Faculté
Polytechnique de Mons, 1998 (http://tcts.fpms.ac.be/publications/phds/deroo/these.zip).
[DIM95] S. Dimolitsas “Evaluation of voice codec performance for Inmarsat mini-M”, ICDSC, 1995
[DUP00] Dupont S. « Etude et développement d’architechtures multi-bandes et multi-modales pour
la reconnaissance robuste de la parole”,Phd Thesis, Faculté Polytechnique de Mons,
2000.
[DUP97] S. Dupont, Bourlard H. and Ris C., “Robust Speech Recognition based on Multi-stream
Features”, Prcoeedings of ESCA/NATO Workshop on Robust Speech Recognition for
Unknown Communication Channels, Pont-à-Mousson, France,pp 95-98, 1997.
[DUT94] T. Dutoit. 1994. "High quality text-to-speech synthesis : A comparison of four candidate
algorithms". Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP’94), 565-568. Adelaide, Australia.
[FAQ] FAQ on Speech Processing http://www-svr.eng.cam.ac.uk/comp.speech or
http://www.itl.atr.co.jp/comp.speech/
(Last update in 1997)
[GOT99] O. Gottesman, A. Gersho “ Enhanced waveform interpolative coding at 4 kbps”, IEEE
Speech Coding Workshop, 1999.
[GRIF87] D.W. Griffin “The Multiband Excitation Vocoder”, Ph.D. Dissertation, MIT, Feb 1987
[HUN96] Hunt, A.J. and A.W. Black. 1996. "Unit Selection in a Concatenative Speech Synthesis
System Using a Large Speech Database". Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP’96), vol. 1, 373-376. Atlanta, Georgia.
[KLE95] W.B. Klein, K.K. Paliwal (eds) “Speech Coding and Synthesis”,
ELSEVIER 1995
[KRO86] P. Kroon, E.F. Deprettere, R.J. Sluyter “Regular pulse excitation – a novel approach to
effective and efficient multipulse coding of speech” IEEE Trans. ASSP, Oct 1986
[LEV93] LEVINSON, S.E., J.P. OLIVE, and J.S. TSCHIRGI, (1993), "Speech Synthesis in
Telecommunications", IEEE Communications Magazine, pp. 46-53.
[MAC96] Macon, M.W. 1996. "Speech Synthesis Based on Sinusoidal Modeling", Ph.D.
Dissertation, Georgia Institute of Technology.
[MCR98] A. McCree J. De Martin “A 1.7 kbps MELP coder with improved analysis and
quantization”, ICASSP 1998
[MOU90] Moulines, E. and F. Charpentier. 1990. "Pitch Synchronous waveform processing
techniques for Text-To-Speech synthesis using diphones". Speech Communication, 9, 5-
6.
[RAB89] Rabiner L. R., „A tutorial on Hidden Markov Models and selected applications in speech
recognition”, Proceedings of the IEEE, vol. 77, no 2, pp 257-285, 1989.
[RAB93] Rabiner L. R. And Juang B.H., „Fundamentals of Speech Recognition“, PTR Prentice
Hall, 1993.
[RIC91] Richard D and Lippman R. P., “Neural Network classifiers estimate Bayesian a posteriori
probabilities”, Neural Computation, no 3, pp 461-483, 1991.
[SCH85] M. Schroeder and B. Atal “Code-excited linear prediction (CELP) :high quality speech at
very low bit rates”, ICASSP 85
[SPA94] A.S. Spanias, “Speech Coding: A tutorial Review”, Proc IEEE, Vol 82, N° 10, Oct 1994
[STA99] J. Stachurski, A. McCree, V. Viswanathan “High Quality MELP coding at bit rates around
4 kbps”, ICASSP 1999
[STY98] Stylianou, Y. 1998. "Concatenative Speech Synthesis using a Harmonic plus Noise
Model". Proceedings of the 3rd ESCA Speech Synthesis Workshop, 261-266. Jenolan
Caves, Australia.
[SYR98] Syrdal, A., Y. Stylianou, L. Garisson, A. Conkie and J. Schroeter. 1998. "TD-PSOLA
versus Harmonic plus Noise Model in diphone based speech synthesis". Proceedings of
the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98),
273-276. Seattle, USA.
[SAN97] Van Santen, J.P.H., R. Sproat, J. Olive, J. Hirshberg, eds. 1997, Progress in Speech
Synthesis, New York, NY: Springer Verlag.
[WAN00] T. Wang, K. Koishida, V. Cuperman, A. Gersho « A 1200 bps Speech coder based on
MELP », ICASSP Proc. 2000
[WER95] B. Wery, S. Deketelaere “Voice coding in the MSBN satellite communication system”,
EUROSPEECH Proc 1995
[WON91] S.W. Wong “Evaluation of 6.4 kbps speech codecs for Inmarsat-M system”, ICASSP Proc.
1991
[WWWDOD] See DDVPC Web Site http://www.plh.af.mil/ddvpc/index.html
[WWWMPEG] “MPEG Official Home page” http://www.cselt.it/mpeg/
[WWWMPA] “The MPEG Audio Web page“ http://www.tnt.uni-hannover.de/project/mpeg/audio/
[WWWITU]Search G.723, G.728, G.729, G.711 standards on ITU web site (http://www.itu.int/ )
[WWWETSI] Search GSM 06.10, GSM 06.60 & GSM 06.90 on ETSI web site (http://www.etsi.org)
[WWWTIA]TIA/EIA/IS-54 & TIA/EIA/IS-641 standards. See TIA website (http://www.tiaonline.org/)
[YOU97] Young S., Adda-Dekker M., Aubert X, Dugast C., Gauvain J.L., Kershaw D. J., Lamel L.,
Leeuwen D. A., Pye D., Robinson A. J., Steeneken H. J. M., Woodland P. C., “Multilingual
large vocabulary speech recognition : the European SQALE project”, Computer Speech
and Language, Vol 11, pp 73-89, 1997