Speech Coders for Wireless Communication
Digital representation of the speech waveform
x(t)
Sampler
x(n) = x(nt)
Quantizer
x(n)
Continuous-time Continuous-amp.
Discrete-time Continuous-amp.
Discrete-time Discrete-amp.
Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST
Three acoustic signals
Frequency range Telephone speech 300 3,400Hz* 50 7,000Hz 10 20,000Hz Sampling rate 8kHz 16kHz 48kHz PCM bits per sample 8 14 16 PCM bit rate 64kb/s 224kb/s 768kb/s
Wideband speech
Wideband audio
* Bandwidth in Europe : 200 3200Hz in the United States and Japan
Frequency response of Telephone transmission channel
Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST
Telephone 0 Hz
Speech 4kHz
Music (CD Quality)
7kHz 20kHz
Talk
A/D
Encoder Compress (record/store) Storage
Decoder Decompress (play)
D/A
Listen
Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST
Source coding techniques
Waveform coding
The reconstructed signal matches as close to the original signal as possible. Robust for wide range of speaker and noisy environment.
A parametric coding based on the quasistationary model of speech production.
Vocoder
Hybrid coding
The combined form of waveform coding and vocoder.
Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST
Hybrid coders
Multi-Pulse Excitation
Efficient at medium bit rates. A sequence of nonuniformly spaced pulses as an excitation signal Amplitudes and positions are excitation parameters Efficient at medium bit rates. A sequence of uniformly spaced pulses as an excitation signal The position of first pulse within a vector and amplitudes are excitation parameters Efficient at low bit rates (below 8 kbps) A code book of excitation sequences Two key issues; the design and search of a codebook
Regular-Pulse Excitation (RPE)
Code-Excited Linear Prediction (CELP)
Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST
a)
g1 n2 n1 g2 0 5 g1 g2
g3 n4 n3 g4 10 15
gk
nk
20 g6
b) Examples of excitations
a) multipulse b) regular-pulse c) Code-excited Linear Prediction
g4
K g3 0 5 Codebook 10 g5 15 20
c)
Codevector # 1 Codevector # 2 Codevector # 3
2M = N (M = Transmission Bit)
Codevector # N
Communication Networks Research (CNR) Lab. EECS, KAIST 7
Speech Compression Standards
64 kbps -law/A-law PCM(CCTT G.711) 64 kbps 7kHz Subband/ADPCM(CCITT G.722) 32 kbps ADPCM(CCITT G.721) 16 kbps Low Delay CELP(CCITT G.728) 13.2 kbps RPE-LTP(GSM 06.10) 13 kbps ACELP(GSM 06.60) 13 kbps QCELP(US CDMA Cellular) 8 kbps QCELP(US CDMA Cellular) 8 kbps VSELP(US TDMA Cellular) 8 kbps CS-ACELP(ITU G.729) 6.7 kbps VSELP(Japan Digital Cellular) 6.4 kbps IMBE(Immarsat Voice Coding Standard) 5.3 & 6.4 kbps True Speech Coder(ITU G.723) 4.8 kbps CELP(Fed. Standard 1016-STU-3) 2.4 kbps LPC(Fed. Standard 1015 LPC-10E)
8
Communication Networks Research (CNR) Lab. EECS, KAIST
Performance of speech codec
Speech Quality (SNR/SEGSNR, MOS, etc) Bit Rate (bits per second) Complexity (MIPS) Coding Delay (msec)
Communication Networks Research (CNR) Lab. EECS, KAIST
Requirements of speech codec for digital cellular
More channel capacity Noise immunity Encryption Reasonable complexity and encoding delay
Communication Networks Research (CNR) Lab. EECS, KAIST
10
Vocoders
Anatomy of Speech Organs :
The source of most speech occurs in the larynx. It contains two folds of tissue called the vocal folds or vocal cords which can open and shut like a pair of fans. The gap between the vocal cords is called the glottis and as air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. This process is known as phonation. The frequency of vibration determines the pitch of the voice for a male is typically in the range 50-200Hz for a female the range can be up to 500Hz.
Amplitude
Opening phase Closing phase
50 Time (ms)
Closure
Period = 12.5ms Fundamental frequency = 1/.0125 = 80Hz
Glottal Pulse
Rosenberg JASM 49, 1971
Intensity
Spectrum of glottal pulse
Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.
Intensity
Spectrum of glottal pulse filtered by the vocal tract
Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.
/ee/
/ar/
/uu/
Properties of Speech in Brief
Vowels
oo in blue o in spot ee in key e in again
Quasi-periodic Relatively high signal power Consonants
s in spot k in key
Non-periodic (random) Relatively low signal power
Wrong
/r/ /o/ /ng/
Moving
/m/ /uu/ /v/ /i/ /ng/
Southampton
/s/ /ou/ /th/ /aa/ /m/ /p/ /t/ /a/ /n/
Digital speech model
A basic digital model for speech production
periodic signal gen.
x
random signal gen. Gain
linear time variant filter
Vocoder
Send three kinds of information to the receiver:
(1) voiced or unvoiced signal, (2) if it is voiced, the period of the excitation signal, (3) the parameters of the prediction filter
Vocoder
voice classification pitch recognition determine filter coeff.
Encoder/Decoder
digital filter excitation signal gen
LPC Introduction
This speech coders are called Vocoders (voice coder). Basic Idea
Encode Parameters
Transmit Parameters
Estimate parameters
Decode Parameters
Synthetise Speech
They usually provide more bandwidth compression than is possible with waveform coding (24009600bps).
Generalities
LP Model Parameter Estimation Typical Memory requirements
LP Model
Pitch Period Impulse Generator
Voice/Unvoice Switch
Voice
All-pole filter
Gain
Speech Signal
White Noise Unvoice Generator
Glottal filter Vocal tract filter Lip Radiation filter
Parameter Estimation
Therefore, for each frame:
estimate LP coefficients (ais) estimate Gain estimate type of excitation (voice or unvoice). Estimate pitch.
V/UV Estimation
Several Methods Energy of Signal Zero Crossing Rate Autocorrelation Coefficient
Speech Measurements (1)
Zero Crossing Rate Log Energy Es
1 Es 10 log( N
S
n 1
( n))
Normalized Autocorrelation Coefficient
C1
s (n) s (n 1)
n 1
( s ( n))( s 2 ( n))
2 n 1 n 0
N 1
Comparison between actual data and V/U/S determination results.
V S U
Pitch Detection
Voiced sounds
Produced by forcing air through glottis Vocal cords oscillate and modulate air flow into quasi periodic pulses Pulses excite resonances in reminder of vocal tract Different sounds produced as muscles work to change shape of vocal tract Resonant frequencies or formant frequencies Fundamental frequency or pitch rate of pulses
Pitch Detection
400 200
amplitude
Short sections of
0 -200 -400
Voiced speech
100
200
300 400 sample number
500
600
700
400 200
amplitude
Unvoiced speech
0 -200 -400
100
200
300 400 sample number
500
600
700
Time-domain pitch estimation
Well studied area Variations of fundamental frequency are evident Time-domain speech processing should be capable of detecting pitch frequency
400 300 200 100
amplitude
-100 -200 -300 -400
100
200
300 400 sample number
500
600
700
Pitch Period Estimation Using the Autocorrelation Function
Periodic signals have periodic auto-correlation function
Rn (k )
N 1 k m 0
' ' [ x ( n m ) w ( m )][ x ( n m k ) w (k m)]
Basic problems in choosing window length: Speech changes over time (N low) but at least 2 periods of the waveform Approaches: Choose window to catch longest period Adaptive N Use modified short-time auto-correlation function
Pitch Period Estimation Using the Autocorrelation Function (Contd)
Auto-correlation representation - retains too much of the information in the speech signal => auto-correlation function has many peaks
14
12
10
100
200
300
400
500
600
700
Spectrum flatteners techniques
Remove the effects of the vocal tract transfer function Center clipping - nonlinear transformation, clipping value depends on maximum amplitude => Strong peak at the pitch frequency
400 300 200 100
amplitude
0.16 0.14 0.12 0.1
0.7
0.6
0.5
0.4
0 -100 -200 -300 -400
0.08
0.3
0.06 0.04 0.02
0 100 200 300 400 sample number 500 600 700
0.2
0.1
100
200
300
400
500
600
700
100
200
300
400
500
600
700
Fundamental Frequency
F0 estimation: (Hess) determining the main period in quasi-periodic waveform
usually using autocorrelation function and the average magnitude difference function (AMDF)
AMDFt (m) 1 Np
| s (n) s (n m) |,
t t n ,m
0 n m L 1
where L is the frame length Npis number of point pairs (peak in ACF and valley in AMDF indicates F0)
Typical Memory Requirements
Pitch coefficient (6 bits). Gain (5 bits) Model parameters:
LP coefficients (8-10 bits)
Small changes in the LPC results in large changes in the pole positions. If |rk| near 1, then large distortion. Represent a non-linear transformation of the Reflection Coefficients to expand the scale near to |rk| near 1.
Reflection coefficients (6 bits)
Log-Area Ratio:
The main difference of the LP vocoders is the calculation of the source of excitation.
LPC-10
Speech Signal ADC (8kHz) Sample Speech Window (180 samples) AMDF and Zero Crossing LP Analysis (Covariance Method)
Non-linear warping
E n Voice/Unvoice c Switch (1 bit) o LAR coefficients d (4 bits and 5 bits) e Reflection r Coefficients
Pitch Frequency (7 bit)
(4 bits)
Channel D e c o d e r
Pitch Period (7 bits)
Gain (5 bits) Impulse Generator White Noise Generator
Voice/Unvoice Switch(1 bit)
10 Reflection Coefficients. (5 bits for one and 4 bits for the others).
Synthesized Speech Signal
1/A(z)
RELP
Simple vocoder offers poor sound quality and is usually unsatisfactory.
An improvement is to use the prediction error rather than the periodical pulse (for voiced signal) or the random noise (for unvoiced signal) to excite the digital filter to reproduce the speech. The prediction error is also called the residual.
This scheme is called Residual Excited Linear Prediction (RELP) coding.
RELP
quantization
encoder
digital filter determine filter coeff.
RELP
RELP follows essentially the same idea as DPCM. However, in RELP the speech signal is divided into blocks (20ms/block).
The optimum linear predictor is designed for each block. For each block, the filter coefficients and the prediction error should be sent to the receiver.
In DPCM, the predictor can be fixed or adaptive. Only the prediction error is sent to the receiver.
Modeling of the prediction error
In each block of speech signal (a frame), the prediction error may also be correlated. To decorrelate the prediction error, each frame is further divided into 4 sub-frames (5ms). The prediction error u(n) is then modelled as
u( n) hu( n M ) e( n)
where M (40<=M<=120) is called the lag, and h is called the gain. They are determined by using the LMSE principle.
Long-term prediction
The decorrelation of the prediction error is called long-term prediction.
s(n) A(z) digital filter determine filter coeff. u(n)
U(z)
long-term prediction
e(n) encoder
RPE-LTP
The RPE-LTP has been adopted as the speech coding method in the GSM 06.10 standard
long-term prediction digital filter determine filter coeff.
Regular pulse selection and coding
RPE-LTP
Speech is sampled at 8 kHz, quantised to 8 bits/sample The speech signal is pre-processed to remove any DC component and to pre-emphasis the high-frequencies component, partly compensating for their low energy. The signal is then dived into frames (20ms, 160 samples). An eighth-order optimum linear predictor is designed using the Shur algorithm. The reflection coefficients (related to the filter coefficients) are nonlinearly mapped to another set of values called log-area ratio(LAR).
RPE-LTP
The 8 LAR parameters are quantized using 6,6,5,5,4,4,3,3 bits.
So a total of 36 bits for the LAR (or for the filter coefficients).
The frame is filtered using this filter and produces u(n). u(n) is then divided into 4 sub-frames (5ms each, 40 samples). Long-term prediction is performed for each sub-frame. The lag M is quantized to 7 bits and the gain h is represented by 2 bits. Long-term prediction produces e(n).
RPE-LTP
e(n) is down-sampled by a factor 3. For each sub-frame, there are 4 down-sample patterns. Need 2 bits to specify the pattern used. The down-sampled e(n) has 13 samples. The maximum of them is quantized to 6 bits, others are normalised then represented by 3 bits.
So in each sub-frame, e(n) is represented by 6+13*3=6+39 bits. A frame has 4 sub-frames, 4*(6+39)=180 bits
The above method is called regular-pulse-excitation (RPE)
13 kbps RPE-LTP coder - Encoder Short term LPC analysis
Reflection coefficients (36 bit / 20 ms)
RPE parameters (13 pulses / 5 ms)
Input signal
Preprocessing
Short term Analysis filter
+ -
RPE grid Selection and coding
Synthesis filter 1/A(z/5)
RPE grid decoding
LTP parameters LTP analysis
- Decoder Reflection coefficients
RPE parameters
RPE grid decoding
+
Synthesis filter 1/A(z/5)
Short term synthesis
Postprocessing
Output signal
LTP parameters
RPE-LTP
Summary
8 LAR coefficients For each sub-frame pattern code
36 (bits)
lag gain regular pulse total 4 sub-frames Total one frame bit rate
2 7 2 6+39 56 4*56=224 224+36=260 260 bits/20 ms=13kbs