SPEECH DEREVERBERATION VIA MAXIMUM-KURTOSIS
SUBBAND ADAPTIVE FILTERING
Bradford W. Gillespie 1, Henrique S. Malvar 2 and Dinei A. F. Florncio 2
1
Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA
2
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
ABSTRACT A simple multi-microphone speech enhancement system is
the delay-and-sum beamformer [2], in which an estimate of s(n)
This paper presents an efficient algorithm for high-quality speech is formed by simply averaging xc ( n Lc ) . The delays, Lc , are
capture in applications such as hands-free teleconferencing or computed to best enhance the desired speech signal. More effi-
voice recording by personal computers. We process the micro- cient approaches have been reported, such as the use of subband
phone signals by a subband adaptive filtering structure using a envelope estimation [3], and decomposition of the received mi-
modulated complex lapped transform (MCLT), in which the crophone signals into minimum-phase and allpass components
subband filters are adapted to maximize the kurtosis of the linear [4]. Such techniques have shown only modest improvement over
prediction (LP) residual of the reconstructed speech. In this way, the delay-and-sum approach, in terms of reverberation reduction.
we attain good solutions to the problem of blind speech derever- The use of speech models to improve performance has been dis-
beration. Experimental results with actual data, as well as with cussed in many reports, e.g. [5], [6]. In this paper, we extend use
artificially difficult reverberant situations, show very good per- of explicit speech models by optimizing a metric of time-domain
formance, both in terms of a significant reduction of the per- signal concentration to control the adaptation of the dereverbera-
ceived reverberation, as well as improvement in spectral fidelity. tion filters. We achieve significant improvement in performance
over the delay-and-sum beamformer, both in subjective signal
1. INTRODUCTION quality and in spectral definition.
The quality of speech captured by personal computers in busi-
2. SPEECH ENHANCEMENT
ness offices is usually degraded by environment noise and by
reverberation (caused by the sound waves reflecting off walls and For clean voiced speech, LP residuals have strong peaks corre-
other surfaces). Quasi-stationary noise produced by computer sponding to glottal pulses, whereas for reverberated speech such
fans and air conditioning can be significantly reduced by spectral peaks are spread in time [6]. A measure of amplitude spread of
subtraction or similar techniques [1]. Reducing the distortion LP residuals can serve as a reverberation metric. To test this
caused by reverberation is a difficult blind deconvolution prob- concept, we performed the following experiment: in a standard
lem, due to broadband nature of speech and the high order of the 1111 office, we collected speech signals played back through
equivalent impulse response from the speakers mouth to the a mouth simulator (Brel &Kjr 4227) with a sampling fre-
microphone. The problem is, of course, alleviated by the use of quency of 16 kHz, at fourteen locations, 6 to 84 (6 spacing)
microphone headsets, but those are usually inconvenient to the from a single omnidirectional electret microphone. We computed
user. 10-th order LP residuals over 32 ms (512 samples) frames, and
In this paper we present an efficient algorithm for speech then the final kurtosis as the average of the frame kurtosis. A
dereverberation using subband adaptive filtering for fast conver- typical result is shown in Figure 1, for a female speaker in the
gence. The key new concept is to control the adaptive subband presence of interfering office noise. We conclude that LP resid-
filters not by a mean-square error criterion, but by a kurtosis ual kurtosis is a reasonable measure of reverberation.
metric on LP residuals. In this way, we make efficient use of the Our goal is to develop an online adaptive gradient-descent
a priori knowledge that the signal to be recovered is speech. The algorithm that maximizes LP residual kurtosis. In other words,
algorithm is capable of reducing reverberation even when a sin- we seek to find blind deconvolution filters that make the LP re-
gle microphone signal is available, but better results are obtained siduals as far as possible from being Gaussian an idea that has
with arrays containing several microphones. been applied to blind deconvolution problems in underwater
We can model the signal received by the c th microphone as acoustics and geophysics [7],[8]. The following sections present
xc ( n ) = sT ( n ) g c ( n ) + wc ( n ) (1) our implementation of such an adaptive algorithm. We begin by
developing an online single channel time-domain system. This is
where s ( n ) = s ( n N + 1)! s ( n ) , with s(n) the clean
T
readily extended to handle multiple channels. While the approach
speech signal to be recovered, wc ( n ) are additive noises, is easier to describe in the time-domain, a frequency-domain
and g c ( n ) are the N-tap acoustic impulse responses. For a typical implementation leads to better results, and thus we present the
wideband telephony sampling rate of 16 kHz, N can vary from details of the frequency-domain mutichannel system.
1,000 to over 4,000.
7
6
where the dependence on the time n is not written for simplicity.
5 In a manner similar to [10], we can approximate the gradient by
(( { } { }) y ) x = f
kurtosis
4
4 E y 2 y 2 E y 4
3 J
(n ) x ( n )
{ }
(4)
2
h E3 y 2
1
0
0 12 24 36 48 60 72 84
distance to microphone (in)
We refer to f(n) as the feedback function. This function is used to
control the filter updates. For continuous adaptation,
Figure 1. Kurtosis of LP residuals as a reverberation metric. E {y 2 ( n )} and E {y 4 ( n )} are estimated recursively. The final
structure of the update equations for a filter that maximizes the
kurtosis of the LP residual of the input waveform is then given
2.1 Single Channel Time-Domain Adaptation
by
This system is shown in Figure 2 (a). The received noisy rever- h ( n + 1) = h ( n ) + f ( n ) x ( n ) (5)
berated speech signal is x(n) and its corresponding LP residual is
x ( n ) . h ( n ) is the L-tap adaptive filter at time n. The output is where
y ( n ) = hT ( n ) x ( n ) , where x ( n ) = x ( n L + 1)! x ( n 1) x ( n ) .
T
An LP synthesis filter yields y(n), the final processed signal. { } { }
4 E y 2 ( n ) y 2 ( n ) E y 4 ( n ) y ( n )
f (n ) =
{ }
,
Adaptation of h ( n ) is similar to the traditional LMS adaptive E3 y 2 ( n )
filter [9], except that instead of a desired signal we use a feed-
back function, f(n), described below.
A problem with the system in Figure 2 (a) is LP reconstruc-
{ (n )} = E{y (n 1)} + (1 ) y 2 (n ),
E y 2
2
and (6)
tion artifacts. This can be avoided in a simple manner. For small E {y 4 ( n )} = E {y 4 ( n 1)} + (1 ) y 4 ( n ).
adaptation rates, the system in Figure 2 (a) is linear. h ( n ) can be
computed from x ( n ) but applied directly to x(n), as shown in Parameter controls the speed of adaptation, and controls the
Figure 2 (b). LP reconstruction artifacts are avoided at the small smoothness of the moment estimates.
price of running two filters.
To derive the adaptation equations, recall that we desire a 2.2 Multichannel Time-Domain Adaptation
filter that maximizes the kurtosis of y ( n ) , given by A multichannel time-domain implementation extends directly
J ( n ) = E y { 4
( n )} E 2
{y (n )} 3
2
(2) from this single-channel system just described. As before, our
objective is to maximize the kurtosis of y ( n ) , the LP residual of
y ( n ) . In this case, y ( n ) = c =1hTc ( n ) xc ( n ) , where C is the
where the expectations E{} can be estimated from sample aver- C
ages. The gradient of J(n) with respect to the current filter is number of channels. Extending the analysis of the previous sub-
=
2 3
({ }{ } { }
J 4 E y E y x E y E { yx }
4
) section, it is easy to see that the multichannel update equations
become
{ }
(3)
h E3 y 2
h c ( n + 1) = h c ( n ) + f ( n ) x c ( n ) (7)
where the feedback function f(n) is computed as in (6) using the
multichannel y ( n ) . To jointly optimize the filters, each channel
x (n) A( z ) x (n) h (n) y (n) A1 ( z ) y (n)
is independently adapted, using the same feedback function.
LP analysis adaptive filter LP synthesis
2.3 Frequency-Domain Implementation
f (n) feedback
function (a)
Direct use of the time-domain LMS-like adaptation equations in
(5) and (7) is not recommended, because the large variations in
x (n ) h (n ) y (n )
the eigenvectors of the autocorrelation matrices of the input sig-
adaptive filter nals may lead to very slow convergence, or no convergence at all
under noisy situations [9]. We use subband adaptive filtering
copy coefficients structure based on the modulated complex lapped transform
(b)
A( z ) x (n) h (n ) y (n) (MCLT), as proposed in [11]. Since each subband signal has an
approximately flat spectrum, we expect not only faster conver-
LP analysis adaptive filter gence but reduced sensitivity to noise [11]. A multichannel
f ( n) feedback
MCLT-based subband version of the structure of Figure 2 (b) is
function shown in Figure 3. Even though that figure shows only two chan-
nels, generalization to more channels is easy. Also, although two
Figure 2. (a) A single channel online time-domain adaptive
inverse MCLT blocks per channel are shown in Figure 3, it is
algorithm for maximizing kurtosis of the LP residual. (b) clear that we can add the channels in the MCLT domain, so that
Equivalent system, which avoids LP reconstruction artifacts. only one IMCLT is needed for y ( n ) and only one for y ( n ).
y (n )
H0 (0,m ) H1 (0,m )
x0 (n) . . x1(n)
MCLT .. IMCLT + MCLT .. IMCLT
H 0 ( S 1, m ) H1 ( S 1, m)
copy coefficients copy coefficients
H0 (0,m ) H1 (0,m )
A0 ( z ) . . A1 ( z )
MCLT .. IMCLT + MCLT .. IMCLT
LP analysis y (n ) LP analysis
H 0 ( S 1, m ) H1 ( S 1, m)
feedback
function
f (n )
Figure 3. A two-channel online frequency-domain adaptive algorithm for speech dereverberation. A system with more than two
channels extends directly from this system.
We assume that the microphone signals are decomposed via For our dereverberation experiments, we obtained good re-
MCLTs into M complex subbands. To determine M, we consider sults with the following parameters: = 0.99, = 0.0004, and
H c ( s,0 ) = [1 0 0 " 0] .
T
the tradeoff that larger M are desired to whiten the subband spec-
tra, whereas smaller M are desired to reduce processing delay. A
good compromise is to set M such that the frame length is about 3. EXPERIMENTAL RESULTS
2040 ms. Each subband s of each channel c is processed by a
complex FIR adaptive filter with L taps, H c ( s, m ) , where m is We present several experimental results from our proposed algo-
the MCLT frame index. By considering that the MCLT approxi- rithm, comparing them to a delay-and-sum beamformer. As per-
mately satisfies the convolution properties of the FFT [11], we formance metrics, we use equalized room impulse responses and
can easily map the update equations in (7) to the frequency do- spectrograms. We refrained from computing mean-square error
main, generating the new update equation (MSE) between the original and reconstructed signals, because
our system is not driven to minimize MSE, and minimum MSE
H c ( s, m + 1) = H c ( s, m ) + F ( s, m ) X c(
* s, m
) (8) does not necessarily correspond to better sounding speech.
where the superscript * denotes complex conjugation. 3.1 Experiment 1
Unlike in an LMS formulation, the appropriate feedback
We collected data using a linear microphone array with 3 spac-
function F(s, m) cannot be computed in the frequency domain.
ing between elements, at a distance of 7 from the mouth simula-
To compute the MCLT-domain feedback function F(s, m), we
tor. To understand the performance of the algorithm we com-
generate the reconstructed signal y ( n ) and compute f(n) from (6).
puted the impulse responses from the mouth simulator to each of
We then compute F(s, m) from f(n) using the MCLT. The over-
the four microphone elements, by playing two minutes of white
lapping nature of the MCLT introduces a one-frame delay in the
noise through the mouth simulator and correlating the received
computation of F(s, m). Thus, to maintain an appropriate ap-
waveform with the transmitted white noise sequence. Without
proximation of the gradient, we use the previous input block in
changing the room, ambient noise was collected (by turning on
the update equation (8), generating our final update equation
fans and computers) on the same array using the same system
H c ( s, m + 1) = H c ( s, m ) + F ( s, m 1) X
* ( s , m 1).
c (9) configuration. For reference, reverberated noisy speech was also
collected by playing clean female speech signal through the
Assuming the learning gain is small enough, the extra delay in mouth simulator. Finally, synthesized noisy speech was obtained
the update equation above will introduce a very small error in the by convolving the clean female speech signal with the impulse
final convergence of the filter. responses and adding the real room noise. Simulations were run
using both the noisy speech and the synthetic speech, with no
2.4 Implementation Issues difference observed when the SNR was the same. Therefore, by
acquiring real ambient noise from the same setup and room, we
Kurtosis is insensitive to the total energy of the waveform. can realistically simulate (1) while being able to control the sig-
Therefore, like in most blind deconvolution problems, there is a nal-to-noise (SNR) ratio and monitor the equalized room impulse
gain uncertainty. As usual, we can solve that by maintaining a response.
constant norm within the filter coefficients at each update cycle. We used a 4-channel, 256-subband structure with only one
It is interesting to note that, although our optimization crite- tap in each subband adaptive filter. The results are shown in
rion of maximizing kurtosis of the LP residual makes more sense Figure 4. The equalized impulse response from our proposed
for voiced speech, we have not found a need to restrict this algo- approach is more impulsive than the equalized response from the
rithm to adapt only during voiced segments. Continuously adapt- delay and sum beamformer. Potentially more significant is the
ing the filters, even during unvoiced or silent periods, provides number of zeros in the spectrum of the equalized delay-sum im-
satisfactory results. This is because during these periods the input pulse response that have been removed by the processing pre-
energy in x is generally small, reducing the adaptation rate. sented here. The spectrum of the equalized impulse response
1 0 1
Delay & Sum Delay & sum Delay & sum 0
energy (dB)
normalized
normalized
-10
amplitude
amplitude
energy (dB)
-20 -10
-20
-0.5 -50 -0.5 Delay & sum
-25 0 50 100 0 .3 1 4 8 -50 0 100 200 -30
0 .3 1 4 8
1 0 1
Proposed Proposed Proposed 0
energy (dB)
normalized
normalized
amplitude
-10
amplitude
energy (dB)
-20 -10
-20
-0.5 -50 -0.5 Proposed
-25 0 50 100 0 .3 1 4 8 -50 0 100 200 -30
time , ms time , ms 0 .3 1 4 8
frequency (kHz)
(a) (b) (a) (b)
Figure 4. Results for Experiment 1. Compare the equalized impulse
response for a delay & sum beamformer to our proposed approach
in (a) the time-domain (ideal result would be an impulse), (b) the
frequency-domain.
from our proposed approach is considerably flatter in the impor-
tant 0.5kHz to 4kHz region, compared to that of delay-and-sum.
3.2 Experiment 2
To test the ability of our proposed algorithm to equalize longer
(c)
reverberation we simulate four impulse responses as white noise
under a decaying exponential. A 4-channel 512-subband filter Figure 5. Results for Experiment 2. Compare the equalized im-
with one tap per band was used. Using these impulse responses pulse response for a delay-and-sum beamformer to our proposed
approach in (a) the time-domain (ideal result would be an im-
we generate a received signal using the same female speaker and
pulse), (b) the frequency-domain. The three voiced-speech spec-
noise segments from Experiment #1. The result of this processing trograms (darker is more intense) in (c) are: original (left), delay-
is shown in Figure 5. Listening to the processed waveform it is and-sum (center), and our proposed approach (right); the horizon-
possible to hear a dramatic reduction in reverberation after about tal time window is 1 sec, and the vertical range is 0 4 kHz. Note
5 seconds of adaptation. Figure 5 also shows that most of the the better spectral definition obtained with the proposed algo-
spectral details of the original signal are recovered with our algo- rithm.
rithm.
[3] H. Wang and F. Itakura, An approach to dereverberation
4. SUMMARY using multi-microphone sub-band envelope estimation,
In this paper we presented a new approach to dereverberate Proc. ICASSP, pp. 953956. 1991.
speech. Our approach is based on the principle that LP residual [4] J. Gonzalez-Rodrigues, J. L. Sanchez-Bote, and J. Ortega-
of reverberated speech (specifically voiced speech) is a time- Garcia, Speech dereverberation and noise reduction with a
spread version of the impulse-like LP residual of clean speech. combined microphone array approach, Proc. ICASSP, pp.
We have shown that a kurtosis metric is effective in measuring 953956. 2000.
reverberation. Computing the gradient of this metric with respect [5] M. Brandstein, On the use of explicit speech modeling in
to the deconvolution filters is relatively easy. This yields a final microphone array applications, Proc. ICASSP, pp. 3613
form for adaptive filters that is simple and LMS-like. 3616, 1998.
For improved performance, we used a dual-filter structure to [6] B. Yegnanarayana and P. Satyanarayana Murthy, En-
avoid LP reconstruction artifacts, and a subband filtering struc- hancement of reverberant speech using LP residual signal,
ture based on the MCLT. In that way, convergence is achieved IEEE Trans. Speech Audio Processing, 8(3), pp. 267281,
within a few seconds, and computational complexity is not much May 2000.
higher than that of a standard LMS adaptive filter. [7] R. A. Wiggins, Minimum entropy deconvolution, Geoex-
We validated the performance of our system on real world ploration, 16, pp. 2135, 1978.
data, from both stationary and moving (not presented here) [8] M. K. Broadhead and L. A. Pflug, Performance of some
sources. It has also been validated on artificially difficult rever- sparseness criterion blind deconvolution methods in the
beration, with significantly better results than delay-and-sum presence of noise, J. Acoust. Soc. Am., 107(2), pp.885
beamforming. 893, Feb. 2000.
[9] S. Haykin, Adaptive Filter Theory. New Jersey: Prentice-
5. REFERENCES Hall, 1996.
[10] O. Tanrikulu and A.G. Constantinides, Least-mean kurto-
[1] W. Jiang and H. Malvar, Adaptive Noise Reduction of sis: a novel higher-order statistics based adaptive filtering
Speech Signals, Microsoft Research Technical Report, algorithm, Electronics Letters, 30(3), pp. 189190, Febru-
MSR-TR-2000-86, July 2000. ary 3, 1994.
[2] J.L. Flanagan et. al., Computer-steered microphone arrays [11] H. Malvar, A modulated complex lapped transform and its
for sound transduction in large rooms, J. Acoust. Soc. Am., application to audio processing, Proc. ICASSP, pp. 1421
78(11), pp. 15081518, Nov. 1985. 1424, 1999.