BINAURAL ANGULAR SEPARATION NETWORK
Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann
Google LLC, U.S.A.
{yanghm,gsung,shaofu,hakanerdogan,chehunglee,grundman}@google.com
ABSTRACT data or in GSENet [11] to provide magnitude contrast to a
arXiv:2401.08864v1 [eess.AS] 16 Jan 2024
neural network for further refined separation of the target
We propose a neural network model that can separate tar-
source within a mixture.
get speech sources from interfering sources at different an-
gular regions using two microphones. The model is trained There have been works relying on explicitly using lo-
with simulated room impulse responses (RIRs) using omni- cations of sources, such as location-guided separation [12]
directional microphones without needing to collect real RIRs. which aims to separate a source in any possible angle from
By relying on specific angular regions and multiple room sim- a mixture with a known microphone geometry, specifically
ulations, the model utilizes consistent time difference of ar- using a circular array. Another related work is distance-
rival (TDOA) cues, or what we call delay contrast, to separate based separation [13] which is designed to separate sources
target and interference sources while remaining robust in var- within a designated distance from one single microphone.
ious reverberation environments. We demonstrate the model The network used a single microphone and relied on im-
is not only generalizable to a commercially available device pulse response characteristics for separation rather than using
with a slightly different microphone geometry, but also out- inter-microphone cues. In recent works, angular positions
performs our previous work which uses one additional micro- of sources are used to order individual speech sources [14]
phone on the same device. The model runs in real-time on- to avoid Permutation-Invariant Training (PIT) for speech
device and is suitable for low-latency streaming applications separation. Neural Spectro-Spatial Filtering[15] performs
such as telephony and video conferencing. separation of target and interference signals either through
location-based ordering similar to [14] or by assuming a
Index Terms— Multi-channel audio separation, deep singular location for a target speech source with multiple
neural networks, spatial separation, speech separation, speech possible locations for a noise source. In [16], the authors
enhancement propose a region-based separation method to separate in-car
1. INTRODUCTION audio into rectangular regions using a 3-mic linear array. Due
Audio source separation is used in many applications from to region shapes, sources would not cause consistent TDOA
voice communication to human computer interface. Previous cues which could make it harder for the network to separate
works on multi-channel speech separation and enhancement regional signals.
focused on using spatial and spectral cues to separate sources In this paper, we propose a new end-to-end paradigm in-
arbitrarily mixed with each other. Works such as deep beam- cluding simulator design, model training, and on-device infer-
forming networks [1, 2], neural beamforming [3, 4, 5, 6, 7], ence for two microphones angular region based source sepa-
or other direct multi-channel separation methods [8, 9] fo- ration applications. We name our model as BASNet, short
cus on a more general separation problem and do not typi- for Binaural Angular Separation Network1 . The model as-
cally assume an information or assumption about locations of sumes the target and interference sources are located within
sources. specific angle ranges. This assumption allows the network to
To achieve audio separation, one method is to utilize spa- implicitly focus on inter-microphone phase differences (IPD)
tial cues with multiple microphones using what is known as or time difference of arrival (TDOA) information to separate
beamforming. Due to the nature of linear processing, con- the sources in a reliable manner. TDOA cues remain consis-
ventional beamforming performance highly depends on the tent throughout training due to using fixed target and inter-
number of microphones and is limited in terms of suppression ference angle ranges and it makes it easy for the network to
of interference and enhancement of target signals [10]. In ad- rely on that information to perform separation. We train the
dition, it is not possible to control the angle ranges easily in network using room simulations based on the image method.
beamforming since main-lobe width and side-lobe levels are The benefit compared to other methods come from the net-
typically variable for each frequency. Beamformer modules work’s capability to seamlessly combine spectro-spatial in-
have been used in neural beamforming [5, 6, 7] to linearly
re-estimate initially separated targets using multi-microphone 1 Audio samples are available at google-research.github.io/seanet/basnet/
Target signal microphone Based on the location of the microphones, the space is divided
region into three disjoint regions as illustrated in Fig. 1: the target
signal region consists of points in the 3-dimensional space
where the angle of its position vector2 to 0◦ plane relative to
the two microphones is less than θ. The 0◦ plane goes through
Interference Interference the mid point of the two microphones and is orthogonal to the
signal region signal region line connecting the two. The interference signal region con-
tains points whose position vectors are at least ϕ degree from
the 0◦ plane. Four signal sources are created where two tar-
get speech sources are randomly sampled in the target signal
region, one interference speech source is sampled in the inter-
Target signal
region ference signal region, and a noise source is sampled randomly
unconstrained by the two regions. Delay contrast occurs be-
Fig. 1: RIR simulation setup for target and interference sources. cause direct path far-field response of the sources in the target
Target signal sources is confined to [−θ, +θ] and [180◦ − θ, 180◦ + and interference regions consistently achieve a distinct range
θ]; interference sources is confined to [90◦ − ϕ, 180◦ − ϕ] and of TDOAs (or relative delays) and the network can rely on
[180◦ + ϕ, 360◦ − ϕ]. Noise source can come from any of the 360◦ these cues for separation.
directions. Distance between the two microphones is denoted as d. With the room geometry, 2 microphones locations, and
4 signal source locations determined, a 4 × 2 RIR ma-
Table 1: Data pipeline parameter setup. trix3 {r(k,j) }0≤k≤3,0≤j≤1 is created using the image method
[17]. The raw audio capture from the two microphones are
Type Configuration synthesized following the equation below4 :
Geometry θ = 30◦ , ϕ = 60◦ , d ∼ Uniform[0.09m, 0.11m]
p1 = 0.8, p2 = 0.6
y0 =s1 ∗ r(0,0) + s2 ∗ r(1,0) + i ∗ r(2,0) + gn · n ∗ r(3,0) ,
Signal
synthesis
g0 ∼ N (0, 0), g1 ∼ N (−3, 3), g2 ∼ N (−3, 3), y1 =s1 ∗ r(0,1) + s2 ∗ r(1,1) + i ∗ r(2,1) + gn · n ∗ r(3,1) ,
g3 ∼ N (−5, 10), gglobal ∼ N (−10, 5)
where s1 , s2 and i are utterances from a speech dataset, and n
formation over all microphones and frequencies, as well as comes from a noise dataset. With probability p1 , the utterance
being robust to the reverberation environment through exten- s2 is set to empty. With probability p2 , the utterance i is set to
sive simulated training. The trained network generalizes well empty. The introduction of p1 and p2 is to ensure the model
to real-world data recorded in a lab. Additionally, the train- can handle both single and multiple target speeches as separa-
ing process does not require on-device data collection, which tion target, and both with and without the presence of interfer-
is another advantage over previous methods such as GSENet ence speech. To add variations to the signal strengths of dif-
[11]. We show that our method achieves better separation ferent components, the average power of the four components
performance than previous neural beamforming methods. In are controlled by normalizing and scaling the signal to follow
particular, using two microphones with our method provides a a sampled dB value, denoted as {gk }0≤k≤3 . A global power
significant performance gain over using a single microphone, normalizing and scaling is then applied to set the final output
unlike previous methods such as Sequential Neural Beam- power to be gglobal . The exact numerical configurations for
forming [5]. In contrast to the Location-Based Training[14] the data pipeline is reported in Table 1. The ground-truth sig-
which orders output speakers according to their angles, our nal for model training is the non-reverberated version of the
proposal is based on a fixed range of angles for target and in- input without the presence of noise and interference sources,
terference and aim to separate target speech from speech and derived following the equation below5 .
non-speech interference. Unlike [15] which used simulations
t =s1 ∗ anechoic(r(0,0) ) + s2 ∗ anechoic(r(1,0) ).
with measured or simulated RIRs for evaluation, we evaluate
our method on real recorded examples. 2.2. Model Architecture
2. METHOD The model utilizes a convolution U-Net with identical archi-
2.1. Data Pipeline and Training tecture with GSENet (see Fig. 2 in [11] for details). The input
to the network are the STFTs of the two raw microphone in-
To utilize the spatial cues for the model by contrasting two
puts packed in real and imaginary channels, and output is the
audio inputs, we design an input simulator which generates
room impulse responses (RIR) to synthesize two input chan- 2 Origin of 3D Cartesian space is defined as the midpoint between mics.
4∗ denotes convolution.
nels. 5 anechoic(·) denotes the anechoic version of the RIR, which only con-
The simulator generates rooms with different geometries. tains the strongest path.
For each room, two microphone locations with distance d are 5 The RIRs are normalized so that the magnitude of the largest peak, over
randomly sampled, where d follows a predefined distribution. all receivers, for each source, is 1.
STFT of the reconstructed waveform followed by an inverse-
STFT to convert to waveform. The input STFT and output
inverse-STFT have a window size (the same as fft size) of 320
and step size of 160. A single-scale STFT re-construction loss
[18] with window size of 1024 and step size of 256 is applied
on the reconstructed waveform.
The network is fully causal with the coarsest temporal res-
olution in the U-Net be limited at 2 times that of the input.
At inference time, the network can be applied in a streaming
fashion with a latency of 20ms (320 samples at 16kHz) using
the streamable library in [19].
Fig. 2: Listening room setup.
2.3. Real Time Inference
Since the network architecture is identical to GSENet [11], Additionally, we compare to two other baselines: SSENet
the proposed method inherits the same real-time inference ca- [11], a single channel speech enhancement network after the
pability (for details see Section 2.3 in [11]), with an average beamformer; and GSENet [11], a speech enhancement net-
latency of 31.81ms profiling on a single CPU core of a Pixel work that takes both beamformer output and a stream of raw
6 model phone with the XNNPACK backend [20]. microphone as input.
3. EXPERIMENTS GSENet leverages the magnitude contrast between two
3.1. Training and Evaluation Dataset inputs: the 3-channel beamformed output as the target speech
Both target speech s1 , s2 and interference speech i are sam- input, and one of the raw microphones as the comparison
pled from a combination of LibriVox [21] and internal speech input. GSENet operates on the assumption that, based on
datasets. Background noise n is sampled from Freesound MCWF assumption, the target beamformed speech from 0◦
[22] dataset. For all the experiments, all the files are resam- is distortion-less while signals from other interference direc-
pled to 16kHz sampling rate. tions are attenuated. Whereas BASNet uses only the top two
For evaluation, we use multi-channel audio collected from microphones with symmetrical placement as raw inputs, and
a Google Pixel Tablet [23] in an ETSI certified listening room mostly relies on delay contrast – the consistent time differ-
as shown in Fig. 2. The tablet is docked on the speaker dock ence of arrival (TDOA) cues – to separate speech signal at
and secured on the table, referred as device-under-test (DUT). target angle from interference angles.
The DUT is equipped with 3 microphones: two of them on 3.3. Evaluation Results
the top edge with 0.07 meter symmetrical spacing to the cen- The evaluations are done with two criteria: enhancement and
ter, and one on the right edge. A head-and-torso simulator steerability. To evaluate speech enhancement effectiveness,
(HATS) is placed at 0 degree in front of the DUT to simulate the setup is configured to have target signal present from
the target speech source. For directional evaluation, 8 individ- HATS at 0◦ consistently, while the interference is arbitrarily
ual loudspeakers are placed surrounding the device with 45◦ assigned to a combination of 8 loudspeakers. For steerability,
succession which represent the ambient noises or interference we use the recording collected without HATS to evaluate the
speech sources. Note that the 0◦ speaker is placed behind the interference only performance, and showcase the model’s ca-
HATS. For each loudspeaker, DEMAND noise [24] data and pability to steer to different spatial directions by introducing
a subset of VCTK speech [25] data are played and recorded artificial latency to one of its inputs.
from the DUT. Each of the speaker and HATS recordings are
done independently and later mixed with various mixtures 3.3.1. Speech Enhancement with Directional Inference
and SNRs conditions for model evaluation. To measure the In Table 2, we evaluate the scenario when there is target
directivity pattern of the processed results, we record another speech played from the HATS, and report the BSS-SDR [26]
set of audio without the HATS hence the 0◦ speaker is unob- of the model output while interference is played from each of
structed. the 8 loudspeakers.
3.2. Evaluation Methods When the interference is speech, beamformer (BF) +
The comparison baselines use all 3 microphones. MCWF SSENet performs similarly to the BF alone. Given that
is a DSP-based beamformer based on linear multi-channel SSENet only has access to mono audio channel, this is the
Wiener filter introduced in [5], where a small amount of expected behavior as SSENet is not capable of separating the
recorded data is used to derive the beamformer weights. target speech from the interference speech based on locations.
More specifically, the signal covariance matrix is derived In contrast, BF + GSENet delivers an additional 3dB gain on
from the HATS recording while the noise covariance matrix average over BF + SSENet. To our surprise, BASNet, with
is derived from the loudspeaker recordings excluding the 0◦ only two microphone inputs, delivers another 2.4dB gain over
and 180◦ directions (refer to Section 3.2 in [11] for details). BF + GSENet which utilizes three microphones.
Table 2: BSS-SDR (dB) [26] of the raw and enhanced speech waveform with interference coming from different angles. Left:
speech as interference. Right: noise as interference. Top: interference at 0dB SNR. Bottom: interference at 6dB SNR.
Speech (VCTK) as interference Noise (DEMAND) as interference
Interference angle 0° 45° 90° 135° 180° 225° 270° 315° avg. 0° 45° 90° 135° 180° 225° 270° 315° avg.
BF (MCWF) [5] 0.5 2.3 3.0 1.8 0.2 1.7 2.2 1.9 1.7 4.1 6.3 5.9 4.4 3.2 4.8 5.7 5.0 4.9
BF + SSENet [11] 0.6 2.4 3.1 1.9 0.2 1.9 2.5 2.1 1.8 10.3 12.6 12.1 10.7 9.6 11.1 12.0 11.3 11.2
BF + GSENet [11] 0.8 7.7 10.5 9.0 0.2 6.1 9.8 9.1 6.7 9.6 13.0 12.7 11.3 8.8 11.8 12.6 11.8 11.4
BASNet (ours) 3.9 11.3 13.1 11.7 -0.1 8.5 12.4 12.0 9.1 11.6 14.8 14.0 12.7 10.7 12.5 13.9 13.3 12.9
BF (MCWF) [5] 6.3 7.9 8.6 7.5 6.0 7.4 7.9 7.5 7.4 9.5 11.4 11.1 9.8 8.7 10.1 10.9 10.2 10.2
BF + SSENet [11] 6.3 8.0 8.6 7.5 5.9 7.5 8.0 7.6 7.4 13.4 14.7 14.5 13.6 12.8 13.8 14.4 14.0 13.9
BF + GSENet [11] 6.4 11.4 13.1 12.0 5.9 10.3 12.6 12.0 10.5 12.6 14.9 14.8 13.9 11.7 14.2 14.7 14.2 13.9
BASNet (Ours) 8.2 15.2 16.5 15.3 5.9 12.3 15.7 15.3 13.1 15.5 18.0 17.4 16.2 14.8 16.2 17.2 16.8 16.5
better denoising performance.
Table 3: Signal energy suppression (dB) when there is only
one speech source coming from different angles.
Angle 0° 45° 90° 135° 180° 225° 270° 315°
BF (MCWF) [5] 1.4 3.3 4.1 2.6 2.0 2.8 2.6 2.0
BF + SSENet [11] 1.6 3.7 4.4 2.9 2.1 3.1 2.9 2.3
BF + GSENet [11] 1.6 10.0 18.2 7.3 2.2 16.9 16.0 3.0
BASNet (ours) 0.6 1.0 46.4 0.5 0.0 0.4 44.1 1.6
(a) offset = −4 samples (b) offset = −2 samples
3.3.2. Steerable Directivity
In Table 3, we report the reduction of signal energy with only
interference signals from each one of the 8 speakers without
the presence of HATS, to measure the directivity pattern of
the model. We observe that, compared to BF + GSENet which
achieves a wider rejection region (rejection happens not just
on 90◦ and 270◦ ), BASNet achieves much stronger > 40dB
rejection at 90◦ and 270◦ .
We postulate that BASNet preserves signals for which
(c) offset = +2 samples (d) offset = +4 samples there is small relative delay in the two inputs, therefore we
should be able to steer the direction of its directivity pattern
by introducing artificial delay to one of its inputs. We verify
this hypothesis as shown in Fig. 3 that, by introducing sample
offsets, BASNet can separate speech components from differ-
ent directions. The ability to steer the focus of the model to
different spatial regions allows the model to be dynamically
adapted to target speakers using visual cues or manual inputs.
4. CONCLUSION
(e) offset = 0 samples In this work, we propose a model that takes two audio chan-
nels as input and rely on the delay contrast between the two
Fig. 3: Directivity pattern with different sample offsets. to preserve target speech and suppress interference ones.
We show that, on a real device, it achieves state-of-the-art
When the interference is noise, BF + SSENet and BF + speech enhancement in the case of directional interference.
GSENet achieves similar performance, and the BSS-SDR val- We further demonstrate the steerability of its directivity pat-
ues are close to uniform across all noise directions. In con- tern, which allows the same model to be used to adapt to
trast, BASNet out-performs both by 1.5dB at 0dB SNR and different target spatial regions. For future work, we plan
2.6dB at 6dB SNR, and generally performs better at 90◦ and to explore how to utilize more than 2 microphone inputs,
270◦ directions where the delay contrasts from the two mi- and how to combine magnitude contrast [11] and delay con-
crophones are the largest. This demonstrate the model’s ca- trast to achieve even stronger enhancement and separation
pability in utilizing delay (or TDOA) information to achieve performance.
5. REFERENCES Speech Extraction Network,” in 2018 IEEE Spoken Lan-
guage Technology Workshop (SLT).
[1] Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang [13] Katharine Patterson, Kevin Wilson, Scott Wisdom, and
Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, John R. Hershey, “Distance-Based Sound Separation,”
Yu Zhang, Michael Mandel, and Dong Yu, “Deep in INTERSPEECH 2022.
beamforming networks for multi-channel speech recog-
nition,” in ICASSP 2016. [14] Hassan Taherian, Ke Tan, and DeLiang Wang,
“Location-Based Training for Multi-Channel Talker-
[2] Andong Li, Wenzhe Liu, Chengshi Zheng, and Xi- Independent Speaker Separation,” in ICASSP 2022.
aodong Li, “Embedding and Beamforming: All-Neural
Causal Beamformer for Multichannel Speech Enhance- [15] Ke Tan, Zhong-Qiu Wang, and DeLiang Wang, “Neural
ment,” in ICASSP 2022. Spectrospatial Filtering,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 30, pp.
[3] Jahn Heymann, Lukas Drude, and Reinhold Haeb- 605–621, 2022.
Umbach, “Neural network based spectral mask estima-
tion for acoustic beamforming,” in ICASSP 2016. [16] Julian Wechsler, Srikanth Raj Chetupalli, Wolfgang
Mack, and Emanuël A. P. Habets, “Multi-Microphone
[4] Hakan Erdogan, John R Hershey, Shinji Watanabe, Speaker Separation by Spatial Regions,” in ICASSP
Michael I Mandel, and Jonathan Le Roux, “Improved 2023, 2023, pp. 1–5.
MVDR Beamforming Using Single-Channel Mask Pre-
[17] Jont B. Allen and David A. Berkley, “Image method for
diction Networks,” in Interspeech 2016.
efficiently simulating small-room acoustics,” The Jour-
[5] Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, nal of the Acoustical Society of America, vol. 65, no. 4,
Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, 1979.
and John R Hershey, “Sequential multi-frame neural
[18] Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu,
beamforming for speech separation and enhancement,”
and Adam Roberts, “DDSP: Differentiable Digital Sig-
in 2021 IEEE Spoken Language Technology Workshop
nal Processing,” in ICLR 2020.
(SLT).
[19] Oleg Rybakov, Natasha Kononenko, Niranjan Subrah-
[6] Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong
manya, Mirkó Visontai, and Stella Laurenzo, “Stream-
Zhang, and Dong Yu, “Generalized spatio-temporal
ing Keyword Spotting on Mobile Devices,” in INTER-
RNN beamformer for target speech separation,” in In-
SPEECH 2020.
terspeech 2021.
[20] Marat Dukhan and XNNPACK team, “XNNPACK,”
[7] Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, https://github.com/google/XNNPACK, Ac-
Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe, cessed: 2023-08-30.
“TF-GridNet: Integrating full-and sub-band modeling
for speech separation,” IEEE/ACM Transactions on Au- [21] “Librivox - free public domain audio books,” https:
dio, Speech, and Language Processing, 2023. //librivox.org/, Accessed: 2023-09-02.
[8] Yi Luo, Cong Han, Nima Mesgarani, Enea Ceolini, and [22] “Freesound,” https://freesound.org/, Ac-
Shih-Chii Liu, “FaSNet: Low-latency adaptive beam- cessed: 2023-09-02.
forming for multi-microphone audio processing,” in [23] “Google pixel tablet,” https://store.google.
ASRU 2019. com/product/pixel_tablet_specs, Ac-
[9] Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, cessed: 2023-09-02.
Min Tang, Zirun Zhu, Zhuo Chen, and Naoyuki [24] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-
Kanda, “VarArray: Array-geometry-agnostic continu- cent, “The Diverse Environments Multi-channel Acous-
ous speech separation,” in ICASSP 2022. tic Noise Database (DEMAND): A database of multi-
channel environmental noise recordings,” Proceedings
[10] Jacob Benesty, Jingdong Chen, and Yiteng Huang, Mi-
of Meetings on Acoustics, 2013.
crophone array signal processing, vol. 1, Springer Sci-
ence & Business Media, 2008. [25] Junichi Yamagishi, Christophe Veaux, and Kirsten Mac-
Donald, “CSTR VCTK Corpus: English Multi-speaker
[11] Yang Yang, Shao-Fu Shih, Hakan Erdogan, Jamie Men-
Corpus for CSTR Voice Cloning Toolkit (version 0.92),”
jay Lin, Chehung Lee, Yunpeng Li, George Sung, and
in University of Edinburgh, 2019.
Matthias Grundmann, “Guided Speech Enhancement
Network,” in ICASSP 2023. [26] Colin Raffel, Brian Mcfee, Eric Humphrey, Justin Sala-
mon, Oriol Nieto, Dawen Liang, and Daniel Ellis,
[12] Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Er-
“mir eval: A Transparent Implementation of Common
dogan, Jinyu Li, and Yifan Gong, “Multi-Channel
MIR Metrics,” in ISMIR 2014, 10 2014.
Overlapped Speech Recognition with Location Guided