Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2023, arXiv (Cornell University)
…
15 pages
1 file
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% -63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech are available online 1 .
arXiv (Cornell University), 2022
We propose to characterize and improve the performance of blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a GAN-based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 72% on the energy decay relief and 22% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9% in word error rate).
ArXiv, 2021
Measuring the acoustic characteristics of a space is often done by capturing its impulse response (IR), a representation of how a full-range stimulus sound excites it. This is the first work that generates an IR from a single image, which we call Image2Reverb. This IR is then applied to other signals using convolution, simulating the reverberant characteristics of the space shown in the image. Recording these IRs is both time-intensive and expensive, and often infeasible for inaccessible locations. We use an end-toend neural network architecture to generate plausible audio impulse responses from single images of acoustic environments. We evaluate our method both by comparisons to ground truth data and by human expert evaluation. We demonstrate our approach by generating plausible impulse responses from diverse settings and formats including well known places, musical halls, rooms in paintings, images from animations and computer games, synthetic environments generated from text, pan...
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013
Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single-and multichannel de-reverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-ofthe-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.
6th International Conference on Spoken Language Processing (ICSLP 2000)
In this paper, we discuss the use of artificial room reverberation to increase the performance of automatic speech recognition (ASR) systems in reverberant enclosures. Our approach consists in training acoustic models on artificially reverberated speech material. In order to obtain the desired reverberated speech training database, we propose to use a reverberating filter whose impulse response is designed to match two high-level acoustic properties of the target reverberant operating environment, namely the earlyto-late energy ratio and the reverberation time. Speech recognition experiments in simulated reverberant environments show that recognizers trained on speech reverberated with the proposed method outperform systems trained on clean speech, even when channel normalization methods like CMS and logRASTA-PLP are used. The extension of our approach to multi-style training is also considered.
arXiv (Cornell University), 2018
Speech recognition in highly-reverberant real environments remains a major challenge. An evaluation dataset for this task is needed. This report describes the generation of the Highly-Reverberant Real Environment database (HRRE). This database contains 13.4 hours of data recorded in real reverberant environments and consists of 20 different testing conditions which consider a wide range of reverberation times and speaker-to-microphone distances. These evaluation sets were generated by re-recording the clean test set of the Aurora-4 database which corresponds to five loudspeaker-microphone distances in four reverberant conditions.
Speech Communication, 2015
This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environments based on multi-model selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time (T Model 60 s) and tested on real room IRs with varying T Room 60
EURASIP Journal on Audio, Speech, and Music Processing, 2021
This paper presents a new dataset of measured multichannel room impulse responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources, and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling, and reflector position estimation. The dataset is provided with software utilities to easily access, manipulate, and visualize the data as well as baseline methods for echo-related tasks.
This paper presents extended techniques aiming at the improvement of automatic speech recognition (ASR) in single-channel scenarios in the context of the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. The focus is laid on the development and analysis of ASR front-end technologies covering speech enhancement and feature extraction. Speech enhancement is performed using a joint noise reduction and dereverberation system in the spectral domain based on estimates of the noise and late reverberation power spectral densities (PSDs). To obtain reliable estimates of the PSDs-even in acoustic conditions with positive direct-to-reverberation energy ratios (DRRs)-we adopt the statistical model of the room impulse response explicitly incorporating DRRs, as well in combination with a novel proposed joint estimator for the reverberation time T 60 and the DRR. The feature extraction approach is inspired by processing strategies of the auditory system, where an amplitude modulation filterbank is applied to extract the temporal modulation information. These techniques were shown to improve the REVERB baseline in our previous work. Here, we investigate if similar improvements are obtained when using a state-of-the-art ASR framework, and to what extent the results depend on the specific architecture of the back-end. Apart from conventional Gaussian mixture model (GMM)-hidden Markov model (HMM) back-ends, we consider subspace GMM (SGMM)-HMMs as well as deep neural networks in a hybrid system. The speech enhancement algorithm is found to be helpful in almost all conditions, with the exception of deep learning systems in matched training-test conditions. The auditory feature type improves the baseline for all system architectures. The relative word error rate reduction achieved by combining our front-end techniques with current back-ends is 52.7% on average with the REVERB evaluation test set compared to our original REVERB result.
The recently introduced REVERB challenge includes a reverberant speech recognition task. We focus on state-of-the-art ASR techniques such as discriminative training and various feature transformations including Gaussian mixture model, sub-space Gaussian mixture model, and deep neural networks, in addition to the proposed single channel dereverberation method with reverberation time estimation and multi-channel beamforming that enhances direct sound compared with the reflected sound. In addition, because the best performing system is different from environment to environment, we perform a system combination approach using different feature and different types of systems to handle these various environments in the challenge. Moreover, we use our discriminative training technique for system combination that improves system combination by making systems complementary. Experiments show the effectiveness of these approaches, reaching 6.76% and 18.60% word error rate on the REVERB simulated and real test sets, which are 68.8% and 61.5% relative improvements over the baseline.
arXiv (Cornell University), 2018
This paper evaluates the robustness of a DNN-HMM-based speech recognition system in highly-reverberant real environments using the HRRE database. The performance of locally-normalized filter bank (LNFB) and Mel filter bank (MelFB) features in combination with Non-negative Matrix Factorization (NMF), Suppression of Slowly-varying components and the Falling edge (SSF) and Weighted Prediction Error (WPE) enhancement methods are discussed and evaluated. Two training conditions were considered: clean and reverberated (Reverb). With Reverb training the use of WPE and LNFB provides WERs that are 3% and 20% lower in average than SSF and NMF, respectively. WPE and MelFB provides WERs that are 11% and 24% lower in average than SSF and NMF, respectively. With clean training, which represents a significant mismatch between testing and training conditions, LNFB features clearly outperform MelFB features. The results show that different types of training, parametrization, and enhancement techniques may work better for a specific combination of speaker-microphone distance and reverberation time. This suggests that there could be some degree of complementarity between systems trained with different enhancement and parametrization methods.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, 2014
IEEE/ACM Transactions on Audio, Speech, and Language Processing
arXiv (Cornell University), 2021
2018 26th European Signal Processing Conference (EUSIPCO), 2018
Interspeech 2021, 2021
ArXiv, 2019
2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014
EURASIP Journal on Advances in Signal Processing, 2015
2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018
Lecture Notes in Computer Science, 2005
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
The Journal of the Acoustical Society of America, 2016