Papers by Ken'ichi Kumatani

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Hypothesis-level combination between multiple models can often yield gains in speech recognition.... more Hypothesis-level combination between multiple models can often yield gains in speech recognition. However, all models in the ensemble are usually restricted to use the same audio segmentation times. This paper proposes to generalise hypothesis-level combination, allowing the use of different audio segmentation times between the models, by splitting and re-joining the hypothesised N-best lists in time. A hypothesis tree method is also proposed to distribute hypothesis posteriors among the constituent words, to facilitate such splitting when per-word scores are not available. The approach is assessed on a Microsoft meeting transcription task, by performing combination between a streaming first-pass recognition and an offline second-pass recognition. The experimental results show that the proposed approach can yield gains when combining over different segmentation times. Furthermore, the results also show that a combination between a hybrid model and an end-to-end neural network model ...

Traditionally, speech recognizers have used a strictly Bayesian paradigm for finding the best hyp... more Traditionally, speech recognizers have used a strictly Bayesian paradigm for finding the best hypothesis from amongst all possible hypotheses for the data at hand that is to be recognized. In fact, the Bayes classification rule has been shown to be optimal when the class distributions represent the true distributions of the data to be classified. In reality, however, this condition is not satisfied the classifer itself is trained on some training data and may be deployed to recognize data that are different from the training data. The use of Enropy as an optimization criterion for various classification tasks has been well established in the literature. In our work, we show that free-energy, a thermodynamic concept directly related to enropy, can also be used as an objective criteion in classification. Furthermore, we show how this novel classification scheme can be used in the framework of existing Bayesian classification schemes implemented in current recognizers by simply modifyi...

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) usin... more This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) using a complex discrete Fourier transform (DFT) feature for keyword spotting. In our previous work, we showed that the feed-forward DNN with a time-delayed bottleneck layer (TDB-DNN) directly trained from the audio input outperformed the model with the log-mel filter bank energy feature (LFBE), given a large amount of training data [1]. However, the deeper structure of such an audio input DNN makes an optimization problem more difficult, which could easily fall in one of the local minimum solutions. In order to alleviate the problem, we propose a new HW network with a time-delayed bottleneck layer (TDB-HW). Our TDB-HW networks can learn a bottleneck feature representation through optimization based on the cross-entropy criterion without stage-wise training proposed in [1]. Moreover, we use the complex DFT feature as a method of pre-processing. Our experimental results on the real data show that the TDB-HW network with the complex DFT feature provides significantly lower miss rates for a range of false alarm rates over the LFBE DNN, yielding approximately 20 % relative improvement in the area under the curve (AUC) of the detection error tradeoff (DET) curves for keyword spotting. Furthermore, we investigate the effects of different pre-processing methods for the deep highway network.

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
In this work, we develop a technique for training features directly from the single-channel speec... more In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional speech recognition systems typically extract a compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such a feature is then used for training a deep neural network (DNN) acoustic model (AM). In contrast, we directly train the WW DNN AM from the single-channel audio data in a stage-wise manner. We first build a feature extraction DNN with a small hidden bottleneck layer, and train this bottleneck feature representation using the same multi-task cross-entropy objective function as we use to train our WW DNNs. Then, the WW classification DNN is trained with input bottleneck features, keeping the feature extraction layers fixed. Finally, the feature extraction and classification DNNs are combined and then jointly optimized. We show the effectiveness of this stagewise training technique through a set of experiments on real beamformed far-field data. The experiment results show that the audioinput DNN provides significantly lower miss rates for a range of false alarm rates over the LFBE when a sufficient amount of training data is available, yielding approximately 12 % relative improvement in the area under the curve (AUC).

Interspeech 2020
In this work, we develop new self-learning techniques with an attention-based sequence-to-sequenc... more In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the multi-task learning (MTL) framework where the n-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the hard-decision errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario.

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2019
The use of spatial information with multiple microphones can improve far-field automatic speech r... more The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives. In this work, we propose to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple types of array geometry. In contrast to deep clustering methods that treat a neural network as a black box tool, the network encoding the spatial filters can process streaming audio data in real time without the accumulation of target signal statistics. We demonstrate the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. We show that our two-channel acoustic model can on average reduce word error rates (WERs) by 13.4 and 12.7% compared to a single channel ASR system with the log-mel filter bank energy (LFBE) feature under the matched and mismatched microphone placement conditions, respectively. Our result also shows that our two-channel network achieves a relative WER reduction of over 7.0% compared to conventional beamforming with seven microphones overall.

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2019
Conventional far-field automatic speech recognition (ASR) systems typically employ microphone arr... more Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly. In contrast to conventional methods, we incorporate array processing knowledge into the acoustic model. Moreover, we initialize the network with beamformers' coefficients. We investigate effects of such MC neural networks through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our MC acoustic model can reduce a word error rate (WER) by 16.5% compared to a single channel ASR system with the traditional log-mel filter bank energy (LFBE) feature on average. Our result also shows that our network with the spatial filtering layer on two-channel input achieves a relative WER reduction of 9.5% compared to conventional beamforming with seven microphones.

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
Traditionally, speech recognizers have used a strictly Bayesian paradigm for finding the best hyp... more Traditionally, speech recognizers have used a strictly Bayesian paradigm for finding the best hypothesis from amongst all possible hypotheses for the data to be recognized. The Bayes classification rule has been shown to be optimal when the class distributions represent the true distributions of the data to be classified. In reality, however, this condition is often not satisfied-the classifier itself is trained on some training data and may be deployed to classify data whose statistical characteristics are different from the training data. The Bayes classification rule may result in suboptimal performance under these conditions of mismatch. Classification may benefit from the use of modified classification rules in this case. The use of entropy as an optimization criterion for various classification tasks has been well established in the literature. In this paper we show that free energy, a thermodynamic concept directly related to entropy, can also be used as an objective criterion in classification. Furthermore, we show how this novel classification scheme can be used in the framework of existing Bayesian classification schemes implemented in current speech recognizers by simply modifying the class distributions a priori. Pilot experiments show that minimization of free energy results in more accurate recognition under conditions of mismatch.

Adaptation techniques for speech recognition are very effective in single-speaker scenarios. Howe... more Adaptation techniques for speech recognition are very effective in single-speaker scenarios. However, when distant microphones capture overlapping speech from multiple speakers, conventional speaker adaptation methods are less effective. The putative signal for any speaker contains interference from other speakers. Consequently, any adaptation technique adapts the model to the interfering speakers as well, which leads to degradation of recognition performance for the desired speaker. In this work, we develop a new feature-space adaptation method for overlapping speech. We first build a beamformer to enhance speech from each active speaker. After that, we compute speech feature vectors from the output of each beamformer. We then jointly transform the feature vectors from all speakers to maximize the likelihood of their respective acoustic models. Experiments run on the speech separation challenge data collected under the AMI project demonstrate the effectiveness of our adaptation method. An absolute word error rate (WER) reduction up to 14 % was achieved in the case of delay-and-sum beamforming. With minimum mutual information (MMI) beamforming, our adaptation method achieved a WER of 31.5 %. To the best of our knowledge, this is the lowest WER reported on this task.

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
In prior work, we investigated the application of a spherical microphone array to a distant speec... more In prior work, we investigated the application of a spherical microphone array to a distant speech recognition task. In that work, the relative positions of a fixed loud speaker and the spherical array required for beamforming were measured with an optical tracking device. In the present work, we investigate how these relative positions can be determined automatically for real, human speakers based solely on acoustic evidence. We first derive an expression for the complex pressure field of a plane wave scattering from a rigid sphere. We then use this theoretical field as the predicted observation in an extended Kalman filter whose state is the speaker's current position, the direction of arrival of the plane wave. By minimizing the squared-error between the predicted pressure field and that actually recorded, we are able to infer the position of the speaker.

This paper proposes a multiple discrete continuous nested extreme value (MDCNEV) model to analyze... more This paper proposes a multiple discrete continuous nested extreme value (MDCNEV) model to analyze household expenditures for transportation-related items in relation to a host of other consumption categories. The model system presented in this paper is capable of providing a comprehensive assessment of how household consumption patterns (including savings) would be impacted by increases in fuel prices or any other household expense. The MDCNEV model presented in this paper is estimated on disaggregate consumption data from the 2002 Consumer Expenditure Survey data of the United States. Model estimation results show that a host of household and personal socioeconomic , demographic, and location variables affect the proportion of monetary resources that households allocate to various consumption categories. Sensitivity analysis conducted using the model demonstrates the applicability of the model for quantifying consumption adjustment patterns in response to rising fuel prices. It is found that households adjust their food consumption, vehicular purchases, and savings rates in the short run. In the long term, adjustments are also made to housing choices (expenses), calling for the need to ensure that fuel price effects are adequately reflected in integrated microsimulation models of land use and travel.

This paper presents our approach for automatic speech recognition (ASR) of overlapping speech. Ou... more This paper presents our approach for automatic speech recognition (ASR) of overlapping speech. Our system consists of two principal components: a speech separation component and a feature estmation component. In the speech separation phase, we first estimated the speaker's position, and then the speaker location information is used in a GSC-configured beamformer with a minimum mutual information (MMI) criterion, followed by a Zelinski and binary-masking postfilter, to separate the speech of different speakers. In the feature estimation phase, the neural networks are trained to learn the mapping from the features extracted from the pre-separated speech to those extracted from the close-talking microphone speech signal. The outputs of the neural networks are then used to generate acoustic features, which are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through ASR experiments on the PASCAL Speech Separation Challenge II (SSC2) corpus. We demonstrate that our system provides large improvements in recognition accuracy compared with a single distant microphone case and the performance of ASR system can be significantly improved both through the use of MMI beamforming and feature mapping approaches.
Es wird anhand von Spracherkennerexperimenten (gemessen in Wortfehlerrate) gezeigt, dass die hier... more Es wird anhand von Spracherkennerexperimenten (gemessen in Wortfehlerrate) gezeigt, dass die hier entwickelten Beamformingtechniken auch erfolgreich Störquellen in verhallten Umgebungen unterdrücken, was ein klarer Vorteil gegenüber den herkömmlichen Methoden ist.
e present a new filter bank design method for subband adaptive beamforming. Filter bank design fo... more e present a new filter bank design method for subband adaptive beamforming. Filter bank design for adaptive filtering poses many problems not encountered in more traditional applications such as subband coding of speech or music. The popular class of perfect reconstruction filter banks is not well-suited for applications involving adaptive filtering because perfect reconstruction is achieved through alias cancellation, which functions correctly only if the outputs of individual subbands are not subject to arbitrary magnitude scaling and phase shifts.

In this work, we show how the speech recognition performance in a noisy car environment can be im... more In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audiovisual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audiovisual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.
In standard microphone array processing for distant speech recognition, the beamformed output is ... more In standard microphone array processing for distant speech recognition, the beamformed output is postfiltered to reduce residual noise. Postfiltering is usually performed through a Weiner filter whose parameters are estimated from both the beamformer output and the signals captured at the microphones themselves. Conventional postfiltering methods assume diffuse or incoherent noise at the various microphones in order to estimate these parameters. When the noise does not conform to this assumption they perform poorly. We propose an alternate postfiltering mechanism that attenuates noise by estimating and separating out the contributions of speech and noise explicitly. Experiments on a corpus of in-car two-channel recordings show that the proposed postfiltering algorithm outperforms conventional postfilters significantly under many noise conditions.

This paper presents a new beamforming method for distant speech recognition. In contrast to conve... more This paper presents a new beamforming method for distant speech recognition. In contrast to conventional beamforming techniques, our beamformer adjusts the active weight vectors so as to make the distribution of beamformer's outputs as super-Gaussian as possible. That is achieved by maximizing negentropy of the outputs. In our previous work, the generalized Gaussian probability density function (GG-PDF) for real-valued random variables (RVs) was used for modeling magnitude of a speech signal and a subband component was not directly modeled. Accordingly, it could not represent the distribution of the subband signal faithfully. In this work, we use the GG-PDF for complex RVs in order to model subband components directly. The appropriate amount of data for adapting the active weight vector is also studied. The performance of the beamforming techniques is investigated through a series of automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The data was recorded with real sensors in a real meeting room, and hence contains noise from computers, fans, and other apparatus in the room. The test data is neither artificially convolved with measured impulse responses nor unrealistically mixed with separately recorded noise.

This paper presents new superdirective beamforming algorithms based on the maximum negentropy (MN... more This paper presents new superdirective beamforming algorithms based on the maximum negentropy (MN) criterion for distant automatic speech recognition. The MN beamformer is configured in the generalized sidelobe canceler structure, and uses the weights derived from a delay-and-sum beamformer as the quiescent weight vector. While satisfying the distortionless constraint in the look direction, it adjusts the active weight vector to make the output maximally super-Gaussian. The current paper proposes to use the weights of a superdirective beamformer as the quiescent vector, which results in improved directivity and noise suppression at lower frequencies. We demonstrate the effectiveness of our approach through far-field speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The technique proposed in the current paper reduces the word error rate (WER) by 56% relative to a single distant microphone baseline, which is a 14% reduction in WER over the original MN beamformer formulation.
This paper presents a new mouth region localization method which uses the Gaussian mixture model ... more This paper presents a new mouth region localization method which uses the Gaussian mixture model (GMM) of feature vectors extracted from mouth region images. The discrete cosine transformation (DCT) and principle component analysis (PCA) based feature vectors are evaluated in mouth localization experiments. The new method is suitable for audiovisual speech recognition. This paper also introduces a new database which is available for audio visual processing. The experimental results show that the proposed system has high accuracy for mouth region localization (more than 95 %) even if the tracking results of preceding frames are unavailable.
Uploads
Papers by Ken'ichi Kumatani