Papers by Harshavardhan Sundar

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
In this paper, we present an end-to-end deep convolutional neural network operating on multi-chan... more In this paper, we present an end-to-end deep convolutional neural network operating on multi-channel raw audio data to localize multiple simultaneously active acoustic sources in space. Previously reported deep learning based approaches work well in localizing a single source directly from multi-channel raw-audio, but are not easily extendable to localize multiple sources due to the well known permutation problem. We propose a novel encoding scheme to represent the spatial coordinates of multiple sources, which facilitates 2D localization of multiple sources in an end-to-end fashion, avoiding the permutation problem and achieving arbitrary spatial resolution. Experiments on a simulated data set and real recordings from the AV16.3 Corpus demonstrate that the proposed method generalizes well to unseen test conditions, and outperforms a recent time difference of arrival (TDOA) based multiple source localization approach reported in the literature.
IEEE/ACM Transactions on Audio, Speech, and Language Processing
2016 International Conference on Signal Processing and Communications (SPCOM), 2016

Interspeech 2016, 2016
We address the problem of moving acoustic source localization and automatic camera steering using... more We address the problem of moving acoustic source localization and automatic camera steering using one-bit measurement of the time-difference of arrival (TDOA) between two microphones in a given array. Given that the camera has a finite field of view (FoV), an algorithm with a coarse estimate of the source location would suffice for the purpose. We use a microphone array and develop an algorithm to obtain a coarse estimate of the source using only one-bit information of the TDOA, the sign of it, to be precise. One advantage of the one-bit approach is that the computational complexity is lower, which aids in real-time adaptation and localization of the moving source. We carried out experiments in a reverberant enclosure with a 60 dB reverberation time of 600 ms (RT60 = 600 ms). We analyzed the performance of the proposed approach using a circular microphone array. We report comparisons with a point source localizationbased automatic camera steering algorithm proposed in the literature. The proposed algorithm turned out to be more accurate in terms of always having the moving speaker within the field of view.

In this paper, we present a latent variable (LV) framework to identify all the speakers and their... more In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed ...

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
ABSTRACT Many state-of-the-art techniques for estimating glottal closure instants (GCIs) use line... more ABSTRACT Many state-of-the-art techniques for estimating glottal closure instants (GCIs) use linear prediction residual (LPR) in one way or another. In this paper, subband analysis of LPR is proposed to estimate the GCIs. A composite signal is derived as the sum of the envelopes of the subband components of the LPR signal. Appropriately chosen peaks of the composite signal are the GCI candidates. The temporal locations of the candidates are refined using the LPR to obtain the GCIs, which are validated against the GCIs obtained from the electroglottograph signal, recorded simultaneously. The robustness is studied using additive white, babble and vehicle noises for different signal to noise ratios. The proposed method is evaluated using six different databases and compared with three state-of-the-art LPR based methods. The results show that the performance of the proposed method is comparable to the best of the LPR based techniques for clean as well as noisy speech.

In many cultures of the world, traditional percussion music uses mnemonic syllables that are repr... more In many cultures of the world, traditional percussion music uses mnemonic syllables that are representative of the timbres of instruments. These syllables are orally transmitted and often provide a language for percussion in those music cultures. Percussion patterns in these cultures thus have a well defined representation in the form of these syllables, which can be utilized in several computational percussion pattern analysis tasks. We explore a connected word speech recognition based framework that can effectively utilize the syllabic representation for automatic transcription and recognition of audio percussion patterns. In particular, we consider the case of Beijing opera and present a syllable level hidden markov model (HMM) based system for transcription and classification of percussion patterns. The encouraging classification results on a representative dataset of Beijing opera percussion patterns supports our approach and provides further insights on the utility of these syllables for computational description of percussion patterns.

IEEE Signal Processing Letters, 2000
ABSTRACT We address the problem of identifying the constituent sources in a single-sensor mixture... more ABSTRACT We address the problem of identifying the constituent sources in a single-sensor mixture signal consisting of contributions from multiple simultaneously active sources. We propose a generic framework for mixture signal analysis based on a latent variable approach. The basic idea of the approach is to detect known sources represented as stochastic models, in a single-channel mixture signal without performing signal separation. A given mixture signal is modeled as a convex combination of known source models and the weights of the models are estimated using the mixture signal. We show experimentally that these weights indicate the presence/absence of the respective sources. The performance of the proposed approach is illustrated through mixture speech data in a reverberant enclosure. For the task of identifying the constituent speakers using data from a single microphone, the proposed approach is able to identify the dominant source with up to 8 simultaneously active background sources in a room with RT60= 250 ms, using models obtained from clean speech data for a Source to Interference Ratio (SIR) greater than 2 dB.

Interspeech 2016, Sep 8, 2016
We address the problem of moving acoustic source localization and automatic camera steering using... more We address the problem of moving acoustic source localization and automatic camera steering using one-bit measurement of the time-difference of arrival (TDOA) between two microphones in a given array. Given that the camera has a finite field of view (FoV), an algorithm with a coarse estimate of the source location would suffice for the purpose. We use a microphone array and develop an algorithm to obtain a coarse estimate of the source using only one-bit information of the TDOA, the sign of it, to be precise. One advantage of the one-bit approach is that the computational complexity is lower, which aids in real-time adaptation and localization of the moving source. We carried out experiments in a reverberant enclosure with a 60 dB reverberation time of 600 ms (RT60 = 600 ms). We analyzed the performance of the proposed approach using a circular microphone array. We report comparisons with a point source localizationbased automatic camera steering algorithm proposed in the literature. The proposed algorithm turned out to be more accurate in terms of always having the moving speaker within the field of view.

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
We address the problem of multi-instrument recognition in polyphonic music signals. Individual in... more We address the problem of multi-instrument recognition in polyphonic music signals. Individual instruments are modeled within a stochastic framework using Student's-t Mixture Models (tMMs). We impose a mixture of these instrument models on the polyphonic signal model. No a priori knowledge is assumed about the number of instruments in the polyphony. The mixture weights are estimated in a latent variable framework from the polyphonic data using an Expectation Maximization (EM) algorithm, derived for the proposed approach. The weights are shown to indicate instrument activity. The output of the algorithm is an Instrument Activity Graph (IAG), using which, it is possible to find out the instruments that are active at a given time. An average F-ratio of 0.75 is obtained for polyphonies containing 2-5 instruments, on a experimental test set of 8 instruments: clarinet, flute, guitar, harp, mandolin, piano, trombone and violin.

IEEE Transactions on Audio, Speech, and Language Processing, 2000
ABSTRACT We address the problem of robust formant tracking in continuous speech in the presence o... more ABSTRACT We address the problem of robust formant tracking in continuous speech in the presence of additive noise. We propose a new approach based on mixture modeling of the formant contours. Our approach consists of two main steps: (i) Computation of a pyknogram based on multiband amplitude-modulation/frequency-modulation (AM/FM) decomposition of the input speech; and (ii) Statistical modeling of the pyknogram using mixture models. We experiment with both Gaussian mixture model (GMM) and Student's-t mixture model (tMM) and show that the latter is robust with respect to handling outliers in the pyknogram data, parameter selection, accuracy, and smoothness of the estimated formant contours. Experimental results on simulated data as well as noisy speech data show that the proposed tMM-based approach is also robust to additive noise. We present performance comparisons with a recently developed adaptive filterbank technique proposed in the literature and the classical Burg's spectral estimator technique, which show that the proposed technique is more robust to noise.
Uploads
Papers by Harshavardhan Sundar