Papers by Vikramjit Mitra
Effects of feature type, learning algorithm and speaking style for depression detection from speech
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
Cross-corpus depression prediction from speech
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
Deep convolutional nets and robust features for reverberation-robust speech recognition
2014 IEEE Spoken Language Technology Workshop (SLT), 2014

Medical Imaging 2006: Ultrasonic Imaging and Signal Processing, 2006
Ultrasound imaging is a noninvasive technique well-suited for detecting abnormalities like cysts,... more Ultrasound imaging is a noninvasive technique well-suited for detecting abnormalities like cysts, lesions and blood clots. In order to use 3D ultrasound to visualize the size and shape of such abnormalities, effective boundary detection methods are needed. A robust boundary detection technique using a nearest neighbor map (NNM) and applicable to multi-object cases has been developed. The algorithm contains three modules: pre-processor, main processor and boundary constructor. The pre-processor detects the object(s) and obtains geometrical as well as statistical information for each object, whereas the main processor uses that information to perform the final processing of the image. These first two modules perform image normalization, thresholding, filtering using median, wavelet, Wiener and morphological operation, estimation and boundary detection of object(s) using NNM, and calculation of object size and their location. The boundary constructor module implements an active contour model that uses information from previous modules to obtain seed-point(s). The algorithm has been found to offer high boundary detection accuracy of 96.4% for single scan plane (SSP) and 97.9 % for multiple scan plane (MSP) images. The algorithm was compared with Stick's algorithm and Gibbs Joint Probability Function based algorithm and was found to offer shorter execution time with higher accuracy than either of them. SSP numerically modeled ultrasound images, SSP real ultrasound images, MSP phantom images and MSP numerically modeled ultrasound images were processed. The algorithm provides an area estimate of the target object(s), which along with position information of the ultrasound transducer, can be used for the calculation of the object volume(s) and for 3D visualization of the object(s).

Medical Imaging 2006: Image Processing, 2006
Medical ultrasound images are noisy with speckle, acoustic noise and other artifacts. Reduction o... more Medical ultrasound images are noisy with speckle, acoustic noise and other artifacts. Reduction of speckle in particular is useful for CAD algorithms. We use two algorithms, namely, mean curvature evolution of the ultrasound image surface and a variation of the mean-curvature flow, to reduce speckle. The premise is that when we view the ultrasound image as a surface, the speckle appears as a high-curvature jagged layer over the true objects intensities and will reduce quickly on curvature evolution. We compare the two speckle reduction algorithms. We apply the speckle reduction to an image of a cyst and a 4-chamber view of the heart. We show significant, if not complete, speckle reduction, while keeping the relevant organ boundaries intact. On the specklereduced images, we apply a segmentation algorithm to detect objects. The segmentation algorithm is two-stepped. In the first step we choose a prior-shape and optimize the pose parameters to maximize the edge-pixels the curve falls into, using gradient ascent. In the second step, a radial motion is used to draw the contour points to the localedges. We apply the algorithm on a cyst and obtain satisfactory results. We compare the total area inside the boundary output of our segmentation algorithm and to the total area covered by a hand-drawn boundary of the cyst, and the ratio is about 97%.
Damped oscillator cepstral coefficients for robust speech recognition
A step in the realization of a speech recognition system based on gestural phonology and landmarks
Recent improvements in SRI’s Keyword Detection System for Noisy Audio

Robust features and system fusion for reverberation-robust speech recognition
ABSTRACT Reverberation in speech degrades the performance of speech recognition systems, leading ... more ABSTRACT Reverberation in speech degrades the performance of speech recognition systems, leading to higher word error rates. Human listeners can often ignore reverberation, indicating that the auditory system somehow compensates for reverberation degradations. In this work, we present robust acoustic features motivated by the knowledge gained from human speech perception and production, and we demonstrate that these features provide reasonable robustness to reverberation effects compared to traditional mel-filterbank-based features. Using a single-feature system trained with the data distributed through the REVERB 2014 challenge on automatic speech recognition, we show a modest 12% and 0.2% relative reduction in word error rate (WER) compared to the mel-scale-feature-based baseline system for simulated and real reverberation conditions. The reduction is more pronounced when three systems are combined, resulting in a relative 20% reduction in WER for the simulated reverberation condition and 11.7% for the real reverberation condition compared to the mel-scale-feature-based baseline system. The WER was found to reduce even further with addition of more systems trained with robust acoustic features. HLDA transform of features and MLLR adaptation of speaker clusters were also explored in this study and both of them were found to improve the recognition performance under reverberant conditions.
Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions
Strategies for high accuracy keyword detection in noisy channels
The SRI AVEC-2014 Evaluation System
Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge - AVEC '14, 2014
Improving language identification robustness to highly channel-degraded speech through multiple system fusion
Modulation features for noise robust speaker identification
A noise-robust system for NIST 2012 speaker recognition evaluation
Automatic phonetic segmentation using boundary models

Highly accurate phonetic segmentation using boundary correction models and system fusion
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
ABSTRACT Accurate phone-level segmentation of speech remains an important task for many subfields... more ABSTRACT Accurate phone-level segmentation of speech remains an important task for many subfields of speech research. We investigate techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models. In prior work [25] we were able to improve on state-of-the-art alignment accuracy by employing special phone boundary HMM models, trained on phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here we present further improved results by using more powerful statistical models for boundary correction that are conditioned on phonetic context and duration features. Furthermore, we find that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps. Overall, we reduce segmentation errors on the TIMIT corpus by almost one half, from 93.9% to 96.8% boundary accuracy with a 20-ms tolerance.

Articulatory features from deep neural networks and their role in speech recognition
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
ABSTRACT This paper presents a deep neural network (DNN) to extract articulatory information from... more ABSTRACT This paper presents a deep neural network (DNN) to extract articulatory information from the speech signal and explores different ways to use such information in a continuous speech recognition task. The DNN was trained to estimate articulatory trajectories from input speech, where the training data is a corpus of synthetic English words generated by the Haskins Laboratories&#39; task-dynamic model of speech production. Speech parameterized as cepstral features were used to train the DNN, where we explored different cepstral features to observe their role in the accuracy of articulatory trajectory estimation. The best feature was used to train the final DNN system, where the system was used to predict articulatory trajectories for the training and testing set of Aurora-4, the noisy Wall Street Journal (WSJ0) corpus. This study also explored the use of hidden variables in the DNN pipeline as a potential acoustic feature candidate for speech recognition and the results were encouraging. Word recognition results from Aurora-4 indicate that the articulatory features from the DNN provide improvement in speech recognition performance when fused with other standard cepstral features; however when tried by themselves, they failed to match the baseline performance.

Medium-duration modulation cepstral feature for robust speech recognition
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
ABSTRACT Studies have shown that the performance of state-of-the-art automatic speech recognition... more ABSTRACT Studies have shown that the performance of state-of-the-art automatic speech recognition (ASR) systems significantly deteriorate with increased noise levels and channel degradations, when compared to human speech recognition capability. Traditionally, noise-robust acoustic features are deployed to improve speech recognition performance under varying background conditions to compensate for the performance degradations. In this paper, we present the Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature, which is a composite feature capturing subband speech modulations and a summary modulation. We analyze MMeDuSA&#39;s speech recognition performance using SRI International&#39;s DECIPHER® large vocabulary continuous speech recognition (LVCSR) system, on noise and channel degraded Levantine Arabic speech distributed through the Defense Advance Research Projects Agency (DARPA) Robust Automatic Speech Transcription (RATS) program. We also analyzed MMeDuSA&#39;s performance against the Aurora-4 noise-and-channel degraded English corpus. Our results from all these experiments suggest that the proposed MMeDuSA feature improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.

Feature fusion for high-accuracy keyword spotting
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
ABSTRACT This paper assesses the role of robust acoustic features in spoken term detection (a.k.a... more ABSTRACT This paper assesses the role of robust acoustic features in spoken term detection (a.k.a keyword spotting - KWS) under heavily degraded channel and noise corrupted conditions. A number of noise-robust acoustic features were used, both in isolation and in combination, to train large vocabulary continuous speech recognition (LVCSR) systems, with the resulting word lattices used for spoken term detection. Results indicate that the use of robust acoustic features improved KWS performance with respect to a highly optimized state-of-the art baseline system. It has been shown that fusion of multiple systems improve KWS performance, however the number of systems that can be trained is constrained by the number of frontend features. This work shows that given a number of frontend features it is possible to train several systems by using the frontend features by themselves along with different feature fusion techniques, which provides a richer set of individual systems. Results from this work show that KWS performance can be improved compared to individual feature based systems when multiple features are fused with one another and even further when multiple such systems are combined. Finally this work shows that fusion of fused and single feature bases systems provide significant improvement in KWS performance compared to fusion of singlefeature based systems.
Uploads
Papers by Vikramjit Mitra