Papers by Hynek Hermansky
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
... timer derivative estimation (such as the 10-frame interval applied in language identification... more ... timer derivative estimation (such as the 10-frame interval applied in language identification system ... between the performances of both systems for models trained with 1 conversation side, which ... In a cheating experiment, the broad-phonetic categories are obtained by a canonical ...
6th European Conference on Speech Communication and Technology (Eurospeech 1999)
Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms anal... more Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most e cient com-ponents of the modulation spectrum of speech for ASR are up to about 16 Hz 1]. Consequently, RASTA pro-cessing ...
5th European Conference on Speech Communication and Technology (Eurospeech 1997)
We describe use of Linear Discriminant Analysis LDA for data-driven automatic design of RASTA-lik... more We describe use of Linear Discriminant Analysis LDA for data-driven automatic design of RASTA-like lters. The LDA applied to rather long segments of time trajectories of critical-band energies yields FIR lters to be applied to these time trajectories in the feature extraction module. Frequency responses of the rst three discriminant v ectors are in principle consistent with the ad hoc designed RASTA, delta and double-delta lters. On a connected digit task the new features outperform the original RASTA processing.
7th International Conference on Spoken Language Processing (ICSLP 2002)
Our feature extraction module for the Aurora task is based on a combination of a conventional noi... more Our feature extraction module for the Aurora task is based on a combination of a conventional noise supression technique (Wiener filtering) with our temporal processing technigues (linear discriminant RASTA filtering and nonlinear TempoRAl Pattern (TRAP) classifier). We observe better than 58% relative error improvement on the prescribed Aurora Digit Task, a performance level that is somewhat better than the new ETSI Advanced Feature standard. Furthermore, to test generalization of our approach to an independent test set not available during development, we evaluate performance on American English SpeechDatCar digits and show 10.54% relative improvement over the new ETSI standard.
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
TRAP based ASR attempts to extract information from rather long (as long as 1 s) and narrow (one ... more TRAP based ASR attempts to extract information from rather long (as long as 1 s) and narrow (one critical-band) patches (tem-poral patterns) from time-frequency plane. We investigate the ef-fect of combining temporal patterns of logarithmic critical-band energies from several ...
4th European Conference on Speech Communication and Technology (Eurospeech 1995)
A new technique is presented which improves thesubjective quality of band-limited speech. The app... more A new technique is presented which improves thesubjective quality of band-limited speech. The approachis based on a linear model of speech production,in which we independently estimate the spectralenvelope and excitation function for a broad-bandwidthspeech signal to reconstruct missing frequency componentsin narrow-bandwidth speech.
5th International Conference on Spoken Language Processing (ICSLP 1998)
The work examines Karhunen-Loeve Transform andLinear Discriminant Analysis as means for designing... more The work examines Karhunen-Loeve Transform andLinear Discriminant Analysis as means for designing optimizedspectral bases for the projection of the critical-bandauditory-like spectrum.1. INTRODUCTION1.1. The state-of-artTypical large vocabulary automatic recognition ofspeech (ASR) consists of three main components: featureextraction, pattern classification, and language modeling.The feature extraction attempts to reduce the informationrate of raw speech data by alleviating...
5th International Conference on Spoken Language Processing (ICSLP 1998)
We provide an analysis of the relative importance ofcomponents of the modulation spectrum for spe... more We provide an analysis of the relative importance ofcomponents of the modulation spectrum for speaker verification.The aim is to remove less relevant components andreduce system sensitivity to acoustic disturbances whileimproving verification accuracy. Spectral components between0.1 Hz and 10 Hz are found to contain the mostuseful speaker information. We discuss this result in thecontext of RASTA processing and cepstral mean subtraction.When
7th International Conference on Spoken Language Processing (ICSLP 2002)
This paper discusses the relevance of non-uniform frequency resolution used by current speech ana... more This paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive (PLP) analysis. It is shown that linear discriminant analysis of short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds. Further, we show that this non-uniform resolution can be traced to the physiology of speech production mechanism. In ASR experiments, features extracted by the discriminant functions are shown to outperform the conventional features derived by cosine basis functions.
5th International Conference on Spoken Language Processing (ICSLP 1998)
The work proposes a radically different set of featuresfor ASR where TempoRAl Patterns of spectra... more The work proposes a radically different set of featuresfor ASR where TempoRAl Patterns of spectral energies areused in place of the conventional spectral patterns. Theapproach has several inherent advantages, among them robustnessto stationary or slowly varying disturbances.1. INTRODUCTION1.1. Spectral featuresIn 1665 Isaac Newton made the following observation:"The filling of a very deepe flaggon with a constant streameof beere or water
6th International Conference on Spoken Language Processing (ICSLP 2000)
The means of the long temporal trajectories of loga-rithmic critical band energies in a vicinity ... more The means of the long temporal trajectories of loga-rithmic critical band energies in a vicinity of individ-ual phoneme show distinct patterns (TRAPs Fig 1) in each critical band for different phonemes. These temporal patterns were successfully used in Automatic Speech ...
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the s... more The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the systems submitted for oneand two-speaker detection tasks and the post-evaluation improvements. In one-speaker detection system, we present a new design of a data-driven temporal filter. We show that using few broad phonetic categories improves the performance of speaker recognition system. In post evaluation experiments, we show that combinations with complementary features and modeling techniques significantly improve the performance of the GMM-based system. In two-speaker detection system, we present a structured approach to detect speaker in the conversations.
5th European Conference on Speech Communication and Technology (Eurospeech 1997)
2nd International Conference on Spoken Language Processing (ICSLP 1992)
2nd European Conference on Speech Communication and Technology (Eurospeech 1991)
... EFFECT OF THE COMMUNICATION CHANNEL IN AUDITORY-LIKE ANALYSIS OF SPEECH (RASTA-PLP) Hynek Her... more ... EFFECT OF THE COMMUNICATION CHANNEL IN AUDITORY-LIKE ANALYSIS OF SPEECH (RASTA-PLP) Hynek Hermansky , Nelson Morgan , Aruna Bayya , Phil Kohn** ^ * US ... It can be shown that if the derivative of step (2) is estimated by a simple first difference, and if the ...
6th International Conference on Spoken Language Processing (ICSLP 2000)
Deviating from the conventional Hidden Markov Model-Multi-Layer Perceptron (HMM-MLP) hybrid parad... more Deviating from the conventional Hidden Markov Model-Multi-Layer Perceptron (HMM-MLP) hybrid paradigm of using MLP for classification, the proposed discriminative MLP technique uses MLP as a mapping module for fea-ture extraction for conventional HMM-based systems. ...

6th European Conference on Speech Communication and Technology (Eurospeech 1999)
This paper examines sources of variability in the speech signal using a new technique that is bas... more This paper examines sources of variability in the speech signal using a new technique that is based on a nested spectral analysis of variance (SANOVA). By constructing an ANOVA in the modulation spectral domain, the technique allows a characterization of unwanted variability in the time sequences of logarithmic energy caused by extraneuous sources of variability such as additive noise, convolutional noise, and telephone handset transducer. Very low and moderate to high modulation frequencies are shown to be particularly affected by these sources. Veri cation results for 500 speakers on Switchboard data from the 1998 NIST speaker recognition evaluation are presented to con rm the conclusions. It is shown that a bandpass ltering and down sampling of the time sequences of logarithmic energy, compared to a conventional highpass ltering, leads to a 13% relative reduction of the EER in mismatched conditions.
6th European Conference on Speech Communication and Technology (Eurospeech 1999)
... Sachin Kajarekar1, Narendranath Malayath1 and Hynek Hermansky1,2 1Oregon Graduate Institute o... more ... Sachin Kajarekar1, Narendranath Malayath1 and Hynek Hermansky1,2 1Oregon Graduate Institute of Science and Technology, Portland, Oregon, USA. ... variability in the speech signal can be attributed to the following sources: a Phonetic content, b Speaker and Channel, and c ...

8th European Conference on Speech Communication and Technology (Eurospeech 2003)
ABSTRACT Local frequency and time averaging and differentiating op- erators, using three neighbor... more ABSTRACT Local frequency and time averaging and differentiating op- erators, using three neighboring points of critical-band time- frequency plane, are used to process the plane prior to its use in TRAP-based ASR. In that way, five alternative TRAP-based ASR systems (the original one and the time/frequency inte- grated/differentiated ones)are created. We show that the fre- quency differentiating operator improves performance of the TRAP-based ASR. 1. Introduction Unlike features which are based on full short-term spectrum with its short time context, temporal pattern (TRAP) features are based on narrow band spectrum with long time context. By breaking the spectrum into individual critical band and using each critical band independently in the initial stage of the fea- ture extraction, the TRAP-based features can be inherently less sensitive to changes in relative levels of the individual critical bands. Further, by using longer temporal context, all informa- tion about underlying linguistic events, which is spread in time due to coarticulation, may be utilized. Initially, a single time trajectories of critical band spectral densities in each critical band were used as input vectors in the frequency-localized TRAP probabilty estimators (3). Thus, the burden of exploiting the useful information in the tempo- ral pattern and alleviating the irrelevant one was fully left on the estimator. Later, attempts for parametrizing the trajectory vectors were made and the critical band spectral density vec- tors were projected on bases obtained by Principal Comonent Analysis (PCA) (6) or Linear Discriminant Analysis (LDA) (5), with the resulting reduction of the size of the input vector to the frequency-localized probability estimator. Recent studies indicate that information extracted from sev- eral (up to three) neighboring bands improves performance of the TRAP system (7). Since these studies use PCA of the input vector space, it is possible to investigate the resulting projec- tion basis. Such an inspection reveals that the PCA rotation resembles frequency averaging and frequency differentiating of the neighboring bands with the subsequent projection on co- sine transform bases. This observation suggests that a simple pre-processing of a critical-band spectrogram (CRBS) prior to the cosine transformation and the TRAP classification may be beneficial. The current work investigates such modifications of CRBS in TRAP system and evaluates their individual efficiency as well as their effect in conjunction with the original (i.e. un- processed) CRBS.
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
Band-independent categories are investigated for feature es-timation in ASR. These categories rep... more Band-independent categories are investigated for feature es-timation in ASR. These categories represent distinct speech-events manifested in frequency-localized temporal patterns of the speech signal. A universal, single estimator is proposed for estimating speech-event ...
Uploads
Papers by Hynek Hermansky