Papers by Tuomas Virtanen

Sound events are proven to have an impact on the emotions of the listener. Recent works on the fi... more Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best
of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal - Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.

We propose a novel exemplar-based feature enhancement method for automatic speech recognition whi... more We propose a novel exemplar-based feature enhancement method for automatic speech recognition which uses coupled dictionaries: an input dictionary containing atoms sampled in the modulation (envelope) spectrogram domain and an output dictionary with atoms in the Mel or full-resolution frequency domain. The input modulation representation is chosen for its separation properties of speech and noise and for its relation with human auditory processing. The output representation is one which can be processed by the ASR backend. The proposed method was investigated on the AURORA-2 and AURORA-4 databases and improved word error rates (WER) were obtained when compared to the system which uses Mel features in the input exemplars. The paper also proposes a hybrid system which combines the baseline and the proposed algorithm on the AURORA-2 database which in turn also yielded improvement over both the algorithms.

The paper considers the task of recognizing phonemes and words from a singing input by using a ph... more The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.

2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
ABSTRACT Birds have been widely used as biological indicators for ecological research. They respo... more ABSTRACT Birds have been widely used as biological indicators for ecological research. They respond quickly to environmental changes and can be used to infer about other organisms (e.g., insects they feed on). Traditional methods for collecting data about birds involves costly human effort. A promising alternative is acoustic monitoring. There are many advantages to recording audio of birds compared to human surveys, including increased temporal and spatial resolution and extent, applicability in remote sites, reduced observer bias, and potentially lower cost. However, it is an open problem for signal processing and machine learning to reliably identify bird sounds in real-world audio data collected in an acoustic monitoring scenario. Some of the major challenges include multiple simultaneously vocalizing birds, other sources of non-bird sound (e.g., buzzing insects), and background noise like wind, rain, and motor vehicles.
This paper proposes a multi-stream speech recognition system that combines information from three... more This paper proposes a multi-stream speech recognition system that combines information from three complementary analysis methods in order to improve automatic speech recognition in highly noisy and reverberant environments, as featured in the 2011 PAS-CAL CHiME Challenge. We integrate word predictions by a bidirectional Long Short-Term Memory recurrent neural network and non-negative sparse classification (NSC) into a multi-stream Hidden Markov Model using convolutive non-negative matrix factorization (NMF) for speech enhancement. Our results suggest that NMF-based enhancement and NSC are complementary despite their overlap in methodology, reaching up to 91.9 % average keyword accuracy on the Challenge test set at signal-to-noise ratios from -6 to 9 dB-the best result reported so far on these data.
Proc. International Conference on Speech and Language Processing, 2010
This paper proposes to use non-negative matrix factorization based speech enhancement in robust a... more This paper proposes to use non-negative matrix factorization based speech enhancement in robust automatic recognition of mixtures of speech and music. We represent magnitude spectra of noisy speech signals as the non-negative weighted linear combination of speech and noise spectral basis vectors, that are obtained from training corpora of speech and music. We use overcomplete dictionaries consisting of random exemplars of the training data. The method is tested on the Wall Street Journal large ...
Audio, Speech, and Language Processing, IEEE Transactions on, Jul 1, 2010
Voice conversion can be formulated as finding a mapping function which transforms the features of... more Voice conversion can be formulated as finding a mapping function which transforms the features of the source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion is commonly used, but it is subject to overfitting. In this paper, we propose to use partial least squares (PLS)-based transforms in voice conversion. To prevent overfitting, the degrees of freedom in the mapping can be controlled by choosing a suitable number of components. We propose a technique to combine PLS with GMMs, enabling ...
IEEE Transactions on Audio, Speech and Language Processing, 2012
A drawback of many voice conversion algorithms is that they rely on linear models and/or require ... more A drawback of many voice conversion algorithms is that they rely on linear models and/or require a lot of tuning. In addition, many of them ignore the inherent time-dependency between speech features.

In exemplar-based speech enhancement systems, lower dimensional features are preferred over the f... more In exemplar-based speech enhancement systems, lower dimensional features are preferred over the full-scale DFT features for their reduced computational complexity and the ability to better generalize for the unseen cases. But in order to obtain the Wiener-like filter for noisy DFT enhancement, the speech and noise estimates obtained in the feature space need to be mapped to the DFT space, which yield a low-rank approximation of the estimates resulting in a sub-optimal filter. This paper proposes a novel method using coupled dictionaries where the exemplars for the required feature space and the DFT space are jointly extracted and the estimates are directly obtained in the DFT space following the decomposition in the chosen feature space. Simulation experiments revealed that the proposed approach, where the activations of exemplars calculated using the Mel resolution are directly used to obtain the Wiener filter in the DFT space, results in improved signal-to-distortion ratio (SDR) when compared to the system without coupled dictionaries. To further motivate the use of coupled dictionaries, the paper also investigates the use of modulation envelope features for the exemplar-based speech enhancement.
Uploads
Papers by Tuomas Virtanen
of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal - Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.
of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal - Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.