Papers by Christian Wellekens

The straightforward method to manipulate compressed audio data consists on decoding, processing a... more The straightforward method to manipulate compressed audio data consists on decoding, processing and re-encoding. This method generates great algorithmic delay and complexity. In order to reduce this drawback, processing in compressed domain was proposed. However this approach is not so easy to use, some problems appears depending on the considered processing and coder. In this thesis, we are interested by perceptual frequency coders like MPEG-1 and FTR&D TDAC and processing such as filtering and mixing. The main application example considered here is audio processing in multipoint teleconferencing context. The first problem examined in this thesis deals with filtering in subband-domain. A generic framework making possible the transposition of any temporal rational filter (FIR or IIR) to subbanddomain, for any critically sampled-perfect reconstruction filter bank, was developed. This method was applied to make sound spatialisation using HRTF filters in subband-domain. The second problem considered is the summation of encoded audio signals. Lot of constraints appears depending on the considered coder. The main problem for MPEG-1 Layer I-II consists on determining psychoacoustics parameters which are required to the bit allocation. To resolve this problem the proposed algorithm makes estimation of masking thresholds of individual signal and then combines them. A new method of bit rate reduction was also derived from this algorithm. Decreasing the complexity in the summation procedure for FTR&D TDAC coder is based on the masking phenomena between different signals and using its particular structure. It takes advantage of the embedded codebooks property used by the vector quantification in this case. Application of compressed domain processing is illustrated by the implementation in an audio bridge for multipoint teleconferencing. This audio bridge has the functionalities of mixing, recovering of erased frames due to packets loss phenomena on network with non-guaranteed QoS and also managing transmission discontinues.

Rétroaction à hypothèses multiples pour la reconnaissance robuste de la parole à l'aide d'un réseau de microphones d'entrée
Reconnaître la parole dans des environnements réels est d'autant plus difficile que le niveau... more Reconnaître la parole dans des environnements réels est d'autant plus difficile que le niveau de bruit augmente et que le locuteur est éloigné du microphone. Des études récentes ont montré que la qualité de la parole en termes de rapport signal/bruit (SNR) peut être augmentée en utilisant des réseaux de microphones. En exploitant la corrélation spatiale entre les signaux multicanaux, on peut orienter le réseau vers le locuteur (formation de faisceau). On peut réaliser cela en exploitant l'interférence destructive entre canaux de bruit à l'aide de la technique retards-et-somme où les retards entre senseurs sont estimés et appliqués au signal de chaque canal. Dans une autre méthode, on peut réaliser un filtre par canal (filtrage-et-somme): ces filtres sont fixes ou adaptatifs sur base du canal voire de la trame selon le critère choisi. Dans ce travail, nous traitons le problème observé que l'accroissement du SNR ne conduit pas automatiquement a celui des taux de reconn...

A £xed scale (typically 25ms) short time spectral analysis of speech signals, which are inherentl... more A £xed scale (typically 25ms) short time spectral analysis of speech signals, which are inherently multi-scale in nature [7] (typically vowels last for 40-80ms while stops last for 10-20ms), is clearly sub-optimal for time-frequency resolution. In this work, we detect piecewise quasi-stationary speech segments based on the likelihood of that segment which in turn is estimated from the linear prediction (LP) residual error. A window size equal in length to that of the detected quasistationary segment is used to obtain its spectral estimate. Such an approach adaptively chooses the largest possible window size such that the signal remains quasistationary within this window and excludes the adjoining quasi-stationary segments from this window. In experiments, it is shown that the proposed multi-scale piecewise stationary spectral analysis based features improve recognition accuracy in clean conditions when compared directly to features based on £xed scale spectral analysis.
This paper aims at comparing the Bayesian Information Criterion and the Variational Bayesian appr... more This paper aims at comparing the Bayesian Information Criterion and the Variational Bayesian approach for scoring unknown multiple speaker clustering. Variational Bayesian learning is a very effective method that allows parameter learning and model selection at the same time. The application we consider here consists in finding the optimal clustering in a conversation where the speaker number is not a priori known. Experiments are run on synthetic data and on the evaluation data set NIST-1996 HUB-4. VB learning achieves higher score in terms of average cluster purity and average speaker purity compared to ML/BIC.
Proceedings of the EURASIP Workshop 1990 on Neural Networks

Différentes stratégies pour le suivi du locuteur , Reconnaissances des Formes et Intelligence Artificielle
This work addresses speaker tracking, which is closely related to speaker indexing. The task cons... more This work addresses speaker tracking, which is closely related to speaker indexing. The task consists in detecting the recorded segments utteredby a givenspeaker. In this approach, only the model of the target speaker is available and only the documents uttered by this given speaker are taken into account. In this paper, two different stategies are explored to set up systems for speaker tracking. The first one relies on a speaker indexing tool. Speaker turns are detected in the front-end of the system without any knowledge on possible speakers. Once the signal has been segmented, a classical speaker verification process is applied to each segment and checks if this segment corresponds to the target speaker. The second solution is worked out from a segmental speaker recognition system from which only the decision step is adapted to the task at the hand. In this case, decision on the presence of the target speaker in the record is based onthewhole recorded document. Segments correspon...
L'internationalisation par les télécommunications et pour les télécommunications
The ISCA technical and research workshop : adaptation methods in automatic speech recognition, Sophia-Antipolis, France

Neural Networks, EURASIP Workshop 1990, Sesimbra, Portugal, February 15-17, 1990, Proceedings
Eurasip Workshop, 1990
When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples?.- ... more When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples?.- Complexity theory of neural networks and classification problems.- Generalization performance of overtrained back-propagation networks.- Stability of the random neural network model.- Temporal pattern recognition using EBPS.- Markovian spatial properties of a random field describing a stochastic neural network: Sequential or parallel implementation?.- Chaos in neural networks.- The "moving targets" training algorithm.- Acceleration techniques for the backpropagation algorithm.- Rule-injection hints as a means of improving network performance and learning time.- Inversion in time.- Cellular neural networks: Dynamic properties and adaptive learning algorithm.- Improved simulated annealing, Boltzmann machine, and attributed graph matching.- Artificial dendritic learning.- A neural net model of human short-term memory development.- Large vocabulary speech recognition using neural-fuzzy and concept networks.- Speech feature extraction using neural networks.- Neural network based continuous speech recognition by combining self organizing feature maps and Hidden Markov Modeling.- Ultra-small implementation of a neural halftoning technique.- Application of self-organising networks to signal processing.- A study of neural network applications to signal processing.- Simulation machine and integrated implementation of neural networks.- VLSI implementation of an associative memory based on distributed storage of information.
Amélioration des taux de reconnaissance par filtrage de données
La conception des filtres approximant simultanément des exigences d'affaiblissement et de phase
These Ecole polytechnique federale de Lausanne EPFL, n° 189 (1974) Reference doi:10.5075/epfl-the... more These Ecole polytechnique federale de Lausanne EPFL, n° 189 (1974) Reference doi:10.5075/epfl-thesis-189Print copy in library catalog Record created on 2005-03-16, modified on 2016-08-08
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
This paper briefly reviews state of the art related to the topic of speech variability sources in... more This paper briefly reviews state of the art related to the topic of speech variability sources in automatic speech recognition systems. It focuses on some variations within the speech signal that make the ASR task difficult. The variations detailed in the paper are intrinsic to the speech and affect the different levels of the ASR processing chain. For different sources of speech variation, the paper summarizes the current knowledge and highlights specific feature extraction or modeling weaknesses and current trends.
Several recent results demonstrate improvement o f recognition scores if some FIR ltering is appl... more Several recent results demonstrate improvement o f recognition scores if some FIR ltering is applied on the trajectories of feature vectors. This paper presents a new approach where the characteristics of lters are trained together with the HMM parameters resulting in improvements of the recognition in rst tests.Reestimation formulas for the cut-o frequencies of ideal LP-lters are derived as well for the impulse response coe cients of a general FIR LP-lter.
Regroupement de modéles Bayesienne variationnelle: applications á la sélection de modéles par méthode variationnelle pour l'indexation audio

Statistical Inference in Multilayer Perceptrons and Hidden Markov Models with Applications in Continuous Speech Recognition
Neurocomputing, 1990
Speech recognition must contend with the statistical and sequential nature of the human speech pr... more Speech recognition must contend with the statistical and sequential nature of the human speech production system. Hidden Markov Models (HMM) provide a powerful method to cope with both of these, and their use made a breakthrough in speech recognition. However, the a priori choice of a model topology and weak discriminative power limit HMM capabilities. Recently, connectionist models have been recognized as an alternative tool. Their main useful properties lie in their discriminative power while capturing input-output relations. They have also proved useful in dealing with statistical data. However, the sequential aspect remains difficult to handle in connectionist models. The statistical use of a particular classic form of a connectionist system, the Multilayer Perceptron (MLP), is described in the context of the recognition of continuous speech. Relations with Hidden Markov Models are explained and preliminary results are reported.
REMAP for video soundtrack indexing
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
Indexing of video soundtracks is an important issue for the navigation in multimedia databases. B... more Indexing of video soundtracks is an important issue for the navigation in multimedia databases. Based on wordspotting techniques, it should meet very constraining specifications; namely fast response to queries, concise processed speech information for limiting the storage memory, speaker independant mode, easy characterization of any word by its phonemic spelling. A solution based on phonemic lattices and on a division

<title>Keyword spotting for multimedia document indexing</title>
Multimedia Storage and Archiving Systems II, 1997
We tackle the problem of multimedia indexing using keyword spotting on the spoken part of the dat... more We tackle the problem of multimedia indexing using keyword spotting on the spoken part of the data. Word spotting systems for indexing have to meet vary hard specifications: short response times to queries, speaker independent mode, open vocabulary in order to be able to track any keyword. To meet these constraints keyword models should be build according to their phonetic spelling and the process should be divided in two parts: preprocessing of the speech signal and query over a lattice of hypotheses. Different classification criteria have been studied for hypothesis generation: frame labeling, maximum likelihood and maximum a posteriori (MAP). The hypothesis probability is computed either through standard gaussian model or through a hybrid Hidden Markov Model-Neural Network. The training of the phonemic models is based either on Viterbi alignment or on recursive estimation and maximization of a posteriori probabilities. In the latter discriminant properties between phonemes are enforced. Tests have been conducted on TIMIT database as well as on TV news soundtracks. Interesting results have been obtained in time saving for the documentalist. The ultimate goal is to couple the soundtrack indexing with tools for video indexing in order to enhance the robustness of the system.

Speech Communication, 2007
Major progress is being recorded regularly on both the technology and exploitation of Automatic S... more Major progress is being recorded regularly on both the technology and exploitation of Automatic Speech Recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non stationarity...), especially when resources for system training are scarce. This papers outlines current advances related to these topics.
Explicit correlation in hidden Markov model for speech recognition
Proceedings of ESANN, 1998
As an introduction to a session dedicated to neural networks in speech processing, this paper des... more As an introduction to a session dedicated to neural networks in speech processing, this paper describes the basic problems faced with in automatic speech recognition ASR. Representation of speech, classi cation problems, speech unit models, training procedures and criteria are discussed. Why a n d h o w neural networks lead to challenging results in ASR is explained.
Uploads
Papers by Christian Wellekens