Papers by Gerasimos Potamianos
The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars
Lecture Notes in Computer Science, 2006
... Page 5. The IBM RT06s Evaluation System for SAD in CHIL Seminars 327 Active NodexAM p( | ) ma... more ... Page 5. The IBM RT06s Evaluation System for SAD in CHIL Seminars 327 Active NodexAM p( | ) max ( | ) ( | x S pg S pg x g Oj( ) S HMM State S p ( | ) Sp1 xAM NSp1 p ( | ) Sp2xAM NSp2 p ( | ) Sp3 xAM NSp3 Speech detection observations Fig. ...

Lecture Notes in Computer Science, 2006
An important step to bring speech technologies into wide deployment as a functional component in ... more An important step to bring speech technologies into wide deployment as a functional component in man-machine interfaces is to free the users from close-talk or desktop microphones, and enable far-field operation in various natural communication environments. In this work, we consider far-field automatic speech recognition and speech activity detection in conference rooms. The experiments are conducted on the smart room platform provided by the CHIL project. The first half of the paper addresses the development of speech recognition systems for the seminar transcription task. In particular, we look into the effect of combining parallel recognizers in both single-channel and multi-channel settings. In the second half of the paper, we describe a novel algorithm for speech activity detection based on fusing phonetic likelihood scores and energy features. It is shown that the proposed technique is able to handle non-stationary noise events and achieves good performance on the CHIL seminar corpus.

2007 IEEE Workshop on Motion and Video Computing (WMVC'07), 2007
We present a computer vision system to robustly track an object in 3D by combining evidence from ... more We present a computer vision system to robustly track an object in 3D by combining evidence from multiple calibrated cameras. Its novelty lies in the proposed unified approach to 3D kernel based tracking, that amounts to fusing the appearance features from all available camera sensors, as opposed to tracking the object appearance in the individual 2D views and fusing the results. The elegance of the method resides in its inherent ability to handle problems encountered by various 2D trackers, including scale selection, occlusion, view-dependence, and correspondence across different views. We apply the method on the CHIL project database for tracking the presenter's head during lectures inside smart rooms equipped with four calibrated cameras. As compared to traditional 2D based mean shift tracking approaches, the proposed algorithm results in 35% relative reduction in overall 3D tracking error and a 70% reduction in the number of tracker re-initializations. * In proc. IEEE Workshop on Motion and Video Computing, Austin TX, February 2007.

Lecture Notes in Computer Science, 2006
We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign ... more We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meeting data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM). The system building process is similar to the IBM conversational telephone speech recognition system. However, the best models for the farfield conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from scientific conference proceedings, and spontaneous telephone conversations. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year's CHIL evaluation. Furthermore, the developed STT system significantly outperformed our last year's results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal.
A Joint System for Person Tracking and Face Detection
Lecture Notes in Computer Science, 2005
Visual detection and tracking of humans in complex scenes is a chal- lenging problem with a wide ... more Visual detection and tracking of humans in complex scenes is a chal- lenging problem with a wide range of applications, for example surveillance and human-computer interaction. In many such applications, time-synchronous views from multiple calibrated cameras are available, and both frame-view and space- level human location information is desired. In such scenarios, efficiently com- bining the strengths of face detection
In recent work, we have concentrated on the problem of lipreading from non-frontal views (poses).... more In recent work, we have concentrated on the problem of lipreading from non-frontal views (poses). In particular, we have focused on the use of profile views, and proposed two approaches for lipreading on basis of visual features extracted from such views: (a) Direct statistical modeling of the features, namely use of view-dependent statistical models; and (b) Normalization of such features
A unified approach to multi-pose audio-visual ASR
The vast majority of studies in the field of audio-visual auto- matic speech recognition (AVASR) ... more The vast majority of studies in the field of audio-visual auto- matic speech recognition (AVASR) assumes frontal images of a speaker's face, but this cannot always be guaranteed in practice. Hence our recent research efforts have concentrated on extract- ing visual speech information from non-frontal faces, in partic- ular the profile view. The introduction of additional views to an AVASR
2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, 2006
Visual information from a speaker's mouth region is known to improve automatic sp... more Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face, which is not always the case in realistic human-computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras
Robust audio-visual speech synchrony detection by generalized bimodal linear prediction
Annual Conference of the International Speech Communication Association, 2009
We study the problem of detecting audio-visual synchrony in video segments containing a speaker i... more We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, sp eaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extendi ng our previously proposed time-evolution model of audio-visual
It is a common experience in our modern world for humans to be overwhelmed by the complexities of... more It is a common experience in our modern world for humans to be overwhelmed by the complexities of technological artifacts around us and by the attention they demand. While technology provides wonderful support and helpful assistance, it also gives rise to an increased preoccupation with technology itself and with a related fragmentation of attention. But, as humans, we would rather attend to a meaningful dialog and interaction with other humans than to control the operations of machines that serve us. The cause for such ...
Audio-visual speech synchronization detection using a bimodal linear prediction model
Computer Vision and Pattern Recognition, 2009
In this work, we study the problem of detecting audio-visual (AV) synchronization in video segmen... more In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model
Asynchrony modeling for audio-visual speech recognition
Human Language Technology, 2002
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. ... more We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various
Stream con dence estimation for audio-visual speech recognition
We investigate the use of single modality confidence measures as a means of estimating adaptive, ... more We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved au-dio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio-or visual-only obser-vation probability, raised to an appropriate exponent.
Efficient likelihood computation in multi-stream HMM based audio-visual speech recognition
ACM Transactions on Speech and Language Processing, 2004
Multi-stream hidden Markov models have recently been intro- duced in the field of automatic speec... more Multi-stream hidden Markov models have recently been intro- duced in the field of automatic speech recognition as an alter- native to single-stream modeling of sequences of speech infor- mative features. In particular, they have been very success- ful in audio-visual speech recognition, where features extracted from video of the speaker's lips are also available. However, in contrast to single-stream modeling,
Stream confidence estimation for audio-visual speech recognition
We investigate the use of single modality confidence measures as a means of estimating adaptive, ... more We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved au- dio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio- or visual-only obser- vation probability, raised to
International Conference on Acoustics, Speech, and Signal Processing, 2001
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-v... more We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied to MFCC based audio-only features, as well as on visual only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a
International Conference on Acoustics, Speech, and Signal Processing, 2001
Addresses the problem of audio-visual information fusion to provide highly robust speech recognit... more Addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood
We have made significant progress in automatic speech recognition (ASR) for well-defined applicat... more We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech
Acoustic fall detection using Gaussian mixture models and GMM supervectors
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
... [6] M. Grassi, A. Lombardi, G. Rescio, P. Malcovati, A. Leone, G. Diraco, C. Distante, P. Sic... more ... [6] M. Grassi, A. Lombardi, G. Rescio, P. Malcovati, A. Leone, G. Diraco, C. Distante, P. Siciliano, M. Malfatti, L. Gonzo, V. Libal, J. Huang, and G. Potamianos, A hardware-software framework for high-reliability people fall detection, In Proc. IEEE Conf. ...
Uploads
Papers by Gerasimos Potamianos