Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, Computer Graphics Forum
Visemes are visual counterpart of phonemes. Traditionally, the speech animation of 3D synthetic faces involvesextraction of visemes from input speech followed by the application of co‐articulation rules to generate realisticanimation. In this paper, we take a novel approach for speech animation — using visyllables, the visual counterpartof syllables. The approach results into a concatenative visyllable based speech animation system. The key contributionof this paper lies in two main areas. Firstly, we define a set of visyllable units for spoken English along withthe associated phonological rules for valid syllables. Based on these rules, we have implemented a syllabificationalgorithm that allows segmentation of a given phoneme stream into syllables and subsequently visyllables. Secondly,we have recorded the database of visyllables using a facial motion capture system. The recorded visyllableunits are post‐processed semi‐automatically to ensure continuity at the vowel boundaries of t...
Computers & Graphics, 2006
This paper presents a novel approach for the generation of realistic speech synchronized 3D facial animation that copes with anticipatory and perseveratory coarticulation. The methodology is based on the measurement of 3D trajectories of fiduciary points marked on the face of a real speaker during the speech production of CVCV nonsense words. The trajectories are measured from standard video sequences using stereo vision photogrammetric techniques. The first stationary point of each trajectory associated with a phonetic segment is selected as its articulatory target. By clustering according to geometric similarity all articulatory targets of a same segment in different phonetic contexts, a set of phonetic context-dependent visemes accounting for coarticulation is identified. These visemes are then used to drive a set of geometric transformation/deformation models that reproduce the rotation and translation of the temporomandibular joint on the 3D virtual face, as well as the behavior of the lips, such as protrusion, and opening width and height of the natural articulation. This approach is being used to generate 3D speech synchronized animation from both natural and synthetic speech generated by a text-to-speech synthesizer.
Proceedings of Computer Animation 2002 (CA 2002), 2002
Animated talking faces can be generated from a set of predefined face and mouth shapes (visemes) by either concatenation or morphing. Each facial image corresponds to one or more phonemes, which are generated in synchrony with the visual changes. Existing implementations require a full set of facial visemes to be captured or created by an artist before the images can be animated. In this work we generate new, photo-realistic visemes from a single neutral face image by transformation using a set of prototype visemes. This allows us to generate visual speech from photographs and portraits where a full set of visemes is not available.
Visual speech animation, also known as, lip synchronization is the process of matching a speech audio file with the lips' movements of a synthetic character. Visual speech is a very demanding task, being done either fully manual, which is very time consuming, or with automatic methods based on data analysis. Currently there is still no automatic method that generates any sequence of visual speech without requiring further fine tuning. This research focused on the problem of automatically achieving lip-sync and led to a system that relies on speech recognition to obtain the words from audio and maps them to the visual poses, thus automatically obtaining visual speech animation. The system also supports language translation, which allows animation in a language different from the one spoken. Automatic visual speech animation has great impact in the entertainment industry, where it can reduce the time required to produce the animation of talking characters.
2000
We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.
2013
In this paper, we propose a 3D facial animation model for simulating visual speech production in the Romanian language. To our best knowledge, this is the first study dedicated to 3D visual speech synthesis for the Romanian language. The model is also capable to generate complex emotion activity for the virtual actors during the speech. Using a set of existing 3D key shapes representing various facial expressions, fluid animations describing facial activity during speech pronunciation are provided, taking into account several Romanian language coarticulation effects. A novel mathematical model for defining efficient viseme coarticulation functions is provided. The 3D tongue activity could be closely observed in real-time while different words are pronounced in Romanian. The framework proposed in this paper is designed for helping people with hearing disabilities learn how to articulate correctly in Romanian and also it works as a training-assistant for Romanian language lip-reading....
Communications in Computer and Information Science, 2012
Visual speech animation, or lip synchronization, is the process of matching speech with the lip movements of a virtual character. It is a challenging task because all articulatory movements must be controlled and synchronized with the audio signal. Existing language-independent systems usually require fine tuning by an artist to avoid artefacts appearing in the animation. In this paper, we present a modular visual speech animation framework aimed at speeding up and easing the visual speech animation process as compared with traditional techniques. We demonstrate the potential of the framework by developing the first automatic visual speech automation system for European Portuguese based on the concatenation of visemes. We also present the results of a preliminary evaluation that was carried out to assess the quality of two different phoneme-to-viseme mappings devised for the language.
2001
The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.
Speech Communication, 2012
Speech visualization is extended to use animated talking heads for computer assisted pronunciation training. In this paper, we design a data-driven 3D talking head system for articulatory animations with synthesized articulator dynamics at the phoneme level. A database of AG500 EMA-recordings of three-dimensional articulatory movements is proposed to explore the distinctions of producing the sounds. Visual synthesis methods are then investigated, including a phoneme-based articulatory model with a modified blending method. A commonly used HMM-based synthesis is also performed with a Maximum Likelihood Parameter Generation algorithm for smoothing. The 3D articulators are then controlled by synthesized articulatory movements, to illustrate both internal and external motions. Experimental results have shown the performances of visual synthesis methods by root mean square errors. A perception test is then presented to evaluate the 3D animations, where a word identification accuracy is 91.6% among 286 tests, and an average realism score is 3.5 (1 = bad to 5 = excellent).
This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The proposed system is based on Hidden Markov Models for the acoustic observation vec-tors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (called visemes). This case study illustrates that it is indeed possible to obtain visually relevant speech segmen-tation data directly from a purely acoustic observation sequence.
IEEE Transactions on Visualization and Computer Graphics, 2006
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
IEEE Transactions on Multimedia, 2005
This article presents an integral system capable of generating animations with realistic dynamics, including the individualized nuances, of 3-D human faces driven by speech acoustics. The system is capable of capturing short phenomena in the orofacial dynamics of a given speaker by tracking the 3-D location of various MPEG-4 facial points through stereovision. A perceptual transformation of the speech spectral envelope and prosodic cues are combined into an acoustic feature vector to predict 3-D orofacial dynamics by means of a nearest-neighbor algorithm. The Karhunen-Loéve transformation is used to identify the principal components of orofacial motion, decoupling perceptually natural components from experimental noise. We also present a highly optimized MPEG-4 compliant player capable of generating audio-synchronized animations at 60 frames per second. The player is based on a pseudo-muscle model augmented with a non-penetrable ellipsoidal structure to approximate the skull and the jaw. This structure adds a sense of volume that provides more realistic dynamics than existing simplified pseudo-muscle-based approaches, yet it is simple enough to work at the desired frame rate. Experimental results on an audio-visual database of compact TIMIT sentences are presented to illustrate the performance of the complete system. 1
Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667)
This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The system that was developed is based on Hidden Markov Models for the acoustic observation vectors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (visemes). The acoustic variability with context was taken into account by building acoustic viseme models that are dependent on the left and right viseme contexts. Our experimental results show that it is indeed possible to obtain visually relevant speech segmentation data directly from the purely acoustic speech signal.
ACM SIGCHI Bulletin, 1987
The arise of a new reality where information is accessed ubiquitously from a variety of different devices demands the development of more efcient and intuitive humanmachine interfaces. In this context, speech synchronized facial animation research aims the development of interactive talkingheads capable of reproducing the natural and intuitive faceto-face communication mechanisms. This paper describes a novel approach to the image-based morphing between visemes synthesis strategy through the use of context-dependent visemes capable of modeling speech coarticulation effects. The proposed approach was validated through the implementation of a videorealistic 2D speech synchronized facial animation system for Brazilian Portuguese based on an image database of just 34 photographs and integrated to a text-to-speech (TTS) synthesizer. Speech intelligibility tests results show that the level of videorealism obtained by the system is capable of improving message comprehension when audio is degenerated by noise, with results similar to the obtained with a video of a real face. The proposed methodology can be adapted to deliver videorealistic animations on devices with limited memory and processing capacities such as mobile phones, PDAs (Personal Digital Assistants) or digital TV set-top boxes.
2011
We automatically generate CG animations to express the pronunciation movement of speech through articulatory feature (AF) extraction to help learn a pronunciation. The proposed system uses MRI data to map AFs to coordinate values that are needed to generate the animations. By using magnetic resonance imaging (MRI) data, we can observe the movements of the tongue, palate, and pharynx in detail while a person utters words. AFs and coordinate values are extracted by multi-layer neural networks (MLN). Specifically, the system displays animations of the pronunciation movements of both the learner and teacher from their speech in order to show in what way the learner’s pronunciation is wrong. Learners can thus understand their wrong pronunciation and the correct pronunciation method through specific animated pronunciations. Experiments to compare MRI data with the generated animations confirmed the accuracy of articulatory features. Additionally, we verified the effectiveness of using AF ...
2003
Abstract The presence of visual information in addition to audio could improve speech understanding in noisy environments. This additional information could be especially useful for people with impaired hearing who are able to speechread. This paper focuses on the problem of synthesizing the facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, from a narrowband acoustic speech (telephone) signal.
2007
Natural looking lip animation, synchronized with incoming speech, is essential for realistic character animation. In this work, we evaluate the performance of phone and viseme based acoustic units, with and without context information, for generating realistic lip synchronization using HMM based recognition systems. We conclude via objective evaluations that utilization of viseme based units with context information outperforms the other methods.
1986
Animation which uses three dimensional computer graphics relies heavily on geometric transformations over time for the motions of camera and objects. To make a figure walk or make a liquid bubble requires sophisticated motion control not usually available in commercial animation systems. This paper describes a way to animate a model of the human face. The animator can build a sequence of facial movements, including speech, by manipulating a small set of keywords. Synchronized speech is possible because the same encoding of the phonetic elements (segments) is used to drive both the animation and a speech generator. Any facial expression, or string of segments, may be given a name and used as a key element. The final animated sequence is then generated automatically by expansion (if necessary) followed by interpolation between the resulting key frames.
The Visual Computer, 1988
In this paper we evaluate the effectiveness in conveying speech information of a speech synchronized facial animation system based on context-dependent visemes. The evaluation procedure is based on an oral speech intelligibility test conducted with, and without, supplementary visual information provided by a real and a virtual speaker. Three situations (audio-only, audio+video and audio+animation) are compared and analysed under five different conditions of noise contamination of the audio signal. The results show that the virtual face driven by context-dependent visemes effectively contributes to speech intelligibility at high noise degradation levels (Signal to Noise Ratio (SNR) ≤-18dB).
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.