Papers by Alexandra Markó
Az ún. longitudinális korpusz rögzítőinek célja felnőtt beszélők követéses vizsgálata a beszéd kü... more Az ún. longitudinális korpusz rögzítőinek célja felnőtt beszélők követéses vizsgálata a beszéd különféle sajátosságainak a tekintetében. Az adatközlők magyar anyanyelvűek, akiknek a hanganyagát először a BEA adatbázisban rögzítették, majd 10-11 évvel később a longitudinális korpusz módszertanával is felvételeket készítettek velük. A korpusz beszéd kutatók számára hozzáférhető lesz. A tanulmány ismerteti a korábbi longitudinális kutatásokat, amelyek a jelen korpusz alapjául szolgáltak a módszertan kialakítása szempontjából, valamint bemutatja a folyamatban lévő korpuszépíési munkálatokat

11th ISCA Speech Synthesis Workshop (SSW 11), 2021
For articulatory-to-acoustic mapping, typically only limited parallel training data is available,... more For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pretrained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.

2019 International Joint Conference on Neural Networks (IJCNN), 2019
When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually ... more When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the inability to capture the relations between close regions (i.e. pixels) of the image, this pixelby-pixel representation of the image is also quite uneconomical. It is easy to see that a significant part of the image is irrelevant for the spectral parameter estimation task as the information stored by the neighbouring pixels is redundant, and the neural network is quite large due to the large number of input features. To resolve these issues, in this study we train an autoencoder neural network on the ultrasound image; the estimation of the spectral speech parameters is done by a second DNN, using the activations of the bottleneck layer of the autoencoder network as features. In our experiments, the proposed method proved to be more efficient than the standard approach: the measured normalized mean squared error scores were lower, while the correlation values were higher in each case. Based on the result of a listening test, the synthesized utterances also sounded more natural to native speakers. A further advantage of our proposed approach is that, due to the (relatively) small size of the bottleneck layer, we can utilize several consecutive ultrasound images during estimation without a significant increase in the network size, while significantly increasing the accuracy of parameter estimation.

It is a commonly held fact that not only adjacent speech sounds but also transconsonantal vowels ... more It is a commonly held fact that not only adjacent speech sounds but also transconsonantal vowels have an effect on each other, as the vowels in V1CV2 sequences are produced with one single underlying diphthongal gesture [7]. It is often hypothesized that the V-to-V coarticulation induced contextual variation of vowels is dependent on several factors, e.g., prosodic position, and vowel quality of the target V, prosodic position of the context/trigger V, and direction of coarticulation. Some previous studies in non-words showed evidence of an increased resistance to coarticulatory effects in prosodically strong positions, i.e., in lexically stressed ([3,6]; acoustic data) and pitch-accented syllables ([1]; articulatory data). Recent real-word studies, however, also revealed that if captured in parallel, acoustic and articulatory resistance tendencies may also be highly different both in magnitude and nature [2]. For prosodically conditioned coarticulatory aggression in V-to-V coarticulation, articulatory data showed no support [1]. As for the direction of coarticulation, carryover coarticulation was found to be stronger than anticipatory both in articulation [1,8] and acoustics [6]. Note that most of the above studies captured V-to-V induced "contextual variation", in quality differences of coarticulated and non-coarticulated tokens. Contextual variation, however, may just as well be interpreted as actual variation, i.e., "dispersion" of vowel tokens observed in the acoustic and articulatory spaces, which is not yet explored under the conditioning effect of the above factors. Furthermore, limited amount of available results warrants for further exploration of the question if coarticulatory resistance/aggression of the same tokens in the same utterances is detectable both in acoustics and articulation. It is important to note here, that according to a well-known (but in some respect, understudied) hypothesis, variability/dispersion of V realizations is also affected by the density of the vowel space [4]; therefore, it is safe to assume that none of the effects claimed to influence Vto-V induced variability generalize automatically across languages. In the present study we analyzed synchronously collected acoustic and EMA data and retested if the effect of V-to-V coarticulation depended on directionality (anticipatory/carryover coarticulation), and examined if coarticulatory resistance and aggression are conditioned prosodically. As opposed to previous studies, we captured coarticulatory effects (or contextual variation) in two ways: i) as acoustic and articulatory differences of coarticulated and non-coarticulated tokens, and ii) as dispersion of corresponding acoustic and articulatory parameters. We analyzed the /i u/ point vowels in 9 speakers of Hungarian, in the context of the minimally constrained labial stop /p/ in nonsense /pVpVpVpV/ sequences (min. 6 repetitions per speaker) in pitch-accented (ACC) and unaccented (UN) syllables, in coarticulating and non-coarticulating contexts. I.

ArXiv, 2021
Articulatory information has been shown to be effective in improving the performance of HMM-based... more Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based textto-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text i...

It is traditionally assumed that geminates undergo degemination when being flanked by another con... more It is traditionally assumed that geminates undergo degemination when being flanked by another consonant in Hungarian. Since in Hungarian duration is considered to be the main acoustic cue to the singleton-geminate opposition, it appears valid to study phonetic implementation of this process in the acoustic domain. However, previous acoustic analyses lead to inconclusive results on the status of the “degeminated” consonant, while articulatory data on Japanese singletons and geminates imply that it is revealing to study degemination on the level of gestural timing. The present study compared gestural organization of geminates, and degeminated, and singleton consonants in heterorganic C-clusters, and in intervocalic positions. We obtained EMA data from 10 female speakers of Hungarian (aged 27.7). Consonant duration, plateau durations and tongue rise showed that degemination does not yield realizations equivalent to intervocalic singletons, and geminates and singletons in clusters showe...
Tamás Gábor Csapó, Andrea Deme, Tekla Etelka Gráczi, Alexandra Markó 1 Dept. of Telecomm. and Med... more Tamás Gábor Csapó, Andrea Deme, Tekla Etelka Gráczi, Alexandra Markó 1 Dept. of Telecomm. and Media Informatics, Budapest Univ. of Technology and Economics, Budapest, Hungary 2 MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary 3 Department of Phonetics, Eötvös Loránd University, Budapest, Hungary 4 Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
State-of-the-art silent speech interface systems apply vocoders to generate the speech signal dir... more State-of-the-art silent speech interface systems apply vocoders to generate the speech signal directly from articulatory data. Most of these approaches concentrate on estimating just the spectral features of the vocoder, and use the original F0, a constant F0 or white noise as excitation. This solution is based on the assumption that the F0 curve is unpredictable from articulatory data that does not contain direct measurements of the vocal fold vibration. Here, we experimented with deep neural networks to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input. Contrary to the common belief that F0 is unpredictable, we attained a correlation rate of 0.74 between the original and the predicted F0 curve. What is more, the listening tests revealed that our subjects could not distinguish the sentences synthesized using the DNN-estimated and the original F0 curve, and ranked them as having the same quality.

Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates un... more Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates undergo degemination when being flanked by another consonant in Hungarian. As in Hungarian duration is considered to be the main acoustic cue to the singleton-geminate opposition, it appears valid to study the phonetic implementation of this process in the acoustic domain. However, previous acoustic analyses lead to inconclusive results on the status of the “degeminated” consonant, while articulatory data on Japanese singletons and geminates imply that it is revealing to study degemination on the level of gestural timing. The present study compared gestural organization of geminates, degeminated and singleton consonants in heterorganic C-clusters, and in intervocalic positions. We obtained EMA data from 10 female speakers of Hungarian (aged 27.7 ys). Consonant durations, plateau durations and tongue rise data showed that degemination does not yield realizations equivalent to intervocalic s...

In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed... more In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed as a function of prominence, with respect to gender and vowel quality. The theoretically most prominent (stressed and accented) and non-prominent (unstressed and unaccented) realizations were compared in terms of duration, f0, formants, and OQ (measured with two different methods). The last two of these parameters were estimated and analyzed sys- tematically for the first time in the study of Hungarian speech. There was a significant interaction between the effect of prominence and vowel quality: prominence led to longer duration for the vowels /ɒ/ and /i/, but had no significant effect on /u/. We found a three-way interaction be- tween prominence, vowel quality and gender, due to different patterns ob- served between the two genders in the case of the vowel /i/. Formant analysis based on Euclidean distance from the vowel space centroid did not reveal any significant effect of the degre...

Articulatory studies performed in Hungary date back to the sixties, when different methods were a... more Articulatory studies performed in Hungary date back to the sixties, when different methods were applied for the description of the segment inventory of Hungarian and various other languages (e.g. Russian, German, English, Polish). Palato- and linguography, labiography, and X-ray were used in the analyses of both typical and atypical speech. However, coarticulation, which requires dynamic methods, was not analysed until recently, when the suitable tools and methods, electromagnetic articulography, ultrasound tongue imaging and electroglottography became also available in Hungary. The paper presents an overview of the main issues of articulatory studies on Hungarian in the past and the present. It summarizes the main findings from some studies on gemination and degemination, transparent vowels, phonatory characteristics of emotion, and gives a couple of examples of possible and future applications.
International Congress of Phonetic Sciences, 2015
Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleratio... more Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleration of articulation rate. In the present study, the final parts of Hungarian read and spontaneous utterances were analyzed in the temporal domain (compared to the other parts of the utterances) and in terms of glottalization. We investigated how glottalization and deceleration are related to each other in read and spontaneous speech in Hungarian. We also analyzed if these phenomena depend on the speech mode. Our results revealed connection between glottalization and deceleration in spontaneous speech, whereas for read speech no such relation could be detected. Speech modes were also found to differ in the frequencies of the occurrence of glottalization and the magnitude of the deceleration at utterance final positions.

Glottal marking is well described for adult speakers; however, children's speech has been les... more Glottal marking is well described for adult speakers; however, children's speech has been less documented yet. The present study analysed the appearance of glottal marking in 16 adolescent (16- and 17-year-old) and 16 adult (20- to 45-year-old) speakers' reading aloud (with an equal number of males and females in both age groups). Data in terms of gender as well as age were compared based on four parameters of frequency of occurrence. The results showed that although the frequency of occurrence of glottal marking in adolescent speech in general was somewhat lower than in adult speech, and the gender-specific differences did not appear yet, the positional triggers for glottal marking were found to affect the frequency of the phenomenon similarly in the two age groups. The results and further research may contribute to the better understanding of both the appearance of glottal marking and the emergence of gender-specific characteristics of speech.
Uploads
Papers by Alexandra Markó
posição intervocálica (“fala” x “falha”). O corpus deste estudo foi gravado por quatro falantes nativos do português brasileiro (todas mulheres) e dez falantes húngaros aprendizes de PLE (sendo oito mulheres). Todos os informantes leram 56 palavras inseridas em uma frase-veículo (Eu disse “___” seis vezes). Os dados foram analisados acusticamente através das medidas de formantes das consoantes. Os resultados mostram que, em comparação com os falantes nativos, os aprendizes húngaros de PLE produzem a lateral alveolar /l/, mas a produção da lateral palatal /ʎ/ só se diferencia da do glide palatal /j/ no início da consoante.