2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
The main challenge introduced in current voice conversion is the tradeoff between speaker similar... more The main challenge introduced in current voice conversion is the tradeoff between speaker similarity and computational complexity. To tackle the latter problems, this paper introduces a novel sinusoidal model applied for voice conversion (VC) with parallel training data. The conventional source-filter based techniques usually give sound quality and similarity degradation of the converted voice due to parameterization errors and over smoothing, which leads to a mismatch in the converted characteristics. Therefore, we developed a VC method using continuous sinusoidal model (CSM), which decomposes the source voice into harmonic components to improve VC performance. In contrast to current VC approaches, our method is motivated by two observations. Firstly, it allows continuous fundamental frequency (F0) to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech, that is important to maintain a high converted speech quality. We secondly compare our model with two high-quality modern (MagPhase and WORLD) vocoders applied for VC, and one with a vocoder-free VC framework based on a differential Gaussian mixture model that was used recently for the Voice Conversion Challenge 2018. Similarity and intelligibility are finally evaluated in objective and subjective measures. Experimental results confirmed that the proposed method obtained higher speaker similarity compared to the conventional methods.
Napjainkban számos automatikus szövegfelolvasási módszer létezik, de az elmúlt években a legnagyo... more Napjainkban számos automatikus szövegfelolvasási módszer létezik, de az elmúlt években a legnagyobb figyelmet a statisztikai parametrikus beszédkeltési módszer, ezen belül is a rejtett Markov-modell (Hidden Markov Model, HMM) alapú szövegfelolvasás kapta. A HMM-alapú szövegfelolvasás minsége megközelíti a manapság legjobbnak számító elemkiválasztásos szintézisét, és ezen túl számos elnnyel rendelkezik: adatbázisa kevés helyet foglal el, lehetséges új hangokat külön felvételek nélkül létrehozni, érzelmeket kifejezni vele, és már néhány mondatnyi felvétel esetén is lehetséges az adott beszél hangkarakterét visszaadni. Jelen cikkben bemutatjuk a HMM-alapú beszédkeltés alapjait, a beszéladaptációjának lehetségeit, a magyar nyelvre elkészült beszélfüggetlen HMM adatbázist és a beszéladaptáció folyamatát félig spontán magyar beszéd esetén. Az eredmények kiértékelése céljából meghallgatásos tesztet végzünk négy különböz hang adaptációja esetén, melyeket szintén ismertetünk a cikkünkben
Az egyre szélesedő kommunikációs lehetőségekkel rohamosan nő a a telefonos ügyfélszolgálatok terh... more Az egyre szélesedő kommunikációs lehetőségekkel rohamosan nő a a telefonos ügyfélszolgálatok terhelése. A tájékoztatás automatizálásához egyre több hangos üzenetet kell elkészíteni, általában ugyanazzal a bemondóval. Ezt a felolvasó személy véges terhelhetősége korlátozza. A cikkben olyan gépi megoldás lehetőségéről számolunk be, amelyik leveszi a munka nagy részét a bemondó válláról, csak ellenőriznie kell a generált üzenet hangzását. A promptgenerátor olyan új beszédtechnológiai megoldás, amilyent még nem készítettek Magyarországon. Tervezése és fejlesztése mind számítógépes nyelvészeti, mind fonetikai és informatikai szempontból új megoldásokat eredményezett. A rendszer, optimális esetben olyan természetes hangminőséget szolgáltat, hogy a hallgató nem veszi észre, hogy gép beszél.
For articulatory-to-acoustic mapping, typically only limited parallel training data is available,... more For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pretrained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.
Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of ... more Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment pr...
Convolutional Neural Networks (CNNs) have been applied to various machine learn-ing tasks, such a... more Convolutional Neural Networks (CNNs) have been applied to various machine learn-ing tasks, such as computer vision, speech technologies and machine translation. One of the main advantages of CNNs is the representation learning capability from high-dimensional data. End-to-end CNN models have been massively explored in computer vision domain, and this approach has also been attempted in other domains as well. In this paper, a novel end-to-end CNN architecture with residual connections is presented for intent detection, which is one of the main goals for building a spoken language understanding (SLU) system. Experiments on two datasets (ATIS and Snips) were carried out. The results demonstrate that the proposed model outperforms previous solutions.
Recently in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in... more Recently in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with hidden Markov model (HMM) based text-to-speech (TTS). However, HMMs often generate over-smoothed and muffled synthesized speech. From this point, we propose here to use the modified version of our continuous vocoder with deep neural networks (DNNs) for further improving its quality. Evaluations between DNNTTS using Continuous and WORLD vocoders are also presented. Experimental results from objective and subjective tests have shown that the DNN-TTS have higher naturalness than HMM-TTS, and the proposed framework provides quality similar to the WORLD vocoder, while being simpler in terms of the number of excitation parameters and models better the voiced/unvoiced speech regions than the WORLD vocoder.
To date, various speech technology systems have adopted the vocoder approach, a method for synthe... more To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. However, conventional sourcefilter systems (i.e., STRAIGHT) and sinusoidal models (i.e., MagPhase) tend to produce over-smoothed spectra, which often result in muffled and buzzy synthesized text-to-speech (TTS). WaveNet, one of the best models that nearly resembles the human voice, has to generate a waveform in a time-consuming sequential manner with an extremely complex structure of its neural networks. WaveNet needs large quantities of voice data before accurate predictions can be obtained. In order to motivate a new, alternative approach to these issues, we present an updated synthesizer, which is a simple signal model to train and easy to generate waveforms, using Continuous Wavelet Transform (CWT) to characterize and decompose speech features. CWT provides time and frequency resolutions different from those of the short-time Fourier transform. It can also retain the fine spectral envelope and achieve high controllability of the structure closer to human auditory scales. We confirmed through experiments that our speech synthesis system was able to provide natural-sounding synthetic speech and outperformed the state-of-the-art WaveNet vocoder.
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based o... more Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections and, furthermore, a model that utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network-based architecture was also evaluated on the NetTal...
Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch est... more Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced...
wiln e § ujski 1 D frnislv qerzov 2 D m¡ s q¡ or gsp¡ o 3 D ldo heli¡ 1 D hilip xF qrner 4 D elek... more wiln e § ujski 1 D frnislv qerzov 2 D m¡ s q¡ or gsp¡ o 3 D ldo heli¡ 1 D hilip xF qrner 4 D eleksndr qjoreski 2 D hvid quenne 5 D orn svnovski 2 D eleksndr welov 2 D q¡ ez x¡ emeth 3 D en tojkovi¡ 2 D nd qy¤ orgy zsz¡ k 3 1 pulty of ehnil ienesD niversity of xovi dD eri [email protected] 2 pulty of iletril ingineering nd snformtion ehnologies niversity of sF gyril nd wethodiusD kopjeD wedoni
Speech synthesis is an important modality in Cognitive Infocommunications, which is the intersect... more Speech synthesis is an important modality in Cognitive Infocommunications, which is the intersection of informatics and cognitive sciences. Statistical parametric methods have gained importance in speech synthesis recently. The speech signal is decomposed to parameters and later restored from them. The decomposition is implemented by speech coders. We apply a novel codebook-based speech coding method to model the excitation of speech. In the analysis stage the speech signal is analyzed frame-by-frame and a codebook of pitch synchronous excitations is built from the voiced parts. Timing, gain and harmonic-tonoise ratio parameters are extracted and fed into the machine learning stage of Hidden Markov-model based speech synthesis. During the synthesis stage the codebook is searched for a suitable element in each voiced frame and these are concatenated to create the excitation signal, from which the final synthesized speech is created. Our initial experiments show that the model fits well in the statistical parametric speech synthesis framework and in most cases it can synthesize speech in a better quality than the traditional pulse-noise excitation. (This paper is an extended version of [10].
Kulcsszavak: beszédszintézis, szöveg-beszéd átalakítás, rejtett Markov-modell Jelen cikk bemutatj... more Kulcsszavak: beszédszintézis, szöveg-beszéd átalakítás, rejtett Markov-modell Jelen cikk bemutatja a rejtett Markov-modell alapú szövegfelolvasás technológiáját és annak a magyar nyelvre való adaptációját. Ennek a megoldásnak számos elônye van: kis adatbázisméret mellett jó minôségû beszédet képes elôállítani, továbbá elvi lehetôséget ad a beszédhang karakterének, stílusának módosítására és érzelmek kifejezésére is meg lehet tanítani a rendszert. Lektorált Kutatási cikkek kategória Mitcsenkov Attila, Meskó Diána, Cinkler Tibor: Forgalomhoz alkalmazkodó védelmi módszerek (2007/2. szám) Nagy Lajos: Determinisztikus beltéri hullámterjedési modellek (2007/3. szám) Kôrösi Attila, Székely Balázs, Lukovszki Csaba, Dang Dihn Trang: DSL hozzáférési hálózatokban alkalmazott csomagütemezôk sorbanállási modellezése és analízise teljes és részleges visszautasítás esetére (2007/4. szám)
In this paper we present a novel machine learning approach usable for text labeling problems. We ... more In this paper we present a novel machine learning approach usable for text labeling problems. We illustrate the importance of the problem for Text-to-Speech systems and through that for telecommunication applications. We introduce the proposed method, and demonstrate its effectiveness on the problem of language identification, using three different training sets and large test corpora.
Currently available speech recognisers do not usually work well with elderly speech. This is beca... more Currently available speech recognisers do not usually work well with elderly speech. This is because several characteristics of speech (e.g. fundamental frequency, jitter, shimmer and harmonic noise ratio) change with age and because the acoustic models used by speech recognisers are typically trained with speech collected from younger adults only. To develop speech-driven applications capable of successfully recognising elderly speech, this type of speech data is needed for training acoustic models from scratch or for adapting acoustic models trained with younger adults' speech. However, the availability of suitable elderly speech corpora is still very limited. This paper describes an ongoing project to design, collect, transcribe and annotate large elderly speech corpora for four European languages: Portuguese, French, Hungarian and Polish. The Portuguese, French and Polish corpora contain read speech only, whereas the Hungarian corpus also contains spontaneous command and control type of speech. Depending on the language in question, the corpora contain 76 to 205 hours of speech collected from 328 to 986 speakers aged 60 and over. The final corpora will come with manually verified orthographic transcriptions, as well as annotations for filled pauses, noises and damaged words.
Speech technology has been an area of intensive research worldwide - including Hungary - for seve... more Speech technology has been an area of intensive research worldwide - including Hungary - for several decades. This paper will give a short overview of the challenges and results of the domain and the vision of the development and the application of the technology will also be introduced.
The PaeLife project is a European industry-academia collaboration whose goal is to provide the el... more The PaeLife project is a European industry-academia collaboration whose goal is to provide the elderly with easy access to online services that make their life easier and encourage their continued participation in the society. To reach this goal, the project partners are developing a multimodal virtual personal life assistant (PLA) offering a wide range of services from weather information to social networking. This paper presents the multimodal architecture of the PLA, the services provided by the PLA, and the work done in the area of speech input and output modalities, which play a key role in the application.
IEEE Journal of Selected Topics in Signal Processing, 2014
Statistical parametric text-to-speech synthesis is optimized for regular voices and may not creat... more Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.
2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
The main challenge introduced in current voice conversion is the tradeoff between speaker similar... more The main challenge introduced in current voice conversion is the tradeoff between speaker similarity and computational complexity. To tackle the latter problems, this paper introduces a novel sinusoidal model applied for voice conversion (VC) with parallel training data. The conventional source-filter based techniques usually give sound quality and similarity degradation of the converted voice due to parameterization errors and over smoothing, which leads to a mismatch in the converted characteristics. Therefore, we developed a VC method using continuous sinusoidal model (CSM), which decomposes the source voice into harmonic components to improve VC performance. In contrast to current VC approaches, our method is motivated by two observations. Firstly, it allows continuous fundamental frequency (F0) to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech, that is important to maintain a high converted speech quality. We secondly compare our model with two high-quality modern (MagPhase and WORLD) vocoders applied for VC, and one with a vocoder-free VC framework based on a differential Gaussian mixture model that was used recently for the Voice Conversion Challenge 2018. Similarity and intelligibility are finally evaluated in objective and subjective measures. Experimental results confirmed that the proposed method obtained higher speaker similarity compared to the conventional methods.
Napjainkban számos automatikus szövegfelolvasási módszer létezik, de az elmúlt években a legnagyo... more Napjainkban számos automatikus szövegfelolvasási módszer létezik, de az elmúlt években a legnagyobb figyelmet a statisztikai parametrikus beszédkeltési módszer, ezen belül is a rejtett Markov-modell (Hidden Markov Model, HMM) alapú szövegfelolvasás kapta. A HMM-alapú szövegfelolvasás minsége megközelíti a manapság legjobbnak számító elemkiválasztásos szintézisét, és ezen túl számos elnnyel rendelkezik: adatbázisa kevés helyet foglal el, lehetséges új hangokat külön felvételek nélkül létrehozni, érzelmeket kifejezni vele, és már néhány mondatnyi felvétel esetén is lehetséges az adott beszél hangkarakterét visszaadni. Jelen cikkben bemutatjuk a HMM-alapú beszédkeltés alapjait, a beszéladaptációjának lehetségeit, a magyar nyelvre elkészült beszélfüggetlen HMM adatbázist és a beszéladaptáció folyamatát félig spontán magyar beszéd esetén. Az eredmények kiértékelése céljából meghallgatásos tesztet végzünk négy különböz hang adaptációja esetén, melyeket szintén ismertetünk a cikkünkben
Az egyre szélesedő kommunikációs lehetőségekkel rohamosan nő a a telefonos ügyfélszolgálatok terh... more Az egyre szélesedő kommunikációs lehetőségekkel rohamosan nő a a telefonos ügyfélszolgálatok terhelése. A tájékoztatás automatizálásához egyre több hangos üzenetet kell elkészíteni, általában ugyanazzal a bemondóval. Ezt a felolvasó személy véges terhelhetősége korlátozza. A cikkben olyan gépi megoldás lehetőségéről számolunk be, amelyik leveszi a munka nagy részét a bemondó válláról, csak ellenőriznie kell a generált üzenet hangzását. A promptgenerátor olyan új beszédtechnológiai megoldás, amilyent még nem készítettek Magyarországon. Tervezése és fejlesztése mind számítógépes nyelvészeti, mind fonetikai és informatikai szempontból új megoldásokat eredményezett. A rendszer, optimális esetben olyan természetes hangminőséget szolgáltat, hogy a hallgató nem veszi észre, hogy gép beszél.
For articulatory-to-acoustic mapping, typically only limited parallel training data is available,... more For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pretrained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.
Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of ... more Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment pr...
Convolutional Neural Networks (CNNs) have been applied to various machine learn-ing tasks, such a... more Convolutional Neural Networks (CNNs) have been applied to various machine learn-ing tasks, such as computer vision, speech technologies and machine translation. One of the main advantages of CNNs is the representation learning capability from high-dimensional data. End-to-end CNN models have been massively explored in computer vision domain, and this approach has also been attempted in other domains as well. In this paper, a novel end-to-end CNN architecture with residual connections is presented for intent detection, which is one of the main goals for building a spoken language understanding (SLU) system. Experiments on two datasets (ATIS and Snips) were carried out. The results demonstrate that the proposed model outperforms previous solutions.
Recently in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in... more Recently in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with hidden Markov model (HMM) based text-to-speech (TTS). However, HMMs often generate over-smoothed and muffled synthesized speech. From this point, we propose here to use the modified version of our continuous vocoder with deep neural networks (DNNs) for further improving its quality. Evaluations between DNNTTS using Continuous and WORLD vocoders are also presented. Experimental results from objective and subjective tests have shown that the DNN-TTS have higher naturalness than HMM-TTS, and the proposed framework provides quality similar to the WORLD vocoder, while being simpler in terms of the number of excitation parameters and models better the voiced/unvoiced speech regions than the WORLD vocoder.
To date, various speech technology systems have adopted the vocoder approach, a method for synthe... more To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. However, conventional sourcefilter systems (i.e., STRAIGHT) and sinusoidal models (i.e., MagPhase) tend to produce over-smoothed spectra, which often result in muffled and buzzy synthesized text-to-speech (TTS). WaveNet, one of the best models that nearly resembles the human voice, has to generate a waveform in a time-consuming sequential manner with an extremely complex structure of its neural networks. WaveNet needs large quantities of voice data before accurate predictions can be obtained. In order to motivate a new, alternative approach to these issues, we present an updated synthesizer, which is a simple signal model to train and easy to generate waveforms, using Continuous Wavelet Transform (CWT) to characterize and decompose speech features. CWT provides time and frequency resolutions different from those of the short-time Fourier transform. It can also retain the fine spectral envelope and achieve high controllability of the structure closer to human auditory scales. We confirmed through experiments that our speech synthesis system was able to provide natural-sounding synthetic speech and outperformed the state-of-the-art WaveNet vocoder.
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based o... more Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections and, furthermore, a model that utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network-based architecture was also evaluated on the NetTal...
Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch est... more Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced...
wiln e § ujski 1 D frnislv qerzov 2 D m¡ s q¡ or gsp¡ o 3 D ldo heli¡ 1 D hilip xF qrner 4 D elek... more wiln e § ujski 1 D frnislv qerzov 2 D m¡ s q¡ or gsp¡ o 3 D ldo heli¡ 1 D hilip xF qrner 4 D eleksndr qjoreski 2 D hvid quenne 5 D orn svnovski 2 D eleksndr welov 2 D q¡ ez x¡ emeth 3 D en tojkovi¡ 2 D nd qy¤ orgy zsz¡ k 3 1 pulty of ehnil ienesD niversity of xovi dD eri [email protected] 2 pulty of iletril ingineering nd snformtion ehnologies niversity of sF gyril nd wethodiusD kopjeD wedoni
Speech synthesis is an important modality in Cognitive Infocommunications, which is the intersect... more Speech synthesis is an important modality in Cognitive Infocommunications, which is the intersection of informatics and cognitive sciences. Statistical parametric methods have gained importance in speech synthesis recently. The speech signal is decomposed to parameters and later restored from them. The decomposition is implemented by speech coders. We apply a novel codebook-based speech coding method to model the excitation of speech. In the analysis stage the speech signal is analyzed frame-by-frame and a codebook of pitch synchronous excitations is built from the voiced parts. Timing, gain and harmonic-tonoise ratio parameters are extracted and fed into the machine learning stage of Hidden Markov-model based speech synthesis. During the synthesis stage the codebook is searched for a suitable element in each voiced frame and these are concatenated to create the excitation signal, from which the final synthesized speech is created. Our initial experiments show that the model fits well in the statistical parametric speech synthesis framework and in most cases it can synthesize speech in a better quality than the traditional pulse-noise excitation. (This paper is an extended version of [10].
Kulcsszavak: beszédszintézis, szöveg-beszéd átalakítás, rejtett Markov-modell Jelen cikk bemutatj... more Kulcsszavak: beszédszintézis, szöveg-beszéd átalakítás, rejtett Markov-modell Jelen cikk bemutatja a rejtett Markov-modell alapú szövegfelolvasás technológiáját és annak a magyar nyelvre való adaptációját. Ennek a megoldásnak számos elônye van: kis adatbázisméret mellett jó minôségû beszédet képes elôállítani, továbbá elvi lehetôséget ad a beszédhang karakterének, stílusának módosítására és érzelmek kifejezésére is meg lehet tanítani a rendszert. Lektorált Kutatási cikkek kategória Mitcsenkov Attila, Meskó Diána, Cinkler Tibor: Forgalomhoz alkalmazkodó védelmi módszerek (2007/2. szám) Nagy Lajos: Determinisztikus beltéri hullámterjedési modellek (2007/3. szám) Kôrösi Attila, Székely Balázs, Lukovszki Csaba, Dang Dihn Trang: DSL hozzáférési hálózatokban alkalmazott csomagütemezôk sorbanállási modellezése és analízise teljes és részleges visszautasítás esetére (2007/4. szám)
In this paper we present a novel machine learning approach usable for text labeling problems. We ... more In this paper we present a novel machine learning approach usable for text labeling problems. We illustrate the importance of the problem for Text-to-Speech systems and through that for telecommunication applications. We introduce the proposed method, and demonstrate its effectiveness on the problem of language identification, using three different training sets and large test corpora.
Currently available speech recognisers do not usually work well with elderly speech. This is beca... more Currently available speech recognisers do not usually work well with elderly speech. This is because several characteristics of speech (e.g. fundamental frequency, jitter, shimmer and harmonic noise ratio) change with age and because the acoustic models used by speech recognisers are typically trained with speech collected from younger adults only. To develop speech-driven applications capable of successfully recognising elderly speech, this type of speech data is needed for training acoustic models from scratch or for adapting acoustic models trained with younger adults' speech. However, the availability of suitable elderly speech corpora is still very limited. This paper describes an ongoing project to design, collect, transcribe and annotate large elderly speech corpora for four European languages: Portuguese, French, Hungarian and Polish. The Portuguese, French and Polish corpora contain read speech only, whereas the Hungarian corpus also contains spontaneous command and control type of speech. Depending on the language in question, the corpora contain 76 to 205 hours of speech collected from 328 to 986 speakers aged 60 and over. The final corpora will come with manually verified orthographic transcriptions, as well as annotations for filled pauses, noises and damaged words.
Speech technology has been an area of intensive research worldwide - including Hungary - for seve... more Speech technology has been an area of intensive research worldwide - including Hungary - for several decades. This paper will give a short overview of the challenges and results of the domain and the vision of the development and the application of the technology will also be introduced.
The PaeLife project is a European industry-academia collaboration whose goal is to provide the el... more The PaeLife project is a European industry-academia collaboration whose goal is to provide the elderly with easy access to online services that make their life easier and encourage their continued participation in the society. To reach this goal, the project partners are developing a multimodal virtual personal life assistant (PLA) offering a wide range of services from weather information to social networking. This paper presents the multimodal architecture of the PLA, the services provided by the PLA, and the work done in the area of speech input and output modalities, which play a key role in the application.
IEEE Journal of Selected Topics in Signal Processing, 2014
Statistical parametric text-to-speech synthesis is optimized for regular voices and may not creat... more Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.
Uploads
Papers by Géza Németh