This paper reports findings from an analysis of errors made by an automatic speech recogniser tra... more This paper reports findings from an analysis of errors made by an automatic speech recogniser trained and tested with 3-10-year-old European Portuguese children's speech. We expected and were able to identify frequent pronunciation error patterns in the children's speech. Furthermore, we were able to correlate some of these pronunciation error patterns and automatic speech recognition errors. The findings reported in this paper are of phonetic interest but will also be useful for improving the performance of automatic speech recognisers aimed at children representing the target population of the study.
This work presents a phonologically and phonetically based description for Portuguese Beira Inter... more This work presents a phonologically and phonetically based description for Portuguese Beira Interior regional speech. In particular, the performance of a poststressed vowel system is shown. We also describe the solutions adopted to build a phonological-phonetic inventory. We explore the view that Quantal Theory and Optimality Theory can support the phonological representation of phone inventories, which could describe languages by using their frequency distribution and providing a feasible alternative to be incorporated in the development of speech systems. The resulting phone inventory is supported by statistical measures obtained from the analysis of an oral corpus. The major goal is to present the allophones which are more or less perceived in speech continuum in order to give a code with which an algorithm will match in a language model of Portuguese speech processing, including regional varieties. Our belief is that formalised knowledge of pronunciation variants will also be acquired by analysing very large amounts of data in the future and will ultimately contribute to improving Portuguese multi-pronunciation modelling, taking minority speech into account.
In this paper we describe our work in building an online tool for manually annotating texts in an... more In this paper we describe our work in building an online tool for manually annotating texts in any spoken language with SignWriting in any sign language. The existence of such tool will allow the creation of parallel corpora between spoken and sign languages that can be used to bootstrap the creation of efficient tools for the Deaf community. As an example, a parallel corpus between English and American Sign Language could be used for training Machine Learning models for automatic translation between the two languages. Clearly, this kind of tool must be designed in a way that it eases the task of human annotators, not only by being easy to use, but also by giving smart suggestions as the annotation progresses, in order to save time and effort. By building a collaborative, online, easy to use annotation tool for building parallel corpora between spoken and sign languages we aim at helping the development of proper resources for sign languages that can then be used in state-of-the-art...
This paper presents research results on new syllable structures for the Portuguese language. The ... more This paper presents research results on new syllable structures for the Portuguese language. The location of the syllable boundaries is a well known problem of nonconsensual resolution in Portuguese, mainly when the acoustic-phonetic constraints are taken into account. Based on various acoustic-phonetic restrictions of Portuguese speech, new syllable splitting rules were studied in a corpus of 400K Portuguese words. All these words were automatically converted into sequences of C (consonants) and V (vowels) and also related to the new syllable structures. When syllable structures were outlined in a statistical form, some mapping confirmed our expectations, others apparently ran against the expectations and others appeared to be new. These results provide clues about Portuguese syllabification architecture and can also guide the development of improved complementary models for syllable processing in speech applications.
Hesitations, so-called disfluencies, are a characteristic of spontaneous speech, playing a primar... more Hesitations, so-called disfluencies, are a characteristic of spontaneous speech, playing a primary role in its structure, reflecting aspects of the language production and the management of inter-communication. In this paper we intend to present a database of hesitations in European Portuguese speech HESITA as a relevant base of work to study a variety of speech phenomena. Patterns of hesitations, hesitation distribution according to speaking style, and phonetic properties of the fillers are some of the characteristics we extrapolated from the HESITA database. This database also represents an important resource for improvement in synthetic speech naturalness as well as in robust acoustic modelling for automatic speech recognition. The HESITA database is the output of a project in the speech-processing field for European Portuguese held by an interdisciplinary group in intimate articulation between engineering tools and experience and the linguistic approach.
Ab stract. This paper describes a project with the overall goal of providing a natural interactio... more Ab stract. This paper describes a project with the overall goal of providing a natural interaction communication platform accessible and adapted for all users, especially for people with speech impairments and elderly, by sharing knowledge between Industry and Academia. The platform will adopt the principles of natural user interfaces such as speech, silent speech, gestures, pictograms, among others, and will provide a set of services that allow easy access to social networks, friends and remote family members, thus contributing to overcome social-exclusion of people with special needs or impairments. Application of these features will be performed in the context of serious games, virtual reality environments and assisted living scenarios. The project will be executed in the scope of the Marie Curie Action Industry-Academia Partnerships and Pathways and will bring together the knowledge of five partners, from three different countries, Portugal, Spain and Turkey. This synergy will b...
This paper presents a study of European Portuguese elderly speech, in which the acoustic characte... more This paper presents a study of European Portuguese elderly speech, in which the acoustic characteristics of two groups of elderly speakers (aged 60-75 and over 75) are compared with those of young adult speakers (aged 19-30). The correlation between age and a set of 14 acoustic features was investigated, and decision trees were used to establish the relative importance of the features. A greater use of pauses characterized speakers aged 60 and over. For female speakers, speech rate also appeared to correlate with age. For male speakers, jitter distinguished between speakers aged 60-75 and older. The correlation between the features and speech recognition performance was also investigated. Word error rate correlated mostly with the use of pauses, speech rate, and the ratio of long phone realizations. Finally, by comparing the phone sequences used by the recognizer on the most frequent words, we observed that the young adult speakers reduced schwas more than the elderly speakers. This...
This paper introduces the LetsRead Corpus of European Portuguese read speech from 6 to 10 years o... more This paper introduces the LetsRead Corpus of European Portuguese read speech from 6 to 10 years old children. The motivation for the creation of this corpus stems from the inexistence of databases with recordings of reading tasks of Portuguese children with different performance levels and including all the common reading aloud disfluencies. It is also essential to develop techniques to fulfill the main objective of the LetsRead project: to automatically evaluate the reading performance of children through the analysis of reading tasks. The collected data amounts to 20 hours of speech from 284 children from private and public Portuguese schools, with each child carrying out two tasks: reading sentences and reading a list of pseudowords, both with varying levels of difficulty throughout the school grades. In this paper, the design of the reading tasks presented to children is described, as well as the collection procedure. Manually annotated data is analyzed according to disfluencies...
This study proposes a model for a phonological description of the speech patterns attested in the... more This study proposes a model for a phonological description of the speech patterns attested in the Portuguese language variety spoken in Beira Interior region (in Fundão, particularly). Our goal was to present the main phone prototypes, which could be considered in the description of the Portuguese language, taking into account minority speech, particularly. Based in an analytic work, a spontaneous speech database was collected in order to establish the pertinent features set in the referred to variety. In accordance with the so-called Functionalist Theory, sounds are considered by the fact that speech is perceived (and produced) regarding those distinctive features terms, which are correlated with an optimal-center-of-gravity region. Therefore, our approach explored the view that Categorical Perception, as Quantal Theory and Optimality Theory could support phonological system and (allo)phone inventories.
The automatic evaluation of children’s reading performance by detecting and analyzing errors and ... more The automatic evaluation of children’s reading performance by detecting and analyzing errors and disfluencies in speech is an important tool to build automatic reading tutors and to complement the current method of manual evaluations of overall reading ability in schools. A large amount of speech from children reading aloud plentiful in errors and disfluencies is needed to train acoustic, disfluency and pronunciation models for an automatic reading assessment system. This paper describes the acquisition and analysis of a read-aloud speech database of European Portuguese from children aged 6-10 from the first to fourth school grades. Towards the goal of detecting all reading errors and disfluencies, we apply a decoding process to the utterances using flexible word level lattices that allow syllable based false starts and repetitions of two or more word sequences. The proposed method proved promising in detecting corrections and repetitions in sentences, and provides an improved alignment of the data, helpful for future annotation tasks. The analysis of the database also shows agreement to government defined curricular goals for reading.
This paper presents a study of European Portuguese elderly speech, in which the acoustic characte... more This paper presents a study of European Portuguese elderly speech, in which the acoustic characteristics of two groups of elderly speakers (aged 60-75 and over 75) are compared with those of young adult speakers (aged 19-30). The correlation between age and a set of 14 acoustic features was investigated, and decision trees were used to establish the relative importance of the features. A greater use of pauses characterized speakers aged 60 and over. For female speakers, speech rate also appeared to correlate with age. For male speakers, jitter distinguished between speakers aged 60-75 and older. The correlation between the features and speech recognition performance was also investigated. Word error rate correlated mostly with the use of pauses, speech rate, and the ratio of long phone realizations. Finally, by comparing the phone sequences used by the recognizer on the most frequent words, we observed that the young adult speakers reduced schwas more than the elderly speakers. This result seems to confirm the common idea that young speakers reduce articulation more than older speakers. Further investigation is needed to confirm this result by determining whether this is due to ageing or to the generation gap.
IEEE/ACM Transactions on Audio, Speech, and Language Processing
This paper proposes an approach to automatically parse children's reading of sentences by detecti... more This paper proposes an approach to automatically parse children's reading of sentences by detecting word pronunciations and extra content, and to classify words as correctly or incorrectly pronounced. This approach can be directly helpful for automatic assessment of reading level or for automatic reading tutors, where a correct reading must be identified. We propose a first segmentation stage to locate candidate word pronunciations based on allowing repetitions and false starts of a word's syllables. A decoding grammar based solely on syllables allows silence to appear during a word pronunciation. At a second stage, word candidates are classified as mispronounced or not. The feature that best classifies mispronunciations is found to be the log-likelihood ratio between a free phone loop and a word spotting model in the very close vicinity of the candidate segmentation. Additional features are combined in multifeature models to further improve classification, including: normalizations of the log-likelihood ratio, derivations from phone likelihoods, and Levenshtein distances between the correct pronunciation and recognized phonemes through two phoneme recognition approaches. Results show that most extra events were detected (close to 2% word error rate achieved) and that using automatic segmentation for mispronunciation classification approaches the performance of manual segmentation. Although the log-likelihood ratio from a spotting approach is already a good metric to classify word pronunciations, the combination of additional features provides a relative reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from 35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).
Evaluating children's reading aloud proficiency is typically a task done by teachers on an indivi... more Evaluating children's reading aloud proficiency is typically a task done by teachers on an individual basis, where reading time and wrong words are marked manually. A computational tool that assists with recording reading tasks, automatically analyzing them and outputting performance related metrics could be a significant help to teachers. Working towards that goal, this work presents an approach to automatically predict the overall reading aloud ability of primary school children by employing automatic speech processing methods. Reading tasks were designed focused on sentences and pseudowords, so as to obtain complementary information from the two distinct assignments. A dataset was collected with recordings of 284 children aged 6-10 years reading in native European Portuguese. The most common disfluencies identified include intra-word pauses, phonetic extensions, false starts, repetitions, and mispronunciations. To automatically detect reading disfluencies, we first target extra events by employing task-specific lattices for decoding that allow syllable-based false starts as well as repetitions of words and sequences of words. Then, mispronunciations are detected based on the log likelihood ratio between the recognized and target words. The opinions of primary school teachers were gathered as ground truth of overall reading aloud performance, who provided 0-5 scores closely related to the expected performance at the end of each grade. To predict these scores, various features were extracted by automatic annotation and regression models were trained. Gaussian process regression proved to be the most successful approach. Feature selection from both sentence and pseudoword tasks give the closest predictions, with a correlation of 0.944 compared to the teachers' grading. Compared to the use of manual annotation, where the best models obtained give a correlation of 0.949, there was a relative decrease of only 0.5% for using automatic annotations to extract features. The error rate of predicted scores relative to ground truth also proved to be smaller than the deviation of evaluators' opinion per child.
To automatically evaluate the performance of children reading aloud or to follow a child's readin... more To automatically evaluate the performance of children reading aloud or to follow a child's reading in reading tutor applications, different types of reading disfluencies and mispronunciations must be accounted for. In this work, we aim to detect most of these disfluencies in sentence and pseudoword reading. Detecting incorrectly pronounced words, and quantifying the quality of word pronunciations, is arguably the hardest task. We approach the challenge as a two-step process. First, a segmentation using task-specific lattices is performed, while detecting repetitions and false starts and providing candidate segments for words. Then, candidates are classified as mispronounced or not, using multiple features derived from likelihood ratios based on phone decoding and forced alignment, as well as additional meta-information about the word. Several classifiers were explored (linear fit, neural networks, support vector machines) and trained after a feature selection stage to avoid overfitting. Improved results are obtained using feature combination compared to using only the log likelihood ratio of the reference word (22% versus 27% miss rate at constant 5% false alarm rate).
Reading aloud performance in children is typically assessed by teachers on an individual basis, m... more Reading aloud performance in children is typically assessed by teachers on an individual basis, manually marking reading time and incorrectly read words. A computational tool that assists with recording reading tasks, automatically analyzing them and providing performance metrics could be a significant help. Towards that goal, this work presents an approach to automatically predicting the overall reading aloud ability of primary school children (6-10 years old), based on the reading of sentences and pseudowords. The opinions of primary school teachers were gathered as ground truth of performance, who provided 0-5 scores closely related to the expectations at the end of each grade. To predict these scores automatically, features based on reading speed and number of disfluencies were extracted, after an automatic disfluency detection. Various regression models were trained, with Gaussian process regression giving best results for automatic features. Feature selection from both sentence and pseudoword reading tasks gave the closest predictions, with a correlation of 0.944. Compared to the use of manual annotation with the best correlation being 0.952, automatic annotation was only 0.8% worse. Furthermore, the error rate of predicted scores relative to ground truth was found to be smaller than the deviation of evaluators' opinion per child.
The automatic evaluation of reading performance of children is an important alternative to any ma... more The automatic evaluation of reading performance of children is an important alternative to any manual or 1-on-1 evaluation by teachers or tutors. To do this, it is necessary to detect several types of reading miscues. This work presents an approach to annotate reading speech while detecting false-starts, repetitions and mispronunciations, three of the most common disfluencies. Using speech data of 6–10 year old children reading sentences and pseudowords, we apply a two-step process: first, an automatic alignment is performed to get the best possible word-level segmentation and detect syllable based false-starts and word repetitions by using a strict FST (Finite State Transducer); then, words are classified as being mispronounced or not through a likelihood measure of pronunciation by using phone posterior probabilities estimated by a neural network. This work advances towards getting the amount and severity of disfluencies to provide a reading ability score computed from several sentence reading tasks.
Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility - ASSETS '15, 2015
Gaze information has the potential to benefit Human-Computer Interaction (HCI) tasks, particularl... more Gaze information has the potential to benefit Human-Computer Interaction (HCI) tasks, particularly when combined with speech. Gaze can improve our understanding of the user intention, as a secondary input modality, or it can be used as the main input modality by users with some level of permanent or temporary impairments. In this paper we describe a multimodal HCI system prototype which supports speech, gaze and the combination of both. The system has been developed for Active Assisted Living scenarios.
To evaluate the reading performance of children, human assessment is usually involved, where a te... more To evaluate the reading performance of children, human assessment is usually involved, where a teacher or tutor has to take time to individually estimate the performance in terms of fluency (speed, accuracy and expression). Automatic estimation of reading ability can be an important alternative or complement to the usual methods, and can improve other applications such as elearning. Techniques must be developed to analyse audio recordings of read utterances by children and detect the deviations from the intended correct reading i.e. disfluencies. For that goal, a database of 284 European Portuguese children from 6 to 10 years old (1st-4th grades) reading aloud amounting to 20 hours was collected in private and public Portuguese schools. This paper describes the design of the reading tasks as well as the data collection procedure. The presence of different types of disfluencies is analysed as well as reading performance compared to known curricular goals.
Cumpre-me manifestar o meu reconhecimento a todos quanto contribuíram para a concretização deste ... more Cumpre-me manifestar o meu reconhecimento a todos quanto contribuíram para a concretização deste trabalho. Presto o meu penhorado tributo à opinião avisada do Prof. Doutor Telmo dos Santos Verdelho, com quem pude contar, ao longo deste trabalho, com a ajuda solícita. Ao senhor Prof. Doutor Jorge Morais Barbosa, pelo generoso acolhimento científico e humano que sempre me proporcionou, neste momento como em anteriores, facilitando-me o acesso à sua biblioteca e ao seu saber, nas diferentes ocasiões em que se disponibilizou a comigo discutir diversos aspectos da presente dissertação. O trabalho de investigação nem sempre se revela corredio e alentador, mas, em todos os momentos, encontrei no Professor o orientador dedicado, esclarecido, sensível e paciente. Ao mestre e amigo muito agradeço. Uma palavra de agradecimento é igualmente devida a meus Mestres da Faculdade de Letras da Universidade de Coimbra, de quem recebi, ao longo da minha formação universitária, lições de humanismo e de sabedoria inestimáveis. Aos Colegas linguistas o meu obrigada pelo encorajamento e disponibilidade. Aos meus Amigos um agradecimento muito especial à sensibilidade com a qual tiveram a bondade de me incentivar. Aos meus Pais, em tudo inexcedíveis, ao meu Marido, pelo envolvimento profundamente solidário com que acompanhou a feitura desta tese, não há palavras ou dedicatórias que traduzam a minha gratidão. Ao Henrique e ao Rafael, meus filhos, a quem também dedico este trabalho, porque se privaram, com elevada compreensão, de uma mãe mais presente. 14 A título de enquadramento, sabe-se que este trabalho se faz acompanhar de quatro mapas, sendo os números dois e três uma reprodução algo retocada e melhorada esteticamente dos mapas de 1897 e de 1929, elaborados por Leite de Vasconcellos.
This paper reports findings from an analysis of errors made by an automatic speech recogniser tra... more This paper reports findings from an analysis of errors made by an automatic speech recogniser trained and tested with 3-10-year-old European Portuguese children's speech. We expected and were able to identify frequent pronunciation error patterns in the children's speech. Furthermore, we were able to correlate some of these pronunciation error patterns and automatic speech recognition errors. The findings reported in this paper are of phonetic interest but will also be useful for improving the performance of automatic speech recognisers aimed at children representing the target population of the study.
This work presents a phonologically and phonetically based description for Portuguese Beira Inter... more This work presents a phonologically and phonetically based description for Portuguese Beira Interior regional speech. In particular, the performance of a poststressed vowel system is shown. We also describe the solutions adopted to build a phonological-phonetic inventory. We explore the view that Quantal Theory and Optimality Theory can support the phonological representation of phone inventories, which could describe languages by using their frequency distribution and providing a feasible alternative to be incorporated in the development of speech systems. The resulting phone inventory is supported by statistical measures obtained from the analysis of an oral corpus. The major goal is to present the allophones which are more or less perceived in speech continuum in order to give a code with which an algorithm will match in a language model of Portuguese speech processing, including regional varieties. Our belief is that formalised knowledge of pronunciation variants will also be acquired by analysing very large amounts of data in the future and will ultimately contribute to improving Portuguese multi-pronunciation modelling, taking minority speech into account.
In this paper we describe our work in building an online tool for manually annotating texts in an... more In this paper we describe our work in building an online tool for manually annotating texts in any spoken language with SignWriting in any sign language. The existence of such tool will allow the creation of parallel corpora between spoken and sign languages that can be used to bootstrap the creation of efficient tools for the Deaf community. As an example, a parallel corpus between English and American Sign Language could be used for training Machine Learning models for automatic translation between the two languages. Clearly, this kind of tool must be designed in a way that it eases the task of human annotators, not only by being easy to use, but also by giving smart suggestions as the annotation progresses, in order to save time and effort. By building a collaborative, online, easy to use annotation tool for building parallel corpora between spoken and sign languages we aim at helping the development of proper resources for sign languages that can then be used in state-of-the-art...
This paper presents research results on new syllable structures for the Portuguese language. The ... more This paper presents research results on new syllable structures for the Portuguese language. The location of the syllable boundaries is a well known problem of nonconsensual resolution in Portuguese, mainly when the acoustic-phonetic constraints are taken into account. Based on various acoustic-phonetic restrictions of Portuguese speech, new syllable splitting rules were studied in a corpus of 400K Portuguese words. All these words were automatically converted into sequences of C (consonants) and V (vowels) and also related to the new syllable structures. When syllable structures were outlined in a statistical form, some mapping confirmed our expectations, others apparently ran against the expectations and others appeared to be new. These results provide clues about Portuguese syllabification architecture and can also guide the development of improved complementary models for syllable processing in speech applications.
Hesitations, so-called disfluencies, are a characteristic of spontaneous speech, playing a primar... more Hesitations, so-called disfluencies, are a characteristic of spontaneous speech, playing a primary role in its structure, reflecting aspects of the language production and the management of inter-communication. In this paper we intend to present a database of hesitations in European Portuguese speech HESITA as a relevant base of work to study a variety of speech phenomena. Patterns of hesitations, hesitation distribution according to speaking style, and phonetic properties of the fillers are some of the characteristics we extrapolated from the HESITA database. This database also represents an important resource for improvement in synthetic speech naturalness as well as in robust acoustic modelling for automatic speech recognition. The HESITA database is the output of a project in the speech-processing field for European Portuguese held by an interdisciplinary group in intimate articulation between engineering tools and experience and the linguistic approach.
Ab stract. This paper describes a project with the overall goal of providing a natural interactio... more Ab stract. This paper describes a project with the overall goal of providing a natural interaction communication platform accessible and adapted for all users, especially for people with speech impairments and elderly, by sharing knowledge between Industry and Academia. The platform will adopt the principles of natural user interfaces such as speech, silent speech, gestures, pictograms, among others, and will provide a set of services that allow easy access to social networks, friends and remote family members, thus contributing to overcome social-exclusion of people with special needs or impairments. Application of these features will be performed in the context of serious games, virtual reality environments and assisted living scenarios. The project will be executed in the scope of the Marie Curie Action Industry-Academia Partnerships and Pathways and will bring together the knowledge of five partners, from three different countries, Portugal, Spain and Turkey. This synergy will b...
This paper presents a study of European Portuguese elderly speech, in which the acoustic characte... more This paper presents a study of European Portuguese elderly speech, in which the acoustic characteristics of two groups of elderly speakers (aged 60-75 and over 75) are compared with those of young adult speakers (aged 19-30). The correlation between age and a set of 14 acoustic features was investigated, and decision trees were used to establish the relative importance of the features. A greater use of pauses characterized speakers aged 60 and over. For female speakers, speech rate also appeared to correlate with age. For male speakers, jitter distinguished between speakers aged 60-75 and older. The correlation between the features and speech recognition performance was also investigated. Word error rate correlated mostly with the use of pauses, speech rate, and the ratio of long phone realizations. Finally, by comparing the phone sequences used by the recognizer on the most frequent words, we observed that the young adult speakers reduced schwas more than the elderly speakers. This...
This paper introduces the LetsRead Corpus of European Portuguese read speech from 6 to 10 years o... more This paper introduces the LetsRead Corpus of European Portuguese read speech from 6 to 10 years old children. The motivation for the creation of this corpus stems from the inexistence of databases with recordings of reading tasks of Portuguese children with different performance levels and including all the common reading aloud disfluencies. It is also essential to develop techniques to fulfill the main objective of the LetsRead project: to automatically evaluate the reading performance of children through the analysis of reading tasks. The collected data amounts to 20 hours of speech from 284 children from private and public Portuguese schools, with each child carrying out two tasks: reading sentences and reading a list of pseudowords, both with varying levels of difficulty throughout the school grades. In this paper, the design of the reading tasks presented to children is described, as well as the collection procedure. Manually annotated data is analyzed according to disfluencies...
This study proposes a model for a phonological description of the speech patterns attested in the... more This study proposes a model for a phonological description of the speech patterns attested in the Portuguese language variety spoken in Beira Interior region (in Fundão, particularly). Our goal was to present the main phone prototypes, which could be considered in the description of the Portuguese language, taking into account minority speech, particularly. Based in an analytic work, a spontaneous speech database was collected in order to establish the pertinent features set in the referred to variety. In accordance with the so-called Functionalist Theory, sounds are considered by the fact that speech is perceived (and produced) regarding those distinctive features terms, which are correlated with an optimal-center-of-gravity region. Therefore, our approach explored the view that Categorical Perception, as Quantal Theory and Optimality Theory could support phonological system and (allo)phone inventories.
The automatic evaluation of children’s reading performance by detecting and analyzing errors and ... more The automatic evaluation of children’s reading performance by detecting and analyzing errors and disfluencies in speech is an important tool to build automatic reading tutors and to complement the current method of manual evaluations of overall reading ability in schools. A large amount of speech from children reading aloud plentiful in errors and disfluencies is needed to train acoustic, disfluency and pronunciation models for an automatic reading assessment system. This paper describes the acquisition and analysis of a read-aloud speech database of European Portuguese from children aged 6-10 from the first to fourth school grades. Towards the goal of detecting all reading errors and disfluencies, we apply a decoding process to the utterances using flexible word level lattices that allow syllable based false starts and repetitions of two or more word sequences. The proposed method proved promising in detecting corrections and repetitions in sentences, and provides an improved alignment of the data, helpful for future annotation tasks. The analysis of the database also shows agreement to government defined curricular goals for reading.
This paper presents a study of European Portuguese elderly speech, in which the acoustic characte... more This paper presents a study of European Portuguese elderly speech, in which the acoustic characteristics of two groups of elderly speakers (aged 60-75 and over 75) are compared with those of young adult speakers (aged 19-30). The correlation between age and a set of 14 acoustic features was investigated, and decision trees were used to establish the relative importance of the features. A greater use of pauses characterized speakers aged 60 and over. For female speakers, speech rate also appeared to correlate with age. For male speakers, jitter distinguished between speakers aged 60-75 and older. The correlation between the features and speech recognition performance was also investigated. Word error rate correlated mostly with the use of pauses, speech rate, and the ratio of long phone realizations. Finally, by comparing the phone sequences used by the recognizer on the most frequent words, we observed that the young adult speakers reduced schwas more than the elderly speakers. This result seems to confirm the common idea that young speakers reduce articulation more than older speakers. Further investigation is needed to confirm this result by determining whether this is due to ageing or to the generation gap.
IEEE/ACM Transactions on Audio, Speech, and Language Processing
This paper proposes an approach to automatically parse children's reading of sentences by detecti... more This paper proposes an approach to automatically parse children's reading of sentences by detecting word pronunciations and extra content, and to classify words as correctly or incorrectly pronounced. This approach can be directly helpful for automatic assessment of reading level or for automatic reading tutors, where a correct reading must be identified. We propose a first segmentation stage to locate candidate word pronunciations based on allowing repetitions and false starts of a word's syllables. A decoding grammar based solely on syllables allows silence to appear during a word pronunciation. At a second stage, word candidates are classified as mispronounced or not. The feature that best classifies mispronunciations is found to be the log-likelihood ratio between a free phone loop and a word spotting model in the very close vicinity of the candidate segmentation. Additional features are combined in multifeature models to further improve classification, including: normalizations of the log-likelihood ratio, derivations from phone likelihoods, and Levenshtein distances between the correct pronunciation and recognized phonemes through two phoneme recognition approaches. Results show that most extra events were detected (close to 2% word error rate achieved) and that using automatic segmentation for mispronunciation classification approaches the performance of manual segmentation. Although the log-likelihood ratio from a spotting approach is already a good metric to classify word pronunciations, the combination of additional features provides a relative reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from 35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).
Evaluating children's reading aloud proficiency is typically a task done by teachers on an indivi... more Evaluating children's reading aloud proficiency is typically a task done by teachers on an individual basis, where reading time and wrong words are marked manually. A computational tool that assists with recording reading tasks, automatically analyzing them and outputting performance related metrics could be a significant help to teachers. Working towards that goal, this work presents an approach to automatically predict the overall reading aloud ability of primary school children by employing automatic speech processing methods. Reading tasks were designed focused on sentences and pseudowords, so as to obtain complementary information from the two distinct assignments. A dataset was collected with recordings of 284 children aged 6-10 years reading in native European Portuguese. The most common disfluencies identified include intra-word pauses, phonetic extensions, false starts, repetitions, and mispronunciations. To automatically detect reading disfluencies, we first target extra events by employing task-specific lattices for decoding that allow syllable-based false starts as well as repetitions of words and sequences of words. Then, mispronunciations are detected based on the log likelihood ratio between the recognized and target words. The opinions of primary school teachers were gathered as ground truth of overall reading aloud performance, who provided 0-5 scores closely related to the expected performance at the end of each grade. To predict these scores, various features were extracted by automatic annotation and regression models were trained. Gaussian process regression proved to be the most successful approach. Feature selection from both sentence and pseudoword tasks give the closest predictions, with a correlation of 0.944 compared to the teachers' grading. Compared to the use of manual annotation, where the best models obtained give a correlation of 0.949, there was a relative decrease of only 0.5% for using automatic annotations to extract features. The error rate of predicted scores relative to ground truth also proved to be smaller than the deviation of evaluators' opinion per child.
To automatically evaluate the performance of children reading aloud or to follow a child's readin... more To automatically evaluate the performance of children reading aloud or to follow a child's reading in reading tutor applications, different types of reading disfluencies and mispronunciations must be accounted for. In this work, we aim to detect most of these disfluencies in sentence and pseudoword reading. Detecting incorrectly pronounced words, and quantifying the quality of word pronunciations, is arguably the hardest task. We approach the challenge as a two-step process. First, a segmentation using task-specific lattices is performed, while detecting repetitions and false starts and providing candidate segments for words. Then, candidates are classified as mispronounced or not, using multiple features derived from likelihood ratios based on phone decoding and forced alignment, as well as additional meta-information about the word. Several classifiers were explored (linear fit, neural networks, support vector machines) and trained after a feature selection stage to avoid overfitting. Improved results are obtained using feature combination compared to using only the log likelihood ratio of the reference word (22% versus 27% miss rate at constant 5% false alarm rate).
Reading aloud performance in children is typically assessed by teachers on an individual basis, m... more Reading aloud performance in children is typically assessed by teachers on an individual basis, manually marking reading time and incorrectly read words. A computational tool that assists with recording reading tasks, automatically analyzing them and providing performance metrics could be a significant help. Towards that goal, this work presents an approach to automatically predicting the overall reading aloud ability of primary school children (6-10 years old), based on the reading of sentences and pseudowords. The opinions of primary school teachers were gathered as ground truth of performance, who provided 0-5 scores closely related to the expectations at the end of each grade. To predict these scores automatically, features based on reading speed and number of disfluencies were extracted, after an automatic disfluency detection. Various regression models were trained, with Gaussian process regression giving best results for automatic features. Feature selection from both sentence and pseudoword reading tasks gave the closest predictions, with a correlation of 0.944. Compared to the use of manual annotation with the best correlation being 0.952, automatic annotation was only 0.8% worse. Furthermore, the error rate of predicted scores relative to ground truth was found to be smaller than the deviation of evaluators' opinion per child.
The automatic evaluation of reading performance of children is an important alternative to any ma... more The automatic evaluation of reading performance of children is an important alternative to any manual or 1-on-1 evaluation by teachers or tutors. To do this, it is necessary to detect several types of reading miscues. This work presents an approach to annotate reading speech while detecting false-starts, repetitions and mispronunciations, three of the most common disfluencies. Using speech data of 6–10 year old children reading sentences and pseudowords, we apply a two-step process: first, an automatic alignment is performed to get the best possible word-level segmentation and detect syllable based false-starts and word repetitions by using a strict FST (Finite State Transducer); then, words are classified as being mispronounced or not through a likelihood measure of pronunciation by using phone posterior probabilities estimated by a neural network. This work advances towards getting the amount and severity of disfluencies to provide a reading ability score computed from several sentence reading tasks.
Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility - ASSETS '15, 2015
Gaze information has the potential to benefit Human-Computer Interaction (HCI) tasks, particularl... more Gaze information has the potential to benefit Human-Computer Interaction (HCI) tasks, particularly when combined with speech. Gaze can improve our understanding of the user intention, as a secondary input modality, or it can be used as the main input modality by users with some level of permanent or temporary impairments. In this paper we describe a multimodal HCI system prototype which supports speech, gaze and the combination of both. The system has been developed for Active Assisted Living scenarios.
To evaluate the reading performance of children, human assessment is usually involved, where a te... more To evaluate the reading performance of children, human assessment is usually involved, where a teacher or tutor has to take time to individually estimate the performance in terms of fluency (speed, accuracy and expression). Automatic estimation of reading ability can be an important alternative or complement to the usual methods, and can improve other applications such as elearning. Techniques must be developed to analyse audio recordings of read utterances by children and detect the deviations from the intended correct reading i.e. disfluencies. For that goal, a database of 284 European Portuguese children from 6 to 10 years old (1st-4th grades) reading aloud amounting to 20 hours was collected in private and public Portuguese schools. This paper describes the design of the reading tasks as well as the data collection procedure. The presence of different types of disfluencies is analysed as well as reading performance compared to known curricular goals.
Cumpre-me manifestar o meu reconhecimento a todos quanto contribuíram para a concretização deste ... more Cumpre-me manifestar o meu reconhecimento a todos quanto contribuíram para a concretização deste trabalho. Presto o meu penhorado tributo à opinião avisada do Prof. Doutor Telmo dos Santos Verdelho, com quem pude contar, ao longo deste trabalho, com a ajuda solícita. Ao senhor Prof. Doutor Jorge Morais Barbosa, pelo generoso acolhimento científico e humano que sempre me proporcionou, neste momento como em anteriores, facilitando-me o acesso à sua biblioteca e ao seu saber, nas diferentes ocasiões em que se disponibilizou a comigo discutir diversos aspectos da presente dissertação. O trabalho de investigação nem sempre se revela corredio e alentador, mas, em todos os momentos, encontrei no Professor o orientador dedicado, esclarecido, sensível e paciente. Ao mestre e amigo muito agradeço. Uma palavra de agradecimento é igualmente devida a meus Mestres da Faculdade de Letras da Universidade de Coimbra, de quem recebi, ao longo da minha formação universitária, lições de humanismo e de sabedoria inestimáveis. Aos Colegas linguistas o meu obrigada pelo encorajamento e disponibilidade. Aos meus Amigos um agradecimento muito especial à sensibilidade com a qual tiveram a bondade de me incentivar. Aos meus Pais, em tudo inexcedíveis, ao meu Marido, pelo envolvimento profundamente solidário com que acompanhou a feitura desta tese, não há palavras ou dedicatórias que traduzam a minha gratidão. Ao Henrique e ao Rafael, meus filhos, a quem também dedico este trabalho, porque se privaram, com elevada compreensão, de uma mãe mais presente. 14 A título de enquadramento, sabe-se que este trabalho se faz acompanhar de quatro mapas, sendo os números dois e três uma reprodução algo retocada e melhorada esteticamente dos mapas de 1897 e de 1929, elaborados por Leite de Vasconcellos.
Uploads
Papers by Sara Candeias