Papers by Jelke Bloem

I use a discourse-annotated corpus to demonstrate a new method for identifying potential discours... more I use a discourse-annotated corpus to demonstrate a new method for identifying potential discourse makers. Discourse markers are often identified manually, but particularly for natural language processing purposes, it is useful to have a more objective, data-driven method of identification. I link this task to the task of identifying co-occurrences of words and constructions, a task where statistical association measures are often used to compute association strengths. I then apply a statistical association measure to the task of discourse marker identification, and present results for several discourse relation types. While the results are noisy due to the limited availability of corpus data, they appear usable after manual correction or as a feature in a classifier. Furthermore, the results highlight a few types of lexical discourse relation cues that are not traditionally considered discourse makers, but still have a clear association with particular discourse relation types.

Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity
This work investigates the application of a measure of surprisal to modeling a grammatical variat... more This work investigates the application of a measure of surprisal to modeling a grammatical variation phenomenon between near-synonymous constructions. We investigate a particular variation phenomenon, word order variation in Dutch two-verb clusters, where it has been established that word order choice is affected by processing cost. Several multifactorial corpus studies of Dutch verb clusters have used other measures of processing complexity to show that this factor affects word order choice. This previous work allows us to compare the surprisal measure, which is based on constraint satisfaction theories of language modeling, to those previously used measures , which are more directly linked to empirical observations of processing complexity. Our results show that surprisal does not predict the word order choice by itself, but is a significant predictor when used in a measure of uniform information density (UID). This lends support to the view that human language processing is facilitated not so much by predictable sequences of words but more by sequences of words in which information is spread evenly.

We examine a case of word order variation where speakers choose between two near-synonymous const... more We examine a case of word order variation where speakers choose between two near-synonymous constructions partly on the basis of the processing complexity of the construction and its context. When producing two-verb clusters in Dutch, a speaker can choose between two word orders. Previous corpus studies have shown that a wide range of factors are associated with this word order variation. We conducted a large-scale corpus study in order to discover what these factors have in common. The underlying generalization appears to be processing complexity: we show that a variety of factors that are related to verbal cluster word order, can also be related to the processing complexity of the cluster's context. This implies that one of the word orders might be easier to process ā when processing load is high, speakers will go for the easier option. Therefore, we also investigate which of the two word orders might be easier to process. By testing for associations with factors indicating a higher or lower processing complexity of the verb and its context, we find evidence for the hypothesis that the word order where the main verb comes last is easier to process.

This study discusses evaluation methods for linguists to use when employing an automatically anno... more This study discusses evaluation methods for linguists to use when employing an automatically annotated treebank as a source of linguistic evidence. While treebanks are usually evaluated with a general measure over all the data, linguistic studies often focus on a particular construction or a group of structures. To judge the quality of linguistic evidence in this case, it would be beneficial to estimate annotation quality over all instances of a particular construction. I discuss the relative advantages and disadvantages of four approaches to this type of evaluation: manual evaluation of the results, manual evaluation of the text, falling back to simpler annotation and searching for particular instances of the construction. Furthermore, I illustrate the approaches using an example from Dutch linguistics, two-verb cluster constructions, and estimate precision and recall for this construction on a large automatically annotated treebank of Dutch. From this, I conclude that a combination of approaches on samples from the treebank can be used to estimate the accuracy of the annotation for the construction of interest. This allows researchers to make more definite linguistic claims on the basis of data from automatically annotated treebanks.
This study discusses lexical preferences as a factor affecting the word order variation in Dutch ... more This study discusses lexical preferences as a factor affecting the word order variation in Dutch verbal clusters. There are two grammatical word orders for Dutch two-verb clusters, with no clear meaning difference. Using the method of collostructional analysis, I find significant associations between specific verbs and word orders, and argue that these associations must be encoded in the lexicon as lexical preferences. In my data, the word orders also show some semantic associations, indicating that there might be a meaning difference after all. Based on these findings, I conclude that both word orders are stored in the lexicon as constructions.

In this work, we demonstrate the application of statistical measures from dialectometry to the st... more In this work, we demonstrate the application of statistical measures from dialectometry to the study of accented English speech. This new methodology enables a more quantitative approach to the study of accents. Studies on spoken dialect data have shown that a combination of representativeness (the difference between pronunciations within the language variety is small) and distinctiveness (the difference between pronunciations inside and outside the variety is large) is a good way to identify characteristic features of a language variety. We applied this method from dialectology to transcriptions of the words from the Speech Accent Archive, while treating L2 English speakers with different L1s as 'varieties'. This yields lists of words that are pronounced characteristically differently in comparison to native accents of English. We discuss English accent characteristics for French, Hungarian and Dutch. We compare the French and Hungarian results to phonological descriptions of those languages to identify the source of the difference. The Dutch results are compared to a Dutch accents judgement study to evaluate the measure. Knowing about these characteristic features of accents has useful applications in teaching L2 learners of English, since potentially difficult sounds or sound combinations can be identified and addressed based on the learner's native language.
We aim to demonstrate that agent-based models can be a useful tool for historical linguists, by m... more We aim to demonstrate that agent-based models can be a useful tool for historical linguists, by modeling the historical development of verbal cluster word order in Germanic languages. Our results show that the current order in German may have developed due to increased use of subordinate clauses, while the English order is predicted to be influenced by the grammaticalization of the verb 'to have'. The methodology we use makes few assumptions, making it broadly applicable to other phenomena of language change.
In this study we investigate which factors affect the degree of non-native accent of L2 speakers ... more In this study we investigate which factors affect the degree of non-native accent of L2 speakers of English who learned English in school and mostly lived for some time in an anglophone setting. We use data from the Speech Accent Archive containing over 700 speakers speaking almost 160 different native languages. We show that besides several important predictors, including the age of English onset and length of anglophone residence, the linguistic distance between the speaker's native language and English is a significant predictor of the degree of non-native accent in pronunciation. This study extends an earlier study which only focused on Indo-European L2 learners of Dutch and used a general speaking proficiency measure.

PLoS ONE, 2014
Wieling, M., J. Nerbonne, J. Bloem, C. Gooskens, W. Heeringa, and R. H. Baayen
In this study we ... more Wieling, M., J. Nerbonne, J. Bloem, C. Gooskens, W. Heeringa, and R. H. Baayen
In this study we develop pronunciation distances based on naive discriminative learning (NDL). Measures of pronunciation distance are used in several subfields of linguistics, including psycholinguistics, dialectology and typology. In contrast to the commonly used Levenshtein algorithm, NDL is grounded in cognitive theory of competitive reinforcement learning and is able to generate asymmetrical pronunciation distances. In a first study, we validated the NDL-based pronunciation distances by comparing them to a large set of native-likeness ratings given by native American English speakers when presented with accented English speech. In a second study, the NDL-based pronunciation distances were validated on the basis of perceptual dialect distances of Norwegian speakers. Results indicated that the NDL-based pronunciation distances matched perceptual distances reasonably well with correlations ranging between 0.7 and 0.8. While the correlations were comparable to those obtained using the Levenshtein distance, the NDL-based approach is more flexible as it is also able to incorporate acoustic information other than sound segments.

In this work, we discuss the beneļ¬ts of using automatically parsed corpora to study language vari... more In this work, we discuss the beneļ¬ts of using automatically parsed corpora to study language variation. The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufļ¬cient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data. We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate the ļ¬exibility of this method by extending the ļ¬ndings to related linguistic constructions and to another domain of text, using additional data.
With an eye toward measuring the strength of foreign accents in American English, we evaluate the... more With an eye toward measuring the strength of foreign accents in American English, we evaluate the suitability of a modified version of the Levenshtein distance for comparing (the phonetic transcriptions of) accented pronunciations. Although this measure has been used successfully inter alia to study the differences among dialect pronunciations, it has not been applied to studying foreign accents. Here, we use it to compare the pronunciation of non-native English speakers to native American English speech. Our results indicate that the Levenshtein distance is a valid native-likeness measurement, as it correlates strongly (r = -0.81) with the average ānative-likeā judgments given by more than 1000 native American English raters.

We present an automatic animacy classifier for Dutch that can determine the animacy status of nou... more We present an automatic animacy classifier for Dutch that can determine the animacy status of nouns - how alive the noun's referent is (human, inanimate, etc.). Animacy is a semantic property that has been shown to play a role in human sentence processing, felicity and grammaticality. Although animacy is not marked explicitly in Dutch, we expect knowledge about animacy to be helpful for parsing, translation and other NLP tasks. Only a few animacy classifiers and animacy-annotated corpora exist internationally. For Dutch, animacy information is only available in the Cornetto lexical-semantic database. We augment this lexical information with context information from the Dutch Lassy Large treebank, to create training data for an animacy classifier that uses a novel kind of context features.
We use the k-nearest neighbour algorithm with distributional lexical features, e.g. how frequently the noun occurs as a subject of the verb `to think' in a corpus, to decide on the (predominant) animacy class. The size of the Lassy Large corpus makes this possible, and the high level of detail these word association features provide, results in accurate Dutch-language animacy classification.
Proceedings of KONVENS 2012
Crowdsourcing has become an important means for collecting linguistic data. However, the output o... more Crowdsourcing has become an important means for collecting linguistic data. However, the output of web-based experiments is often challenging in terms of spelling, grammar and out-of-dictionary words, and is therefore hard to process with standard NLP tools. Instead of the common practice of discarding data outliers that seem unsuitable for further processing, we introduce an approach that tunes NLP tools such that they can reliably clean and process noisy data collected for a narrow but unknown domain. We demonstrate this by modifying a spell-checker and building a coreference resolution tool to process data for paraphrasing and script learning, and we reach state-of-the-art performance where the original state-of-the-art tools fail.
Conference Presentations by Jelke Bloem
http://www.clarin.eu/sites/default/files/cac2014_submission_34_0.pdf

In this work, we discuss the benefits of using automatically parsed corpora to study language var... more In this work, we discuss the benefits of using automatically parsed corpora to study language variation.
The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufficient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data.
We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate
the flexibility of this method by extending the findings to related linguistic constructions and to another domain of text, using additional data.

Word order changes and grammaticalization in Germanic verbal clusters
In this work, we model t... more Word order changes and grammaticalization in Germanic verbal clusters
In this work, we model the historical development of verbal cluster order in Germanic languages. While there is an ongoing debate on the syntactic structure of these clusters, we created a simple model of surface patterns in which we view each order as a separate outcome, with a probability distribution over the outcomes. This type of modeling lets us explore the diverging development of verbal clusters in these languages, taking a reconstruction of the state of verbal clusters in Proto-Germanic as a starting point. The models converge from their manually defined, Proto-Germanic initial probability distribution to a state in which probabilities are distributed based on the features of the model. We then compare the resulting model output with actual Germanic language texts to see how well we have modeled the real state of these languages. We show that the interaction of basic probabilistic choices of constructions with shifting input and shifting preference of constructions may be a key to understanding different word orders as observed in the Germanic languages.
Uploads
Papers by Jelke Bloem
In this study we develop pronunciation distances based on naive discriminative learning (NDL). Measures of pronunciation distance are used in several subfields of linguistics, including psycholinguistics, dialectology and typology. In contrast to the commonly used Levenshtein algorithm, NDL is grounded in cognitive theory of competitive reinforcement learning and is able to generate asymmetrical pronunciation distances. In a first study, we validated the NDL-based pronunciation distances by comparing them to a large set of native-likeness ratings given by native American English speakers when presented with accented English speech. In a second study, the NDL-based pronunciation distances were validated on the basis of perceptual dialect distances of Norwegian speakers. Results indicated that the NDL-based pronunciation distances matched perceptual distances reasonably well with correlations ranging between 0.7 and 0.8. While the correlations were comparable to those obtained using the Levenshtein distance, the NDL-based approach is more flexible as it is also able to incorporate acoustic information other than sound segments.
We use the k-nearest neighbour algorithm with distributional lexical features, e.g. how frequently the noun occurs as a subject of the verb `to think' in a corpus, to decide on the (predominant) animacy class. The size of the Lassy Large corpus makes this possible, and the high level of detail these word association features provide, results in accurate Dutch-language animacy classification.
Conference Presentations by Jelke Bloem
The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufficient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data.
We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate
the flexibility of this method by extending the findings to related linguistic constructions and to another domain of text, using additional data.
In this work, we model the historical development of verbal cluster order in Germanic languages. While there is an ongoing debate on the syntactic structure of these clusters, we created a simple model of surface patterns in which we view each order as a separate outcome, with a probability distribution over the outcomes. This type of modeling lets us explore the diverging development of verbal clusters in these languages, taking a reconstruction of the state of verbal clusters in Proto-Germanic as a starting point. The models converge from their manually defined, Proto-Germanic initial probability distribution to a state in which probabilities are distributed based on the features of the model. We then compare the resulting model output with actual Germanic language texts to see how well we have modeled the real state of these languages. We show that the interaction of basic probabilistic choices of constructions with shifting input and shifting preference of constructions may be a key to understanding different word orders as observed in the Germanic languages.
In this study we develop pronunciation distances based on naive discriminative learning (NDL). Measures of pronunciation distance are used in several subfields of linguistics, including psycholinguistics, dialectology and typology. In contrast to the commonly used Levenshtein algorithm, NDL is grounded in cognitive theory of competitive reinforcement learning and is able to generate asymmetrical pronunciation distances. In a first study, we validated the NDL-based pronunciation distances by comparing them to a large set of native-likeness ratings given by native American English speakers when presented with accented English speech. In a second study, the NDL-based pronunciation distances were validated on the basis of perceptual dialect distances of Norwegian speakers. Results indicated that the NDL-based pronunciation distances matched perceptual distances reasonably well with correlations ranging between 0.7 and 0.8. While the correlations were comparable to those obtained using the Levenshtein distance, the NDL-based approach is more flexible as it is also able to incorporate acoustic information other than sound segments.
We use the k-nearest neighbour algorithm with distributional lexical features, e.g. how frequently the noun occurs as a subject of the verb `to think' in a corpus, to decide on the (predominant) animacy class. The size of the Lassy Large corpus makes this possible, and the high level of detail these word association features provide, results in accurate Dutch-language animacy classification.
The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufficient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data.
We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate
the flexibility of this method by extending the findings to related linguistic constructions and to another domain of text, using additional data.
In this work, we model the historical development of verbal cluster order in Germanic languages. While there is an ongoing debate on the syntactic structure of these clusters, we created a simple model of surface patterns in which we view each order as a separate outcome, with a probability distribution over the outcomes. This type of modeling lets us explore the diverging development of verbal clusters in these languages, taking a reconstruction of the state of verbal clusters in Proto-Germanic as a starting point. The models converge from their manually defined, Proto-Germanic initial probability distribution to a state in which probabilities are distributed based on the features of the model. We then compare the resulting model output with actual Germanic language texts to see how well we have modeled the real state of these languages. We show that the interaction of basic probabilistic choices of constructions with shifting input and shifting preference of constructions may be a key to understanding different word orders as observed in the Germanic languages.