Papers by Aleksandar Kostic

Đorđe Kostić, at that time the director of the Institute, initiated a project of machine translat... more Đorđe Kostić, at that time the director of the Institute, initiated a project of machine translation and automatic speech and text recognition. Professor Kostić, who directed this project, was of the opinion that the problem of automatic text and speech recognition could not be solved algorithmically and that a probabilistic approach is more plausible. However, a probabilistic approach required compilation of an annotated corpus that will provide precise probability estimates of all aspects of languagefrom phonology to syntax. This was the impetus that brought to the compilation of the Corpus of Serbian Language. The Corpus is diachronic, encompassing the Serbian language from the 12 th century to the contemporary language, with 11 million words, each word being manually annotated for its grammatical status. There were two principal aspects of this project. On the one hand, it was necessary to build up the system of annotation, while on the other hand the corpus had to be representative in order to provide reliable probability estimates. The annotation implied that each word from the corpus should be defined in terms of its grammatical (i.e. morphological status). Available grammars of the Serbian (Serbo-Croatian) language did not seem to satisfy this requirement because at some instances they could not provide a strict specification for each word from the corpus (e.g. constituents of some complex verb tenses). In order to solve this problem a team of linguists expanded the repertoire of standard grammars, building up a system of annotation that distinguished about 2500 grammatical forms. In order to properly represent the contemporary Serbian language, the Corpus encompassed five functional styles of written language: novels and essays (126 books), poetry (215 books), daily press (Politika), scientific texts (136 books) and political texts. Each sample consisted of about one million words: in sum, the sample of contemporary Serbian language consisted of about five million words. This, however, was just one segment of the Corpus. As noted, the Corpus is diachronic, divided into five periods. In addition to the sample of the contemporary language it includes the Serbian language from 12 th to 18 th century, Serbian language of 18 th ct. and the first half of the 19 th ct., complete works of Vuk St. Karadžić, and Serbian language of the second part of the 19 th century. The Corpus was manually annotated at the level of inflected morphology with several stages of text preparation and annotation. The final goal was to compile a series of frequency dictionaries that would serve as a probabilistic basis for automatic speech and text recognition

Psihologija, 2002
Processing of Serbian inflected verbs was investigated in two lexical decision experiments. In th... more Processing of Serbian inflected verbs was investigated in two lexical decision experiments. In the first experiment subjects were presented with five forms of future tense, while in the second experiment the same verbs were presented in three forms of present and future tense. The outcome of the first experiment indicates that processing of inflected verb is determined by the amount of information derived from the average probability per congruent personal pronoun of a particular verb form. This implies that the cognitive system is not sensitive to verb person per se, nor to the gender of congruent personal pronoun. Results of the second experiment show that for verb forms of different tenses, presented in the same experiment, the amount of information has to be additionally modulated by tense probability. Such an outcome speaks in favor of cognitive relevance of verb tense.

Psihologija, 2009
The aim of the present study is to establish criteria for the optimal size of a corpus that can p... more The aim of the present study is to establish criteria for the optimal size of a corpus that can provide stable conditional probabilities of morphological and/or syntagmatic types. The optimality of corpus size is defined in terms of the smallest sample that generates probability distribution equal to distribution derived from the large sample that generates stable probabilities. The latter distribution we refer to as 'target distribution'. In order to establish the above criteria we varied the sample size, the word sequence size (bigrams and trigrams), sampling procedure (randomly chosen words and continuous text) and position of the target word in a sequence. The obtained distributions of conditional probabilities derived from smaller samples have been correlated with target distributions. Sample size at which probability distribution reaches maximal correlation (r=1) with the target distribution was taken as being optimal. The research was done on Corpus of Serbian languag...

Psihologija, 2008
Reliable language corpus implies a text sample of size n that provides stable probability distrib... more Reliable language corpus implies a text sample of size n that provides stable probability distributions of linguistic phenomena. The question is what is the minimal (i.e. the optimal) text size at which probabilities of linguistic phenomena become stable. Specifically, we were interested in probabilities of grammatical forms. We started with an a priori assumption that text size of 1.000.000 words is sufficient to provide stable probability distributions. Text of this size we treated as a "quasi-population". Probability distribution derived from the "quasi-population" was then correlated with probability distribution obtained on a minimal sample size (32 items) for a given linguistic category (e.g. nouns). Correlation coefficient was treated as a measure of similarity between the two probability distributions. The minimal sample was increased by geometrical progression, up to the size where correlation between distribution derived from the quasi-population and th...
… Investigations In Theoretical …, 2008
Milena Jakić1, Aleksandar Kostić1, and Duica Filipović-Đurđević1, 2 1Laboratory of Experimental ... more Milena Jakić1, Aleksandar Kostić1, and Duica Filipović-Đurđević1, 2 1Laboratory of Experimental Psychology, Faculty of Philosophy, University of Belgrade, 2Department of Psychology, Faculty of Philosophy, University of ... Experiments will also test the predictions of this theory. ...

Psihologija, 2008
It has been shown that while multiple unrelated meanings of a word (e.g. bank) increase processin... more It has been shown that while multiple unrelated meanings of a word (e.g. bank) increase processing latency, polysemy, that is multiple related word senses (e.g. paper) produce faster responses (Rodd, Gaskell & Marslen-Wilson, 2002; Klepousniotou, 2002). The goal of this study was to explore the effect of polysemy on word processing in Serbian. The outcomes of three lexical decision experiments have shown that polysemous words are processed faster. In addition, lemma frequency and number of related senses did not interact. Finally, a measure that combines lemma frequency and number of related senses into a single metric is proposed. This measure is information residual, initially applied on derivational morphology (Moscoso del Prado Mart?n, Kostic & Baayen, 2004). In this study the information residual is a difference between the amount of information (bit) derived from lemma frequency and the entropy of a polysemic cluster. Since relative frequencies of different word senses of a gi...

Psihologija, 2008
It has been shown that while multiple unrelated meanings of a word (e.g. bank) increase processin... more It has been shown that while multiple unrelated meanings of a word (e.g. bank) increase processing latency, polysemy, that is multiple related word senses (e.g. paper) produce faster responses (Rodd, Gaskell & Marslen-Wilson, 2002; Klepousniotou, 2002). The goal of this study was to explore the effect of polysemy on word processing in Serbian. The outcomes of three lexical decision experiments have shown that polysemous words are processed faster. In addition, lemma frequency and number of related senses did not interact. Finally, a measure that combines lemma frequency and number of related senses into a single metric is proposed. This measure is information residual, initially applied on derivational morphology (Moscoso del Prado Martìn, Kostic & Baayen, 2004). In this study the information residual is a difference between the amount of information (bit) derived from lemma frequency and the entropy of a polysemic cluster. Since relative frequencies of different word senses of a given word in Serbian are currently not available, maximum entropy (log N) was used as an approximation. The outcome of this study indicates that cognitive system is sensitive not only to the entropy of derivational clusters, but polysemic clusters as well.
Analogy in grammar: Form …, 2009

In this study we investigated the consequences for word recognition of artificially manipulating ... more In this study we investigated the consequences for word recognition of artificially manipulating the probability distribution of word senses in polysemic words. For this purpose, we performed three visual lexical decision experiments using Serbian polysemous words with controlled distributions of word senses. Our main prediction is that the difference between the experimentally induced distribution of senses of a word, and its distribution of senses as observed in language will influence word recognition. Forty five participants performed three phases of our experiment: Study Phase, Test Phase 1, and Test Phase 2. During the Study Phase, participants performed a self-paced reading aloud task to a list of sentences containing Serbian polysemous words. In Test Phase 1, immediately after the study phase, participants performed visual lexical decision for the ambiguous words that were manipulated in the Study Phase. Finally, on the day immediately following Test Phase 1, participants pe...

Psihologija, 2008
The aim of this study was to investigate whether there is a systematic distinction between associ... more The aim of this study was to investigate whether there is a systematic distinction between associate pairs that constitute categories of lexical relations (e.g. synonyms, antonyms, hyponyms etc.) and categories of associate pairs that have no obvious lexical relation. Proportion of categories of associates were estimated on 80 nouns from "Associative Dictionary of Serbian Language" (Piper, Dragicevic & Stefanovic, 2005), while frequencies of associates were estimated from "Frequency Dictionary of Contemporary Serbian Language" (Kostic, Dj., 1999). Categories of associates were divided into two groups: group of categories that included standard lexical relations and group that included idiosyncratic associates. Proportions of categories were analyzed with respect to a) frequency of a noun to which associates were generated and b) whether it was an abstract or concrete noun. Three measures were used to estimate proportion of categories: a) number of associates, b) ...

Psihologija, 2006
Changes in probability distributions of individual words and word types were investigated within ... more Changes in probability distributions of individual words and word types were investigated within two samples of daily press in the span of fifty years. Two samples of daily press were used in this study. The one derived from the Corpus of Serbian Language (CSL) /Kostic, Dj., 2001/ that covers period between 1945. and 1957. and the other derived from the Ebart Media Documentation (EBR) that was complied from seven daily news and five weekly magazines from 2002. and 2003. Each sample consisted of about 1 million words. The obtained results indicate that nouns and adjectives were more frequent in the CSL, while verbs and prepositions are more frequent in the EBR sample, suggesting a decrease of sentence length in the last five decades. Conspicuous changes in probability distribution of individual words were observed for nouns and adjectives, while minimal or no changes were observed for verbs and prepositions. Such an outcome suggests that nouns and adjectives are most susceptible to d...

Psihologija, 2011
In this study we addressed three issues concerning semantic and associative relatedness between t... more In this study we addressed three issues concerning semantic and associative relatedness between two words and how they prime each other. The first issue is whether there is a priming effect of semantic relatedness over and above the effect of associative relatedness. The second issue is how difference in semantic overlap between two words affects priming. In order to specify the semantic overlap we introduce five relation types that differ in number of common semantic components. Three relation types (synonyms, antonyms and hyponyms) represent semantic relatedness while two relation types represent associative relatedness, with negligible or no semantic relatedness. Finally, the third issue addressed in this study is whether there is a symmetric priming effect if we swap the position of prime and target, i.e. whether the direction of relatedness between two words affects priming. In two lexical decision experiments we presented five types of word pairs. In both experiments we obtain...

Psihologija, 2009
Previous research demonstrated that processing time was facilitated by number of related word sen... more Previous research demonstrated that processing time was facilitated by number of related word senses (polysemy) and inhibited by number of unrelated word meanings (homonymy). The starting point of this research were the findings described by Moscoso del Prado Mart?n and colleagues, who offered a unique account of processing of two forms of lexical ambiguity. By applying the techniques they proposed, for the set of strictly polysemous Serbian nouns we calculated ambiguity measures they introduced. Based on the covariance matrix of the context vectors, we derived entropy of equivalent Gaussian distribution, and based on the context vectors probability density function, we derived differential entropy. Negentropy was calculated as the difference between the two. Based on interpretation that entropy of equivalent Gaussian mirrors sense cooperation, or polysemy, while negentropy mirrors meaning competition, or homonymy, we predicted that in the set of strictly polysemous nouns, negentrop...
Uploads
Papers by Aleksandar Kostic