Aleksandar Kostic

Followers

Following

Public Views

Irena Spadijer

University of Belgrade, Faculty of Philology

Petre Guran

Institute for South East European Studies (Romanian Academy)

Milan Gromović

Faculty of Philosophy Novi Sad

Anna Adashinskaya

Koç University

Serbian Studies Research

Nikola Dobric

Alpen-Adria-Universität Klagenfurt

Bosko Bojovic

Marka Tomić

The Institute for Byzantine Studies

Aleksandra Trtovac

University of Belgrade

Миљана Матић Miljana Matić

University of Belgrade

Interests

Uploads

Papers by Aleksandar Kostic

Electronic Corpus of Sebian Language from 12 TH to 18 TH Century

Đorđe Kostić, at that time the director of the Institute, initiated a project of machine translat... more Đorđe Kostić, at that time the director of the Institute, initiated a project of machine translation and automatic speech and text recognition. Professor Kostić, who directed this project, was of the opinion that the problem of automatic text and speech recognition could not be solved algorithmically and that a probabilistic approach is more plausible. However, a probabilistic approach required compilation of an annotated corpus that will provide precise probability estimates of all aspects of languagefrom phonology to syntax. This was the impetus that brought to the compilation of the Corpus of Serbian Language. The Corpus is diachronic, encompassing the Serbian language from the 12 th century to the contemporary language, with 11 million words, each word being manually annotated for its grammatical status. There were two principal aspects of this project. On the one hand, it was necessary to build up the system of annotation, while on the other hand the corpus had to be representative in order to provide reliable probability estimates. The annotation implied that each word from the corpus should be defined in terms of its grammatical (i.e. morphological status). Available grammars of the Serbian (Serbo-Croatian) language did not seem to satisfy this requirement because at some instances they could not provide a strict specification for each word from the corpus (e.g. constituents of some complex verb tenses). In order to solve this problem a team of linguists expanded the repertoire of standard grammars, building up a system of annotation that distinguished about 2500 grammatical forms. In order to properly represent the contemporary Serbian language, the Corpus encompassed five functional styles of written language: novels and essays (126 books), poetry (215 books), daily press (Politika), scientific texts (136 books) and political texts. Each sample consisted of about one million words: in sum, the sample of contemporary Serbian language consisted of about five million words. This, however, was just one segment of the Corpus. As noted, the Corpus is diachronic, divided into five periods. In addition to the sample of the contemporary language it includes the Serbian language from 12 th to 18 th century, Serbian language of 18 th ct. and the first half of the 19 th ct., complete works of Vuk St. Karadžić, and Serbian language of the second part of the 19 th century. The Corpus was manually annotated at the level of inflected morphology with several stages of text preparation and annotation. The final goal was to compile a series of frequency dictionaries that would serve as a probabilistic basis for automatic speech and text recognition

Aleksandar Kostic

Uploads

Papers by Aleksandar Kostic

Log In