Parallel Corpora Research Papers

The availability of partially overlapping parallel corpora for a language pair opens up opportunities for automatically comparing, evaluating and improving them. We compare and evaluate the alignment quality of two English-Estonian... more

Bookmark
Download
- by Kaarel Veskis
- •
- 2
  Corpus Linguistics, Parallel Corpora

Resumen: Los corpus de textos son herramientas de larga tradición y numerosas aplicaciones. De todos los tipos existentes, este trabajo se centra en uno en concreto: el corpus paralelo alineado. Tomando como punto de partida un corpus... more

Rule Based Machine Translation (RBMT) and Statistical Machine translation (SMT) have different approach in performing translation task. RBMT uses linguistic rule between two languages which is built manually by human in general, whereas... more

The aim of this paper is to investigate Polish equivalents of English phrasal verbs as found in an English-Polish (E-P) parallel corpus PHRAVERB. Given the semantic idiosyncrasy exhibited by phrasal verbs, it is assumed that the... more

In this paper we present a method for term extraction that can be used in classroom with translation students. The terms are extracted from a multilingual parallel corpus with the aid of a parallel concordancer, AntPConc. Our work is... more

Rafael guzmán tiRado iRina a. VotyakoVa (ed.) gRanada 2013 tipología léxica cualquier forma de reproducción, distribución, comunicación pública o transformación de esta obra sólo puede ser realizada con la autorización de sus titulares,... more

5η Συνάντηση Ελληνόφωνων Μεταφρασεολόγων, ΑΠΘ, 21-23/5/2015 In this paper we present a method for term extraction that can be used in classroom with translation students. The terms are extracted from a multilingual parallel corpus with... more

Bookmark
Download
- by eirini chatzikoumi and +1
  Elpiniki Margariti
- •
- 4
  Terminology, Parallel Corpora, Medical Terminology, Medical translation

The paper discusses the main trends in the development of the parallel corpora within the RNC since 2015. The New languages section deals with seven new language pairs that emerged during this period, their architecture and tagging.... more

The article presents the analysis of etiquette formulas (forms of address, greetings and farewells) used between teachers of Russian as a foreign language and students studying Russian outside Russia. the survey was conducted among 100... more

Bookmark
Download
- by Wojciech Sosnowski
- •
- 6
  Russian Language, Forms of address, Parallel Corpora, Lacunarity

"This article presents a corpus-based study of the metaphorical and metonymical use of the words "head" and "heart," together with the Norwegian correspondents "hode" and "hjerte." The continuum between metaphor and metonymy is explored,... more

Bookmark
Download
- by Susan Nacey
- •
- 4
  Metaphor, Metonymy, Parallel Corpora, ENPC

摘要：最近几十年，语料库语言学已成为现代应用语言学的支柱。因此，本文的宗旨是更深入地探讨语料库建设的一些认知性和操作性的步骤，以便把语料库观念向广大的研究人员推广。本文主要分为三个部分： 1. 语料库建设：理论与实践 2. 语料文本的加工层面 3. 语料格式属性的标注... more

В статье рассматривается на корпусном материале русская конструкция типа пошёл было в сопоставлении с белорусским плюсквамперфектом (форма типа пайшоў быў). Выявлены некоторые особенности менее изученной белорусской формы -прежде всего,... more

Bookmark
Download
- by Dmitri Sitchinava
- •
- 2
  Corpus Linguistics, Parallel Corpora

This paper presents a bilingual corpus-based study of the use of several nouns meaning ‘time’ or time units (‘hour’, ‘minute’, ‘moment’) in Bulgarian and Ukrainian. All matching instances of these words in a collection of parallel texts... more

Bookmark
Download
- by Ivan Derzhanski
- •
- 5
  Translation Studies, Translation, Bulgarian Language, Parallel Corpora

Bookmark
Download
- by Niraj Aswani
- •
- 5
  Word alignment, Parallel Corpora, Hybrid Approach, Edit Distance

Translation is a profession highly connected to technology, and for this reason, most of today's translators are in contact with a variety of tools, services and programs, such as word processors, e-mail, electronic dictionaries, among... more

In this thesis we describe and evaluate a tool for automatic generation of translations for multiword English terms into Spanish from a monolingual specialized Spanish corpus, compiled by means of web crawling. The resulting translations... more

Bookmark
Download
- by Olya Novikova
- •
- 15
  Machine Translation, Terminology, Lexical Semantics, Lexicography

The paper reports on a study based on the data drawn from such a corpus. The aim of the study was to find and examine the closest Polish translation equivalents of two semantically related verbs in Czech. The author starts with the... more

This paper concentrates on the verbal moods used after Spanish adverbs expressing potentiality (quizá(s), tal vez, probablemente, posiblemente). With the use of the corpus CREA, we sought to determine whether there is a preference for... more

Bookmark
Download
- by Dana Kratochvílová
- •
- 10
  Translation Studies, Spanish, Modality, Corpus Linguistics

We report on a project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation... more

Bookmark
Download
- by Andrejs Vasiļjevs
- •
- 2
  Machine Translation, Parallel Corpora

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of... more

Bookmark
Download
- by Vojta Diatka
- •
- 4
  Computational Linguistics, Machine Translation, Corpora, Parallel Corpora

La lingüística histórica, en su camino hacia la consagración como disciplina autónoma, no ha podido, o no ha querido, distanciarse de las corrientes anejas que transitan y evolucionan en el seno de una lingüística más general y... more

We discuss the elative adjectival prefix _pre-_ in Bulgarian and Ukrainian, variously treated as derivative or inflexional by grammarians and lexicographers. Our investigation, performed on a bilingual corpus of parallel texts, shows that... more

Bookmark
Download
- by Ivan Derzhanski and +1
  Olena Siruk
- •
- 8
  Reduplication, Bulgarian Language, Parallel Corpora, Evaluative morphology

Canonical question tags feature prominently in spoken English, where they display great versatility. At face value they are meant to elicit a response from a co-participant in the form of (dis)agreement with the proposition to which the... more

Bookmark
Download
- by Lieven Buysse
- •
- 4
  Pragmatics, Tag Questions, Parallel Corpora, Contrastive Linguistics

يُعْتَبَر علم الذخائر اللغوية من العلوم اللغوية التأسيسية التي تُرَسِّخْ مفهوم دراسة اللغة في بيئتها الطبيعية، بعيدًا عن القياس اللغوي المنطقي الذي ساد في حقل الدراسات اللغوية قرونًا عدة. إن علم الذخائر اللغوية، الذي أَسَّسَ له عالم اللغة... more

يُعْتَبَر علم الذخائر اللغوية من العلوم اللغوية التأسيسية التي تُرَسِّخْ مفهوم دراسة اللغة في بيئتها الطبيعية، بعيدًا عن القياس اللغوي المنطقي الذي ساد في حقل الدراسات اللغوية قرونًا عدة. إن علم الذخائر اللغوية، الذي أَسَّسَ له عالم اللغة الإنجليزي ليتش (Leech)( ) في النصف الثاني من القرن العشرين، هو علم يبحث في كيفية جمع النصوص اللغوية الطبيعية وتهيئتها وترميزها؛ بحيث تكون صالحة للبحث اللغوي ودراسة الظواهر اللغوية الطبيعية على مستوى أفرع علم اللغة بنظرياته وتطبيقاته الحديثة. يحتل علم الذخائر اللغوية، باعتباره إحدى المنهجيات التي تمهد لدراسة اللغة الطبيعية بشكل موضوعي، مكانة متقدمة في حقل اللسانيات الحديثة. ولا غنى للباحث اللغوي عن التعرف على مفاهيم هذا العلم وفنياته وتطبيقاته، بل وطُرُق بناء الذخائر اللغوية بأحجامها المختلفة لخدمة أغراض بحثية معينة.
فالذخيرة اللغوية، وفقًا لمفاهيم علم الذخائر اللغوية، هي بناء لغوي يتمتع بمواصفات ومعايير فنية تجعله قادرًا على استيعاب النصوص اللغوية وإتاحتها للبحث اللغوي العام والخاص. وينبغي أن تخضع النصوص اللغوية التي يحتويها هذا البناء إلى قواعد معينة من حيث أساليب الجمع، ونِسَب التمثيل، وطرق المعالجة قبل عملية الجمع وبعدها، ومنهجيات الترميز والأساليب التي يتم على أساسها عمليات الاستعلام والاستدعاء حسب متطلبات البحث اللغوي...

We report on a project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation... more

The Algerian Arabic dialects are under-resourced languages, which lack both corpora and Natural Language Processing (NLP) tools, although they are increasingly used in written form, especially on social media and forums. We aim through... more

Bookmark
Download
- by Mourad Abbas and +3
  Karima Meftouh
  kamel smaili
  slm hrrt
- •
- 3
  Statistical Machine Translation, Arabic Dialects, Parallel Corpora

Accessing historical texts is often a challenge because readers either do not know the historical language, or they are challenged by the technological hurdle when such texts are available digitally. Merging corpus linguistic methods and... more

The sentences in the RNC are aligned sentence -by -sentence. The texts kindly offered for the use in the RNC by Adrian Barentsen and included into the Amsterdam Slavic Parallel Aligned Corpus multilingual corpus are already aligned... more

Contrastive methods have long been employed in lexicography, in particular in bi-and multilingual dictionary projects. The main rationale for this is the necessity to comprehensively study, i.e. compare and contrast, two or more... more

Bookmark
Download
- by Marek Łukasik
- •
- 31
  Lexicology, Vocabulary, Terminology, Conceptual Modelling

Bookmark
Download
- by Andrejs Vasiļjevs
- •
- 2
  Machine Translation, Parallel Corpora

iii

Bookmark
Download
- by Roman Roszko and +1
  Danuta Roszko
- •
- 4
  Bulgarian Language, Parallel Corpora, Lithuanian language, Polish Language

The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus developed by ACTRES research group. P-ACTRES 2.0 contains over 4 million words both directions. From original English texts to their Spanish translations,... more

Bookmark
Download
- by H. Sanjurjo-González
- •
- 4
  Spanish, English, Comparable Corpora, Parallel Corpora

Automatic extraction of bilingual lexicons from parallel corpora has been recently exploited to overcome the knowledge acquisition bottleneck in a number of research areas in natural language processing, such as machine translation (MT)... more

This paper describes the first phase of the CEXI project at the University of Bologna in Forlì, involving the selection of the texts to be included in the corpus and decisions about the processing of these texts. The aim of the project is... more

Bookmark
Download
- by Federico Zanettin
- •
- 3
  Translation, Parallel Corpora, Corpus-Based Translation

The paper describes semantic properties of Perfect forms in European languages exemplified by a massive parallel corpus. A NeighbourNet distance graph for European Perfects is built. In a separate section, the English Perfect in the... more

Bookmark
Download
- by Dmitri Sitchinava
- •
- 3
  Perfect Tense, Parallel Corpora, English Perfect Tenses

In this study we examine the occurrences and correspondences of terms for blood kinship in a Bulgarian–Ukrainian parallel corpus of fiction. All instances of the terms selected for study, matching and non-matching, were located and... more

Bookmark
Download
- by Ivan Derzhanski and +1
  Olena Siruk
- •
- 11
  Translation Studies, Semantics, Slavic Languages, Corpus Linguistics

reading and commenting on a draft of this paper. 2 There is no published account on this corpus; for an example of work with it, see 3 See

У збірнику вміщені дослідження з актуальних проблем комп'ютерної лінгвістики. Для викладачів, науковців, учителів, студентів. This volume presents investigations on topical issues in Computational Linguistics. It is intended for... more

Bookmark
Download
- by Dmitri Sitchinava and +1
  Maria Shvedova
- •
- 5
  Corpus Linguistics, Linguistic Typology, Ukrainian Lingustics, Parallel Corpora

The present paper is about the project of Russian Learner Translator Corpus, which is currently under development. The paper discusses the feasibility of such a corpus and existing analogues, describes the current status of corpus... more

Bookmark
Download
- by Чепуркова Анна and +1
  Andrey Kutuzov
- •
- 6
  Translation Studies, Corpus Linguistics, Learner corpora, Parallel Corpora

This paper presents a comparative bilingual corpus-based study of the use of several frequent temporal adverbs and adverbial expressions (‘always’, ‘sometimes’, ‘never’ and their synonyms) in Bulgarian and Ukrainian. The Ukrainian items... more

Bookmark
Download
- by Ivan Derzhanski and +1
  Olena Siruk
- •
- 11
  Translation Studies, Semantics, Slavic Languages, Corpus Linguistics

This study will examine the prefixed derivates from the verb of motion (VoM) ходить and analyse their translations to German by focusing on the problem of determining the correct meaning of individual forms and possible irregularities in... more

This paper presents a comparison between Russian prefixed verbs of memory and their Italian equivalent. In particular, analysing a Russian-Italian parallel corpus, we observed the strategies used for the translation of these verbs from... more

Research in the Humanities is predominantly text-based. For centuries scholars have studied documents such as historical manuscripts, literary works, legal contracts, diaries of important personalities, old tax records etc. Manual... more

Research in the Humanities is predominantly text-based. For centuries scholars have studied documents such as historical manuscripts, literary works, legal contracts, diaries of important personalities, old tax records etc. Manual analysis of such documents is still the dominant research paradigm in the Humanities. However, with the advent of the digital age this is increasingly complemented by approaches that utilise digital resources. More and more corpora are made available in digital form (theatrical plays, contemporary novels, critical literature, literary reviews etc.). This has a potentially profound impact on how research is conducted in the Humanities. Digitised sources can be searched more easily than traditional, paper-based sources, allowing scholars to analyse texts quicker and more systematically. Moreover, digital data can also be (semi-)automatically mined: important facts, trends and interdependencies can be detected, complex statistics can be calculated and the results can be visualised and presented to the scholars, who can then delve further into the data for verification and deeper analysis. Digitisation encourages empirical research, opening the road for completely new research paradigms that exploit `big data' for humanities research. This has also given rise to Digital Humanities (or E-Humanities) as a new research area. Digitisation is only a first step, however. In their raw form, electronic corpora are of limited use to humanities researchers. The true potential of such resources is only unlocked if corpora are enriched with different layers of linguistic annotation (ranging from morphology to semantics). While corpus annotation can build on a long tradition in (corpus) linguistics and computational linguistics, corpus and computational linguistics on the one side and the Humanities on the other side have grown apart over the past decades. We believe that a tighter collaboration between people working in the Humanities and the research community involved in developing annotated corpora is now needed because, while annotating a corpus from scratch still remains a labor-intensive and time-consuming task, today this is simplified by intensively exploiting prior experience in the field. Indeed, such a collaboration is still quite far from being achieved, as a gap still holds between computational linguists (who sometimes do not involve humanists in The ACRH-2 Co-Chairs and Organisers

Bookmark
Download
- by Svetlozara Leseva
- •
- 2
  Parallel Corpora, Corpus Annotation

The article provides information on the development of corpus linguistics in Belarus and Poland. The importance of creating parallel Belarusian-Polish and Polish-Belarusian parallel corpora is noted, the possible algorithm for building... more

Bookmark
Download
- by Uladzimir Koščanka and +1
  Radosław Kaleta
- •
- 6
  Parallel Corpora, Polish Language, Belarusian language, Parallel Corpus

The article deals with diminutive adjectives, numerals, pronouns and adverbs in a parallel bilingual corpus of Bulgarian and Ukrainian texts. We address some theoretical questions regarding the category of diminutivity in both languages.... more

Bookmark
Download
- by Ivan Derzhanski and +1
  Olena Siruk
- •
- 7
  Slavic Languages, Bulgarian Language, Diminutives, Parallel Corpora

In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to... more

Bookmark
Download
- by Niraj Aswani
- •
- 5
  Word alignment, Parallel Corpora, Hybrid Approach, Edit Distance

The present Ph. D. thesis deals with the so-called grey areas that can be found within the Spanish modal system. In these areas, two different types of modality (modal meanings) can occur. We study the relationship that can be found... more

Bookmark
Download
- by Dana Kratochvílová
- •
- 12
  Spanish, Romance philology, Modality, Corpus Linguistics

Parallel Corpora

Log In