Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005, Recent Advances in …
…
7 pages
1 file
The choice of natural language technology appropriate for a given language is greatly impacted by density (availability of digitally stored material). More than half of the world speaks medium density languages, yet many of the methods appropriate for high or low density languages yield suboptimal results when applied to the medium density case. In this paper we describe a general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating our main points on the case of Hungarian, Romanian, and Slovenian. We also describe and evaluate the hybrid sentence alignment method we are using.
Acta Cybernetica, 2008
We present an efficient hybrid method for aligning sentences with their translations in a parallel bilingual corpus. The new algorithm is composed of a length-based and anchor matching method that uses Named Entity recognition. This algorithm combines the speed of length-based models with the accuracy of anchor finding methods. The accuracy of finding cognates for Hungarian-English language pair is extremely low, hence we thought of using a novel approach that includes Named Entity recognition. Due to the well selected anchors it was found to outperform the best two sentence alignment algorithms so far published for the Hungarian-English language pair.
Proceedings of the ACL Workshop on …, 2005
This paper describes an experiment in applying sentence alignment methods to Croatian-English parallel corpora and systematically evaluate their performance within the recall, precision and F-measure framework. It is our primary goal to provide an insight and a reference point on sentence alignment accuracy for Croatian-English language pair and also to extend the scope of (Tadiü, 2000) -to our knowledge, the first experiment dealing with automatic sentence alignment of Croatian-English parallel corpora -by utilizing newly implemented tools, creating corpora subsets defined by genre and finally by expanding and formalizing its preliminary observations on alignment accuracy. Therefore, in this paper we start off by briefly describing and argumenting sentence alignment paradigms of choice and presenting available language resources, subset of Croatian-English parallel corpus described in being our primary asset. These descriptions are followed by a formal definition of our testing framework. Results are then discussed in detail and conclusions are stated along with a brief insight on possible future work.
Advances in Intelligent Systems and Computing, 2014
Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.
2004
This paper presents a simple way of producing symmetric, phrase-based alignments, combining two single-word based alignments. Our algorithm exploits the asymmetries in the superposition of the two word alignments to detect the phrases that must be aligned as a whole. It was run with baseline word alignments produced by the Giza++ software and improved these alignments. The ability to treat some groups of words as a whole is essential in applications like machine translation. The paper also addresses the difficulty of the alignment evaluation task.
2008
India is a multilingual, linguistically dense and diverse country with rich resources of information. Parallel corpora have major role in multilingual natural language processing, computational linguistics, speech and information retrieval. This paper describes an alignment system for aligning English-Hindi texts in Gyan-Nidhi corpus at sentence level. The criteria used for alignment is combination of linguistic, statistical information and simple heuristics. We use multi-feature approach with Anusaaraka (Machine Translation System), Hindi shallowparser, Hindi WordNet lookup as primary technique with resources of target language to increase the level of alignment accuracy. Other features such as Named Entities, linguistic information, notation converters are used to match the words in between one-to-many bilingual sentences. Our experiments are based on the GyanNidhi corpus. We obtained 92.06% accuracy for English-to-Hindi sentence alignment with 95.68% precision and 88.09% recall for one-to-many sentence alignment. The study also suggests procedures for aligning parallel translated corpora by using a machine translation system.
International Journal of Engineering Research and Technology (IJERT), 2014
https://www.ijert.org/comparison-of-various-bilingual-sentence-alignment-methods-for-parallel-corpora-development https://www.ijert.org/research/comparison-of-various-bilingual-sentence-alignment-methods-for-parallel-corpora-development-IJERTV3IS040569.pdf Natural Language Processing is the field of AI, Linguistics and Computer Science mainly concerned with the Human (languages) Computer Interactions. Machine translation takes the input in one language and converts it into other language by preserving its meaning. The existing literary works in more than one language are useful resources for many practical applications, such as translation studies, language learning aids, writing and data-based machine translation. These bilingual works need to be aligned as precisely as possible in order to be of any use in parallel corpora development, which is a notoriously difficult task. In this paper, the performance of various sentence alignment approaches are compared through experiments conducted on manually built Telugu and Kannada parallel corpora. A few enhancements on the existing methods for effective and intelligent sentence alignment process for building parallel corpora are suggested based on the learning and experimentation results.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-theart models with automatically extracted information using basic NLP tools to effectively handle rich morphology.
The availability of partially overlapping parallel corpora for a language pair opens up opportunities for automatically comparing, evaluating and improving them. We compare and evaluate the alignment quality of two English-Estonian parallel corpora that have been created independently, but contain overlapping texts. We describe how to determine the overlapping parts and find their alignment similarities that allow us to economize on manual evaluation effort. We also suggest a feature that could be used instead of comparing and manual checking to predict the alignment correctness.
2012
Abstract In this paper, we describe an accurate, robust and language-independent algorithm to align paragraphs with their translations in a parallel bilingual corpus. The paragraph alignment is tested on 998 anchors (combination of 7 books) of English-Hindi language pair of Gyan-Nidhi corpus and achieved a precision of 86.86% and a recall of 82.03%.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
MatLit : Materialidades da Literatura, 2015