Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006
Abstract: Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigrambased F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved.
2012
The Framework for the Evaluation for Machine Translation (FEMTI) contains guidelines for building a quality model that is used to evaluate MT systems in relation to the purpose and intended context of use of the systems. Contextual quality models can thus be constructed, but entering into FEMTI the knowledge required for this operation is a complex task. An experiment has been set up in order to transfer knowledge from MT evaluation experts into the FEMTI guidelines, by polling experts about the evaluation methods they would use in a particular context, then inferring from the results generic relations between characteristics of the context of use and quality characteristics. The results of this hands-on exercise, carried out as part of a conference tutorial, have served to refine FEMTI’s ‘generic contextual quality model ’ and to obtain feedback on the FEMTI guidelines in general.
Any scientific endeavour must be evaluated in order to assess its correctness. In many applied sciences it is necessary to check that the theory adequately matches actual observations. In Machine Translation (MT), evaluation serves two purposes: relative evaluation allows us to check whether one MT technique is better than another, while absolute evaluation gives an absolute measure of performance, eg a score of 1 may mean a perfect translation.
Proceedings of the workshop on Human Language Technology - HLT '93, 1993
This paper reports results of the 1992 Evaluation of machine translation (MT) systems in the DARPA MT initiative and results of a Pre-test to the 1993 Evaluation. The DARPA initiative is unique in that the evaluated systems differ radically in languages translated, theoretical approach to system design, and intended end-user application. In the 1992 suite, a Comprehension Test compared the accuracy and interpretability of system and control outputs; a Quality Panel for each language pair judged the fidelity of translations from each source version. The 1993 suite evaluated adequacy and fluency and investigated three scoring methods.
2003
Abstract Machine translation can be evaluated using precision, recall, and the F-measure. These standard measures have significantly higher correlation with human judgments than recently proposed alternatives. More importantly, the standard measures have an intuitive interpretation, which can facilitate insights into how MT systems might be improved. The relevant software is publicly available.
This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra-and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies.
Proceedings of the 4th ISLE Workshop on MT …, 2001
Work on comparing a set of linguistic test scores for MT output to a set of the same tests' scores for naturally-occurring target language text (Jones and Rusk 2000) broke new ground in automating MT Evaluation. However, the tests used were selected on an ad hoc basis. In this paper, we report on work to extend our understanding, through refinement and validation, of suitable linguistic tests in the context of our novel approach to MTE. This approach was introduced in Miller and Vanni (2001a) and employs standard, rather than randomly-chosen, tests of MT output quality selected from the ISLE framework as well as a scoring system for predicting the type of information processing task performable with the output. Since the intent is to automate the scoring system, this work can also be viewed as the preliminary steps of algorithm design.
2008
This tutorial offers an introduction to the field of Machine Translation evaluation, and in particular to FEMTI, the Framework for the Evaluation of Machine Translation in ISLE, which groups together a wide range of evaluation metrics, following a contextual evaluation approach. In a practical application of the framework, participants will be shown how to apply FEMTI to an operational example of MT use, in order to construct a well-motivated quality model. The results from the practical exercise will be compared, and a synthesis will be proposed in the end, explaining how feedback from the community can be input into FEMTI.
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks. In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
2014
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicate and apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks. In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation. Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using part-of-speech (POS) to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages. In addition, we also present some novel work on quality estimation of MT without using reference translations including the usage of probability models of Naïve Bayes (NB), support vector machine (SVM) classification algorithms, and a discriminative undirected graphical model conditional random field (CRF), in addition to feature engineering.
2011
State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improvements between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
Translation Studies: Theory and Practice
Along with the development and widespread dissemination of translation by artificial intelligence, it is becoming increasingly important to continuously evaluate and improve its quality and to use it as a tool for the modern translator. In our research, we compared five sentences translated from Armenian into Russian and English by Google Translator, Yandex Translator and two models of the translation system of the Armenian company Avromic to find out how effective these translation systems are when working in Armenian. It was necessary to find out how effective it would be to use them as a translation tool and in the learning process by further editing the translation. As there is currently no comprehensive and successful method of human metrics for machine translation, we have developed our own evaluation method and criteria by studying the world's most well-known methods of evaluation for automatic translation. We have used the post-editorial distance evaluation criterion as ...
Today Machine Translation (MT) systems are commercially available for a variety of language pairs and in price range, which makes them accessible to the nonprofessionals. Yet there is no standard evaluation for any type of translation systems whether automatic or manual especially for the commercial systems that involve Arabic language in the Arabic region. This paper presents a brief survey of evaluation of MT system methods and its importance. It also presents some approaches for developing a comprehensive evaluation system without any developer cooperation. Although we proposed some dimension for MTS's, we concentrated on translation quality evaluation of the MTS not the MTS itself dimension for MTS's, we concentrated on translation quality evaluation of the MTS not the MTS itself. dimension for MTS's, we concentrated on translation quality evaluation of the MTS not the MTS itself 2 The situation in the Arabic region As the globalization of the Arabic world becomes common, more than 180 million speakers around the world, problems caused by lack of communication can seriously affect its situation especially as a receiver of knowledge more than a producer. Thereby its need to MT is essential and there should 1 Some approach for a machine
Machine Translation: From Real Users …, 2004
Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.
This paper presents a study on human and automatic evaluations of translations in a French-German translation learner corpus. The aim of the paper is to shed light on the differences between MT evaluation scores and approaches to translation evaluation rooted in a closely related discipline, namely translation studies. We illustrate the factors contributing to the human evaluation of translations, opposing these factors to the results of automatic evaluation metrics, such as BLEU and Meteor. By means of a qualitative analysis of human translations we highlight the concept of legitimate variation and attempt to reveal weaknesses of automatic evaluation metrics. We also aim at showing that translation studies provide sophisticated concepts for translation quality estimation and error annotation which the automatic evaluation scores do not yet cover.
2014
MT-EQuAl (Machine Translation Errors, Quality, Alignment) is a toolkit for human assessment of Machine Translation (MT) output. MT-EQuAl implements three different tasks in an integrated environment: annotation of translation errors, translation quality rating (e.g. adequacy and fluency, relative ranking of alternative translations), and word alignment. The toolkit is webbased and multi-user, allowing large scale and remotely managed manual annotation projects. It incorporates a number of project management functions and sophisticated progress monitoring capabilities. The implemented evaluation tasks are configurable and can be adapted to several specific annotation needs. The toolkit is open source and released under Apache 2.0 license.
arXiv: Computation and Language, 2016
We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. The deep learning models for evaluation are very newly ...
Proceedings of the Second Workshop on …, 2007
Meteor is an automatic metric for Machine Translation evaluation which has been demonstrated to have high levels of correlation with human judgments of translation quality, significantly outperforming the more commonly used Bleu metric. It is one of several automatic metrics used in this year's shared task within the ACL WMT-07 workshop. This paper recaps the technical details underlying the metric and describes recent improvements in the metric. The latest release includes improved metric parameters and extends the metric to support evaluation of MT output in Spanish, French and German, in addition to English.
Tradumàtica tecnologies de la traducció, 2014
This paper gives a general overview of the main classes of methods for automatic evaluation of Machine Translation (MT) quality, their limitations and their value for professional translators and MT developers. Automated evaluation of MT characterizes performance of MT systems on specific text or a corpus. Automated scores are expected to correlate with certain parameters of MT quality scored by human evaluators, such as adequacy of fluency of translation. Automated evaluation is now part of MT development cycle, but it also contributes to fundamental research on MT and improving MT technology..
Cihan University/ Erbil, 2018
In this study, it is attempted to make a comparison between two common methods of evaluation of machine translation (MT) output (Human and Automatic MT evaluation). Materials of the study have been selected from economical texts. Twenty English sentences and their Persian translation were selected from "translating of economic texts" book published by Payam-e-Nour University. To assess translated sentences humanly 20 Ma students of translation studies participated in this study as evaluators. In order to evaluate sentences automatically, BLEU method of Mt output evaluation was applied. According to the findings of the study both methods of evaluation lead to the same results, however , human evaluation method is more precious than automatic evaluation methods, at the same time automatic evaluation methods is faster and more time saving than human evaluation methods.
This paper examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation. In addition to considering annotator performance and task informativeness over multiple evaluations, we explore the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR-NEXT metric. We present results showing clear advantages of tuning to certain types of judgments and discuss causes of inconsistency when tuning to various judgment data, as well as sources of difficulty in the human evaluation tasks themselves.