Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, Lecture Notes in Computer Science
…
2 pages
1 file
This paper deals with a new strategy to evaluate a Natural Language Processing (NLP) complex task using the Turing test. Automatic summarization based on sentence compression requires to asses informativeness and modify inner sentence structures. This is much more intrinsically related with real rephrasing than plain sentence extraction and ranking paradigm so new evaluation methods are needed. We propose a novel imitation game to evaluate Automatic Summarization by Compression (ASC). Rationale of this Turing-like evaluation could be applied to many other NLP complex tasks like Machine translation or Text Generation. We show that a state of the art ASC system can pass such a test and simulate a human summary in 60% of the cases.
Transactions of the Association for Computational Linguistics
The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgmen...
2006
We applied a single-document sentencetrimming approach (Trimmer) to the problem of multi-document summarization. Trimmer was designed with the intention of compressing a lead sentence into a space consisting of tens of characters. In our Multi-Document Trimmer (MDT), we use Trimmer to generate multiple trimmed candidates for each sentence. Sentence selection is used to determine which trimmed candidates provide the best combination of topic coverage and brevity. We demonstrate that we were able to port Trimmer easily to this new problem. We also show that MDT generally ranked higher for recall than for precision, suggesting that MDT is currently more successful at finding relevant content than it is at weeding out irrelevant content. Finally, we present an error analysis that shows that, while sentence compressions is making space for additional sentences, more work is needed in the area of generating and selecting the right candidates.
ArXiv, 2021
Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Therefore, many automatic evaluation metrics have been developed to measure the summarization quality in a fast and reproducible way. However, most of the metrics still rely on humans and need gold standard summaries generated by linguistic experts. Since BLANC does not require golden summaries and supposedly can use any underlying language model, we consider its application to the evaluation of summarization in German. This work demonstrates how to adjust the BLANC metric to a language other than English. We compare BLANC scores with the crowd and expert ratings, as well as with commonly used automatic metrics on a German summarization data set. Our results show that BLANC in German is especially good in evaluating informativeness.
The originality of this work leads in tackling text compression using an unsupervised method, based on a deep linguistic analysis, and without resorting on a learning corpus. This work presents a system for dependent tree pruning, while preserving the syntactic coherence and the main informational contents, and led to an operational software, named COLIN. Experiment results show that our compressions get honorable satisfaction levels, with a mean compression ratio of 38 %.
2009
Abstract Text summarization is one of the oldest problems in natural language processing. Popular approaches rely on extracting relevant sentences from the original documents. As a side effect, sentences that are too long but partly relevant are doomed to either not appear in the final summary, or prevent inclusion of other relevant sentences. Sentence compression is a recent framework that aims to select the shortest subsequence of words that yields an informative and grammatical sentence.
2021
ive Summarization methods build a summary by generating new sentences from the information in the original text. Summaries made by humans usually have abstraction values, where sentences in the summary can be new sentences that convey the information from the original article. The automatically generated abstractive summaries are expected to be better and closer to a human-built summary. Algorithms that try to create abstractive summaries have three main challenges: 1) Compression. The following two challenges come to mind when thinking about abstractive summaries, but to meet the need of every summary, the algorithm's output must be shorter than the original text. The algorithm uses the previous training process, which created a model that recognizes important words by topic and tokenizes connection between words. 2) Sentence Fusion. This stage comes together with the compression stage. The compression stage tells the meaningful information from one or more sentences, and the F...
2004
Automatic summaries of text generated through sentence or word extraction has been evaluated by comparing them with manual summaries generated by humans by using numerical evaluation measures based on precision or accuracy. Although sentence extraction has previously been evaluated based only on precision of a single sentence, sentence concatenations in the summaries should be evaluated as well. We have evaluated the appropriateness of sentence concatenations in summaries by using evaluation measures used for evaluating word concatenations in summaries through word extraction. We determined that measures considering sentence concatenation much better reflect the human judgment rather than those based only on the precision of a single sentence.
ive text summarization aims at generating human-like summaries by understanding and paraphrasing the given input content. Recent efforts based on sequence-to-sequence networks only allow the generation of a single summary. However, it is often desirable to accommodate the psycho-linguistic preferences of the intended audience while generating the summaries. In this work, we present a reinforcement learning based approach to generate formality-tailored summaries for an input article. Our novel input-dependent reward function aids in training the model with stylistic feedback on sampled and ground-truth summaries together. Once trained, the same model can generate formal and informal summary variants. Our automated and qualitative evaluations show the viability of the proposed framework.
In abstract-like summarisation, extracted sentences containing key content are often revised to improve the coherence of the overall summary. In this work, we consider the task of Global Revision, in which a key sentence is revised and supplemented with additional content from the original document. Specifically, this task comprises two subtasks: selecting content; and grammatically ordering content, the focus of this paper. Using statistical dependency models, we search for a Maximal Spanning (Dependency) Tree that structures recycled words and phrases to form a novel sentence. Combining a modified version of Prim's algorithm with a four-gram language model, we evaluated our system on a sentence regeneration task obtaining Bleu scores of .30, a statistically significant improvement above the baseline.
In this paper we present a method for re-using the human judgements on summary quality provided by the DUC contest. The score to be awarded to automatic summaries is calculated as a function of the scores assigned manually to the most similar summaries for the same document. This approach enhances the standard n-gram based evaluation of automatic summarization systems by establishing similarities between extractive (vs. abstractive) summaries and by taking advantage of the big quantity of evaluated summaries available from the DUC contest. The utility of this method is exemplified by the improvements achieved on a headline production system.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Natural Language Engineering, 2002
Lecture Notes in Computer Science, 2013
Computación y Sistemas
Information Processing & Management, 2007
Natural Language Engineering, 2019
Lecture Notes in Computer Science, 2005
arXiv (Cornell University), 2023