Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
In this paper we study the Enertex model that has been applied to fundamental tasks in Natural Language Processing (NLP) including automatic document summarization and topic segmentation. The model is language independent. It is based on the intuitive concept of Textual Energy, inspired by Neural Networks and Statistical Physics of magnetic systems. It can be implemented using simple matrix operations and on the contrary of PageRank algorithms, it avoids any iterative process.
Lecture Notes in Computer Science, 2007
In this paper we present a Neural Network approach, inspired by statistical physics of magnetic systems, to study fundamental problems of Natural Language Processing (NLP). The algorithm models documents as neural network whose Textual Energy is studied. We obtained good results on the application of this method to automatic summarization and Topic Segmentation.
Physica A: Statistical Mechanics and its Applications, 2018
MICAI 2007: Advances …, 2007
In this article we present a hybrid approach for automatic summarization of Spanish medical texts. There are a lot of systems for automatic summarization using statistics or linguistics, but only a few of them combining both techniques. Our idea is that to reach a good summary we need to use linguistic aspects of texts, but as well we should benefit
Computing Research Repository, 2008
In this article we present a model of human written text based on statistical mechanics approach by deriving the potential energy for different parts of the text using large text corpus.
2011
Multidocument summarization (MDS) aims for each given query to extract compressed and relevant information with respect to the different query-related themes present in a set of documents. Many approaches operate in two steps. Themes are first identified from the set, and then a summary is formed by extracting salient sentences within the different documents of each of the identified themes. Among these approaches, latent semantic analysis (LSA) based approaches rely on spectral decomposition techniques to identify the themes.
Automatic Control and Computer Sciences, 2007
A method is proposed for use in summarization of text-based documents. By means of the method it is possible to discover latent topical sections and information-rich sentences. The underlying basis of the method -clustering of sentences -is formulated mathematically in the form of a problem of quadratic-type integer programming. An algorithm that makes it possible to determine with specified precision the optimal number of clusters is developed. The synthesis of a neural network is described for the purpose of solving a problem of integer quadratic programming.
Proceedings of EMNLP, 2004
In this paper, we introduce TextRank -a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for keyword and sentence extraction, and show that the results obtained compare favorably with previously published results on established benchmarks.
… of the ACL-2002 Workshop on …, 2002
Topic segmentation can be used as a pre-processing step in numerous natural lan-guage processing applications. In this short paper, we will discuss how we adapted our segmentation algorithm for automatic summarization. ... Human readers are able to construct a mental rep- ...
In natural language processing (NLP), text summarization is a challenging operation that involves reducing the length of a document while retaining its key points and essential details. In this project, we propose a solution to this problem by leveraging NLP techniques to automatically generate concise summaries from text data. Our approach involves utilizing various NLP segmentation such as part-of-speech tagging, tokenization, and sentence parsing, to analyze and understand the content of the input text as well as its structure. We then apply machine learning algorithms, like RNNs and transformer models, to learn the important features and patterns of the text data. Based on these learned features, we develop a summarization model that can identify and extract relevant information from the original text. To calculate and analyze the effectiveness of our approach, we will use standard evaluation metrics, such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), to assess the quality of the generated summaries in terms of their similarity to the reference summaries. We will also conduct experiments on different types of text data, such as social media posts, news articles, and research papers, to assess the robustness and generalizability of our summarization model. The proposed project has several potential applications, including news article summarization, document summarization for information retrieval, and social media summarization for sentiment analysis. The outcome of this project is expected to contribute to the field of NLP and machine learning by providing an effective solution for text summarization, which can help users quickly and efficiently extract key information from large amounts of text data.
Computación y Sistemas, 2020
This paper aims to show that generating and evaluating summaries are two linked but different tasks even when the same Divergence of the Probability Distribution (DPD) is used in both. This result allows the use of DPD functions for evaluating summaries automatically without references and also for generating summaries without falling into inconsistencies.
Expert Systems with Applications, 2021
Automatic text summarization aims to cut down readers' time and cognitive effort by reducing the content of a text document without compromising on its essence. Ergo, informativeness is the prime attribute of document summary generated by an algorithm, and selecting sentences that capture the essence of a document is the primary goal of extractive document summarization. In this paper, we employ Shannon's entropy to capture informativeness of sentences. We employ Non-negative Matrix Factorization (NMF) to reveal probability distributions for computing entropy of terms, topics, and sentences in latent space. We present an information theoretic interpretation of the computed entropy, which is the bedrock of the proposed E-Summ algorithm, an unsupervised method for extractive document summarization. The algorithm systematically applies information theoretic principle for selecting informative sentences from important topics in the document. The proposed algorithm is generic and fast, and hence amenable to use for summarization of documents in real time. Furthermore, it is domain-, collection-independent and agnostic to the language of the document. Benefiting from strictly positive NMF factor
2013 12th Mexican International Conference on Artificial Intelligence, 2013
In this paper we revisit the Textual Energy model. We deal with the two major disadvantages of the Textual Energy: the asymmetry of the distribution and the unboundedness of the maximum value. Although this model has been successfully used in several NLP tasks like summarization, clustering and sentence compression, no correction of these problems has been proposed until now. Concerning the maximum value, we analyze the computation of Textual Energy matrix and we conclude that energy values are dominated by the lexical richness in quadratic growth of the vocabulary size. Using the Box-Cox transformation, we show empirical evidence that a log transformation could correct both problems.
Journal of the American Society for Information Science and Technology, 2012
Multidocument summarization (MDS) aims for each given query to extract compressed and relevant information with respect to the different query-related themes present in a set of documents. Many approaches operate in two steps. Themes are first identified from the set, and then a summary is formed by extracting salient sentences within the different documents of each of the identified themes. Among these approaches, latent semantic analysis (LSA) based approaches rely on spectral decomposition techniques to identify the themes. In this article, we propose a major extension of these techniques that relies on the quantum information access (QIA) framework. The latter is a framework developed for modeling information access based on the probabilistic formalism of quantum physics. The QIA framework not only points out the limitations of the current LSA-based approaches, but motivates a new principled criterium to tackle multidocument summarization that addresses these limitations. As a byproduct, it also provides a way to enhance the LSAbased approaches. Extensive experiments on the DUC 2005, 2006 and 2007 datasets show that the proposed approach consistently improves over both the LSAbased approaches and the systems that competed in the yearly DUC competitions. This demonstrates the potential impact of quantum-inspired approaches to information access in general, and of the QIA framework in particular.
2009
In this article we present a model of human written text based on statistical mechanics consideration. The empirical derivation of the potential energy for the parts of the text and the calculation of the thermodynamic parameters of the system, show that the “specific heat” corresponds to the semantic classification of the words in the text, separating keywords, function words and common words. This can give advantages when the model is used in text searching mechanisms.
2009
Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.
2000
This paper deals with the problem of automatic topic detection in text documents. The proposed method follows a non-linear approach. The method uses a simple clustering algorithm to group the semantically-related sentences. The distance between two sentences is calculated based on the distance between all nouns that appear in the sentences. The distance between two nouns is calculated using the Wordnet thesaurus. An automatic text summarization system using a topic strength method was used to compare the results achieved by the Text Tiling Algorithm and the proposed method. The obtained initial results shows that the proposed method is a promising approach.
In this work, we suggest a parameterized statistical model (the gamma distribution) for the frequency of word occurrences in long strings of english text and use this model to build a corresponding thermody- namic picture by constructing the partition function. We then use our partition function to compute ther- modynamic quantities such as the free energy and the specific heat. In this approach, the parameters of the word frequency model vary from word to word so that each word has a different corresponding thermo- dynamics and we suggest that differences in the spe- cific heat reflect differences in how the words are used in language, differentiating keywords from common and function words. Finally, we apply our thermo- dynamic picture to the problem of retrieval of texts based on keywords and suggest some advantages over traditional information retrieval methods.
Physics Letters, 2018
Text can be regarded as a complex system. There are some methods in statistical physics which can be used to study this system. In this work, by means of statistical physics methods, we reveal new universal behaviors of texts associating with the fractality values of words in a text. The fractality measure indicates the importance of words in a text by considering distribution pattern of words throughout the text. We observed a power law relation between fractality of text and vocabulary size for texts and corpora. We also observed this behavior in studying biological data.
Applied Network Science
Recent work has employed information theory in social and complex networks. Studies often discuss entropy in the degree distributions of a network. However, no specific work on entropy exists in clique networks. This work is an extension of a previous study that discussed this topic. We propose a method for calculating the entropy of a clique network and its minimum and maximum values in temporal semantic networks based on titles of scientific papers. In addition, the critical network of moments was extracted. We use the titles of scientific papers published in Nature and Science over ten-year period. The results show the diversity of vocabulary over time, based on the entropy values of vertices and edges. In each critical network, we discover the paths that connect important words and an interesting modular structure.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.