Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
5 pages
1 file
We consider topic detection without any prior knowledge of category structure or possible categories. Keywords are extracted and clustered based on different similarity measures using the induced k-bisecting clustering algorithm. Evaluation on Wikipedia articles shows that clusters of keywords correlate strongly with the Wikipedia categories of the articles. In addition, we find that a distance measure based on the Jensen-Shannon divergence of probability distributions outperforms the cosine similarity. In particular, a newly proposed term distribution taking co-occurrence of terms into account gives best results.
Proceedings of the 23rd World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2019 Vol. I, pp. 99-103, 2019
Defining text topicality is often an expensive problem that requires significant resources for text labeling. Though many packages already exist that provide dictionaries of labeled text, synonyms, and Part-of-Speach tagging, the problem is ongoing as language develops and new meanings of words and phrases emerge. This paper proposes a cheap in human labor solution to topic labeling of any text in the majority of languages. The methodology uses links to the naturally emerging corpus of labeled text – the Wikipedia. Wikipedia categories are processed to extract a weighted set of topic labels for the analyzed text. The approach is evaluated by processing categorized texts and comparing the similarity of the top ranks of topic labels to the text category. The topic labels extracted using this methodology can be used for comparing similarity of texts, for the assessment of the completeness of topic coverage in automated marking of essays, and for coding in qualitative text analysis. The paper contributes to the field of NLP by offering a cheap and organically developing method of topical text labeling. The paper contributes to the work of qualitative analysts by offering a methodology for the analysis of interview transcripts and other unstructured text.
News media includes print media, broadcast news and internet. Print media contains newspapers, news magazines, broadcast news contains radio and television, while internet contains online newspapers, news blogs, etc. The online news has been the prevalent form of information on the internet. Often, the occurrence of the same event or happening is depicted differently in different news websites or sources due to the varied perceptions of the same circumstance. Proposed system intends to collect news data from such diverse sources, capture the varied perceptions, summarize and present them at one place. Another goal of the proposed system includes detecting topics accurately in case of short news data. Previous approaches like LDA and its variants are able to identify topics efficiently for long texts (news), however, fail to do so in the case of short texts (news) due to data sparsity problem. Since sophisticated signals are delivered by the short news, it is an importnat resource for topic modeling, however, the issues of acute sparsity and irregularity are prevalent. These pose new difficulties to existing topic models, like LDA and its variations. In this paper, a lucid but generic explanation for topic modeling in online news has been provided. System presents a word co-occurrence network based model named WNTM, which works for both long as well as short news articles by managing the sparsity and imbalance issues simultaneously. WNTM is modeled by assigning and reassigning (according to probability calculation) a topic to every word in the document rather than modeling topics for every document. It effectively improves the density of information space without wasting much time and space complexity. Along these lines, the rich context saved in the word-word space likewise ensures to detect new and uncommon topics with convincing quality. The system extracts real time online news data and uses this data for system implementation. Firstly, topic modeling algorithm is applied on this online news data to identify the key topic of the incoming news and also to identify the most trending topic. Once we identify the topic of news, the system uses k-means document clustering algorithm to cluster all latest news associated to a particular topic together. Likewise, classify the news on the basis of topic. After clustering, generation of the summary is done from the output and we intend to present the summarized news along with the topic to the user.
2008
In this paper, we introduce a new clustering algorithm for obtaining labeled document clusters that accurately identify the topics of a text collection. In order to determine the topics, our approach relies on both probable term pairs generated from the collection and the estimation of the topic homogeneity associated to term pair clusters. Experimental results obtained over two benchmark text collections demonstrate the utility of this new approach.
2017
Document clustering is a technique which groups similar content documents from the collection. It can further be extended to extract topics of each groups. Document clustering and Topic identification form back bone of information retrieval, but size of documents to be grouped in terms of number of words affects these processes negatively. The sparsity of terms present in big documents impacts weight of individual term and in turn quality of clusters adversely. This paper presents application of cluster analysis for document collection of small documents and document collection of big documents for topic identification from document collection. Results are presented as comparisons to emphasize the concerns with respect to big documents.
Proceedings of 3rd International Conference on Data Management Technologies and Applications, 2014
Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the t f − id f matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in each document, represented by a row. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.
2012
Search results clustering (SRC) is a challenging algorithmic problem that requires grouping together the results returned by one or more search engines in topically coherent clusters, and labeling the clusters with meaningful phrases describing the topics of the results included in them.
2010
Weblogs are an important source of information that requires automatic techniques to categorize them into "topic-based" content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf.idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to-categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the weblogs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches.
This paper discusses a system for online new event detection as part of the Topic Detection and Tracking (TDT) initiative. Our approach uses a single-pass clustering algorithm, which includes a time-based selection model and a thresholding model. We evaluate two benchmark systems: The first indexes documents by keywords and the second attempts to perform conceptual indexing through the use of the WordNet thesaurus software. We propose a more complex document/cluster representation using lexical chaining. We believe such a representation will improve the overall performance of our system by allowing us to encapsulate the context surrounding a word and to disambiguate its senses.
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques. This is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.
2012
The rapid growth in the number of documents available to end users from around the world has led to a greatlyincreased need for machine understanding of their topics, as well as for automatic grouping of related documents. This constitutes one of the main current challenges in text mining. In this work, a novel technique is proposed, to automatically construct a background knowledge structure in the form of a hierarchical ontology, using one of the largest online knowledge repositories: Wikipedia. Then, a novel approach is presented to automatically identify the documents' topics based on the proposed Wikipedia Hierarchical Ontology (WHO). Results show that the proposed model is efficient in identifying documents' topics, and promising, as it outperforms the accuracy of the other conventional algorithms for document clustering.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
ACM Transactions on Asian Language Information Processing, 2009
International Journal of Artificial Intelligence & Applications, 2014
World Academy of Research in Science and Engineering, 2019
2011 3rd Conference on Data Mining and Optimization (DMO), 2011
… of Human Language Technologies: The 2009 …, 2009
International Journal on Document Analysis and Recognition, 2007
Indonesian Journal of Electrical Engineering and Computer Science, 2022