Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000
This paper describes research into the development of techniques to build effective Topic Tracking systems. Topic tracking involves tracking a given news event in a stream of news stories i.e. finding all subsequent stories in the news stream that discuss the given event. This research has grown out of the Topic Detection and Tracking (TDT) initiative sponsored by DARPA. The paper describes the results of a topic tracking system designed using traditional IR techniques and outlines a new approach to TDT using lexical chaining which should improve effectiveness.
1998
ABSTRACT Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks:(1) segmenting a stream of data, especially recognized speech, into distinct stories;(2) identifying those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream.
In this paper, we present our recent contributions in the field of text mining, especially when dealing with topic extraction and tracking. After a brief overview of the state of the art, we present a whole system for extracting topics and finding understandable key phrases to label these topics; we present a platform for fetching information forums (either RSS feeds or Web sites) and for analyzing online discussions. We give also current work and preliminary results to tracking topics through various information sources and to deal with the evolution of topics over time. The crucial point of validating topic models is evoked. An important part of the paper is used to give future works in which we are interested in.
Rapid proliferation of the World Wide Web led to an enormous increase in the availability of textual corpora. In this paper, the problem of topic detection and tracking is considered with application to news items. The proposed approach explores two algorithms (Non-Negative Matrix Factorization and a dynamic version of Latent Dirichlet Allocation (DLDA)) over discrete time steps and makes it possible to identify topics within storylines as they appear and track them through time. Moreover, emphasis is given to the visualization and interaction with the results through the implementation of a graphical tool (regardless the approach). Experimental analysis on Reuters RCV1 corpus and the Reuters 2015 archive reveals that explored approaches can be effectively used as tools for identifying topic appearances and their evolutions while at the same time allowing for an efficient visualization.
This paper discusses a system for online new event detection as part of the Topic Detection and Tracking (TDT) initiative. Our approach uses a single-pass clustering algorithm, which includes a time-based selection model and a thresholding model. We evaluate two benchmark systems: The first indexes documents by keywords and the second attempts to perform conceptual indexing through the use of the WordNet thesaurus software. We propose a more complex document/cluster representation using lexical chaining. We believe such a representation will improve the overall performance of our system by allowing us to encapsulate the context surrounding a word and to disambiguate its senses.
Vector Space Model (VSM) has aroused significant research attention in recent years due to its advantage in topic tracking. However, its effectiveness has been restrained by its incapability in revealing same-concept semantic information of different keywords or hidden semantic relations of the text, making the accuracy of topic tracking hardly guaranteed. Confronting these issues with concern, a modified VSM, namely Semantic Vector Space Model, is put forward. To establish the model, numerous lexical chains based on HowNet are first built, then sememes of the lexical chains are extracted as characteristics of feature vectors. Afterwards, initial weight and structural weight of the characteristics are calculated to construct the Semantic Vector Space Model, encompassing both semantic and structural information. The initial weight is collected from word frequency, while the structure weight is obtained from a designed calculation method: Each lexical chain structure weight is defined as (m + 1)/S, m is the number of the other similar chains, and S is the number of the reports used for extraction of the lexical chains. Finally, the model is applied in web news topic tracking with satisfactory experimental results, conforming the method to be effective and desirable.
1999
ABSTRACT The goal of TDT Topic Detection and Tracking is to develop automatic methods of identifying topically related stories within a stream of news media. We describe approaches for both detection and tracking based on the well-known id/-weighted cosine coefficient similarity metric. The surprising outcome of this research is that we achieved very competitive results for tracking using a very simple method of feature selection, without word stemming and without a score normalization scheme.
Web mining - is the application of data mining techniques to discover patterns from the Web. Topic tracking is one of the technologies that has been developed and can be used in the text mining process. The main purpose of topic tracking is to identify and follow events presented in multiple news sources, including newswires, radio and TV broadcasts. In this paper, a survey of topic tracking techniques is presented
2008
This paper presents a keyword extraction technique that can be used for tracking topics over time. In our work, keywords are a set of significant words in an article that gives high-level description of its contents to readers. Identifying keywords from a large amount of on-line news data is very useful in that it can produce a short summary of news articles. As on-line text documents rapidly increase in size with the growth of WWW, keyword extraction has become a basis of several text mining applications such as search engine, text categorization, summarization, and topic detection. Manual keyword extraction is an extremely difficult and time consuming task; in fact, it is almost impossible to extract keywords manually in case of news articles published in a single day due to their volume. For a rapid use of keywords, we need to establish an automated process that extracts keywords from news articles. We propose an unsupervised keyword extraction technique that includes several variants of the conventional TF-IDF model with reasonable heuristics.
2002
Abstract We extend relevance modeling to the link detection task of Topic Detection and Tracking (TDT) and show that it substantially improves performance. Relevance modeling, a statistical language modeling technique related to query expansion, is used to enhance the topic model estimate associated with a news story, boosting the probability of words that are associated with the story even when they do not appear in the story.
2012
Abstract Interactive Topic Detection and Tracking (iTDT) refers to the TDT works which focus on user interaction, user evaluation and user interfaces aspects. This article investigates and identifies elements of the design of an interface that aims to facilitate journalists performing TDT tasks such as tracking and detection. It presents an (iTDT) interface called Interactive Event Tracking (iEvent), and evaluates the usability of the features introduced.
The Information Retrieval Series, 2002
This paper presents algorithms for Chinese and English-Chinese topic detection. Named entities, other nouns and verbs are cue patterns to relate news stories describing the same event. Lexical translation and name transliteration resolve lexical differences between English and Chinese. A two-threshold scheme determines relevance (irrelevance) between a news story and a topic cluster. Lookahead information deals with ambiguous cases in clustering. The least-recently-used removal strategy models the time factor in such a way that older and unimportant terms will have no effect on clustering. Experimental results show that nouns and verbs as well as the least-recently-used removal strategy outperform other models. The performance of the named-entity-only approach decreases slightly, but it has no overhead of nouns-and-verbs approach with the least-recently-used removal strategy.
2002
This paper presents algorithms for Chinese and English-Chinese topic detection. Named entities, other nouns and verbs are cue patterns to relate news stories describing the same event. Lexical translation and name transliteration resolve lexical differences between English and Chinese. A two-threshold scheme determines relevance (irrelevance) between a news story and a topic cluster. Lookahead information deals with ambiguous cases in clustering. The least-recently-used removal strategy models the time factor in such a way that older and unimportant terms will have no effect on clustering. Experimental results show that nouns and verbs as well as the least-recently-used removal strategy outperform other models. The performance of the named-entity-only approach decreases slightly, but it has no overhead of nouns-and-verbs approach with the least-recently-used removal strategy. 1.
Information Processing & Management, 2007
In this paper, we propose a new language model, namely, a dependency structure language model, for topic detection and tracking (TDT) to compensate for weakness of unigram and bigram language models. The dependency structure language model is based on the Chow expansion theory and the dependency parse tree generated by a linguistic parser. So, long-distance dependencies can be naturally captured by the dependency structure language model. We carried out extensive experiments to verify the proposed model on topic tracking and link detection in TDT. In both cases, the dependency structure language models perform better than strong baseline approaches.
2003
In this work, we present a new semantic language modeling approach to model news stories in the Topic Detection and Tracking (TDT) task. In the new approach, we build a unigram language model for each semantic class in a news story. We also cast the link detection subtask of TDT as a two-class classification problem in which the features of each sample consist of the generative log-likelihood ratios from each semantic class. We then compute a linear discriminant classifier using the perceptron learning algorithm on the training set. Results on the test set show a marginal improvement over the unigram performance, but are not very encouraging on the whole.
2000
Abstract We investigate the ex traction of linked-ob j ect representations (LORs) f rom te xtf or use in topic tracking. LORs provide us aw ay to represent relationships bet w een ob j ects f ound in te x t. We sho w the use of naive core f erence resolution during the ex traction of ob j ects does not provide improvement over the absence of core f erence resolution. We investigate the creation of links, or relationships, bet w een ob j ects through closeness of the ob j ects in te xtf or small and large w indo w sizes.
2009
This paper presents the design of a new interface for interactive Topic Detection and Tracking (TDT) called Ievent. It is composed of 3 main views; a Cluster View, a Document View, and a Named Entity View, supporting the user in identifying new events and tracking them in a news stream. The interface has also been designed to test the usefulness in interactive TDT of named entity recognition. We report some initial findings from a user study on the effectiveness of our novel interface.
Proceedings of the 32nd …, 2009
BBN's systems for TDT use probabilistic models for higher accuracy and easy training. They generate measures that are normalized across topics, so that only one threshold is necessary to make decisions. These systems make little or no use of deep linguistic knowledge. and therefore are easy to modify for new languages and domains. At the same time their performance has consistently been in the top tier.
2004
Huge volume of news makes it hard for people to keep up with the latest information, and automatic processing of news information becomes necessary. Topic Detection and Tracking is a research program that deals with this problem. From the observations in TDT, news topics can be described in different sizes, making it hard to define the "correct" granularity.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.