Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, Lecture Notes in Computer Science
Continuous monitoring of web-based news sources has emerged as a key intelligence task particularly for Homeland Security. We propose a system for web-based news tracking and alerting. Unlike subscription-based alerts, alerting is implemented as a personalized service where the system is trained to recognize potentially important news based on user preferences. Preferences are expressed as combinations of topics and can change dynamically. The system employs Latent Dirichlet Allocation (LDA) for topic discovery and Latent Semantic Indexing (LSI) for alerting.
Rapid proliferation of the World Wide Web led to an enormous increase in the availability of textual corpora. In this paper, the problem of topic detection and tracking is considered with application to news items. The proposed approach explores two algorithms (Non-Negative Matrix Factorization and a dynamic version of Latent Dirichlet Allocation (DLDA)) over discrete time steps and makes it possible to identify topics within storylines as they appear and track them through time. Moreover, emphasis is given to the visualization and interaction with the results through the implementation of a graphical tool (regardless the approach). Experimental analysis on Reuters RCV1 corpus and the Reuters 2015 archive reveals that explored approaches can be effectively used as tools for identifying topic appearances and their evolutions while at the same time allowing for an efficient visualization.
2008 Eighth IEEE International Conference on Data Mining, 2008
This paper presents Online Topic Model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the Latent Dirichlet Allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the Empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
News media includes print media, broadcast news and internet. Print media contains newspapers, news magazines, broadcast news contains radio and television, while internet contains online newspapers, news blogs, etc. The online news has been the prevalent form of information on the internet. Often, the occurrence of the same event or happening is depicted differently in different news websites or sources due to the varied perceptions of the same circumstance. Proposed system intends to collect news data from such diverse sources, capture the varied perceptions, summarize and present them at one place. Another goal of the proposed system includes detecting topics accurately in case of short news data. Previous approaches like LDA and its variants are able to identify topics efficiently for long texts (news), however, fail to do so in the case of short texts (news) due to data sparsity problem. Since sophisticated signals are delivered by the short news, it is an importnat resource for topic modeling, however, the issues of acute sparsity and irregularity are prevalent. These pose new difficulties to existing topic models, like LDA and its variations. In this paper, a lucid but generic explanation for topic modeling in online news has been provided. System presents a word co-occurrence network based model named WNTM, which works for both long as well as short news articles by managing the sparsity and imbalance issues simultaneously. WNTM is modeled by assigning and reassigning (according to probability calculation) a topic to every word in the document rather than modeling topics for every document. It effectively improves the density of information space without wasting much time and space complexity. Along these lines, the rich context saved in the word-word space likewise ensures to detect new and uncommon topics with convincing quality. The system extracts real time online news data and uses this data for system implementation. Firstly, topic modeling algorithm is applied on this online news data to identify the key topic of the incoming news and also to identify the most trending topic. Once we identify the topic of news, the system uses k-means document clustering algorithm to cluster all latest news associated to a particular topic together. Likewise, classify the news on the basis of topic. After clustering, generation of the summary is done from the output and we intend to present the summarized news along with the topic to the user.
2000
This paper describes research into the development of techniques to build effective Topic Tracking systems. Topic tracking involves tracking a given news event in a stream of news stories i.e. finding all subsequent stories in the news stream that discuss the given event. This research has grown out of the Topic Detection and Tracking (TDT) initiative sponsored by DARPA. The paper describes the results of a topic tracking system designed using traditional IR techniques and outlines a new approach to TDT using lexical chaining which should improve effectiveness.
Large, real time text classification systems are becoming a popular topic. We present a method for automatically extracting correlated news from online media using a dynamic similarity graph and use the variation of information as a measure to identify topics, lifespan and key terms. The presented method has the advantage of requiring no human intervention or training and having no pre-assigned categories because they emerge from the dynamics of the generated network.
I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them. I am highly indebted to my Guide Prof.S.D.Bandari for her guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in completing the project. I would like to express my gratitude towards my parents & member of my institute for their kind co-operation and encouragement which help me in completion of this project. My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities.
Proceedings of the 20th international conference on World wide web, 2011
News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.
In this paper, we present our recent contributions in the field of text mining, especially when dealing with topic extraction and tracking. After a brief overview of the state of the art, we present a whole system for extracting topics and finding understandable key phrases to label these topics; we present a platform for fetching information forums (either RSS feeds or Web sites) and for analyzing online discussions. We give also current work and preliminary results to tracking topics through various information sources and to deal with the evolution of topics over time. The crucial point of validating topic models is evoked. An important part of the paper is used to give future works in which we are interested in.
INTERNATIONAL JOURNAL OF RECENT TRENDS IN ENGINEERING & RESEARCH, 2019
We present a topic identification system for news, which is based upon an evaluation of similarity between the topics and a large amount of documents in the news database. Our system is able to provide the topics for every news samples. The system implements and compares the two Topic Models, Latent Dirichlet Allocation (LDA) and Latent Semantic Allocation (LSA), on a news database containing eleven thousand documents. The topic models behaviour has been examined on the basis of standard metrics, accuracy and the implementation speed of the algorithms.
1998
ABSTRACT Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks:(1) segmenting a stream of data, especially recognized speech, into distinct stories;(2) identifying those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream.
Proceedings of the 32nd …, 2009
Web mining - is the application of data mining techniques to discover patterns from the Web. Topic tracking is one of the technologies that has been developed and can be used in the text mining process. The main purpose of topic tracking is to identify and follow events presented in multiple news sources, including newswires, radio and TV broadcasts. In this paper, a survey of topic tracking techniques is presented
Twitter is a very popular social networking site with simple rules, many interactions, and accessible data. This makes a great target for studying possibilities of automated monitoring. This paper aims to realize the concept of the Automated Topic-Focused Monitor (ATM) Framework, and address the challenges of implementing such a system. ATM is a system that gathers tweets that are relevant to a topic, like sports or politics, in real time, and iteratively adapts the keywords that are used to gather relevant tweets based on the recent history. To adapt the keywords, we use text based classification combined with a practical greedy selection algorithm. This framework makes it simple to introduce new topics to the system and to train their accompanying classifiers. We then conduct a series of experiments to judge the framework"s effectiveness.
Telematika, 2021
Online media news portals have the advantage of speed in conveying information on any events that occur in society. One way to know what a story is about is from the title. The headline is a headline that introduces the reader's knowledge about the news content to be described. From these headlines, you can search for the main topics or trends that are being discussed. It takes a fast and efficient method to find out what topics are trending in the news. One method that can be used to overcome this problem is topic modeling. Topic modeling is necessary to help users quickly understand recent issues. One of the algorithms in topic modeling is Latent Dirichlet Allocation (LDA). The stages of this research began with data collection, preprocessing, forming n-grams, dictionary representation, weighting, validating the topic model, forming the topic model, and the results of topic modeling. The results of modeling LDA topics in news headlines taken from www.detik.com for 8 months (March-October 2020) during the COVID-19 pandemic showed that the best number of topics produced each month were 3 topics dominated by news topics about corona cases, positive corona, positive COVID, COVID-19 with an accuracy of 0.824 (82.4%). The resulting precision and recall values indicate that the two values are identical, so this is ideal for an information retrieval system.
2012
The media today bombards us with massive amounts of news about events ranging from the mundane to the memorable. This growing cacophony places an ever greater premium on being able to identify significant stories and to capture their salient features. In this paper, we consider the problem of mining on-line news over a certain period to identify what the major stories were in that time. Major stories are defined as those that were widely reported, persisted for significant duration or had a lasting influence on subsequent stories. Recently, some statistical methods have been proposed to extract important information from large corpora, but most of them do not consider the full richness of language or variations in its use across multiple reporting sources. We propose a method to extract major stories from large news corpora using a combination Latent Dirichlet Allocation and with n-gram analysis.
Lecture Notes in Computer Science, 2022
Tracking news stories in documents is a way to deal with the large amount of information that surrounds us everyday, to reduce the noise and to detect emergent topics in news. Since the Covid-19 outbreak, the world has known a new problem: infodemic. News article titles are massively shared on social networks and the analysis of trends and growing topics is complex. Grouping documents in news stories lowers the number of topics to analyse and the information to ingest and/or evaluate. Our study proposes to analyse news tracking with little information provided by titles on social networks. In this paper, we take advantage of datasets of public news article titles to experiment news tracking algorithms on short messages. We evaluate the clustering performance with little amount of data per document. We deal with the document representation (sparse with TF-IDF and dense using Transformers [26]), its impact on the results and why it is key to this type of work. We used a supervised algorithm proposed by Miranda et al. [22] and K-Means to provide evaluations for different use cases. We found that TF-IDF vectors are not always the best ones to group documents, and that algorithms are sensitive to the type of representation. Knowing this, we recommend taking both aspects into account while tracking news stories in short messages. With this paper, we share all the source code and resources we handled.
2010
Event tracking is the task of discovering temporal patterns of popular events from text streams. Existing approaches for event tracking have two limitations: scalability and inability to rule out non-relevant portions in text streams. In this study, we propose a novel approach to tackle these limitations. To demonstrate the approach, we track news events across a collection of weblogs spanning a two-month time period.
1999
ABSTRACT The goal of TDT Topic Detection and Tracking is to develop automatic methods of identifying topically related stories within a stream of news media. We describe approaches for both detection and tracking based on the well-known id/-weighted cosine coefficient similarity metric. The surprising outcome of this research is that we achieved very competitive results for tracking using a very simple method of feature selection, without word stemming and without a score normalization scheme.
Procedia Computer Science, 2020
In the digitization air, it is very important to detect and analyze the related topics to some discussions, occurred in social media or to label some visited web pages or documents. This information could be very helpful to the process of personalization as well as user satisfaction. There are various and different methods that study and deal with a huge data to provide insights into user behaviors. In this paper, we propose a filtering process that enhances topic detection and labelling. The latter aims to compact the result delivered by inferential algorithms such as Latent Dirichlet Allocation and Dirichlet Mixture Model. Our filtering process relies on words dependency on each contextual use for delivering high correlated label. Indeed, we use Word2vec as well as N-grams to eliminate non-significant words in each topic. We also use Hellinger distance to aggregate redundant words to the appropriate topic. Besides, we eliminate the non-reliable topics according to some metric. We associate this proposal to different topic-modeling algorithms. Experiments demonstrate the effectiveness of the made association between inferential model and our filtering process compared to the state of the art. We also use different textual data to validate our proposal.
West African Journal of Industrial and Academic Research, 2012
Autonomous agents are software systems situated within and a part of an environment that senses stimuli in that environment, acts on it, over time, in pursuit of its own agenda so as to effect what it senses in the future. Autonomous agents take action without user intervention and operate concurrently, either while the user is idle or taking other actions. The internet encompasses a large number of documents to which search engines try to provide access. Even for many narrow topics and potential information needs, there are often many web pages online. The user of a web search engine would prefer the best pages to be returned. The use of autonomous intelligent agent topic tracker will help to make decision on behalf of the user, by narrowing the search domain and decreasing the human computer interaction, phenomenally. Previous research works on information retrieval system usually consists of long list of results containing documents with low relevance to the user query. Thus, the goal of this paper is to build an Intelligent Agent Topic Tracking System, that employs document concepts to track identical document related to the researcher's needs within a publication topic development. The system solely refines the user query as well as retrieving the result from a search engine with the help of Google API and refines the noisy result produced using Document-document Similarity model and the Document Component model to find similar topic documents in the document pool indexed by the search engines. In addition, the Web Structure Analysis model will use the hub and authority algorithm to evaluate the importance of web pages or to determine their relatedness to a particular topic. Finally, clustering is used to automatically group document pool into similar topics.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.