Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2002
In this paper, a new topic identification method, WSIM, is investigated. It exploits the similarity between words and topics. This measure is a function of the similarity between words, based on the mutual information. The performance of WSIM is compared to the cache model and to the well- known SVM classifier. Their behavior is also studied in terms of recall
World Academy of Research in Science and Engineering, 2019
Topic identification is an area of data mining that finds common text/ themes from several documents. It is a data summarization technique that helps to summarize documents. This area is of great interest among researchers as its applications in the real world are very wide. This paper presents a review of topic identification techniques. Existing solutions include text clustering, latent semantic approach, probabilistic latent semantics approach, latent Dirichlet allocation approach, association rule-based approaches, document clustering, and soft computing approach. Soft computing techniques including fuzzy logic, neural networks, support vector machine, ant colony optimization, swarm optimization, and their hybrid approaches provide a good solution for text clustering. This paper presents a comparative study of different text mining techniques with their strengths and weaknesses. A future dimension is also proposed to develop a hybrid approach for topic identification using different techniques.
2001
This paper presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classifier, topic perplexity, and weighted model. Our work aims to study these methods by confronting them to very different data. This study is very fruitful for our research. Statistical topic identification methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identification of ¼± on a general newspaper corpus but does not exceed ¿¼± on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
Annual Conference of the International Speech Communication Association, 2004
This paper describes the constrained minimization approach to combine multiple classifiers in order to improve classification accuracy. Since errors of individual classifiers in the ensemble should somehow be uncorrelated to yield higher classification accuracy, we propose a combination strategy where the combined classifier accuracy is a function of the correlation between classification errors of the individual classifiers. To obtain powerful single classifiers, different techniques are investigated including support vector machines and latent semantic indexing (LSI) matrix, which is a popular vector-space model. We also investigate discriminative training (DT) of the LSI matrix on constrained minimization approach. DT minimizes the classification error by increasing the score separation of the correct from competing documents. Experimental evaluation is carried out on a banking call routing and on switchboard databases with a set of 23 and 67 topics respectively. Results show that the combined classifier we propose outperforms the accuracy of individual baseline classifiers by 44%.
2007
Recent studies on automatic new topic identification in Web search engine user sessions demonstrated that learning algorithms such as neural networks and regression have been fairly successful in automatic new topic identification. In this study, we investigate whether another learning algorithm, Support Vector Machines (SVM) are successful in terms of identifying topic shifts and continuations. Sample data logs from the Norwegian search engine FAST (currently owned by Overture) and Excite are used in this study. Findings of this study suggest that support vector machines' performance depends on the characteristics of the dataset it is applied on.
In this paper we present two well-known methods for topic identification. The first one is a TFIDF classifier approach, and the second one is a based machine learning approach which is called Support Vector Machines (SVM). In our knowledge, we do not know several works on Arabic topic identification. So that we decide to investigate in this article. The corpus we used is extracted from the daily Arabic newspaper Akhbar Al Khaleej', it includes 5120 news articles corresponding to 2.855.069 words covering four topics : sport, local news, international news and economy. According to our experiments, the results are encouraging both for SVM and TFIDF classifier, however we have noticed the superiority of the SVM classifier and its high capability to distinguish topics.
2017
Document clustering is a technique which groups similar content documents from the collection. It can further be extended to extract topics of each groups. Document clustering and Topic identification form back bone of information retrieval, but size of documents to be grouped in terms of number of words affects these processes negatively. The sparsity of terms present in big documents impacts weight of individual term and in turn quality of clusters adversely. This paper presents application of cluster analysis for document collection of small documents and document collection of big documents for topic identification from document collection. Results are presented as comparisons to emphasize the concerns with respect to big documents.
INTERNATIONAL JOURNAL OF RECENT TRENDS IN ENGINEERING & RESEARCH, 2019
We present a topic identification system for news, which is based upon an evaluation of similarity between the topics and a large amount of documents in the news database. Our system is able to provide the topics for every news samples. The system implements and compares the two Topic Models, Latent Dirichlet Allocation (LDA) and Latent Semantic Allocation (LSA), on a news database containing eleven thousand documents. The topic models behaviour has been examined on the basis of standard metrics, accuracy and the implementation speed of the algorithms.
The paper demonstrates how information on text structure can be used to improve the performance on the identification of topical words in texts, which is based on a probabilistic model of text categorization. We use texts which are not explicitly structured. A text structure is identified by measuring the similarity between segments comprising the text and its title. It is shown that a text structure thus identified gives a good clue to finding out parts of the text most relevant to its content. The significance of exploiting information on the structure for topic identification is demonstrated by a set of experiments conducted on the 19Mb of Japanese newspaper articles. The paper also brings concepts from the rhetorical structure theory (RST) to the statistical analysis of a text structure. Finally, it is shown that information on text structure is more effective for large documents than for small documents.
Procedia Computer Science, 2020
In the digitization air, it is very important to detect and analyze the related topics to some discussions, occurred in social media or to label some visited web pages or documents. This information could be very helpful to the process of personalization as well as user satisfaction. There are various and different methods that study and deal with a huge data to provide insights into user behaviors. In this paper, we propose a filtering process that enhances topic detection and labelling. The latter aims to compact the result delivered by inferential algorithms such as Latent Dirichlet Allocation and Dirichlet Mixture Model. Our filtering process relies on words dependency on each contextual use for delivering high correlated label. Indeed, we use Word2vec as well as N-grams to eliminate non-significant words in each topic. We also use Hellinger distance to aggregate redundant words to the appropriate topic. Besides, we eliminate the non-reliable topics according to some metric. We associate this proposal to different topic-modeling algorithms. Experiments demonstrate the effectiveness of the made association between inferential model and our filtering process compared to the state of the art. We also use different textual data to validate our proposal.
2013
Using conditional probabilities for automatic new topic identification is an efficient approach compared to the other studies of new topic identification due to its significant performance as well as relatively easy implementation. In this paper, we analyze the usage of conditional probabilities approach for automatic new topic identification, and extend the approach by considering the position of a query, namely query number, as an input for the computation of conditional probabilities besides the other -mostly usedparameters (time interval and search pattern). Specifically, we consider four different settings where these three parameters together as well as their 2-combinations are used for the conditional probability computations. The performance analysis of the approach with these settings is also presented.
2002
This paper focuses on dynamic topic identification for adaptive statistical language modeling in automatic speech recognition (ASR). It proposes a more correct solution of the cache model presented in [1] from a mathematical point of view, which constitutes at the same time a simplification and an improvement of the model. Moreover, an original solution is put forward to overcome some limitations of this model. This new solution proposes to take into account the underlying semantic concepts of the text by introducing triggers of words into the cache memory. The relative identification accuracy is assessed on a newspaper corpus, highlighting a small increase of identification rate. In practice, the original cache model achieves an identification rate of 79.5 %. Making use of the triggers, the feasibility of the task is investigated by an improvement of 1.2 % in identification rate. This study is thus devoted to the specification of a possible direction to perform well dynamic topic identification for ASR.
International journal of engineering research and technology, 2021
Every day large quantities of data are collected. As more information is available, the access to what we are seeking gets challenging. We, therefore, require processes and techniques for organizing, searching, and understanding massive amounts of information. The task of topic modeling is to analyze the whole document to learn the meaningful pattern that exists in the document. It is a supervised strategy used to identify and monitor words in clusters of texts (known as the "topics"). Through the use of topic analysis models, companies can load tasks on machines rather than burden employees with too much data. In this paper, we have used Word embedding for Topic Modelling to learn the meaningful pattern of words, and k-means clustering is used to group the words that belong to one group. In this paper, we have created the nine clusters of words from the headline dataset. One of the applications of topic modeling i.e sentiment analysis using the VADER algorithm is also demonstrated in this paper.
This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on computing triggers. The experiments show that TR-Classifier has the advantage to give best performances compared to kNN, by using much reduced sizes of Topic Vocabularies. TR-Classifier performance is enhanced by increasing jointly the number of triggers and the size of topic vocabularies. It should be noted that topic vocabularies are used by the TR-Classifier. Whereas, a general vocabulary is needed for kNN, and it is obtained by the concatenation of those used by the TR-Classifier. In addition to the standard measures Recall and Precision used for the evaluation step, we have drawn ROC curves for some topics to illustrate more clearly the difference in performance between the two classifiers. The corpus used in our experiments is downloaded from an ...
2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009
A software system for topic extraction and automatic document classification is presented. Given a set of documents, the system automatically extracts the mentioned topics and assists the user to select their optimal number. The user-validated topics are exploited to build a model for multi-label document classification. While topic extraction is performed by using an optimized implementation of the Latent Dirichlet Allocation model, multi-label document classification is performed by using a specialized version of the Multi-Net Naive Bayes model. The performance of the system is investigated by using 10,056 documents retrieved from the WEB through a set of queries formed by exploiting the Italian Google Directory. This dataset is used for topic extraction while an independent dataset, consisting of 1,012 elements labeled by humans, is used to evaluate the performance of the Multi-Net Naive Bayes model. The results are satisfactory, with precision being consistently better than recall for the labels associated with the four most frequent topics.
Computational Linguistics and Intelligent …, 2010
This paper proposes a method of using ontology hierarchy in automatic topic identification. The fundamental idea behind this work is to exploit an ontology hierarchical structure in order to find a topic of a text. The keywords which are extracted from a given text will be mapped onto their corresponding concepts in the ontology. By optimizing the corresponding concepts, we will pick a single node among the concepts nodes which we believe is the topic of the target text. However, a limited vocabulary problem is encountered while mapping the keywords onto their corresponding concepts. This situation forces us to extend the ontology by enriching each of its concepts with new concepts using the external linguistics knowledge-base (WordNet). Our intuition of a high number keywords mapped onto the ontology concepts is that our topic identification technique can perform at its best.
Suresh Kumar Sharma, Kanchan Jain and Gurpreet Singh Bawa, 2022
In this paper, the concept of document models is conversed with respect to the Bernoulli document approach, that is on basis of the presence or absence of primary blocks of the documents, namely tokens. The research primarily deals with how an unstructured dataset consisting of text documents is converted to structured content with mathematical and statistical foundation and then topic of conversation is predicted (or estimated) based on Bernoulli assumptions. The application of Naïve Bayes approach is discussed for the model under consideration. Examples and sample code snippets in R and Python to execute the same have been included for Bernoulli document model.
Lecture Notes in Computer Science, 2003
Topics in 0-1 datasets are sets of variables whose occurrences are positively connected together. Earlier, we described a simple generative topic model. In this paper we show that, given data produced by this model, the lift statistics of attributes can be described in matrix form. We use this result to obtain a simple algorithm for finding topics in 0-1 data. We also show that a problem related to the identification of topics is NP-hard. We give experimental results on the topic identification problem, both on generated and real data.
We present a novel way to identify the representative words that are able to capture the topic of documents for use in text categorization. Our intuition is that not all word n-grams equally represent the topic of a document, and thus using all of them can potentially dilute the feature space. Hence, our aim is to investigate methods for identifying good indexing words, and empirically evaluate their impact on text categorization. To this end, we experiment with five different word sub-spaces: title words, first sentence words, keyphrases, domain-specific words, and named entities. We also test TF•IDF-based unsupervised methods for extracting keyphrases and domain-specific words , and empirically verify their feasibility for text categorization. We demonstrate that using representative words outperforms a simple 1-gram model.
Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. In existing system, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is used.Each topic is represented by patterns. The patterns are generated from topic models and are organized in terms of their statistical and taxonomic features and the most discriminative and representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance to the user's information needs in order to filter out irrelevant documents.The Maximum matched patterns , which are the largest patterns in each equivalence class that exist in the received documents, are used to calculate the relevance words to represent topics. However, LDA does not capture correlations between topics and these not find the hidden topics in the document. To deal with the above problem the pachinko allocation model (PAM) is proposed. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. The algorithm improves upon earlier topic models such as LDA by modeling correlations between topics in addition to the word correlations which constitute topics. In this method the most accurate topics are given to that document. PAM provides more flexibility and greater expressive power than latent Dirichlet allocation.
Proceedings of the Thirteenth …, 2009
ii As in previous years, CoNLL-2009 has a shared task, Syntactic and Semantic Dependencies in Multiple Languages. This is an extension of the CoNLL-2008 shared task to multiple languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features are compatible evaluation for several languages and their comparison, and learning curves for languages with large datasets. We expect that this major comparative exercise will lead to very enlightening results and discussion that will serve to move the field forward. The Shared Task papers are collected into an accompanying volume of CoNLL-2009. We thank Jan Hajic and the rest of the organizers for their great effort in running the Shared Task.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.