Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Clustering is a branch of data mining which involves grouping similar data in a collection known as cluster. Clustering can be used in many fields, one of the important applications is the intelligent text clustering. Text clustering in traditional algorithms was collecting documents based on keyword matching, this means that the documents were clustered without having any descriptive notions. Hence, non-similar documents were collected in the same cluster. The key solution for this problem is to cluster documents based on semantic similarity, where the documents are clustered based on the meaning and not keywords. In this research, fifty papers which use semantic similarity in different fields have been reviewed, thirteen of them that are using semantic similarity based on document clustering in five recent years have been selected for a deep study. A comprehensive literature review for all the selected papers is stated. A comparison regarding their algorithms, used tools, and evaluation methods is given. Finally, an intensive discussion comparing the works is presented.
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters. Measuring similarity and discernment of two documents is not always clear problem and it depends of topical affiliation of the documents. For example, when clustering research papers, two documents are regarded as similar if they share similar topics. When clustering is employed on web sites, we are usually more interested in clustering the component pages according to the type of information that is presented in the page. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity, Pearson correlation coefficient, Euclidian distance etc. This paper deals with semantic clustering of text documents written in Serbian language. The aim is to prepare the documents of different formats for clustering, to find key words in the set of documents, clustering documents based on key words and finding the most appropriate document for the given question.
No.Of correctly retrieved documents Precision= No. Of retrieved documents (Online) 41 | P a g e
2017
ing for each document. There are several possible extensions to this work: The proposed document clustering approach has many practical applications. One direction is to apply this technique on some specific application area along with application specific optimizations to see the outcome. For example: web search results can be clustered using this approach. The snippets for each cluster are generated to see the quality of these snippets. In the proposed approach each term, whether it is from lexical chain or from topic maps, has an equal effect on similarity calculation for a pair of documents. One possible direction is to introduce discriminative feature weighting for the features in this approach. Discriminative feature weighting has encouraging results for both text clustering and classification tasks.
Indonesian Journal of Electrical Engineering and Computer Science, 2021
Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with densitybased algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.
Knowledge and Information Systems, 2011
Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. This paper proposes new models for document representation that capture semantic similarity between documents based on measures of correlations between their terms. The paper uses the proposed models to enhance the effectiveness of different algorithms for document clustering. The proposed representation models define a corpus-specific semantic similarity by estimating measures of term-term correlations from the documents to be clustered. The corpus of documents accordingly defines a context in which semantic similarity is calculated. Experiments have been conducted on thirteen benchmark data sets to empirically evaluate the effectiveness of the proposed models and compare them to VSM and other well-known models for capturing semantic similarity.
The utilization of textual documents is spontaneously increasing over the internet, email, web pages, reports, journals, articles and they stored in the electronic database format. It is challenging to find and access these documents without proper classification mechanisms. To overcome such difficulties we proposed a semantic document clustering model and develop this model. The document pre-processing steps, semantic information from WordNet help us to be bioavailable the semantic relation from raw text. By reminding the limitation of traditional clustering algorithms on the natural language, we consider semantic clustering by COBWEB conceptual clustering. Clustering quality and high accuracy were one of the most important aims of our research, and we chose F-Measure evaluation for ensuring the purity of clustering. However, there still exist many challenges, like the word, high spatial property, extracting core linguistics from texts, and assignment adequate description for the generated clusters. By the help of Word Net database, we eliminate those issues. In this research paper, there have a proposed framework and describe our development evaluation with evaluation.
Artificial Intelligence and Soft Computing, 2010
Traditionally, clustering was applied on numerical and categorical information. However, textual information is acquiring an increasing importance with the appearance of methods for textual data mining. This paper proposes the use of classical clustering algorithms with a mixed function that combines numerical, categorical and semantic features. The content of the semantic features is extracted from textual data. As the semantic features must be compared using a semantic similarity function and several measures have been developed, this paper analyses and compares the behavior of some of them using WordNet as background ontology. The different partitions obtained are compared to human classifications in order to see which one approximates better the human reasoning. Moreover, the interpretability of clusters obtained is discussed. The results show that those similarity measures that provide better results when compared using a standard benchmark also provide better and more interpretable partitions.
Computing the similarity between documents is an important operation in the text processing. In this paper, a new similarity measure is proposed. To calculate the similarity between two documents with respect to a feature, the proposed measure takes the following three cases in to account I) The same feature appears in both documents, II)
IEEE Xplore, 2022
The Internet's continued growth has resulted in a significant rise in the amount of electronic text documents. Grouping these materials into meaningful collections has become crucial. The old approach of document compilation based on statistical characteristics and categorization relied on syntactic rather than semantic information. This article introduces a unique approach for classifying texts based on their semantic similarity. The graph-based approach is depended an efficient technique been utilized for clustering. This is performed by extracting document summaries called synopses from the Wikipedia and IMDB databases and grouping thus downloaded documents, then utilizing the NLTK dictionary to generate them by making some important preprocessing to make it more convenient to use. Following that, a vector space is modelled using TFIDF and converted to TFIDF matrix as numeric form, and clustering is accomplished using Spectral methods. The results are compared with previews work.
International Journal of Advanced Computer Science and Applications, 2016
Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical α, syntactic β and semantic γ features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical α, syntactic β and semantic γ. The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: F-Measure and Purity, and have obtained encouraging results.
Now the age of information technology, the textual document is spontaneously increasing over online or offline. In those articles contain Product information to a company profile. A lot of sources generate valuable information into text in the medical report, economic analysis, scientific journals, news, blog etc. Maintain and access those documents are very difficult without proper classification. Those problems can be overcome by proper document classification. Only a few documents are classified. All need classification and those are unsupervised. In this context clustering is the only solution. Traditional clustering technique and textual clustering have some difference. Relations between words are very imported to do clustering. Semantic clustering is proven as more appropriate clustering technique for texts. In this review paper, there has valuable information about clustering to semantic document clustering technique. In this paper, there has some information provided about advantage and disadvantage for various clustering methods.
A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed seman-tic-based similarity measures by utilizing text annotation through external thesauruses like Word...
International Journal of Computer Applications, 2012
Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low intercluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.
IEEE Transactions on Knowledge and Data Engineering, 2000
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, we propose a phrase-based document similarity to compute the pairwise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpasses the results of traditional single-word tf-idf similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.
2010
Different document representation models have been proposed to measure semantic similarity between documents using corpus statistics. Some of these models explicitly estimate semantic similarity based on measures of correlations between terms, while others apply dimension reduction techniques to obtain latent representation of concepts. This paper proposes new hybrid models that combine explicit and latent analysis to estimate semantic similarity between documents. The proposed models have been used to enhance the performance of document clustering algorithms. Experiments on thirteen benchmark data sets show that hybrid models achieve significant improvement in clustering performance when used with clustering algorithms that are sensitive to errors in estimating document similarity.
2013
Semantic similarity is a way of analyzing the perfect synonym t hat exists between wordpairs. This measure is necessary to detect the degree of relationship that persists within wordpairs. To compute the semantic similarity that lies between a wordpair, clustering and classification augmented with semantic similarity (CCASS) was developed. CCASS is a novel method that uses page counts and text snippets returned by search engine. Several similarity measures are defined using the page counts of word� pairs. Lexical pattern clustering is applied on text snippets, obtained from search engine. These are fed to the support vector machine (SVM) which computes the semantic similarity that exists between wordpairs. Based on this value obtained from the support vector machine, Simple KMeans clustering algorithm is used to form clusters. Upcoming wordpairs can be classified, after computation of its semantic similarity measure. If it does match with the existing clusters, a new cluster may be ...
International Journal of Computer Applications, 2012
With the documents increasing amount available in local or Web repositories, the comparison methods have to analyze large documents sets with different types and terminologies to obtain a response with minimum documents and with as much useful content to the user. For large documents sets where each document can contain many pages, it is impossible to compute the similarity using the entire document, to require creating solutions to analyze a few meaningful terms, in summary form. This article presents TextSSimily, a method that compares documents semantically considering only short text for comparison (text summary), using semantics to improve the set of responses and summaries to improve time to obtain results for large sets of documents.
2017
Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.