Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2016, International Journal of Advanced Computer Science and Applications
Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical α, syntactic β and semantic γ features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical α, syntactic β and semantic γ. The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: F-Measure and Purity, and have obtained encouraging results.
2017
ing for each document. There are several possible extensions to this work: The proposed document clustering approach has many practical applications. One direction is to apply this technique on some specific application area along with application specific optimizations to see the outcome. For example: web search results can be clustered using this approach. The snippets for each cluster are generated to see the quality of these snippets. In the proposed approach each term, whether it is from lexical chain or from topic maps, has an equal effect on similarity calculation for a pair of documents. One possible direction is to introduce discriminative feature weighting for the features in this approach. Discriminative feature weighting has encouraging results for both text clustering and classification tasks.
Dimensionality reduction is very challenging and important in text mining. We need to know which features be retained what to be and It helps in reducing the processing overhead when performing text classification and text clustering. Another concern in text clustering and text classification is the similarity measure which we choose to find the similarity degree between any two text documents. In this paper, we work towards text clustering and text classification by addressing dimensionality reduction using SVD followed by the use of the proposed similarity measure which is an improved version of our previous measure [25, 31]. This proposed measure is used for supervised and un-supervised learning. The proposed distance measure overcomes the disadvantages of the existing measures [10].
Computing the similarity between documents is an important operation in the text processing. In this paper, a new similarity measure is proposed. To calculate the similarity between two documents with respect to a feature, the proposed measure takes the following three cases in to account I) The same feature appears in both documents, II)
Clustering is a branch of data mining which involves grouping similar data in a collection known as cluster. Clustering can be used in many fields, one of the important applications is the intelligent text clustering. Text clustering in traditional algorithms was collecting documents based on keyword matching, this means that the documents were clustered without having any descriptive notions. Hence, non-similar documents were collected in the same cluster. The key solution for this problem is to cluster documents based on semantic similarity, where the documents are clustered based on the meaning and not keywords. In this research, fifty papers which use semantic similarity in different fields have been reviewed, thirteen of them that are using semantic similarity based on document clustering in five recent years have been selected for a deep study. A comprehensive literature review for all the selected papers is stated. A comparison regarding their algorithms, used tools, and evaluation methods is given. Finally, an intensive discussion comparing the works is presented.
Similarity measurement is the important process in text processing. It measures the similarities between the two documents. Unlabeled document collections are becoming increasingly large and common and available; mining such data sets is a major contemporary challenge. Words are used as features. Text documents are often represented as high-dimensional and sparse vectors. Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as retrieval of information, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, and text summarization. This paper shows the survey of the document clustering.
2010
Document clustering is one of the most major techniques to group documents automatically. This technique is to divide a given set of documents into a certain number of clusters automatically. In this technique, the first step is ’feature extraction’ from documents. As a feature used in the conventional methods, we frequently use a set of words that contains nouns and verbs. Although words are used as features in a generic clustering framework, some previous research proposes the clustering method using the other features based on vector space model such as kernel methods and adaptive sprinkling. However, in previous research of document clustering, the method of appending new feature vectors obtained by using relationship between the existing documents and other documents has not been reported yet. So, we propose a new method for clustering documents using the relationship between the existing documents and other documents to acquire the more useful clusters for users. Our method ca...
The exploitation of syntactic structures and semantic background knowledge has always been an appealing subject in the context of data mining, text retrieval and information management. The usefulness of this kind of information has been shown most prominently in highly specialized tasks, such as text categorization scenarios. So far, however, additional syntactic or semantic information has been used only individually. In this paper, a new principle approach , the concept and term based similarity measure, which incorporates linguistic and semantic structures, using syntactic dependencies, and semantic background knowledge is proposed. This novel method represents the meaning of texts in a high-dimensional space of concepts derived from WordNet. A number of case studies have been included in the research to demonstrate the various aspects of this framework.
Knowledge and Information Systems, 2011
Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. This paper proposes new models for document representation that capture semantic similarity between documents based on measures of correlations between their terms. The paper uses the proposed models to enhance the effectiveness of different algorithms for document clustering. The proposed representation models define a corpus-specific semantic similarity by estimating measures of term-term correlations from the documents to be clustered. The corpus of documents accordingly defines a context in which semantic similarity is calculated. Experiments have been conducted on thirteen benchmark data sets to empirically evaluate the effectiveness of the proposed models and compare them to VSM and other well-known models for capturing semantic similarity.
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy.
No.Of correctly retrieved documents Precision= No. Of retrieved documents (Online) 41 | P a g e
2016
Present days humans are associated with large amount of data on regular basis. The sole purpose of generated data is to meet the immediate needs and no attempt in organizing the data for later efficient retrieval. Data mining is a concept of extracting knowledge from such an enormous amount of data.There are many techniques to classify and cluster the data which exists in the structured format, based on similarity between documents in the text processing field. Clustering algorithms require a metric to quantify how different two given documents are.This difference is often measured by some distance measure such as Euclidean distance, Cosine similarity, Jaccard correlation, Similarity measure for text processing to name a few. In this research work, we experiment with Euclidean distance, Cosine similarity and Similarity measure for text processing distance measures. The effectiveness of these three measures is evaluated on a real-world data set for text classification and clustering ...
WIT Transactions on Information and Communication Technologies, 2003
Knowledge Discovery in Text (KDT) has emerged as a challenging application due to the large amount of textual documents available from heterogeneous sources. An approach to knowledge discovery in text is based on clustering techniques in which the quality of results strongly depends on features extracted from documents and on similarity coefficients defined on them. In this work we present a framework for textual document preprocessing useful for the extraction of relevant features (i.e. lemma and word). Moreover, we define two similarity coefficients to measure "semantic" similarity (i.e. similarity by high-level contents, as they are intended by a human reader) among documents.
A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed seman-tic-based similarity measures by utilizing text annotation through external thesauruses like Word...
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters. Measuring similarity and discernment of two documents is not always clear problem and it depends of topical affiliation of the documents. For example, when clustering research papers, two documents are regarded as similar if they share similar topics. When clustering is employed on web sites, we are usually more interested in clustering the component pages according to the type of information that is presented in the page. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity, Pearson correlation coefficient, Euclidian distance etc. This paper deals with semantic clustering of text documents written in Serbian language. The aim is to prepare the documents of different formats for clustering, to find key words in the set of documents, clustering documents based on key words and finding the most appropriate document for the given question.
Eighth Sense Research Group
ABSTRACT The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents according to its similarity. As there is a huge amount of unstructured data and there is a semantic correlation between features of data it is difficult to handle that. There are large no of feature selection methods that are used to used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. Keywords:- Clustering, CHIR, CHIRSIM, K-means algorithm
Journal of Statistics and Management Systems, 2019
The need for appropriate applications of the various similarity measures for clustering has arisen over the years as data massively keep on increasing. The issue of deciding which similarity measure is the best and on what kind of dataset have been a very cumbersome task in the field of data mining, data science, other related fields, and organizations that highly depends on the knowledge outcome from a huge set of data to make some vital / crucial decisions. This is because various datasets portray some common features associated with them; the need for clearer understanding of the various similarity measures for clustering different datasets is needed. This paper presents a critical review of various similarity measures applied in text and data clustering. A theoretical comparison has been made to check the suitability of the measures on different kind of data sets.
This paper introduces a live of similarity between two clustering’s of constant dataset made by two completely different algorithms, or maybe constant algorithmic rule. Mensuration the similarity between documents is a crucial operation within the text process field. This paper projected a replacement similarity measure. To compute the similarity between two documents with relevance a feature, the proposed measure takes the subsequent three cases into account: a) The feature seems in each documents, b) The feature seems in just one document, and c) The feature seems in none of the documents. For the primary case, the similarity will increase because the distinction between the two concerned feature values decreases. Moreover, the contribution of the distinction is generally scaled. For the second case, a hard and fast worth is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The effectiveness of the live is evaluated on many real-world knowledge sets for text classification and bunch issues.
The utilization of textual documents is spontaneously increasing over the internet, email, web pages, reports, journals, articles and they stored in the electronic database format. It is challenging to find and access these documents without proper classification mechanisms. To overcome such difficulties we proposed a semantic document clustering model and develop this model. The document pre-processing steps, semantic information from WordNet help us to be bioavailable the semantic relation from raw text. By reminding the limitation of traditional clustering algorithms on the natural language, we consider semantic clustering by COBWEB conceptual clustering. Clustering quality and high accuracy were one of the most important aims of our research, and we chose F-Measure evaluation for ensuring the purity of clustering. However, there still exist many challenges, like the word, high spatial property, extracting core linguistics from texts, and assignment adequate description for the generated clusters. By the help of Word Net database, we eliminate those issues. In this research paper, there have a proposed framework and describe our development evaluation with evaluation.
2017
Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.