Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, Lecture Notes in Computer Science
Clustering is a data mining method, which consists in discovering interesting data distributions in very large databases. The applications of clustering cover customer segmentation, catalog design, store layout, stock market segmentation, etc. In this paper, we consider the problem of discovering similarity-based clusters in a large database of event sequences. We introduce a hierarchical algorithm that uses sequential patterns found in the database to efficiently generate both the clustering model and data clusters. The algorithm iteratively merges smaller, similar clusters into bigger ones until the requested number of clusters is reached. In the absence of a well-defined metric space, we propose the similarity measure, which is used in cluster merging. The advantage of the proposed measure is that no additional access to the source database is needed to evaluate the inter-cluster similarities.
Advances in Databases and Information Systems, 1999
Clustering is a data mining method, which consists in discovering interesting data distributions in very large databases. The applications of clustering cover customer segmentation, catalog design, store layout, stock market segmentation, etc. In this paper, we consider the problem of discovering similarity-based clusters in a large database of event sequences. We introduce a hierarchical algorithm that uses sequential patterns found
Lecture Notes in Computer Science, 2009
Clustering is a widely used unsupervised data analysis technique in machine learning. However, a common requirement amongst many existing clustering methods is that all pairwise distances between patterns must be computed in advance. This makes it computationally expensive and difficult to cope with large scale data used in several applications, such as in bioinformatics. In this paper we propose a novel sequential hierarchical clustering technique that initially builds a hierarchical tree from a small fraction of the entire data, while the remaining data is processed sequentially and the tree adapted constructively. Preliminary results using this approach show that the quality of the clusters obtained does not degrade while reducing the computational needs.
Conceptual clustering is a discovery process that groups a set of data in the way that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Traditional clustering algorithms employ some measure of distance between data points in n-dimensional space. However, not all data types can be represented in a metric space, therefore no natural distance function is available for them. We address the problem of clustering sequences of categorical values. We present a measure of similarity for the sequences and an agglomerative hierarchical algorithm that uses frequent sequential patterns found in the database to efficiently generate the resulting clusters. The algorithm iteratively merges smaller, similar clusters into bigger ones until the requested number of clusters is reached.
Lecture Notes in Computer Science, 2001
Data clustering methods have many applications in the area of data mining. Traditional clustering algorithms deal with quantitative or categorical data points. However, there exist many important databases that store categorical data sequences, where significant knowledge is hidden behind sequential dependencies between the data. In this paper we introduce a problem of clustering categorical data sequences and present an efficient scalable algorithm to solve the problem. Our algorithm implements the general idea of agglomerative hierarchical clustering and uses frequently occurring subsequences as features describing data sequences. The algorithm not only discovers a set of high quality clusters containing similar data sequences but also provides descriptions of the discovered clusters.
2016 IEEE International Conference on Automation Science and Engineering (CASE), 2016
Finding frequent patterns is an important problem in data mining. We have devised a method for detecting frequent patterns in event log data. By representing events in a graph structure, we can generate clusters of frequently co-occurring events. This method is compared with basic association mining techniques and found to give a “macro-level” overview of patterns, which is more interpretable. In addition, the resulting graph-based clustering output for frequently co-occurring event sets is substantially less than association mining, while providing similar information levels. Therefore, the results are more manageable for practical applications.
Data Mining and Knowledge Discovery, 2004
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depends on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two-phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
2002
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
2010
Bioinformatics emerged as a challenging new area of research and brought forth numerous computational problems. Here computers are used to gather, store, analyze and merge biological data. In this paper, the problem of clustering intervalscaled data and sequence data is analyzed in a new approach using Hierarchical Sequence Clustering. In Sequence clustering, it is necessary to find the similarity or distance between each pair of sequences. To find the similarity between sequences the data structure Probabilistic Suffix Tree can be used. An agglomerative algorithm is introduced based on UPGMA (Un weighted Pair wise Group Average Method) cluster analysis, that required O(n3) of total computing time. Then a new algorithm using the new approach is introduced with O(n2) computing time. The result of this new algorithm is compared with UPGMA cluster analysis.
International journal of computer applications, 2015
The objective of data mining is to take out information from large amounts of data and convert it into form that can be used further. It comes with several functionalities, among which Clustering is worked upon in this paper. Clustering is basically an unsupervised learning where the categories in which the data to put is not known priorly. It is a process where we group set of abstract objects into similar objects such that objects in one cluster are highly similar in comparison to each and dissimilar to objects in other clusters. Clustering can be done by different number of methods such as-partitioning based methods, methods based on hierarchy, density based ,grid based ,model based methods and constraint based clustering. In this survey paper review of clustering and its different techniques is done with special focus on Hierarchical clustering. A number of hierarchical clustering methods that have recently been developed are described here, with a goal to provide useful references to fundamental concepts accessible to the broad community of clustering practitioners.
Lecture Notes in Computer Science, 2005
We propose a mining framework that supports the identification of useful patterns based on incremental data clustering. Given the popularity of Web news services, we focus our attention on news streams mining. News articles are retrieved from Web news services, and processed by data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request. A key challenging issue within news repository management is the high rate of document insertion. To address this problem, we present a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, to overcome the lack of topical relations in conceptual ontologies, we propose a topic ontology learning framework that utilizes the obtained document hierarchy. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and a topic ontology provides interpretations of news topics at different levels of abstraction.
IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No.37243)
The problem 0 f clustering m~~ltidinretrsiotri~l ilutu with similar properties has been targeted in litiwitiire. I n this p u p i~, authors have concentrated on the druivhuck 0f'one id ~j i c t r rrsccl methods. moiintuin clustering. A method tlrut overc'omes this problem is proposed. Method is tested on exumplcc. rrrrrl rtwrlts ure graphicully depicted.
Clustering of inherent sequential natured data sets is useful for various purposes. Over the years, many methods have been developed for clustering objects having sequential nature according to their similarity. However, these methods tend to have a computational complexity that is at least quadratic on the number of sequences. Also, clustering algorithms often require that the entire dataset be kept in the computer memory. In this paper, we present novel algorithm for Mining of constraint based clustered sequential patterns (CBCSP) algorithm for clustering only user interesting sequential data using recency, monetary and compactness constraints. So, the algorithm generates a compact set of clusters of sequential patterns according to user interest by applying constraints in mining process. It minimizes the I/O cost involved. The proposed algorithm basically applies the well known K-means clustering algorithm along with Prefix-Projected Database construction to the set of sequential patterns. In this approach, the method first performs clustering based on a novel similarity function and then captures the sequential patterns of which are only user interesting in each cluster using a sequential pattern mining algorithm which employs pattern growth method not. The proposed work results in reduced search space as user intended sequential patterns tend to be discovered in the resulting list. Through experimental evaluation under various simulated conditions, the proposed method is shown to deliver excellent performance and leads to reasonably good clusters.
Many agencies construct catalogs of the hundreds of seismic events that occur daily around the world. The Ground-Based Nuclear Explosion Monitoring Research and Development (GNEMRD) program merges these catalogs together into a composite catalog containing multiple descriptions of the same seismic event, one from each catalog of interest. The merging process requires associating seismic events in individual catalogs (herein called origins), that are independent estimates of the same seismic event. In this paper we describe application of classical cluster analysis techniques that provide a straightforward and robust solution to this merging problem. The resulting algorithm is much simpler to tune than the rule-based methodology used by EvLoader, which is the application currently used to merge catalogs in the GNEMRD program. For this study, we used a simple agglomerative hierarchical clustering technique to create clusters of similar origins where the various origins in a cluster re...
In data mining and knowledge discovery, for finding the significant correlation among events Pattern discovery (PD) is used. PD typically produces an overwhelming number of patterns. Since there are too many patterns, it is difficult to use them to further explore or analyze the data. To address the problems in Pattern Discovery, a new method that simultaneously clusters the discovered patterns and their associated data. It is referred to as "Simultaneous pattern and data clustering using Modified K-means Algorithm". One important property of the proposed method is that each pattern cluster is explicitly associated with a corresponding data cluster. Modified Kmeans algorithm is used to cluster patterns and their associated data. After clusters are found, each of them can be further explored and analyzed individually. The proposed method reduces the number of iterations to cluster the given data. The experimental results using the proposed algorithm with a group of randomly constructed data sets are very promising.
Al-Azhar University Engineering Journal, JAUES, 11th International Conference, 2010
Clustering is an important data mining technique that groups similar data records, recently categorical transaction clustering is received more attention. In this research we study the problem of categorical data clustering for transactional data characterized with high dimensionality and large volume. We propose a novel algorithm for clustering transactional data called F-Tree, which is based on the idea of the frequent pattern algorithm FP-tree; the fastest approaches to frequent item set mining. And the simple idea behind the F-Tree is to generate small high pure clusters, and then merge them. That makes it fast, and dynamic in clustering large transactional datasets with high dimensions. We also present a new solution to solve the overlapping problem between clusters, by defining a new criterion function, which is based on the probability of overlapping between weighted items. Our experimental evaluation on real datasets shows that: Firstly, F-Tree is effective in finding interesting clusters. Secondly, the usage of the tree structure reduces the clustering process time of the large data set with high attributes. Thirdly, the proposed evaluation metric used efficiently to solve the overlapping of transaction items generates a high quality clustering results. Finally, we have concluded that the process of merging pure and small clusters increases the purity of resulted clusters as well as it reduces time of clustering better than the process of generating clusters directly from dataset then refine clusters.
Undergraduate Topics in Computer Science, 2011
Clustering techniques have a wide use and importance nowadays. This importance tends to increase as the amount of data grows and the processing power of the computers increases. Clustering applications are used extensively in various fields such as artificial intelligence, pattern recognition, economics, ecology, psychiatry and marketing.There are several algorithms and methods have been developed for clustering problem. But problem are always arises for finding a new algorithm and process for extracting knowledge for improving accuracy and efficiency. This type of dilemma motivated us to develop new algorithm and process for clustering problems. There are several another issue are also exits like cluster analysis can contribute in compression of the information included in data. In several cases, the amount of available data is very large and its processing becomes very demanding. Clustering can be used to partition data set into a number of "interesting" clusters. Then, instead of processing the data set as an entity, we adopt the representatives of the defined clusters in our process. Thus, data compression is achieved. Cluster analysis is applied to the data set and the resulting clusters are characterized by the features of the patterns that belong to these clusters. Then, unknown patterns can be classified into specified clusters based on their similarity to the clusters' features. Useful knowledge related to our data can be extracted [1].
2003
Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.
International Journal of Computer Applications, 2018
Pattern mining is an important field of data mining. The fundamental task of data mining is to explore the database to find out sequential, frequent patterns. In recent years, data mining has shifted its focus to design methods for discovering patterns with user expectations. In this regard various types of pattern mining methods have been proposed. Frequent pattern mining, sequential pattern mining, temporal pattern mining, and constraint based pattern mining. Pattern mining has various useful real-life applications such as market basket analysis, e-learning, social network analysis, web page, click sequences, Bioinformatics, etc., this paper presents a survey of various types of pattern mining. The main goal of this paper is to present both an introduction to all pattern mining and a survey of various algorithms, challenges and research opportunities. This paper not only discusses the problems of pattern mining and its related applications, but also the extensions and possible future improvements in this field.
A plethora of algorithms exist for clustering to discover actionable knowledge from large data sources. Given un-labeled data objects, clustering is an unsupervised learning to find natural groups of objects which are similar. Each cluster is a subset of objects that exhibit high similarity. Quality of clusters is high when they feature highest intra-cluster similarity and lowest inter-cluster similarity. The quality of clusters is influenced by the similarity measure being employed for grouping objects. The clustering quality is measured the ability of clustering technique to unearth latent trends distributed in data. The usage of data mining technique clustering is ubiquitous in real time applications such as market research, discovering web access patterns, document classification, image processing, pattern recognition, earth observation, banking, insurance to name few. Clustering algorithms differ in type of data, measure of similarity, computational efficiency, and linkage methods, soft or hard clustering and so on. Employing a clustering technique correct depends on the technical knowhow one has on various kinds clustering algorithms and suitable scenarios to apply them. Towards this end, in this paper, we explore clustering algorithms in terms of computational efficiency, measure of similarity, speed and performance.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.