Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003
…
8 pages
1 file
Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.
2002
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Genomics & Informatics, 2012
Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.
Knowledge and Information Systems, 2002
Efficient algorithms to mine frequent patterns are crucial to many tasks in data mining. Since the Apriori algorithm was proposed in 1994, there have been several methods proposed to improve its performance. However, most still adopt its candidate set generation-and-test approach. In addition, many methods do not generate all frequent patterns, making them inadequate to derive association rules. We propose a pattern decomposition (PD) algorithm that can significantly reduce the size of the dataset on each pass making it more efficient to mine all frequent patterns in a large dataset. The proposed algorithm avoids the costly process of candidate set generation and saves time by reducing dataset. Our empirical evaluation shows that the algorithm outperforms Apriori by one order of magnitude and is faster than FP-tree.
Parallel and Distributed Systems, 2006. …, 2006
When computationally feasible, mining extremely large databases produces tremendously large numbers of frequent patterns. In many cases, it is impractical to mine those datasets due to their sheer size; not only the extent of the existing patterns, but mainly the magnitude of the search space. Many approaches have been suggested such as sequential mining for maximal patterns or searching for all frequent patterns in parallel. So far, those approaches are still not genuinely effective to mine extremely large datasets.
Expert Systems with Applications, 2017
Mining maximal frequent patterns (MFPs) is an approach that limits the number of frequent patterns (FPs) to help intelligent systems operate efficiently. Many approaches have been proposed for mining MFPs, but the complexity of the problem is enormous. Therefore, the run time and memory usage are still large. Recently, the N-list structure has been proposed and verified to be very effective for mining FPs, frequent closed patterns, and top-rank-k FPs.
In data mining and knowledge discovery, for finding the significant correlation among events Pattern discovery (PD) is used. PD typically produces an overwhelming number of patterns. Since there are too many patterns, it is difficult to use them to further explore or analyze the data. To address the problems in Pattern Discovery, a new method that simultaneously clusters the discovered patterns and their associated data. It is referred to as "Simultaneous pattern and data clustering using Modified K-means Algorithm". One important property of the proposed method is that each pattern cluster is explicitly associated with a corresponding data cluster. Modified Kmeans algorithm is used to cluster patterns and their associated data. After clusters are found, each of them can be further explored and analyzed individually. The proposed method reduces the number of iterations to cluster the given data. The experimental results using the proposed algorithm with a group of randomly constructed data sets are very promising.
Information Sciences, 2009
IEEE Transactions on Knowledge and Data Engineering, 2000
Bioseqeunces such as protein, RNA and DNA, are made up of sequences of amino acids/nucleotides. The binding of biosequences among themselves is important for governing many biological processes of a living organism. The bindings are maintained by short segments of these biosequences, known as functional elements. Due to the importance of these functional elements, their presence is well conserved throughout evolution, allowing them to be discovered as patterns. As sequencing technologies continue to improve, the amount of biosequences is available in abundance. It is thus convenient and cost-effective if functional elements can be discovered from biosequences data computationally in an unsupervised manner without the need of prior knowledge or costly pre-preprocessing. In this paper, we aim to give a brief review of an unsupervised pattern discovery tool known as Aligned Pattern Clustering (or its software WeMine TM ). It is developed to facilitate the discovery and analysis of patterns in biosequences, and has been applied in1) unsupervised identification of protein binding sites; 2) revealing functioning subgroup characteristics; and 3) identification of intra-protein, inter-protein and protein-DNA binding sites. In the era of ever-expanding biosequence data, we believe that this unsupervised pattern discovery approach would render a reliable, robust, and scalable method for scientific discovery and applications through leveraging the ever expanding volume of biosequences.
2008
Knowledge discovery or extracting knowledge from large amount of data is a desirable task in competitive businesses. Data mining is an essential step in knowledge discovery process. Frequent patterns play an important role in data mining tasks such as clustering, classification, and prediction and association analysis. However, the mining of all frequent patterns will lead to a massive number of patterns. A reasonable solution is identifying maximal frequent patterns which form the smallest representative set of patterns to generate all frequent patterns. This research proposes a new method for mining maximal frequent patterns. The method includes an efficient database encoding technique, a novel tree structure called PC_Tree and PCMiner algorithm. Experiment results verify the compactness and performance.
2003
Efficient mining of frequent patterns from large databases has been an active area of research since it is the most expensive step in association rules mining. In this paper, we present an algorithm for finding complete frequent patterns from very large dense datasets in a cluster environment. The data needs to be distributed to the nodes of the cluster only once and the mining can be performed in parallel many times with different parameter settings for minimum support. The algorithm is based on a master-slave scheme where a coordinator controls the data parallel programs running on a number of nodes of the cluster. The parallel program was executed on a cluster of Alpha SMPs. The performance of the algorithm was studied on small and large dense datasets. We report the results of the experiments that show both speed up and scale up of our algorithm along with our conclusions and pointers for further work.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No.37243)
Lecture Notes in Computer Science, 1999
Lecture Notes in Computer Science, 2006
Proceedings of the 2006 SIAM International Conference on Data Mining, 2006
International Journal of Computer Theory and Engineering, 2012
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY, 2012
International Journal of Computer Applications, 2018
International journal of computer applications, 2013
International Journal of Computer Applications, 2015
ACM Transactions on Management Information Systems
International Journal for Research in Applied Science & Engineering Technology, 2021