Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, 2011 11th International Conference on Intelligent Systems Design and Applications
Due to the dramatic increase of data volumes in different applications, it is becoming infeasible to keep these data in one centralized machine. It is becoming more and more natural to deal with distributed databases and networks. That is why distributed data mining techniques have been introduced. One of the most important data mining problems is data clustering. While many clustering algorithms exist for centralized databases, there is a lack of efficient algorithms for distributed databases. In this paper, an efficient algorithm is proposed for clustering distributed databases. The proposed methodology employs an iterative optimization technique to achieve better clustering objective. The experimental results reported in this paper show the superiority of the proposed technique over a recently proposed algorithm based on a distributed version of the well known K-Means algorithm (Datta et al. 2009) [1].
TJPRC, 2013
Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. The central issue is to propose a new data mining algorithm that results better cluster configuration than previous algorithms. The issue of determining the most appropriate cluster configuration is a challenging one, and is addressed in this paper.
IEEE Transactions on Knowledge and Data Engineering, 2009
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by "local" synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.
— For business and real time applications large set of data sets are used to extract the unknown patterns which is termed as data mining approach. Clustering and classification algorithm are used to classify the unlabeled data from the large data set in a supervised and unsupervised manner. The assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in any way is termed as cluster analysis or clustering. Through these algorithms the inferences of the clustering process and its domain application competence is determined.the research work deals with K-means and MCL algorithms which are most delegated clustering algorithms.
Lecture Notes in Computer Science, 2007
Many parallel and distributed clustering algorithms have already been proposed. Most of them are based on the aggregation of local models according to some collected local statistics. In this paper, we propose a lightweight distributed clustering algorithm based on minimum variance increases criterion which requires a very limited communication overhead. We also introduce the notion of distributed perturbation to improve the globally generated clustering. We show that this algorithm improves the quality of the overall clustering and manage to find the real structure and number of clusters of the global dataset.
Proceedings of the Twenty-First …, 2010
Several algorithms have been recently developed for distributed data clustering, which are applied when data cannot be concentrated on a single machine, for instance because of privacy reasons or due to net-work bandwidth limitations, or because of the huge amount of distributed ...
Applied Soft Computing, 2017
Clustering is one of the important data mining issues, especially for large and distributed data analysis. Distributed computing environments such as Peer-to-Peer (P2P) networks involve separated/scattered data sources, distributed among the peers. According to unpredictable growth and dynamic nature of P2P networks, data of peers are constantly changing. Due to the high volume of computing and communications and privacy concerns, processing of these types of data should be applied in a distributed way and without central management. Today, most applications of P2P systems focus on unstructured P2P systems. In unstructured P2P networks, spreading gossip is a simple and efficient method of communication, which can adapt to dynamic conditions in these networks. Recently, some algorithms with different pros and cons have been proposed for data clustering in P2P networks. In this paper, by combining a novel method for extracting the representative data, a gossip-based protocol and a new centralized clustering method, a Gossip Based Distributed Clustering algorithm for P2P networks called GBDC-P2P is proposed. The GBDC-P2P algorithm is suitable for data clustering in unstructured P2P networks and it adapts to the dynamic conditions of these networks. In the GBDC-P2P algorithm, peers perform data clustering operation with a distributed approach only through communications with their neighbours. The GBDC-P2P does not need to rely on a central server and it performs asynchronously. Evaluation results demonstrate the superior performance of the GBDC-P2P algorithm. Also, a comparative analysis with other well-established methods illustrates the efficiency of the proposed method.
Information Sciences, 2006
This paper describes a technique for clustering homogeneously distributed data in a peer-to-peer environment like sensor networks. The proposed technique is based on the principles of the K-Means algorithm. It works in a localized asynchronous manner by communicating with the neighboring nodes. The paper offers extensive theoretical analysis of the algorithm that bounds the error in the distributed clustering process compared to the centralized approach that requires downloading all the observed data to a single site. Experimental results show that, in contrast to the case when all the data is transmitted to a central location for application of the conventional clustering algorithm, the communication cost (an important consideration in sensor networks which are typically equipped with limited battery power) of the proposed approach is 0020-0255/$ -see front matter Ó Sciences 176 (2006Sciences 176 ( ) 1952Sciences 176 ( -1985 www.elsevier.com/locate/ins significantly smaller. At the same time, the accuracy of the obtained centroids is high and the number of samples which are incorrectly labeled is also small.
2016
In many popular applications large amounts of data are distributed among multiple sources. Analysis of this data and identifying clusters is challenging due to storage, processing, and transmission costs. A decentralized clustering algorithm called DCluster, which is capable of clustering distributed and dynamic data sets. Nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the data set. The summarized view is a basis for executing the clustering algorithms to produce approximations of the final clustering results. DCluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment. In DCluster the complete data set is clustered in a fully decentralized fashion, such that each node obtains an accurate clustering model, without collecting the whole data set.
IEEE Internet Computing, 2000
Distributed data mining deals with the problem of data analysis in environments with distributed data, computing nodes, and users. Peer-to-peer computing is emerging as a new distributed computing paradigm for many novel applications that involve exchange of information among a large number of peers with little centralized coordination. Peerto-peer file sharing, peer-to-peer electronic commerce, and peer-to-peer monitoring based on a network of sensors are some examples. This paper offers an overview of distributed data mining applications and algorithms for peer-to-peer environments. It describes both exact and approximate distributed data mining algorithms that work in a decentralized manner. It illustrates these approaches for the problem of computing and monitoring clusters in the data residing at the different nodes of a peer-to-peer network.
… Conference on Data Mining (DMIN'07), USA, 2007
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical parallel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution.
2012
Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology etc. The goal of clustering is to decompose or partition a data set into groups such that both the intra-group similarity and the inter-group dissimilarity are maximized. In many applications, the size of the data that needs to be clustered is much more than what can be processed at a single site. Further, the data to be clustered could be inherently distributed. The increasing demand to scale up to these massive data sets which are inherently distributed over networks with limited bandwidth and computational resources has led to methods for parallel and distributed data clustering. In this thesis, we present a cohesive framework for cluster identification and outlier detection for distributed data. The core idea is to generate independent local models and combine the local models at a central server to obtain global clusters. ...
International Journal of Advanced Computer Science and Applications, 2019
Privacy and security have always been a concern that prevents the sharing of data and impedes the success of many projects. Distributed knowledge computing, if done correctly, plays a key role in solving such a problem. The main goal is to obtain valid results while ensuring the non-disclosure of data. Density-based clustering is a powerful algorithm in analyzing uncertain data that naturally occur and affect the performance of many applications like location-based services. Nowadays, a huge number of datasets have been introduced for researchers which involve high-dimensional data points with varying densities. Such datasets contain data points with highdensity regions surrounded by data points with sparse density. The existing clustering approaches handle these situations inefficiently, especially in the context of distributed data. In this paper, we design a new decomposable density-based clustering algorithm for distributed datasets (DDBC). DDBC utilizes the concept of mutual k-nearest neighbor relationship to cluster distributed datasets with different density. The proposed DDBC algorithm is capable of preserving the privacy and security of data on each site by requiring a minimal number of transmissions to other sites.
2004
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely k-windows, that is able to achieve high quality results in distributed computing environments.
2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015
Distributed data mining techniques and mainly distributed clustering are widely used in last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach analyses the datasets on their locations the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect knowledge. In this paper we propose a new clustering approach for very large spatial datasets that are heterogeneous and distributed. The approach is based on K-means Algorithm but it generates the number of global clusters dynamically. It is not necessary to fix the number of clusters. Moreover, this approach uses a very sophisticated aggregation phase. The aggregation phase is designed in such away that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. Preliminary results show that the proposed approach scales up well in terms of running time, and result quality, we also compared it to two other clustering algorithms BIRCH and CURE and we show clearly this approach is much more efficient than the two algorithms.
Proceedings of the 2007 SIAM International Conference on Data Mining, 2007
In distributed data mining models, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a hierarchically-distributed peer-to-peer architecture and algorithm for data clustering (HP2PC). The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Peers at a certain level of the hierarchy cooperate within their respective neighborhoods to perform clustering. Using this model, we can partition the clustering problem in a modular way, solve each part individually, then successively combine clusterings up the hierarchy where increasingly global solutions are computed. The algorithm was applied to a distributed document clustering problem and achieved decent speedup with comparable clustering quality to the centralized approach.
2020
Graph clustering is one of the key techniques to understand structures that are present in networks. In addition to clusters, bridges and outliers detection is also a critical task as it plays an important role in the analysis of networks. Recently, several graph clustering methods are developed and used in multiple application domains such as biological network analysis, recommendation systems and community detection. Most of these algorithms are based on the structural clustering algorithm. Yet, this kind of algorithm is based on the structural similarity, this later requires to parse all graph ' edges in order to compute the structural similarity. However, the height needs of similarity computing make this algorithm more adequate for small graphs, without significant support to deal with large-scale networks. In this paper, we propose a novel distributed graph clustering algorithm based on structural graph clustering. The experimental results show the efficiency in terms of r...
Categorizing the different types of data over network is still an important research issue in the field of distributed clustering. There are different types of data such as news, social networks, and education etc. All this text data available in different resources. In searching process the server have to gather information about the keyword from different resources. Due to more scalability this process leads more burdens to resources. So we introduced a framework that consists of efficient grouping method and efficiently clusters the text in the form of documents. It guarantees that more text documents are to be clustered faster.
2015
Abstract:Clustering is one of the most traditional data mining techniques for knowledge extraction of mass data storages and high dimensional dataset in the years of research, Un preprocessed data (Articles, prepositions,conjunctions,adverbs…..etc.) may leads to irrelevant clusters. In this paper we are proposing feature set extraction of preprocessing of input documents and document weights can be computed based on term frequency and inverse document frequencies then computes the mutation based centroid for each iteration, except initial iteration during clustering of documents. I.
The design aspect of distributed database environment is a major research issue. With the characteristics like, robustness and ability to scale, the Peer-to-Peer Distributed Database architecture has the potential to handle the data in an efficient manner. This work proposed an improved methodology to cluster the sites based on locality reference value for Peer-to-Peer architecture, to address the issues in fragmentation and allocation phases of database design. This work takes the inspiration of the previous works done based on the predicate based fragmentation and introduces the clustering approach for drafting the database architecture and to allocate the fragmented data across the sites.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.