Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, 2011 10th International Conference on Machine Learning and Applications and Workshops
Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and quality of solutions depend highly on the quality of the aggregation. In this respect, we proposed for distributed density-based clustering that both reduces the communication overheads due to the data exchange and improves the quality of the global models by considering the shapes of local clusters. From preliminary results we show that this algorithm is very promising.
International Journal of Advanced Computer Science and Applications, 2019
Privacy and security have always been a concern that prevents the sharing of data and impedes the success of many projects. Distributed knowledge computing, if done correctly, plays a key role in solving such a problem. The main goal is to obtain valid results while ensuring the non-disclosure of data. Density-based clustering is a powerful algorithm in analyzing uncertain data that naturally occur and affect the performance of many applications like location-based services. Nowadays, a huge number of datasets have been introduced for researchers which involve high-dimensional data points with varying densities. Such datasets contain data points with highdensity regions surrounded by data points with sparse density. The existing clustering approaches handle these situations inefficiently, especially in the context of distributed data. In this paper, we design a new decomposable density-based clustering algorithm for distributed datasets (DDBC). DDBC utilizes the concept of mutual k-nearest neighbor relationship to cluster distributed datasets with different density. The proposed DDBC algorithm is capable of preserving the privacy and security of data on each site by requiring a minimal number of transmissions to other sites.
2003
While data mining algorithms invariably operate on centralized data, in practice related information is often acquired and stored at geographically distributed locations due to organizational or operational constraints. Centralization of such data before analysis may not be desirable because of computational or bandwidth costs. In some cases, it may not even be possible due to variety of real-life constraints including security, privacy, proprietary nature of data/software and the accompanying ownership and legal issues. This paper briefly describes how one can achieve distributed clustering in two different settings that impose severe constraints on the data or knowledge that can be shared among (local) data sites. The first allows only the cluster labels of individual objects to be shared, but not their attributes. The second disallows sharing of the attributes or cluster labels of individual objects altogether. In this case generative (probabilistic) models of local data are used to generate "virtual samples" that are then used to obtain a "global" solution. Applications are identified for both of these settings.
Categorizing the different types of data over network is still an important research issue in the field of distributed clustering. There are different types of data such as news, social networks, and education etc. All this text data available in different resources. In searching process the server have to gather information about the keyword from different resources. Due to more scalability this process leads more burdens to resources. So we introduced a framework that consists of efficient grouping method and efficiently clusters the text in the form of documents. It guarantees that more text documents are to be clustered faster.
Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16, 2016
Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed, many are complex and do not scale well as more data become available, making then inadequate to analyze very large datasets. In addition, many clustering algorithms are sequential, thus inherently difficult to parallelize. We propose PatchWork, a novel clustering algorithm to address those issues. PatchWork is a distributed density clustering algorithm with linear computational complexity and linear horizontal scalability. It presents several desirable characteristics in knowledge discovery, in particular, it does not require a priori the number of clusters to identify, and offers a natural protection against outliers and noise. In addition, PatchWork makes it possible to discover spatially large clusters instead of dense clusters only. PatchWork relies on the map/reduce paradigm to parallelize computations and was implemented using Apache Spark, the distributed computation framework. As a result, PatchWork can cluster a billion points in a few minutes only, a 40x improvement over the distributed implementation of k-means in Spark MLLib.
Engineering Applications of Artificial Intelligence, 2006
In this paper we address confidentiality issues in distributed data clustering, particularly the inference problem. We present KDEC-S algorithm for distributed data clustering, which is shown to provide mining results while preserving confidentiality of original data. We also present a confidentiality framework with which we can state the confidentiality level of KDEC-S. The underlying idea of KDEC-S is to use an approximation of density estimation such that the original data cannot be reconstructed to a given extent.
… Conference on Data Mining (DMIN'07), USA, 2007
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical parallel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution.
2016
In many popular applications large amounts of data are distributed among multiple sources. Analysis of this data and identifying clusters is challenging due to storage, processing, and transmission costs. A decentralized clustering algorithm called DCluster, which is capable of clustering distributed and dynamic data sets. Nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the data set. The summarized view is a basis for executing the clustering algorithms to produce approximations of the final clustering results. DCluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment. In DCluster the complete data set is clustered in a fully decentralized fashion, such that each node obtains an accurate clustering model, without collecting the whole data set.
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016
Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases; generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms; K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic; no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.
Journal of Software Engineering and Applications, 2010
Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart.
arXiv: Computation, 2020
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. We introduce a nearly embarrassingly parallel algorithm using a Bayesian finite mixture of mixtures model for distributed clustering, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across nodes for a final cluster estimate based on any loss function on the space of partitions. DIB-C can also provide a posterior p...
2004
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely k-windows, that is able to achieve high quality results in distributed computing environments.
Lecture Notes in Computer …, 2005
In this paper we address confidentiality issues in distributed data clustering, particularly the inference problem. We present KDEC-S algorithm for distributed data clustering, which is shown to provide mining results while preserving confidentiality of original data. We also present a confidentiality framework with which we can state the confidentiality level of KDEC-S. The underlying idea of KDEC-S is to use an approximation of density estimation such that the original data cannot be reconstructed to a given extent.
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
In this paper, we present a new approach of distributed clustering for spatial datasets, based on an innovative and efficient aggregation technique. This distributed approach consists of two phases: 1) local clustering phase, where each node performs a clustering on its local data, 2) aggregation phase, where the local clusters are aggregated to produce global clusters. This approach is characterised by the fact that the local clusters are represented in a simple and efficient way. And The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in both response time and memory allocation. We evaluated the approach with different datasets and compared it to well-known clustering techniques. The experimental results show that our approach is very promising and outperforms all those algorithms.
Communications in Computer and Information Science, 2018
The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications.
2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015
Distributed data mining techniques and mainly distributed clustering are widely used in last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach analyses the datasets on their locations the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect knowledge. In this paper we propose a new clustering approach for very large spatial datasets that are heterogeneous and distributed. The approach is based on K-means Algorithm but it generates the number of global clusters dynamically. It is not necessary to fix the number of clusters. Moreover, this approach uses a very sophisticated aggregation phase. The aggregation phase is designed in such away that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. Preliminary results show that the proposed approach scales up well in terms of running time, and result quality, we also compared it to two other clustering algorithms BIRCH and CURE and we show clearly this approach is much more efficient than the two algorithms.
2012
Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology etc. The goal of clustering is to decompose or partition a data set into groups such that both the intra-group similarity and the inter-group dissimilarity are maximized. In many applications, the size of the data that needs to be clustered is much more than what can be processed at a single site. Further, the data to be clustered could be inherently distributed. The increasing demand to scale up to these massive data sets which are inherently distributed over networks with limited bandwidth and computational resources has led to methods for parallel and distributed data clustering. In this thesis, we present a cohesive framework for cluster identification and outlier detection for distributed data. The core idea is to generate independent local models and combine the local models at a central server to obtain global clusters. ...
ArXiv, 2018
Density-based clustering techniques are used in a wide range of data mining applications. One of their most attractive features con- sists in not making use of prior knowledge of the number of clusters that a dataset contains along with their shape. In this paper we propose a new algorithm named Linear DBSCAN (Lin-DBSCAN), a simple approach to clustering inspired by the density model introduced with the well known algorithm DBSCAN. Designed to minimize the computational cost of density based clustering on geospatial data, Lin-DBSCAN features a linear time complexity that makes it suitable for real-time applications on low-resource devices. Lin-DBSCAN uses a discrete version of the density model of DBSCAN that takes ad- vantage of a grid-based scan and merge approach. The name of the algorithm stems exactly from its main features outlined above. The algorithm was tested with well known data sets. Experimental results prove the efficiency and the validity of this approach over DBSCAN ...
DBSCAN, a density-based clustering method for multi-dimensional points, was proposed in 1996.
TJPRC, 2013
Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. The central issue is to propose a new data mining algorithm that results better cluster configuration than previous algorithms. The issue of determining the most appropriate cluster configuration is a challenging one, and is addressed in this paper.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.