Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM)
Distributed data mining techniques and mainly distributed clustering are widely used in last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach analyses the datasets on their locations the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect knowledge. In this paper we propose a new clustering approach for very large spatial datasets that are heterogeneous and distributed. The approach is based on K-means Algorithm but it generates the number of global clusters dynamically. It is not necessary to fix the number of clusters. Moreover, this approach uses a very sophisticated aggregation phase. The aggregation phase is designed in such away that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. Preliminary results show that the proposed approach scales up well in terms of running time, and result quality, we also compared it to two other clustering algorithms BIRCH and CURE and we show clearly this approach is much more efficient than the two algorithms.
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
In this paper, we present a new approach of distributed clustering for spatial datasets, based on an innovative and efficient aggregation technique. This distributed approach consists of two phases: 1) local clustering phase, where each node performs a clustering on its local data, 2) aggregation phase, where the local clusters are aggregated to produce global clusters. This approach is characterised by the fact that the local clusters are represented in a simple and efficient way. And The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in both response time and memory allocation. We evaluated the approach with different datasets and compared it to well-known clustering techniques. The experimental results show that our approach is very promising and outperforms all those algorithms.
International Journal of Parallel, Emergent and Distributed Systems, 2018
Clustering techniques are very attractive for identifying and extracting patterns of interests from datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality, heterogeneity, and high complexity of some algorithms. Distributed clustering techniques constitute a very good alternative to the Big Data challenges (e.g., Volume, Variety, Veracity, and Velocity). In this paper, we developed and implemented a Dynamic Parallel and Distributed clustering (DPDC) approach that can analyse Big Data within a reasonable response time and produce accurate results, by using existing and current computing and storage infrastructure, such as cloud computing. The DPDC approach consists of two phases. The first phase is fully parallel and it generates local clusters and the second phase aggregates the local results to obtain global clusters. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. DPDC was thoroughly tested and compared to wellknown clustering algorithms BIRCH and CURE. The results show that the approach not only produces high-quality results but also scales up very well by taking advantage of the Hadoop MapReduce paradigm or any distributed system.
Communications in Computer and Information Science, 2018
The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications.
ArXiv, 2017
In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase executes a clustering algorithm on local data, assuming that the datasets was already distributed among the system processing nodes. The second phase deals with the local clusters aggregation to generate global clusters. This approach not only generates local clusters on each processing node in parallel, but also facilitates the formation of global clusters without prior knowledge of the number of the clusters, which many partitioning clustering algorithm require. In this study, this approach was applied on spatial datasets. The proposed aggregation phase is very efficient and does not involve the exchange of large amounts of data between the processing nodes. The experimental results show that the approach has super linear speed up, scales up very ...
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016
Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases; generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms; K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic; no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.
TJPRC, 2013
Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. The central issue is to propose a new data mining algorithm that results better cluster configuration than previous algorithms. The issue of determining the most appropriate cluster configuration is a challenging one, and is addressed in this paper.
In Proc. of the 20th VLDB Conference, 1994
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a role to play in spatial data mining.
— For business and real time applications large set of data sets are used to extract the unknown patterns which is termed as data mining approach. Clustering and classification algorithm are used to classify the unlabeled data from the large data set in a supervised and unsupervised manner. The assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in any way is termed as cluster analysis or clustering. Through these algorithms the inferences of the clustering process and its domain application competence is determined.the research work deals with K-means and MCL algorithms which are most delegated clustering algorithms.
Lecture Notes in Computer Science, 2007
Many parallel and distributed clustering algorithms have already been proposed. Most of them are based on the aggregation of local models according to some collected local statistics. In this paper, we propose a lightweight distributed clustering algorithm based on minimum variance increases criterion which requires a very limited communication overhead. We also introduce the notion of distributed perturbation to improve the globally generated clustering. We show that this algorithm improves the quality of the overall clustering and manage to find the real structure and number of clusters of the global dataset.
2006
In this paper, a novel clustering algorithm is proposed to address the clustering problem within both spatial and non-spatial domains by employing a fusion-based approach. The motivation for this work is to overcome the limitations of the existing spatial clustering methods. In most conventional spatial clustering algorithms, the similarity measurement mainly takes the geometric attributes into consideration. However, in many real applications, there is a need to fuse the information from both the spatial and the non-spatial attributes. The goal of our approach is to create and optimize clusters, such that the data objects satisfy both spatial and non-spatial similarity constraints. The proposed algorithm first captures the spatial cores having the highest structure and then employs an iterative, heuristic mechanism to determine the optimal number of spatial cores and non-spatial clusters that exist in the data. Such a fusion-based framework allows for comparing clusters in spatial and non-spatial contexts. The correctness and efficiency of the proposed clustering algorithm is demonstrated on real world data sets.
2016
In many popular applications large amounts of data are distributed among multiple sources. Analysis of this data and identifying clusters is challenging due to storage, processing, and transmission costs. A decentralized clustering algorithm called DCluster, which is capable of clustering distributed and dynamic data sets. Nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the data set. The summarized view is a basis for executing the clustering algorithms to produce approximations of the final clustering results. DCluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment. In DCluster the complete data set is clustered in a fully decentralized fashion, such that each node obtains an accurate clustering model, without collecting the whole data set.
ipcsit.com
Spatial Clustering has been recognized as a primary data mining method for knowledge discovery in spatial databases. In this paper, we have analyzed an efficient method for the fusion of the outputs of the various clusterers, with less computations.We have discussed our proposed layered merging technique for spatial datasets and used it in our clustering combination technique in this paper. Voting procedure is normally used to assign labels for the clusters and resolving the correspondence problem when partitions are made from different clusters. We have eliminated the need for such voting using the matching groups. Based on the cardinality and the set intersections, most likely cluster groups across different clusters are grouped as matching pairs. When more than 50 percent of the clusterers agree upon the groupings, they are resolved into the final partition. In our method, as we travel down the layered merge, we calculate degree of agreement (DOA) factor, based on the count of agreed clusterers. Using the updated DOA at every layer, the movement of unresolved, unsettled data elements will be handled at much reduced the computational cost. Added advantage of this approach is the reuse of the gained knowledge in previous layers, thereby yielding better cluster accuracy and robustness.
International Journal of Advanced Computer Science and Applications, 2019
Privacy and security have always been a concern that prevents the sharing of data and impedes the success of many projects. Distributed knowledge computing, if done correctly, plays a key role in solving such a problem. The main goal is to obtain valid results while ensuring the non-disclosure of data. Density-based clustering is a powerful algorithm in analyzing uncertain data that naturally occur and affect the performance of many applications like location-based services. Nowadays, a huge number of datasets have been introduced for researchers which involve high-dimensional data points with varying densities. Such datasets contain data points with highdensity regions surrounded by data points with sparse density. The existing clustering approaches handle these situations inefficiently, especially in the context of distributed data. In this paper, we design a new decomposable density-based clustering algorithm for distributed datasets (DDBC). DDBC utilizes the concept of mutual k-nearest neighbor relationship to cluster distributed datasets with different density. The proposed DDBC algorithm is capable of preserving the privacy and security of data on each site by requiring a minimal number of transmissions to other sites.
World Academy of Research in Science and Engineering
Spatial clustering can be defined as the process of grouping object with certain dimensions into groups such that objects within a group exhibits similar characteristics when compared to those which are in the other groups. It is an important part of spatial data mining since it provides certain insights into the distribution of data and characteristics of spatial clusters. Spatial clustering methods are mainly categorized into four: Hierarchical, Partitional, Density based and Grid based. All those categorizations are based on the specific criteria used in grouping similar objects. In this paper, we will introduce each of these categories and present some representative algorithms for them. In addition, we will compare these algorithms in terms of four factors such as time complexity, inputs, handling of higher dimensions and handling of irregularly shaped clusters.
2004
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely k-windows, that is able to achieve high quality results in distributed computing environments.
… Conference on Data Mining (DMIN'07), USA, 2007
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical parallel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution.
International Journal on Computer Science and Engineering
Spatial data mining is the process of mining the spatial and spatiotemporal data, in order to discover otherwise hidden and unknown patterns and trends in data. Its application areas include business prospecting, store prospecting, hospital prospecting and automobile insurance etc. This paper includes a survey of spatial data mining, its types, techniques and roles in the field of research. Clustering, Classification, Cloropeth display have been the main focus of the paper. All kinds of clustering such as partitioning based, hierarchical based, grid based, model based and density based clustering techniques have been applied here. The outcome of the survey is a consolidated study of all the above techniques and methods of spatial data mining including a detailed summary, advantages, disadvantages, comparison and distinguishes among several techniques. Such type of survey has not been existing till date. So, our main objective behind the paper was to bring a standardized and uniform platform for all the techniques of spatial data mining.
International Journal of Business Intelligence and Data Mining, 2008
Regionalisation, a prominent problem from social geography, could be solved by a classification algorithm for grouping spatial objects. A typical task is to find spatially compact and dense regions of arbitrary shape with a homogeneous internal distribution of social variables. Grouping a set of homogeneous spatial units to compose a larger region can be useful for sampling procedures as well as many applications, e.g., direct mailing. It would be helpful to have specific purpose regions, depending on the kind of homogeneity one is interested in. In this paper, we propose an algorithm combining the 'spatial density' clustering approach and a covariance-based method to inductively find spatially dense and non-spatially homogeneous clusters of arbitrary shape.
Pattern Recognition Letters, 2005
Spatial clustering, which groups similar spatial objects into classes, is an important component of spatial data mining [Han and Kamber, Data Mining: Concepts and Techniques, 2000]. Due to its immense applications in various areas, spatial clustering has been highly active topic in data mining researches, with fruitful, scalable clustering methods developed recently. These spatial clustering methods can be classified into four categories: partitioning method, hierarchical method, density-based method and grid-based method. Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data with very large number of records or data sets with very high number of dimensions. This new clustering method GCHL (a Grid-Clustering algorithm for High-dimensional very Large spatial databases) combines a novel density-grid based clustering with axis-parallel partitioning strategy to identify areas of high density in the input data space. The algorithm work as well in the feature space of any data set. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, capability of discovering concave/deeper and convex/higher regions, their robustness to outlier and noise, and GCHL excellent scalability.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.