Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012
Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology etc. The goal of clustering is to decompose or partition a data set into groups such that both the intra-group similarity and the inter-group dissimilarity are maximized. In many applications, the size of the data that needs to be clustered is much more than what can be processed at a single site. Further, the data to be clustered could be inherently distributed. The increasing demand to scale up to these massive data sets which are inherently distributed over networks with limited bandwidth and computational resources has led to methods for parallel and distributed data clustering. In this thesis, we present a cohesive framework for cluster identification and outlier detection for distributed data. The core idea is to generate independent local models and combine the local models at a central server to obtain global clusters. ...
2012
Special thanks are given to my supervisors in Lisbon and Stavanger, Prof. Paulo Urbano and Prof. Chunming Rong, who accepted supervise this master thesis project in two different universities, on two countries so far away, for the orientation and suggestions, patience on my writing process and precious corrections.
2004
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely k-windows, that is able to achieve high quality results in distributed computing environments.
Lecture Notes in Computer Science, 2010
We propose a distributed approach addressing the problem of distance-based outlier detection in very large data sets. The presented algorithm is based on the concept of outlier detection solving set ([1]), which is a small subset of the data set that can be provably used for predicting novel outliers. The algorithm exploits parallel computation in order to meet two basic needs: (i) the reduction of the run time with respect to the centralized version and (ii) the ability to deal with distributed data sets. The former goal is achieved by decomposing the overall computation into cooperating parallel tasks. Other than preserving the correctness of the result, the proposed schema exhibited excellent performances. As a matter of fact, experimental results showed that the run time scales up with respect to the number of nodes. The latter goal is accomplished through executing each of these parallel tasks only on a portion of the entire data set, so that the proposed algorithm is suitable to be used over distributed data sets. Importantly, while solving the distance-based outlier detection task in the distributed scenario, our method computes an outlier detection solving set of the overall data set of the same quality as that computed by the corresponding centralized method.
— For business and real time applications large set of data sets are used to extract the unknown patterns which is termed as data mining approach. Clustering and classification algorithm are used to classify the unlabeled data from the large data set in a supervised and unsupervised manner. The assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in any way is termed as cluster analysis or clustering. Through these algorithms the inferences of the clustering process and its domain application competence is determined.the research work deals with K-means and MCL algorithms which are most delegated clustering algorithms.
TJPRC, 2013
Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. The central issue is to propose a new data mining algorithm that results better cluster configuration than previous algorithms. The issue of determining the most appropriate cluster configuration is a challenging one, and is addressed in this paper.
Categorizing the different types of data over network is still an important research issue in the field of distributed clustering. There are different types of data such as news, social networks, and education etc. All this text data available in different resources. In searching process the server have to gather information about the keyword from different resources. Due to more scalability this process leads more burdens to resources. So we introduced a framework that consists of efficient grouping method and efficiently clusters the text in the form of documents. It guarantees that more text documents are to be clustered faster.
International Journal of Modern Trends in Engineering and Research, 2014
In this paper, a distributed method is introduced for detecting distance-based outliers in very large data sets. The approach is based on the concept of outlier detection solving set, which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results show that the algorithm is efficient and that it’s running time scales quite well for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the solving set computed in a distributed environment has the same quality as that produced by the corresponding centralized method.
2021
Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algor...
2016
Outlier detection in high-dimensional information presents numerous challenges ensuing from the “curse of spatiality.” A prevailing read is that distance concentration, i.e., the tendency of distances in high-dimensional information to become indiscernible, hinders the detection of outliers by creating distance-based strategies label all points as nearly equally smart outliers. During this paper, we offer proof supporting the opinion that such a read is just too easy, by demonstrating that distance-based strategies will turn out additional different outlier scores in high-dimensional settings. What is more, we have a tendency to show that top spatiality will have a special impact, by reexamining the notion of reverse nearest neighbors within the unattended outlier-detection context. Namely, it had been recently ascertained that the distribution of points’ reverse-neighbor counts becomes skew in high dimensions, leading to the development called hub ness. We offer insight into howeve...
2014
Outlier detection is a fundamental issue in data mining, specifically it has been used to detect and remove anomalous objects from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, network intrusions or human errors. Firstly, this thesis presents a theoretical overview of outlier detection approaches. A novel outlier detection method is proposed and analyzed, it is called Clustering Outlier Removal (COR) algorithm. It provides efficient outlier detection and data clustering capabilities in the presence of outliers, and based on filtering of the data after clustering process. The algorithm of our outlier detection method is divided into two stages. The first stage provides k-means process. The main objective of the second stage is an iterative removal of objects, which are far away from their cluster centroids. The removal occurs according to a chosen threshold. Finally, we provide experimental results from the application of our algori...
Existing studies in data mining focus on Outlier detection on data with single clustering algorithm mostly. There are lots of Clustering methods available in data mining. The values or objects that are similar to each other are organized in group it’s called cluster and the values or objects that do not comply with the model or general behavior of the data these data objects called outliers. Outliers detect by clustering. Many Algorithms have been developed for clustering. Where partitional and Hierarchical Clustering is the two well known methods for clustering. In comparison of Hierarchical and Partitional Clustering Majority of the Hierarchical algorithms are very computationally, complex and consume high memory space. Whereas majority of Partitional clustering algorithm have required a linear time with better effectiveness. The clustering quality is not as Better as that a Hierarchical clustering algorithm. Hierarchical and Partitional Clustering algorithm have advantage over each other so in our proposed algorithm we integrate the Partitional Algorithm K-Modes because of Categorial Dataset and Hierarchical Clustering Algorithm CURE because of Large dataset, robust to outliers and identified cluster having non-spherical shape. And we plan to implement that algorithm in MapReduce Framework so the execution time of the algorithm is improve.
International Journal of Advanced Computer Science and Applications, 2019
Privacy and security have always been a concern that prevents the sharing of data and impedes the success of many projects. Distributed knowledge computing, if done correctly, plays a key role in solving such a problem. The main goal is to obtain valid results while ensuring the non-disclosure of data. Density-based clustering is a powerful algorithm in analyzing uncertain data that naturally occur and affect the performance of many applications like location-based services. Nowadays, a huge number of datasets have been introduced for researchers which involve high-dimensional data points with varying densities. Such datasets contain data points with highdensity regions surrounded by data points with sparse density. The existing clustering approaches handle these situations inefficiently, especially in the context of distributed data. In this paper, we design a new decomposable density-based clustering algorithm for distributed datasets (DDBC). DDBC utilizes the concept of mutual k-nearest neighbor relationship to cluster distributed datasets with different density. The proposed DDBC algorithm is capable of preserving the privacy and security of data on each site by requiring a minimal number of transmissions to other sites.
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and quality of solutions depend highly on the quality of the aggregation. In this respect, we proposed for distributed density-based clustering that both reduces the communication overheads due to the data exchange and improves the quality of the global models by considering the shapes of local clusters. From preliminary results we show that this algorithm is very promising.
Outlier detection is a fundamental issue in data mining, specifically it has been used to detect and remove anomalous objects from data mining. In this paper, we describe what Cluster Analysis is, their advantages and limitations followed by a study of clustering methods for outlier detection
2009
Clustering is an established data mining technique for grouping objects based on similarity. For sensor networks one aims at grouping sensor measurements in groups of similar measurements. As sensor networks have limited resources in terms of available memory and energy, a major task sensor clustering is efficient computation on sensor nodes. As a dominating energy consuming task, communication has to be reduced for a better energy efficiency. Considering memory, one has to reduce the amount of stored information on each sensor node.
Engineering Applications of Artificial Intelligence, 2006
In this paper we address confidentiality issues in distributed data clustering, particularly the inference problem. We present KDEC-S algorithm for distributed data clustering, which is shown to provide mining results while preserving confidentiality of original data. We also present a confidentiality framework with which we can state the confidentiality level of KDEC-S. The underlying idea of KDEC-S is to use an approximation of density estimation such that the original data cannot be reconstructed to a given extent.
… Conference on Data Mining (DMIN'07), USA, 2007
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical parallel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution.
Lecture Notes in Computer Science, 2007
Many parallel and distributed clustering algorithms have already been proposed. Most of them are based on the aggregation of local models according to some collected local statistics. In this paper, we propose a lightweight distributed clustering algorithm based on minimum variance increases criterion which requires a very limited communication overhead. We also introduce the notion of distributed perturbation to improve the globally generated clustering. We show that this algorithm improves the quality of the overall clustering and manage to find the real structure and number of clusters of the global dataset.
Data mining, in general, deals with the discovery of non-trivial, hidden and interesting knowledge from different types of data. With the development of information technologies, the number of databases, as well as their dimension and complexity, grow rapidly. It is necessary what we need automated analysis of great amount of information. The analysis results are then used for making a decision by a human or program. One of the basic problems of data mining is the outlier detection. The outlier detection problem in some cases is similar to the classification problem. For example, the main concern of clustering-based outlier detection algorithms is to find clusters and outliers, which are often regarded as noise that should be removed in order to make more reliable clustering. In this thesis, the ability to detect outliers can be improved using a combined perspective from outlier detection and cluster identification. In proposed work comparison of four methods will be done like K-Mean, k-Mediods, Iterative k-Mean and density based method. Unlike the traditional clustering-based methods, the proposed algorithm provides much efficient outlier detection and data clustering capabilities in the presence of outliers, so comparison has been made. The purpose of our method is not only to produce data clustering but at the same time to find outliers from the resulting clusters. The goal is to model an unknown nonlinear function based on observed input-output pairs. The whole simulation of this proposed work has been taken in MATLAB environment.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.