Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2019, Canadian Conference on Computational Geometry
Given a set of points in the plane, the unit clustering problem asks for finding a minimum-size set of unit disks that cover the whole input set. We study the unit clustering problem in a distributed setting, where input data is partitioned among several machines. We present a (3 + ε)-approximation algorithm for the problem in the Euclidean plane, and a (4 + ε)-approximation algorithm for the problem under general L p metric (p ≥ 1). We also study the capacitated version of the problem, where each cluster has a limited capacity for covering the points. We present a distributed algorithm for the capacitated version of the problem that achieves an approximation factor of 4 + ε in the L 2 plane, and a factor of 5 + ε in general L p metric. We also provide some complementary lower bounds.
Lecture Notes in Computer Science, 2008
In this paper, we consider the clustering of resources on large scale platforms. More precisely, we target parallel applications consisting of independant tasks, where each task is to be processed on a different cluster. In this context, each cluster should be large enough so as to hold and process a task, and the maximal distance between two hosts belonging to the same cluster should be small in order to minimize latencies of intra-cluster communications. This corresponds to maximum bin covering with an extra distance constraint. We describe a distributed approximation algorithm that computes resource clustering with coordinates in Q in O(log 2 n) steps and O(n log n) messages, where n is the overall number of hosts. We prove that this algorithm provides an approximation ratio of 1 3 .
2020
We study a weighted balanced version of the k-center problem, where each center has a fixed capacity, and each element has an arbitrary demand. The objective is to assign demands of the elements to the centers, so as the total demand assigned to each center does not exceed its capacity, while the maximum distance between centers and their assigned elements is minimized. We present a deterministic O(1)-approximation algorithm for this generalized version of the k-center problem in the distributed setting, where data is partitioned among a number of machines. Our algorithm substantially improves the approximation factor of the current best randomized algorithm available for the problem. We also show that the approximation factor of our algorithm can be improved to 5+ ε, when the underlying metric space has a bounded doubling dimension.
ACM Transactions on Algorithms, 2010
We continue the study of the online unit clustering problem, introduced by Chan and Zarrabi-Zadeh (Workshop on Approximation and Online Algorithms 2006, LNCS 4368, p.121-131. Springer, 2006. We design a deterministic algorithm with a competitive ratio of 7/4 for the one-dimensional case. This is the first deterministic algorithm that beats the bound of 2. It also has a better competitive ratio than the previous randomized algorithms. Moreover, we provide the first non-trivial deterministic lower bound, improve the randomized lower bound, and prove the first lower bounds for higher dimensions.
Proceedings of the Twenty-First Annual ACM- …, 2010
This paper introduces a polynomial time approximation scheme for the metric Correlation Clustering problem, when the number of clusters returned is bounded (by k). Consensus Clustering is a fundamental aggregation problem, with considerable application, and it is analysed here as a metric variant of the Correlation Clustering problem. The PTAS exploits a connection between Correlation Clustering and the k-cut problems. This requires the introduction of a new rebalancing technique, based on minimum cost perfect matchings, to provide clusters of the required sizes.
Lecture Notes in Computer Science, 2005
In many modern application ranges high-dimensional feature vectors are used to model complex real-world objects. Often these objects reside on different local sites. In this paper, we present a general approach for extracting knowledge out of distributed data sets without transmitting all data from the local clients to a server site. In order to keep the transmission cost low, we first determine suitable local feature vector approximations which are sent to the server. Thereby, we approximate each feature vector as precisely as possible with a specified number of bytes. In order to extract knowledge out of these approximations, we introduce a suitable distance function between the feature vector approximations. In a detailed experimental evaluation, we demonstrate the benefits of our new feature vector approximation technique for the important area of distributed clustering. Thereby, we show that the combination of standard clustering algorithms and our feature vector approximation technique outperform specialized approaches for distributed clustering when using high-dimensional feature vectors.
2004
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely k-windows, that is able to achieve high quality results in distributed computing environments.
Journal of The ACM, 2010
We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1 + ε) approximations with probability ≥ 1/2 and running times of O(2 (k/ε) O(1) dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and ε are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications. fixed) . Interestingly, the center in the optimal solution to the 1-mean problem is the same as the center of mass of the points. However, in the case of the 1-median problem, also known as the Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a closed form, we can obtain an approximation to the optimal 1-median in O(1) time (independent of the number of points). There are many useful variations to these clustering problems, for example, in the discrete versions of these problems, the centers that we seek should belong to the input set of points.
We study the online unit clustering problem introduced by Chan and Zarrabi-Zadeh at WAOA 2006. The problem in one dimension is as follows: Given a sequence of points on the real line, partition the points into clusters, each enclosable by a unit interval, with the objective of minimizing the number of clusters used. In this paper, we give a brief survey on the existing algorithms for this problem, and compare their efficiency in practice by implementing all deterministic and randomized algorithms proposed thus far for this problem in the literature. Meanwhile, we introduce two new deterministic algorithms that achieve better performance ratios on average in practice.
Computational Learning Theory, 1999
Clustering problems arise in a wide range of application settings. We shall briefly survey some recent work on approximation algorithms for two clustering problems, the k-median problem and the uncapacitated facility location problem. In a clustering problem, the input consists of a collection of points, and the aim is to partition the points into disjoint sets with the property that,
IEEE Transactions on Knowledge and Data Engineering, 2009
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by "local" synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and quality of solutions depend highly on the quality of the aggregation. In this respect, we proposed for distributed density-based clustering that both reduces the communication overheads due to the data exchange and improves the quality of the global models by considering the shapes of local clusters. From preliminary results we show that this algorithm is very promising.
Categorizing the different types of data over network is still an important research issue in the field of distributed clustering. There are different types of data such as news, social networks, and education etc. All this text data available in different resources. In searching process the server have to gather information about the keyword from different resources. Due to more scalability this process leads more burdens to resources. So we introduced a framework that consists of efficient grouping method and efficiently clusters the text in the form of documents. It guarantees that more text documents are to be clustered faster.
2021
Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algor...
Machine Learning, 2000
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(k log n/k)) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the kmedian problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1 100 ) probability. The best previous upper bound for the problem wasÕ(nk), where theÕ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees. *
2005
We generalize the k-means algorithm presented by the authors and show that the resulting algorithm can solve a larger class of clustering problems that satisfy certain properties (existence of a random sampling procedure and tightness). We prove these properties for the k-median and the discrete k-means clustering problems, resulting in O(2 (k/ε) O(1) dn) time (1 + ε)-approximation algorithms for these problems. These are the first algorithms for these problems linear in the size of the input (nd for n points in d dimensions), independent of dimensions in the exponent, assuming k and ε to be fixed. A key ingredient of the k-median result is a (1 + ε)-approximation algorithm for the 1-median problem which has running time O(2 (1/ε) O(1) d). The previous best known algorithm for this problem had linear running time.
ACM Transactions on Sensor Networks, 2006
We consider the problem of finding a minimum connected dominating set (CDS) in unit disk graphs.
In this paper, we consider the problem of distributed spectral clustering, wherein the data to be clustered is (horizontally) partitioned over a set of interconnected agents with limited connectivity. In order to solve it, we consider the equivalent problem of reconstructing the Euclidean distance matrix of pairwise distances among the joint set of datapoints. This is obtained in a fully decentralized fashion, making use of an innovative distributed gradient-based procedure, where at every agent we interleave gradient steps on a low-rank factorization of the distance matrix, with local averaging steps considering all its neighbors' current estimates. The procedure can be applied to any spectral clustering algorithm, including normalized and unnormalized variations, for multiple choices of the underlying Laplacian matrix. Experimental evaluations demonstrate that the solution is competitive with a fully centralized solver, where data is collected beforehand on a (virtual) coordinating agent.
During my graduate career, there were many individuals that helped me become a better researcher as well as a better person. First and foremost, I would like to express my gratitude to my advisor Greg Plaxton for his valuable guidance and boundless enthusiasm. His attitude toward problem solving and research in general have been truly enlightening, and I consider myself lucky for having had the opportunity to work with him for the last few years. I can remember many marathon research meetings that were fueled by Greg's ideas and enthusiasm. Whether he was in Austin or on a leave of absence several thousand miles away, Greg was always available and willing to contribute. All of the results in this dissertation were coauthored with Greg. I am also deeply indebted to my family for providing gentle, but firm, guidance through the times when I was unsure of where I was headed.
arXiv (Cornell University), 2023
In the Max-k-Diameter problem, we are given a set of points in a metric space, and the goal is to partition the input points into k parts such that the maximum pairwise distance between points in the same part of the partition is minimized. The approximability of the Max-k-Diameter problem was studied in the eighties, culminating in the work of Feder and Greene [STOC'88], wherein they showed it is NP-hard to approximate within a factor better than 2 in the ℓ 1 and ℓ ∞ metrics, and NP-hard to approximate within a factor better than 1.969 in the Euclidean metric. This complements the celebrated 2 factor polynomial time approximation algorithm for the problem in general metrics (Gonzalez [TCS'85]; Hochbaum and Shmoys [JACM'86]). Over the last couple of decades, there has been increased interest from the algorithmic community to study the approximability of various clustering objectives when the number of clusters is fixed. In this setting, the framework of coresets has yielded PTAS for most popular clustering objectives, including k-means, k-median, k-center, k-minsum, and so on. In this paper, rather surprisingly, we prove that even when k = 3, the Max-k-Diameter problem is NP-hard to approximate within a factor of 1.5 in the ℓ 1 -metric (and Hamming metric) and NP-hard to approximate within a factor of 1.304 in the Euclidean metric. Our main conceptual contribution is the introduction of a novel framework called cloud systems which embed hypergraphs into ℓ p -metric spaces such that the chromatic number of the hypergraph is related to the quality of the Max-k-Diameter clustering of the embedded pointset. Our main technical contributions are the constructions of nontrivial cloud systems in the Euclidean and ℓ 1 -metrics using extremal geometric structures.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.