0% found this document useful (0 votes)
108 views3 pages

Balanced k-means Algorithm Analysis

This document summarizes a research article about revisiting the balanced k-means clustering algorithm. The standard k-means algorithm aims to minimize variance within clusters but does not consider balanced cluster sizes. The paper proposes a balance-driven variant of k-means that treats cluster balance as a secondary objective. It initially focuses on variance minimization but increases weight on the balance constraint over iterations. This allows the algorithm to terminate when a desired balance is reached, rather than relying on a parameter tuning. Previous approaches to balanced clustering include hard constraints or regularization terms, but the proposed approach provides a flexible soft balance.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views3 pages

Balanced k-means Algorithm Analysis

This document summarizes a research article about revisiting the balanced k-means clustering algorithm. The standard k-means algorithm aims to minimize variance within clusters but does not consider balanced cluster sizes. The paper proposes a balance-driven variant of k-means that treats cluster balance as a secondary objective. It initially focuses on variance minimization but increases weight on the balance constraint over iterations. This allows the algorithm to terminate when a desired balance is reached, rather than relying on a parameter tuning. Previous approaches to balanced clustering include hard constraints or regularization terms, but the proposed approach provides a flexible soft balance.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Applied Computing and Intelligence

3(2): 145–179
DOI: 10.3934/aci.2023008
Received: 06 October 2023
Accepted: 17 October 2023
https://www.aimspress.com/journal/aci
Published: 27 October 2023
Research article
Balanced k-means revisited

Rieke de Maeyer1 , Sami Sieranoja2, * and Pasi Fränti2


1
Saarland Informatics Campus, Saarland University, Saarbrücken, Germany
2
Machine Learning Group, School of Computing, University of Eastern Finland, Joensuu, Finland
* Correspondence: Email: [email protected].

Academic Editor: Chih-Cheng Hung


Abstract: The k-means algorithm aims at minimizing the variance within clusters without considering
the balance of cluster sizes. Balanced k-means defines the partition as a pairing problem that enforces the
cluster sizes to be strictly balanced, but the resulting algorithm is impractically slow O(n3 ). Regularized
k-means addresses the problem using a regularization term including a balance parameter. It works
reasonably well when the balance of the cluster sizes is a mandatory requirement but does not generalize
well for soft balance requirements. In this paper, we revisit the k-means algorithm as a two-objective
optimization problem with two goals contradicting each other: to minimize the variance within clusters
and to minimize the difference in cluster sizes. The proposed algorithm implements a balance-driven
variant of k-means which initially only focuses on minimizing the variance but adds more weight to
the balance constraint in each iteration. The resulting balance degree is not determined by a control
parameter that has to be tuned, but by the point of termination which can be precisely specified by a
balance criterion.
Keywords: clustering; k-means; balanced k-means; balanced-constrained; soft balance

1. Introduction

The clustering problem is to partition objects into separate groups (called clusters) so that objects
within one cluster are more similar to each other than objects in different clusters [1]. Since the middle
of the 20th century, thousands of algorithms addressing the clustering problem have been published [2].
One of the most popular clustering algorithms is the k-means algorithm [2, 3]. It was first proposed
by [4] and [5]. This clustering algorithm aims to build k disjoint clusters such that the sum of squared
distances between the data points and their representatives is minimized. The representatives, called
centroids, are determined by the mean of the data points belonging to a cluster. As a distance function,
the Euclidean distance is used. The number of clusters k has to be set by the user.
147

SSE = 53.7 SSE = 34.5 SSE = 26.0


Balanced Balanced Unbalanced

Figure 1. Balance constraints can lead to a different clustering result. There are two balanced
clusterings with different SSE values (left and middle) and one unconstrained clustering
optimized for SSE (right).

demonstrates the case where minimizing the SSE with and without a balance constraint results in a
different optimal clustering result.
To optimize both aims, there exist two different approaches, hard-balanced, also called balance-
constrained, and soft-balanced, also called balance-driven, clustering. Both approaches differ in the
way they assess the two objectives. Hard-balanced clustering strictly requires cluster size balance,
whereas the minimization of the SSE serves as a secondary criterion. Soft-balanced clustering considers
the balance of the cluster sizes as an aim but not as a mandatory requirement. It intends to find a
compromise between the two goals, e.g., by weighting them or by using a heuristic which minimizes
the SSE but indirectly creates more balanced clusters than the standard k-means algorithm [7, 8].
In this paper, we propose a balanced clustering algorithm based on the k-means algorithm. Its main
principle is an increasing penalty term, which is added to the assignment function of the k-means
algorithm and favors objects to be assigned to smaller clusters. Because of the increasing penalty term,
the resulting balance degree of a clustering is not determined by a rather non-intuitive parameter, but
by the point of termination of the algorithm. In this way, the desired balance degree can be specified
precisely. However, even if the algorithm has found a clustering with the desired balance degree, it may
still be possible to improve the quality of the clustering (SSE) by iterating the algorithm further while
keeping the last penalty term fixed.
There are many applications for clustering that rely on a balanced distribution of the objects, i.e.,
a distribution in which every cluster contains exactly or approximately the same number of objects.
Balanced clustering can be used in the division step of divide-and-conquer algorithms to provide equal-
sized partitions [7]. In load balancing algorithms, balanced clustering can help to avoid unbalanced
energy consumption in networking [8, 14, 15] or to balance the workload of salesmen in the multiple
traveling salesmen problem [16]. In the clustering of documents, articles or photos. Also, in the creation
of domain-specific ontologies, balanced clustering can improve the resulting hierarchies by generating a
more balanced view of the objects to facilitate navigation and browsing [17]. In retail chains, balanced
clustering can be used to segment customers into equal-sized groups to spend the same amount of
marketing resources on each segment or to group similar products into categories of specified sizes to
match units of shelf or floor space [17]. Cost function leading to more balanced cluster sizes was used
in [18] to allow manual investigation of the content of the diagnosis clusters.
A fast O(N 1.5 ) time divide-and-conquer algorithm for planar minimum spanning tree (MST) in [19]
assumed that the points are distributed equally among the clusters. However, this assumption does not
hold if there is even a single large cluster, and the time complexity of the algorithm grows to O(N 2 ).

Applied Computing and Intelligence Volume 3, Issue 2, 145–179


149

optimization is used by [24], which also allows to provide bounds on the suboptimality of the given
solution. The fuzzy c-means algorithm is applied by [25] before using the resulting partial belongings and
the given size constraints to finally assign the data points to the clusters. A basic variable neighborhood
search heuristic following the less is more approach was proposed by [6]. This heuristic performs
a local descent method to explore neighbors, which are obtained by swapping points from different
clusters in the current optimal solution. Recently, [26] proposed a memetic algorithm combining a
crossover operator to generate offspring and a responsive threshold search alternating between two
different search procedures to optimize the solution locally. A greedy randomized adaptive search
procedure combined with a strategic oscillation approach to alternate between feasible and infeasible
solutions is used by [27].

2.2. Soft-balanced clustering


A popular approach for the soft-balanced clustering problem is the use of a multiplicative or additive
bias in the assignment function of the standard k-means algorithm. First, Banerjee and Ghosh [28]
proposed to use the frequency-sensitive competitive learning method. Competitive units, clusters
competing for data points, are penalized in proportion to the frequency of their winning, aiming at
making all units participate. Banerjee and Ghosh [28] applied this method by introducing a multiplicative
bias term in the objective function of the standard k-means algorithm, which weights the distance
between a data point and a centroid depending on the number of data points already assigned to the
cluster. In this way, smaller clusters are favored in the assignment step.
They also provided a theoretical background for their approach. The k-means algorithm implicitly
assumes that the overall distribution of the data points can be decomposed into a mixture of isotropic
Gaussians with a uniform prior. Banerjee and Ghosh [28] followed the idea to shrink the Gaussians
in proportion to the number of data points that have been assigned to them by dividing the covariance
matrix of each cluster by the number of data points assigned to it. Maximizing the log-likelihood of a
data point with respect to this framework leads to the multiplicative bias [12, 28].
A similar approach was presented by [29]. They also adapted the assumption that a data point is
distributed according to a mixture of isotropic Gaussians with uniform prior. But instead of changing
the shape of the clusters by shrinking the Gaussians, they adjusted their prior probabilities such that
they decrease exponentially in the number of data points assigned to them. Thus, the more data points a
Gaussian contains, the lower its prior probability becomes. Maximizing the log-likelihood of a data
point with respect to this framework results in an additive bias. Liu et al. [30] complemented their work
by providing the objective function and adding a theoretical analysis with respect to convergence and
bounds in terms of bicriteria approximation.
Further algorithms use the least square linear regression method combined with a balance constraint
that aims at minimizing the variance of the cluster sizes [14, 31]. The least square regression error is
minimized in each iteration such that the accuracy of the estimated hyperplanes, which partition the
data into clusters, improves step by step.
Li et al. [32] proposed an algorithm following the approach of the exclusive lasso. This method
models a situation in which variables within the same group compete with each other [33]. They
computed the exclusive lasso of the cluster indicator matrix, which equals the sum of the squared cluster
sizes, and used it as a balance constraint by adding it as a bias to the objective function of the standard
k-means algorithm.

Applied Computing and Intelligence Volume 3, Issue 2, 145–179

You might also like