Balanced k-means Algorithm Analysis

This document summarizes a research article about revisiting the balanced k-means clustering algorithm. The standard k-means algorithm aims to minimize variance within clusters but does not consider balanced cluster sizes. The paper proposes a balance-driven variant of k-means that treats cluster balance as a secondary objective. It initially focuses on variance minimization but increases weight on the balance constraint over iterations. This allows the algorithm to terminate when a desired balance is reached, rather than relying on a parameter tuning. Previous approaches to balanced clustering include hard constraints or regularization terms, but the proposed approach provides a flexible soft balance.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views3 pages

Balanced k-means Algorithm Analysis

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Applied Computing and Intelligence

3(2): 145–179
DOI: 10.3934/aci.2023008
Received: 06 October 2023
Accepted: 17 October 2023
https://www.aimspress.com/journal/aci
Published: 27 October 2023
Research article
Balanced k-means revisited

Rieke de Maeyer1 , Sami Sieranoja2, * and Pasi Fränti2

1
Saarland Informatics Campus, Saarland University, Saarbrücken, Germany
2
Machine Learning Group, School of Computing, University of Eastern Finland, Joensuu, Finland
* Correspondence: Email: [email protected].

Academic Editor: Chih-Cheng Hung

Abstract: The k-means algorithm aims at minimizing the variance within clusters without considering
the balance of cluster sizes. Balanced k-means defines the partition as a pairing problem that enforces the
cluster sizes to be strictly balanced, but the resulting algorithm is impractically slow O(n3 ). Regularized
k-means addresses the problem using a regularization term including a balance parameter. It works
reasonably well when the balance of the cluster sizes is a mandatory requirement but does not generalize
well for soft balance requirements. In this paper, we revisit the k-means algorithm as a two-objective
optimization problem with two goals contradicting each other: to minimize the variance within clusters
and to minimize the difference in cluster sizes. The proposed algorithm implements a balance-driven
variant of k-means which initially only focuses on minimizing the variance but adds more weight to
the balance constraint in each iteration. The resulting balance degree is not determined by a control
parameter that has to be tuned, but by the point of termination which can be precisely specified by a
balance criterion.
Keywords: clustering; k-means; balanced k-means; balanced-constrained; soft balance

1. Introduction

The clustering problem is to partition objects into separate groups (called clusters) so that objects
within one cluster are more similar to each other than objects in different clusters [1]. Since the middle
of the 20th century, thousands of algorithms addressing the clustering problem have been published [2].
One of the most popular clustering algorithms is the k-means algorithm [2, 3]. It was first proposed
by [4] and [5]. This clustering algorithm aims to build k disjoint clusters such that the sum of squared
distances between the data points and their representatives is minimized. The representatives, called
centroids, are determined by the mean of the data points belonging to a cluster. As a distance function,
the Euclidean distance is used. The number of clusters k has to be set by the user.
147

SSE = 53.7 SSE = 34.5 SSE = 26.0

Balanced Balanced Unbalanced

Figure 1. Balance constraints can lead to a different clustering result. There are two balanced
clusterings with different SSE values (left and middle) and one unconstrained clustering
optimized for SSE (right).

demonstrates the case where minimizing the SSE with and without a balance constraint results in a
different optimal clustering result.
To optimize both aims, there exist two different approaches, hard-balanced, also called balance-
constrained, and soft-balanced, also called balance-driven, clustering. Both approaches differ in the
way they assess the two objectives. Hard-balanced clustering strictly requires cluster size balance,
whereas the minimization of the SSE serves as a secondary criterion. Soft-balanced clustering considers
the balance of the cluster sizes as an aim but not as a mandatory requirement. It intends to find a
compromise between the two goals, e.g., by weighting them or by using a heuristic which minimizes
the SSE but indirectly creates more balanced clusters than the standard k-means algorithm [7, 8].
In this paper, we propose a balanced clustering algorithm based on the k-means algorithm. Its main
principle is an increasing penalty term, which is added to the assignment function of the k-means
algorithm and favors objects to be assigned to smaller clusters. Because of the increasing penalty term,
the resulting balance degree of a clustering is not determined by a rather non-intuitive parameter, but
by the point of termination of the algorithm. In this way, the desired balance degree can be specified
precisely. However, even if the algorithm has found a clustering with the desired balance degree, it may
still be possible to improve the quality of the clustering (SSE) by iterating the algorithm further while
keeping the last penalty term fixed.
There are many applications for clustering that rely on a balanced distribution of the objects, i.e.,
a distribution in which every cluster contains exactly or approximately the same number of objects.
Balanced clustering can be used in the division step of divide-and-conquer algorithms to provide equal-
sized partitions [7]. In load balancing algorithms, balanced clustering can help to avoid unbalanced
energy consumption in networking [8, 14, 15] or to balance the workload of salesmen in the multiple
traveling salesmen problem [16]. In the clustering of documents, articles or photos. Also, in the creation
of domain-specific ontologies, balanced clustering can improve the resulting hierarchies by generating a
more balanced view of the objects to facilitate navigation and browsing [17]. In retail chains, balanced
clustering can be used to segment customers into equal-sized groups to spend the same amount of
marketing resources on each segment or to group similar products into categories of specified sizes to
match units of shelf or floor space [17]. Cost function leading to more balanced cluster sizes was used
in [18] to allow manual investigation of the content of the diagnosis clusters.
A fast O(N 1.5 ) time divide-and-conquer algorithm for planar minimum spanning tree (MST) in [19]
assumed that the points are distributed equally among the clusters. However, this assumption does not
hold if there is even a single large cluster, and the time complexity of the algorithm grows to O(N 2 ).

Applied Computing and Intelligence Volume 3, Issue 2, 145–179

149

optimization is used by [24], which also allows to provide bounds on the suboptimality of the given
solution. The fuzzy c-means algorithm is applied by [25] before using the resulting partial belongings and
the given size constraints to finally assign the data points to the clusters. A basic variable neighborhood
search heuristic following the less is more approach was proposed by [6]. This heuristic performs
a local descent method to explore neighbors, which are obtained by swapping points from different
clusters in the current optimal solution. Recently, [26] proposed a memetic algorithm combining a
crossover operator to generate offspring and a responsive threshold search alternating between two
different search procedures to optimize the solution locally. A greedy randomized adaptive search
procedure combined with a strategic oscillation approach to alternate between feasible and infeasible
solutions is used by [27].

2.2. Soft-balanced clustering

A popular approach for the soft-balanced clustering problem is the use of a multiplicative or additive
bias in the assignment function of the standard k-means algorithm. First, Banerjee and Ghosh [28]
proposed to use the frequency-sensitive competitive learning method. Competitive units, clusters
competing for data points, are penalized in proportion to the frequency of their winning, aiming at
making all units participate. Banerjee and Ghosh [28] applied this method by introducing a multiplicative
bias term in the objective function of the standard k-means algorithm, which weights the distance
between a data point and a centroid depending on the number of data points already assigned to the
cluster. In this way, smaller clusters are favored in the assignment step.
They also provided a theoretical background for their approach. The k-means algorithm implicitly
assumes that the overall distribution of the data points can be decomposed into a mixture of isotropic
Gaussians with a uniform prior. Banerjee and Ghosh [28] followed the idea to shrink the Gaussians
in proportion to the number of data points that have been assigned to them by dividing the covariance
matrix of each cluster by the number of data points assigned to it. Maximizing the log-likelihood of a
data point with respect to this framework leads to the multiplicative bias [12, 28].
A similar approach was presented by [29]. They also adapted the assumption that a data point is
distributed according to a mixture of isotropic Gaussians with uniform prior. But instead of changing
the shape of the clusters by shrinking the Gaussians, they adjusted their prior probabilities such that
they decrease exponentially in the number of data points assigned to them. Thus, the more data points a
Gaussian contains, the lower its prior probability becomes. Maximizing the log-likelihood of a data
point with respect to this framework results in an additive bias. Liu et al. [30] complemented their work
by providing the objective function and adding a theoretical analysis with respect to convergence and
bounds in terms of bicriteria approximation.
Further algorithms use the least square linear regression method combined with a balance constraint
that aims at minimizing the variance of the cluster sizes [14, 31]. The least square regression error is
minimized in each iteration such that the accuracy of the estimated hyperplanes, which partition the
data into clusters, improves step by step.
Li et al. [32] proposed an algorithm following the approach of the exclusive lasso. This method
models a situation in which variables within the same group compete with each other [33]. They
computed the exclusive lasso of the cluster indicator matrix, which equals the sum of the squared cluster
sizes, and used it as a balance constraint by adding it as a bias to the objective function of the standard
k-means algorithm.

Applied Computing and Intelligence Volume 3, Issue 2, 145–179

Balanced K-Means Revisited-2
No ratings yet
Balanced K-Means Revisited-2
2 pages
Balanced K-Means Revisited-7
No ratings yet
Balanced K-Means Revisited-7
2 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
K-Means Clustering Algorithm and Its Improvement R
No ratings yet
K-Means Clustering Algorithm and Its Improvement R
6 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Efficient K-Means Clustering Algorithm
No ratings yet
Efficient K-Means Clustering Algorithm
4 pages
KMeansPP Soda
No ratings yet
KMeansPP Soda
9 pages
Unit 4
No ratings yet
Unit 4
46 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
An Efficient Fuzzy Clusnjkstering Algorithm
No ratings yet
An Efficient Fuzzy Clusnjkstering Algorithm
10 pages
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
No ratings yet
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
6 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
No ratings yet
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
5 pages
Imbalanced K-Means Clustering Algorithm
No ratings yet
Imbalanced K-Means Clustering Algorithm
9 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Efficient K-Means Clustering Algorithm Using Feature Weight and Min-Max Normalization
No ratings yet
Efficient K-Means Clustering Algorithm Using Feature Weight and Min-Max Normalization
4 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
No ratings yet
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
12 pages
Novel Unsupervised K-Means Algorithm
No ratings yet
Novel Unsupervised K-Means Algorithm
17 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
No ratings yet
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
12 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Anupama Luthra - 2011
No ratings yet
Anupama Luthra - 2011
21 pages
Genedata
No ratings yet
Genedata
67 pages
Mobile Information Systems - 2022 - Zhao - Design and Implementation of An Improved K Means Clustering Algorithm
No ratings yet
Mobile Information Systems - 2022 - Zhao - Design and Implementation of An Improved K Means Clustering Algorithm
10 pages
Unit IV Clustering
No ratings yet
Unit IV Clustering
60 pages
Dynamic Approach To K-Means Clustering Algorithm-2
No ratings yet
Dynamic Approach To K-Means Clustering Algorithm-2
16 pages
Na 2010
No ratings yet
Na 2010
5 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering
No ratings yet
Clustering
10 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
Agglomerative Mean-Shift Clustering
No ratings yet
Agglomerative Mean-Shift Clustering
7 pages
Clustering
No ratings yet
Clustering
28 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
No ratings yet
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
19 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Incremental Clustering Algorithm Analysis
No ratings yet
Incremental Clustering Algorithm Analysis
3 pages
K-Means Clustering Insights
No ratings yet
K-Means Clustering Insights
8 pages
1 s2.0 S1877050923018549 Main
No ratings yet
1 s2.0 S1877050923018549 Main
5 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
No ratings yet
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
5 pages
K-Means Clustering in Mathematica
No ratings yet
K-Means Clustering in Mathematica
11 pages
Unit 4
No ratings yet
Unit 4
40 pages
Optimal k-Means++ for Scalar Data
No ratings yet
Optimal k-Means++ for Scalar Data
6 pages
Comprehensive Review of K-Means Clustering Algorithms
No ratings yet
Comprehensive Review of K-Means Clustering Algorithms
5 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
MLF Mod3
No ratings yet
MLF Mod3
10 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
K-Means Clustering Techniques Explained
No ratings yet
K-Means Clustering Techniques Explained
10 pages
Standardization and Its Effects On K-Means Clustering Algorithm
No ratings yet
Standardization and Its Effects On K-Means Clustering Algorithm
6 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Bock 2007
No ratings yet
Bock 2007
12 pages
Clustering
No ratings yet
Clustering
34 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
Understanding MapReduce Benefits
No ratings yet
Understanding MapReduce Benefits
7 pages
Overview of Hadoop and MapReduce Framework
No ratings yet
Overview of Hadoop and MapReduce Framework
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
MapReduce Algorithms for Sorting and Searching
No ratings yet
MapReduce Algorithms for Sorting and Searching
7 pages
SAP HANA K-Means Customer Segmentation
No ratings yet
SAP HANA K-Means Customer Segmentation
3 pages
SAP HANA K-Means Customer Segmentation
No ratings yet
SAP HANA K-Means Customer Segmentation
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
Optimizing BKM+ for Clustering Efficiency
No ratings yet
Optimizing BKM+ for Clustering Efficiency
3 pages
SVM Distance-Based Kernel Accuracy
No ratings yet
SVM Distance-Based Kernel Accuracy
1 page
SAP HANA K-Means for Segmentation
No ratings yet
SAP HANA K-Means for Segmentation
6 pages
K-Means Clustering in SAP HANA PAL
No ratings yet
K-Means Clustering in SAP HANA PAL
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
Data Visualization for Machine Learning
No ratings yet
Data Visualization for Machine Learning
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Understanding MapReduce Algorithms
No ratings yet
Understanding MapReduce Algorithms
6 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
3 pages
SAP HANA K-Means Clustering Guide
No ratings yet
SAP HANA K-Means Clustering Guide
3 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Big Data Clustering with MapReduce
No ratings yet
Big Data Clustering with MapReduce
7 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Embed and Conquer: Scalable Embeddings For Kernel K-Means On Mapreduce
No ratings yet
Embed and Conquer: Scalable Embeddings For Kernel K-Means On Mapreduce
9 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Assignment - 4 - Compiler Design - Solution
No ratings yet
Assignment - 4 - Compiler Design - Solution
8 pages
58 Soft Computing Techniques
No ratings yet
58 Soft Computing Techniques
12 pages
ATPG DRC Reverse Question
No ratings yet
ATPG DRC Reverse Question
4 pages
Fantom Foundation Lottery Details
100% (1)
Fantom Foundation Lottery Details
1,435 pages
Uninformed Search Strategies Guide
No ratings yet
Uninformed Search Strategies Guide
21 pages
Big-M Method for Linear Programming
No ratings yet
Big-M Method for Linear Programming
4 pages
15 String Matching
No ratings yet
15 String Matching
45 pages
Advanced Algorithms TUTORIAL 6 Dynamic Programming
No ratings yet
Advanced Algorithms TUTORIAL 6 Dynamic Programming
47 pages
Turing Machines Explained
No ratings yet
Turing Machines Explained
59 pages
Advanced Algorithms Course. Lecture Notes. Part 2: Set Cover
No ratings yet
Advanced Algorithms Course. Lecture Notes. Part 2: Set Cover
3 pages
DFS Algorithm in Python
No ratings yet
DFS Algorithm in Python
4 pages
TrollAll Incar - Lua
No ratings yet
TrollAll Incar - Lua
30 pages
Algorithms 1
No ratings yet
Algorithms 1
6 pages
Finite Automata Language Identification Guide
No ratings yet
Finite Automata Language Identification Guide
32 pages
Btech Cse 4 Sem Design and Analysis of Algorithms 105402 2022
No ratings yet
Btech Cse 4 Sem Design and Analysis of Algorithms 105402 2022
4 pages
Lockbox Problem
No ratings yet
Lockbox Problem
14 pages
Anwesha Chinese Checker
No ratings yet
Anwesha Chinese Checker
34 pages
Lab Report 2:: To Explain The Universality of NAND and NOR GATES in ORDER To Design Other Logic Gates
No ratings yet
Lab Report 2:: To Explain The Universality of NAND and NOR GATES in ORDER To Design Other Logic Gates
9 pages
Lec #15-ABC
No ratings yet
Lec #15-ABC
17 pages
Week 04
No ratings yet
Week 04
101 pages
Lesson 3 Transportation Problem
No ratings yet
Lesson 3 Transportation Problem
41 pages
Algorithm Notes
No ratings yet
Algorithm Notes
22 pages
MODULE CC 104 Data Structures and Algorithms
No ratings yet
MODULE CC 104 Data Structures and Algorithms
9 pages
Mid-II - Pps Question - Bank
No ratings yet
Mid-II - Pps Question - Bank
2 pages
A Short Proof of König's Matching Theorem - Romeo Rizzi PDF
No ratings yet
A Short Proof of König's Matching Theorem - Romeo Rizzi PDF
3 pages
Optimization for Engineering Students
No ratings yet
Optimization for Engineering Students
46 pages
Graph Coloring Techniques
No ratings yet
Graph Coloring Techniques
3 pages
Introduction to Graphs Overview
No ratings yet
Introduction to Graphs Overview
69 pages
SJSU EE288 Lecture18 SAR ADC
No ratings yet
SJSU EE288 Lecture18 SAR ADC
15 pages
YACC Program for Arithmetic Evaluation
No ratings yet
YACC Program for Arithmetic Evaluation
37 pages

Balanced k-means Algorithm Analysis

Uploaded by

Balanced k-means Algorithm Analysis

Uploaded by

Applied Computing and Intelligence

Rieke de Maeyer1 , Sami Sieranoja2, * and Pasi Fränti2

Academic Editor: Chih-Cheng Hung

SSE = 53.7 SSE = 34.5 SSE = 26.0

Applied Computing and Intelligence Volume 3, Issue 2, 145–179

2.2. Soft-balanced clustering

Applied Computing and Intelligence Volume 3, Issue 2, 145–179

You might also like