0% found this document useful (0 votes)
68 views6 pages

Client Clustering For Energy-Efficient Clustered Federated Learning in Wireless Networks

EE

Uploaded by

Sree Krishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views6 pages

Client Clustering For Energy-Efficient Clustered Federated Learning in Wireless Networks

EE

Uploaded by

Sree Krishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Client Clustering for Energy-Efficient Clustered Federated

Learning in Wireless Networks


Jieming Bian Jie Xu∗
[email protected] [email protected]
University of Miami University of Miami
Coral Gables, FL, USA Coral Gables, FL, USA
ABSTRACT processing of data on end-users’ own devices have become in-
Clustered Federated Learning (FL) is a novel approach for handling creasingly appealing. Hence, distributed learning, and in particular
data heterogeneity in FL and training personalized ML models for Federated Learning (FL) [4, 7], has become a significant area of
clients. However, existing research overlooks network constraints research, aiming to fully exploit on-device storage and computing
and objectives when implementing clustered FL in a wireless net- capabilities.
work. Since clients experience varying energy costs when con- A salient feature of FL is data heterogeneity or non-i.i.d. data,
nected to different servers, the cluster formation greatly impacts which is inherent as data is generated by and stored on individual
system energy efficiency. To address this, we present an energy- clients. Exploiting this heterogeneity can greatly aid in building
efficient client clustering problem that optimizes FL performance personalized ML models, but it is challenged by the limited volume
while minimizing energy costs. We introduce a new metric, the of data each client has. Even though it’s common to train a single
effective number of clients, to predict cluster FL performance based ML model via FL using all devices’ data, this may not be able to
on size and data heterogeneity, guiding the clustering optimiza- provide personalized predictions and achieve high performance
tion. To find optimal client clusters, we propose an energy-efficient for all clients. To tackle non-i.i.d. data, clustered FL [2, 6, 9] has
clustering algorithm using parallel Gibbs Sampling. Simulations recently been explored to train personalized ML models using data
and data experiments demonstrate that our algorithm achieves tun- similarity among clients. However, previous work on clustered
able trade-offs between data similarity (thus FL performance) and FL mainly treated it as a pure ML problem, overlooking network
energy cost. constraints when clustering clients in a real-world wireless network
environment.
KEYWORDS In this paper, we examine, for the first time, the formation of
client clusters in clustered FL in a distributed wireless network. The
Federated Learning, Wireless Networks
model considers that both clients and the parameter servers of clus-
ACM Reference Format: ters are distributed over the wireless network, leading to varying
Jieming Bian and Jie Xu. 2023. Client Clustering for Energy-Efficient Clus- wireless channel conditions. Given that clients upload local model
tered Federated Learning in Wireless Networks. In Adjunct Proceedings of the updates to associated parameter servers via wireless channels, their
2023 ACM International Joint Conference on Pervasive and Ubiquitous Comput- transmission energy consumption can vary significantly depending
ing & the 2023 ACM International Symposium on Wearable Computing (Ubi-
on which server and hence which cluster they join. As clients are
Comp/ISWC ’23 Adjunct), October 08–12, 2023, Cancun, Quintana Roo, Mexico.
energy-limited and numerous rounds are often needed for FL to
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3594739.3612913
converge, the formation of clusters plays a crucial role in the energy
efficiency of the clustered FL system in wireless networks.
1 INTRODUCTION The paper focuses on the "one-shot" client clustering problem
Today, an enormous amount of data is generated daily from end- where clustering is performed only once before FL starts and re-
user devices like personal computers, smart phones, and Internet-of- mains unchanged throughout the learning course. Even though
Things (IoT) devices. Leveraging machine learning (ML) algorithms dynamically adjusting clusters might yield better outcomes, it can
such as deep learning, this data can be harnessed for applications be energy-inefficient due to substantial extra data exchange and
like recommendation systems, image recognition, and natural lan- computation. However, some of the techniques developed here
guage processing. As privacy concerns mount, local storage and could also be readily applied in a periodic clustering setting. The
main contributions of this paper include:
∗ This
work is partially supported by NSF under grants 2006630, 2033681, 2029858,
2044991 and 2319780.
• In the context of one-shot clustering, we introduce a new
metric called the effective number of clients that captures
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed both the number of participating clients and their data het-
for profit or commercial advantage and that copies bear this notice and the full citation erogeneity. This metric, which has been empirically shown
on the first page. Copyrights for components of this work owned by others than the to correlate with FL performance on non-i.i.d. data, is used
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission as the basis for client clustering.
and/or a fee. Request permissions from [email protected]. • We formulate an energy-efficient client clustering problem
UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico considering both the data similarity among clients and the
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0200-6/23/10. . . $15.00 total energy cost. Despite being a variation of the correlation
https://doi.org/10.1145/3594739.3612913 clustering problem and hence NP-hard [1], we develop a

718
UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico Jieming Bian and Jie Xu

new algorithm based on Gibbs sampling that can solve this degree of this set is defined as the average pair-wise distribution
problem with a performance guarantee. difference:
1 ∑︁ ∑︁
ΘN = 2 Δ𝑛,𝑚 (4)
2 FEDERATED LEARNING AND NON-I.I.D. 𝑁 𝑛∈ N 𝑚∈ N
DATA Intuitively, a higher degree of non-i.i.d. data negatively impacts
2.1 Federated Learning Preliminaries the performance of FL. Thus, taking also the number of clients 𝑁
into account, we hypothesize that the FL performance is positively
In a typical FL system, there is a single parameter server and a
correlated with the following metric
number of 𝑁 participating clients, denoted by N = {1, ..., 𝑁 } with
distributed datasets. Let 𝐷𝑛 be the dataset of client 𝑛 ∈ N . FL aims 𝑁 · (1 − 𝛼 · Θ N ) ≜ 𝜂 (N ) (5)
to solve an empirical risk minimization problem where 𝛼 ≥ 0 is a constant depending on the specific FL task and the
1 ∑︁ choice of the function Δ. We call this metric the effective number of
min 𝑙 (𝑤) = min 𝑙𝑛 (𝑤) (1) clients, and denote it by 𝜂 (N ) for a client set N . On the one hand,
𝑤 𝑤 𝑁
𝑛∈ N
having more clients participating in FL improves the FL perfor-
where 𝑤 is the ML model variable that one would like to optimize, mance. On the other hand, a larger non-i.i.d. degree among these
𝑙𝑛 (𝑤) = |𝐷1 | 𝑧 ∈𝐷𝑛 𝑙 (𝑤; 𝑧) is the empirical loss function at client
Í
𝑛
clients hampers the FL performance. Just like the number of clients
𝑛, and 𝑙 (𝑤; 𝑧) is the loss function evaluated at model 𝑤 and data 𝑁 in the i.i.d. data case, the effective number of clients 𝜂 (N ) can be
sample 𝑧 = (𝑥; 𝑦) with 𝑥 being the feature and 𝑦 being the label. used to predict the FL performance by a set of clients with non-i.i.d.
A typical FL algorithm, such as FedAvg [4], runs in an iterative datasets. This metric will serve as the basis for the client clustering
manner. In each iteration 𝑡, the server sends the current global problem to predict the performance of clustered FL. Note that when
model 𝑤 𝑡 −1 to the clients. Each client then updates its local model the data is i.i.d. among the clients, namely Θ N = 0, the effective
𝑤𝑛𝑡 based on the global model 𝑤 𝑡 −1 with its local dataset 𝐷𝑛 . The number of clients 𝜂 (N ) reduces to the actual number of clients 𝑁 .
updated local models are then sent back to the server. Finally, the
server aggregates the received local models, e.g., by averaging the 3 MODEL AND FORMULATION
local models, to obtain a new global model 𝑤 𝑡 . The iterative proce- 3.1 System Model
dure continues until a certain termination criteria is met.
We consider a wireless network consisting of 𝐾 base stations (BSes),
indexed by K = {1, ..., 𝐾 }, and 𝑁 users, indexed by N = {1, ..., 𝑁 }.
2.2 Non-i.i.d. Degree and Effective Client Users play as FL clients while the BSes play as the FL parameter
Number servers, and hence we use users and clients interchangeably, and
A key challenge in FL is the non-i.i.d. data among the participating BSes and servers interchangeably thereafter. Each client 𝑛 ∈ N has
clients. In this paper, we focus on the label distribution heterogene- a local dataset 𝐷𝑛 with size |𝐷𝑛 | and data distribution 𝑃𝑛 . The clients
ity. Suppose there are a total of 𝑑 labels in the label set Y, then are distributed over the network and within different distances from
client 𝑛’s label distribution is denoted by a vector 𝑃𝑛 ∈ R𝑑 , with the servers. Depending on the clients’ and servers’ locations, each
𝑃𝑛 (𝑦) being the probability that a data sample has label 𝑦. client 𝑛 can access a subset of servers K𝑛 ⊆ K. We assume that
For two label distributions 𝑃𝑛 and 𝑃𝑚 , let Δ(𝑃𝑛 , 𝑃𝑚 ) be a func- clients are in fixed locations during the time of interest, and let ℎ𝑛,𝑘
tion that is used to measure their difference. This function can denote the average channel condition between client 𝑛 and server
have many specific choices. For example, Kullback-Leibler (KL) 𝑘 ∈ K𝑛 .
divergence [5] is a common function to measure the distribution We consider a clustered FL architecture where the clients are
difference and, in this case divided into (at most) 𝐾 different clusters before FL starts, and
∑︁ 
𝑃𝑛 (𝑦)
 clients within the same cluster collaboratively train a ML model
ΔKL (𝑃𝑛 , 𝑃𝑚 ) = 𝑃𝑛 (𝑦) log (2) via FL. Without loss of generality, we let server 𝑘 be the parameter
𝑃𝑚 (𝑦)
𝑦∈Y server for cluster 𝑘. Hence, clients in cluster 𝑘 will have to download
the global model from server 𝑘 and upload its local model to server
Alternatively, we can simply calculate the distance between 𝑃𝑛 and
𝑘 in every FL iteration. Let 𝑎𝑛 ∈ K𝑛 denote the server association
𝑃𝑚 in a norm space as follows:
decision variable for client 𝑛, which represents the server it is
Δ𝑝 (𝑃𝑛 , 𝑃𝑚 ) = ∥𝑃𝑛 − 𝑃𝑚 ∥ 𝑝 (3) associated with and hence the cluster it joins in. We collect the
server association decisions of all clients in 𝒂 = (𝑎 1, ..., 𝑎 𝑁 ). Denote
where ∥ · ∥ 𝑝 denotes the 𝑙𝑝 -norm. Clearly, in both cases, the dif-
the set of clients in cluster 𝑘 by C𝑘 (𝒂), which can be computed
ference between a distribution and itself is zero, i.e., Δ(𝑃𝑛 , 𝑃𝑛 ) = 0.
based on the association decision as C𝑘 (𝒂) = {𝑛 : 𝑎𝑛 = 𝑘 }. Note
We keep the probability divergence function Δ(·, ·) general in the
that it is possible that a cluster is empty and no clients are associated
remainder of this paper, and write Δ(𝑃𝑛 , 𝑃𝑚 ) = Δ𝑛,𝑚 for short.
with its corresponding server.
Note that Δ needs not to be symmetric. For example, for KL diver-
gence, ΔKL (𝑃𝑛 , 𝑃𝑚 ) ≠ ΔKL (𝑃𝑚 , 𝑃𝑛 ) in general. With Δ, we define 3.2 Client Energy Cost
the non-i.i.d. degree of the datesets of a set of clients.
Participating in FL incurs energy costs to the clients which are
Definition 1 (Non-I.I.D. Degree). Consider a set of 𝑁 clients often energy-limited. The client energy cost consists of two parts:
N = {1, ..., 𝑁 } and their respective datasets {𝐷𝑖 }𝑖 ∈ N . The non-i.i.d. computation energy cost and (upload) transmission energy cost.

719
Client Clustering for Energy-Efficient Clustered Federated Learning in Wireless Networks UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico

Since computation energy cost does not depend on which cluster {𝑤 ∗C , ..., 𝑤 ∗C } can be written as
1 𝐾
the client joins in, we consider only the transmission energy cost
in this paper. ∑︁ ∑︁
𝐿(𝑤 ∗C1 , ..., 𝑤 ∗C𝐾 ) = 𝑙𝑛 (𝑤 ∗C𝑘 )
We consider a synchronized FL setting where the parameter
𝑘 ∈ K 𝑛∈ C𝑘
server performs local model aggregation after it receives the local ∑︁ 1 ∑︁
models from all clients in the cluster. We assume a maximal delay = |C𝑘 | 𝑙𝑛 (𝑤 ∗C𝑘 )
constraint 𝑇 on the training of each FL round. The delay constraints |C𝑘 |
𝑘∈K 𝑛∈ C𝑘
can vary among different FL tasks for different clusters, but in this ∑︁
= |C𝑘 | · 𝑙 C𝑘 (𝑤 ∗C𝑘 )
paper we assume a common 𝑇 for all clusters since the extension to
𝑘∈K
the heterogeneous delay constraint is straightforward. We assume
that clients have heterogeneous computing capabilities, and denote Considering both the total learning loss function and the energy
the computing speed, in terms of the CPU frequency, of client 𝑛 cost, the client clustering problem is formulated as a total cost
by 𝑓𝑛 . Depending on which server a client 𝑛 is connected with, the minimization problem as follows:
client can adjust its upload transmission power 𝑝𝑛 , to minimize the
total energy cost while meeting the delay constraint. In particular, ∑︁
cp 𝛽𝑔𝐷 min 𝛾𝐿(𝑤 ∗C1 , ..., 𝑤 ∗C𝐾 |𝒂) + (1 − 𝛾) 𝐸𝑛 (𝑎𝑛 ) (9)
the computation time of client 𝑛 can be calculated as 𝑇𝑛 = 𝑓 𝑛 , 𝒂
𝑛 𝑛∈ N
where 𝑔 is the number of CPU cycles required for computing one
data sample and 𝛽 is the number of local training iterations in where 𝛾 ∈ [0, 1] is a tunable weight that trades-off between the
one learning round. These two parameters both depend on the learning loss and the energy cost.
specific FL algorithm adopted and its configurations (e.g., mini- Unfortunately, the above cost minimization problem is impossi-
batch size and number of epoches). To meet the delay constraint 𝑇 , ble to solve because there is no way to obtain the optimal model
the transmission time for client 𝑛 to upload the local model updates parameters 𝑤 ∗C , ..., 𝑤 ∗C and evaluate the total learning loss un-
cp 1
is therefore𝑇𝑛tx = 𝑇 −𝑇𝑛 . Let 𝑆𝑛 denote the uploading data size, then
𝐾
less FL finishes training. However, as client clustering has to be
the minimum required transmission rate is 𝑟𝑛 ≜ 𝑆𝑛 cp . Following performed before FL starts the training process, a serious contra-
𝑇 −𝑇𝑛
the Shannon’s channel capacity equation, if client 𝑛 is associated diction is caused. To overcome this challenge, we have to somehow
with server 𝑘, then it must set its transmission power 𝑝𝑛 so that predict the FL performance if a certain cluster of clients perform
FL together. Thus, we leverage the effective client number metric
  that considers both the cluster size and its non-i.i.d. degree, and
ℎ𝑛,𝑘 𝑝𝑛 formulate an alternative optimization problem as follows:
𝐵 log2 1 + = 𝑟𝑛 (6)
𝐵𝑁 0
∑︁ ∑︁
max 𝛾 |C𝑘 (𝒂)| · 𝜂 (C𝑘 (𝒂)) − (1 − 𝛾) 𝐸𝑛 (𝑎𝑛 ) (10)
𝒂
where 𝐵 is the bandwidth for each client and 𝑁 0 the noise power 𝑘∈K 𝑛∈ N
spectral density. Solving for 𝑝𝑛 in the above equation, the transmis-
sion energy of client 𝑛 is thus Essentially, we replace the loss function 𝑙 C𝑘 (𝑤 ∗C ) with the negative
𝑘
effective number of clients 𝜂 (C𝑘 (𝒂)). Hence, instead of minimizing
the empirical loss function directly, we aim to maximize a weighed
cp 𝐵𝑁 0 (2𝑟𝑛 /𝐵 − 1) sum of the per-cluster effective client number while considering
𝐸𝑛 (𝑎𝑛 ) = (𝑇 − 𝑇𝑛 ) (7)
ℎ𝑛,𝑎 (𝑛) clients’ energy costs.

The different channel conditions between client 𝑛 and the servers 4 ENERGY-EFFICIENT CLUSTERING
result in different energy costs, which defines a preference order ALGORITHM
on the servers for the client to connect. Thus, the wireless network
effect will play an indispensable role in forming client clusters. 4.1 Extreme Cases
When 𝛾 = 0, problem (10) simplifies to
3.3 Problem Formulation ∑︁
min 𝐸𝑛 (𝑎𝑛 ) (11)
For a given cluster C ⊆ N , let 𝑤 ∗C
be the optimal solution that 𝒂
𝑛∈ N
solves the empirical risk minimization with respect to the datasets
of clients in C, i.e., wherein each client 𝑛 connects to the server with the best channel
condition, i.e., 𝑎𝑛∗ = arg max𝑘 ∈ K𝑛 ℎ𝑛,𝑘 . However, this can result
1 ∑︁ in poor FL performance due to grouping clients with diverse data
𝑤 ∗C = arg min 𝑙 C (𝑤) = arg min 𝑙𝑛 (𝑤) (8) distributions.
𝑤 𝑤 |𝐶 |
𝑛∈ C
On the other extreme end 𝛾 = 1, problem (10) reduces to
∑︁
Therefore, for a given clustering outcome {C1, ..., C𝐾 }, the total loss max |C𝑘 (𝒂)| · 𝜂 (C𝑘 (𝒂)) (12)
of the entire clustered FL system given the optimal model weights 𝒂
𝑘∈K

720
UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico Jieming Bian and Jie Xu

By the definition of the effective client number, the objective func- The client utility takes into account the aggregated data distribution
tion can be rewritten as similarity (or dis-similarity) between the client itself and all other
∑︁ ∑︁ clients in the same cluster, as well as the energy cost by connecting
|C𝑘 (𝒂)| · 𝜂 (C𝑘 (𝒂)) = |C𝑘 (𝒂)| 2 (1 − 𝛼 · Θ C𝑘 (𝒂) )
to the server of this cluster. It is easy to verify that the objective
𝑘∈K 𝑘∈K
function in problem (10) can be decomposed as a sum of client
∑︁ 𝛼 ∑︁ ∑︁ utilities. Hence, the energy-efficient client clustering problem is
= |C𝑘 (𝒂)| ­1 − Δ𝑛,𝑚 ®
© ª
|C𝑘 (𝒂)| 2 converted into a sum utility maximization problem as follows:
𝑘∈K 𝑛∈ C𝑘 (𝒂) 𝑚∈ C𝑘 (𝒂)
∑︁ ∑︁ « ∑︁ ¬ ∑︁
= (1 − 𝛼 · Δ𝑛,𝑚 ) max 𝑢𝑛 (𝑎𝑛 , 𝒂 −𝑛 ) (15)
𝒂
𝑘 ∈ K 𝑛∈ C𝑘 (𝒂) 𝑚∈ C𝑘 (𝒂) 𝑛∈ N

With this conversion, the effective client number maximization where by convention 𝒂 −𝑛 = (𝑎 1, ..., 𝑎𝑛−1, 𝑎𝑛+1, ..., 𝑎 𝑁 ) denotes the
problem can be transformed into a correlation clustering where server association actions of all clients except 𝑛.
the affinity matrix 𝑊 ∈ R𝑁 ×𝑁 is 𝑊𝑛,𝑚 = 1 − 𝛼 · Δ𝑛,𝑚 , ∀𝑛, 𝑚. In Our algorithm is developed based on the Gibbs Sampling (GS)
other words, consider that there exists an edge between any pair technique [3]. GS is used to generate the probability distribution of
of nodes (i.e., clients in our case) 𝑛 and 𝑚, and the edge weight the server association action by sweeping through each client, sam-
is 𝑊𝑛,𝑚 . The correlation clustering problem aims to partition the pling its server association action from a conditional distribution
set of nodes into 𝐾 clusters so that the intra-cluster weights are with the remaining clients fixing their server association actions.
maximized. Mathematically, the problem is According to the theory of Markov chain Monte Carlo, it is guaran-
∑︁ ∑︁ teed that the probability distribution of the server association action
max 𝑊𝑛,𝑚 𝑣𝑛,𝑘 𝑣𝑚,𝑘 Í
profile 𝒂 approximated by GS is proportional to exp( 𝑛 𝑢𝑛 (𝒂)/𝜏),
𝒖
𝑛,𝑚∈ N 𝑘∈K
∑︁ where 𝜏 is an algorithm parameter. Such a probability measure is
𝑣𝑛,𝑘 ∈ {0, 1}, ∀𝑛, 𝑘; 𝑣𝑛,𝑘 = 1, ∀𝑛 known as the Gibbs distribution. Furthermore, performing GS while
𝑘 reducing 𝜏 can result in the globally maximum system utility.
Í
Note that 𝑘 ∈ K 𝑣𝑛,𝑘 𝑣𝑚,𝑘 = 1 if and only if nodes 𝑛 and 𝑚 belong Specifically, in every iteration 𝑡, a client 𝑖 is picked according
to the same cluster. to some prescribed order. Then GS evaluates the achievable sum
utility 𝑛 𝑢𝑛 (𝑎𝑖 , 𝒂𝑡−𝑖−1 ) by varying client 𝑖’s action 𝑎𝑖 ∈ K𝑖 while
Í
It is well-known that optimizing the correlation clustering is
NP-hard, and classic methods take a continuous and convex relax- fixing the other clients’ actions 𝒂𝑡−𝑖−1 from the previous iteration.
ation approach to approximate the solution. Specifically, let 𝑉 = Next, GS updates client 𝑖’s action according to the GS distribution,
[𝑣𝑛,𝑘 ]𝑛∈ N,𝑘 ∈ K denote the partition matrix and define 𝑋 = 𝑉𝑉 𝑇 as i.e.,
the binary adjacency matrix, where 𝑋𝑛,𝑚 = 1 if and only if 𝑛 and
exp( 𝑛∈ N 𝑢𝑛 (𝑘, 𝒂𝑡−𝑖−1 )/𝜏)
Í
𝑚 belong to the same cluster. Therefore, the quadratic objective 𝑎𝑖𝑡 ← 𝑘, with prob. Í (16)
′ 𝑡 −1
Í
𝑘 ′ ∈ K𝑖 exp( 𝑛∈ N 𝑢𝑛 (𝑘 , 𝒂 −𝑖 )/𝜏)
Í
in (13), namely 𝑛,𝑚 𝑊𝑛,𝑚 [𝑉𝑉 𝑇 ]𝑛,𝑚 becomes linear (w.r.t. 𝑋 ), i.e.,
Í
𝑛,𝑚 𝑊𝑛,𝑚 𝑋𝑛,𝑚 . To ensure that 𝑋 can be decomposed to recover
𝑉 , a semi-definite constraint is imposed, and the problem becomes Intuitively, action that yields a higher utility will be chosen with a
∑︁ higher probability. For the other clients, they keep the action from
max 𝑊𝑛,𝑚 𝑋𝑛,𝑚 s.t. 𝑋 ⪰ 0 (13) the previous iteration, i.e., 𝑎𝑡𝑗 ← 𝑎𝑡𝑗 −1, ∀𝑗 ≠ 𝑖.
𝑋 𝑛,𝑚 While sequential GS converges to the global optimal solution
This problem is a semi-definite program and hence easy to solve. with probability 1 [3], it can be inefficient as it takes too long to
Once (the continuous version of) 𝑋 is obtained, it is factorized to complete one round of action updating if the number of clients is
𝑉𝑉 𝑇 and (the continuous version of) 𝑉 is then converted to binary large. Therefore, we design a parallel GS algorithm to speed up
values via proper rounding to produce the clustering solution. convergence by exploiting the structure of the client clustering
problem. The key to enabling parallelization is realizing that a
4.2 General Case client’s utility does not depend on far-away clients that do not share
common reachable servers. Specifically, we say that two clients 𝑛
The general case of the energy-efficient client clustering problem and 𝑚 are neighbors if they have at least one common reachable
is much more challenging than the two extreme cases. Because server, i.e. K𝑛 ∩ K𝑚 ≠ ∅. Denote the neighborhood of client 𝑛 by
of the additional energy cost term in the objective function and Γ(𝑛) = {𝑚 ∈ N : K𝑛 ∩ K𝑚 ≠ ∅}. Hence, the utility of a client
the fact that a client may not access all servers, the continuous satisfies
and convex relaxation approach used in the previous subsection
is no longer applicable. In addition, the relaxation approach only 𝑢𝑛 (𝑎𝑛 , 𝒂 −𝑛 ) = 𝑢𝑛 (𝑎𝑛 , 𝒂 Γ (𝑛) ) (17)
produces approximate clustering solutions with no performance
guarantee. In this subsection, we propose a new energy-efficient In other words, the server association actions of clients outside
client clustering algorithm. Γ(𝑛) has no impact on client 𝑛’s utility as they will never be in the
To describe our algorithm, we define the “utility” of a client 𝑛 as same cluster with client 𝑛. Likewise, client 𝑛’s server association
∑︁ action also has no impact on the utilities of clients outside Γ(𝑛).
𝑢𝑛 (𝒂) = 𝛾 (1 − 𝛼 · Δ𝑛,𝑚 ) − (1 − 𝛾)𝐸𝑛 (𝑎𝑛 ) Therefore, by canceling out utilities of these clients on both the
(14)
𝑚:𝑎 (𝑚)=𝑎 (𝑛) denominator and the numerator in (16), client 𝑖’s action can be

721
Client Clustering for Energy-Efficient Clustered Federated Learning in Wireless Networks UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico

*+,$-./"01&%23'4 5'/6+0'&6'&$&1%/%-31
can be seen in Fig. 3, by increasing the preference parameter 𝛾, intra-
" $ cluster data similarity (in terms of the weighted sum of per-cluster
effective number of clients) is higher. In other words, clients with
similar data distributions tend to be clustered together. However,
! " # $ ! # this is achieved at a higher energy cost as the clients may have to
connect to farther-away servers. On the other hand, decreasing 𝛾
!"#$"%&'()*+!, ! " # $ achieves a higher energy efficiency. However, data distributions
of the clients in the formed clusters may be very diverse, thereby
-(.())")*+!, ! $ " # degrading the FL performance. Fig. 4 further shows the obtained
5'6")'%"
model accuracy and learning loss on the Fashion-MNIST dataset
/&".(&'0%*1 /&".(&'0%*2 /&".(&'0%*3 /&".(&'0%*4
for different values of 𝛾. As can be seen, the FL performance is
correlated with data similarity and hence exhibits a similar trade-
Figure 1: Illustration of parallel GS. off as before.
updated equivalently as
exp( 𝑛∈Γ (𝑖 ) 𝑢𝑛 (𝑘, 𝒂𝑡Γ−1
Í
(𝑖 )
)/𝜏)
𝑡
𝑎𝑖 ← 𝑘, with prob. Í (18)
′ 𝑡 −1
Í
𝑘 ′ ∈ K𝑖 exp( 𝑛∈Γ (𝑖 ) 𝑢𝑛 (𝑘 , 𝒂 Γ (𝑖 ) )/𝜏)
With this observation, we can divide clients into 𝑍 < 𝑁 groups
so that no two clients in the same group are neighbors, and update
all clients in a group at the same time. Finding the minimum value
of 𝑍 is equivalent to a graph coloring problem, which we do not
discuss in detail in this paper. In our implementation, we use the
Figure 4: Accuracy and Loss Figure 5: CCR v.s. GS
sequential coloring algorithm [8] to generate the colors and divide
the clients into groups. The difference between sequential GS and
parallel GS is illustrated in Fig. 1. In Proposition 1, we formally state Comparison with Continuous Convex Relaxation. We com-
the convergence and the optimality of the proposed algorithm. pare the GS-based algorithm with the continuous convex relaxation
(CCR) method in (13). Because CCR cannot handle the energy cost
Proposition 1. Let {𝜏𝑡 } be a decreasing sequence of control pa-
or the partial access to the server set, we let 𝛾 = 0 and consider a
rameters over iterations. As 𝜏𝑡 → 0, proposed algorithm converges
setting where clients can access all servers for a fair comparison.
to the global optimal solution of (15) with probability 1.
Fig. 5 shows the normalized utility gap by changing 𝛼 (which is the
parameter in 𝜂 (N )). As can be seen, our algorithm outperforms
5 SIMULATIONS
CCR as expected. When 𝛼 is small (i.e., the data heterogeneity plays
In this section, we evaluate the proposed client clustering algorithm a smaller role), then the optimal clustering outcome is a grand clus-
in simulations. We simulate a wireless network with 9 BSes (servers) ter, which are relatively easy to find. In these cases, the performance
deployed in a grid layout. 60 users (clients) are randomly distributed of our algorithm and that of CCR is similar.
over a 500m x 500m area. The server/client transmission range is
200 m. The path loss model is 128.1 + 37.6 log10 𝑑 where d is the 6 CONCLUSIONS
distance which is measured in km, and the standard deviation of
This paper made an initial attempt to study client clustering for
shadow fading is 8 dB. In addition, we simulate the noise power
clustered FL in a wireless network by taking into account the net-
spectral density which labeled as 𝑁 0 equals to −174 dBm/HZ.
work constraints and objectives. Based on the effective number of
Algorithm Convergence. We first study the convergence of
clients, we formulated an energy-efficient clustering optimization
the proposed parallel GS-based algorithm in Fig. 2 (𝛾 = 0.75 and
problem and developed new algorithm to find the optimal/stable
𝜏 = 0.1). As illustrated, our parallel GS-based algorithm converges
clusters. Several future research directions are worth exploring to
after about 40 iterations, and is much faster than the conventional
further advance this study. First, dynamic client clustering may
sequential GS-based algorithm.
discover more accurate client correlations based on intermediate
learning outcomes. Such information may be incorporated into
our formulation to perform periodic (re-)clustering. Second, for
mobile clients, the energy cost by associating to different servers
becomes more difficult to evaluate. Third, when the BS has limited
bandwidth resource, client clustering may have to be performed
jointly with resource allocation.

REFERENCES
Figure 2: Sequential/Parallel Figure 3: Trade-off [1] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation clustering.
GS Machine learning 56, 1 (2004), 89–113.
[2] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2020. An ef-
Learning v.s. Energy Tradeoff. We investigate the trade-off ficient framework for clustered federated learning. arXiv preprint arXiv:2006.04088
between the learning performance and the system energy cost. As (2020).

722
UbiComp/ISWC ’23 Adjunct, October 08–12, 2023, Cancun, Quintana Roo, Mexico Jieming Bian and Jie Xu

[3] John Hammersley. 2013. Monte carlo methods. Springer Science & Business Media. [7] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
[4] Jakub Konecny, H Brendan McMahan, Felix X Yu, Peter Richtarik, Ananda Theertha Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks
Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
communication efficiency. arXiv preprint arXiv:1610.05492 (2016). [8] John Mitchem. 1976. On various algorithms for estimating the chromatic number
[5] Solomon Kullback. 1997. Information theory and statistics. Courier Corporation. of a graph. Comput. J. 19, 2 (1976), 182–183.
[6] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. 2020. Three [9] Felix Sattler, Klaus-Robert Muller, and Wojciech Samek. 2020. Clustered feder-
approaches for personalization with applications to federated learning. arXiv ated learning: Model-agnostic distributed multitask optimization under privacy
preprint arXiv:2002.10619 (2020). constraints. IEEE transactions on neural networks and learning systems (2020).

723

You might also like