0% found this document useful (0 votes)

15 views39 pages

Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks

Pdf

Uploaded by

Huy Hà Xuân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views39 pages

Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks

Pdf

Uploaded by

Huy Hà Xuân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

1

Diffusion Adaptation Strategies for Distributed

Optimization and Learning over Networks
Jianshu Chen, Student Member, IEEE, and Ali H. Sayed, Fellow, IEEE
arXiv:1111.0034v3 [math.OC] 12 May 2012

Abstract

We propose an adaptive diffusion mechanism to optimize a global cost function in a distributed

manner over a network of nodes. The cost function is assumed to consist of a collection of individual
components. Diffusion adaptation allows the nodes to cooperate and diffuse information in real-time; it
also helps alleviate the effects of stochastic gradient noise and measurement noise through a continu-
ous learning process. We analyze the mean-square-error performance of the algorithm in some detail,
including its transient and steady-state behavior. We also apply the diffusion algorithm to two problems:
distributed estimation with sparse parameters and distributed localization. Compared to well-studied
incremental methods, diffusion methods do not require the use of a cyclic path over the nodes and
are robust to node and link failure. Diffusion methods also endow networks with adaptation abilities
that enable the individual nodes to continue learning even when the cost function changes with time.
Examples involving such dynamic cost functions with moving targets are common in the context of
biological networks.

Index Terms

Distributed optimization, diffusion adaptation, incremental techniques, learning, energy conservation,

biological networks, mean-square performance, convergence, stability.

I. I NTRODUCTION

We consider the problem of optimizing a global cost function in a distributed manner. The cost function
is assumed to consist of the sum of individual components, and spatially distributed nodes are used to

Manuscript received October 30, 2011; revised March 15, 2012. This work was supported in part by NSF grants CCF-1011918
and CCF-0942936. Preliminary results related to this work are reported in the conference presentations [1] and [2].
The authors are with Department of Electrical Engineering, University of California, Los Angeles, CA 90095. Email: {jshchen,
sayed}@ee.ucla.edu.

November 27, 2024 DRAFT

seek the common minimizer (or maximizer) through local interactions. Such problems abound in the
context of biological networks, where agents collaborate with each other via local interactions for a
common objective, such as locating food sources or evading predators [3]. Similar problems are common
in distributed resource allocation applications and in online machine learning procedures. In the latter
case, data that are generated by the same underlying distribution are processed in a distributed manner
over a network of learners in order to recover the model parameters (e.g., [4], [5]).
There are already a few of useful techniques for the solution of optimization problems in a distributed
manner [6]–[24]. Most notable among these methods is the incremental approach [6]–[9] and the con-
sensus approach [10]–[23]. In the incremental approach, a cyclic path is defined over the nodes and
data are processed in a cyclic manner through the network until optimization is achieved. However,
determining a cyclic path that covers all nodes is known to be an NP-hard problem [25] and, in addition,
cyclic trajectories are prone to link and node failures. When any of the edges along the path fails, the
sharing of data through the cyclic trajectory is interrupted and the algorithm stops performing. In the
consensus approach, vanishing step-sizes are used to ensure that nodes reach consensus and converge
to the same optimizer in steady-state. However, in time-varying environments, diminishing step-sizes
prevent the network from continuous learning and optimization; when the step-sizes die out, the network
stops learning. In earlier publications [26]–[35], and motivated by our work on adaptation and learning
over networks, we introduced the concept of diffusion adaptation and showed how this technique can be
used to solve global minimum mean-square-error estimation problems efficiently both in real-time and
in a distributed manner. In the diffusion approach, information is processed locally and simultaneously
at all nodes and the processed data are diffused through a real-time sharing mechanism that ripples
through the network continuously. Diffusion adaptation was applied to model complex patterns of behavior
encountered in biological networks, such as bird flight formations [36] and fish schooling [3]. Diffusion
adaptation was also applied to solve dynamic resource allocation problems in cognitive radios [37],
to perform robust system identification [38], and to implement distributed online learning in pattern
recognition applications [5].
This paper generalizes the diffusive learning process and applies it to the distributed optimization
of a wide class of cost functions. The diffusion approach will be shown to alleviate the effect of
gradient noise on convergence. Most other studies on distributed optimization tend to focus on the
almost-sure convergence of the algorithms under diminishing step-size conditions [6], [7], [39]–[43], or
on convergence under deterministic conditions on the data [6]–[8], [15]. In this article we instead examine
the distributed algorithms from a mean-square-error perspective at constant step-sizes. This is because

November 27, 2024 DRAFT

constant step-sizes are necessary for continuous adaptation, learning, and tracking, which in turn enable
the resulting algorithms to perform well even under data that exhibit statistical variations, measurement
noise, and gradient noise.
This paper is organized as follows. In Sec. II, we introduce the global cost function and approximate
it by a distributed optimization problem through the use of a second-order Taylor series expansion. In
Sec. III, we show that optimizing the localized alternative cost at each node k leads naturally to diffusion
adaptation strategies. In Sec. IV, we analyze the mean-square performance of the diffusion algorithms
under statistical perturbations when stochastic gradients are used. In Sec. V, we apply the diffusion
algorithms to two application problems: sparse distributed estimation and distributed localization. Finally,
in Sec. VI, we conclude the paper.
Notation. Throughout the paper, all vectors are column vectors except for the regressors {uk,i }, which
are taken to be row vectors for simplicity of notation. We use boldface letters to denote random quantities
(such as uk,i ) and regular font letters to denote their realizations or deterministic variables (such as uk,i ).
We write E to denote the expectation operator. We use diag{x1 , . . . , xN } to denote a diagonal matrix
consisting of diagonal entries x1 , . . . , xN , and use col{x1 , . . . , xN } to denote a column vector formed
by stacking x1 , . . . , xN on top of each other. For symmetric matrices X and Y , the notation X ≤ Y
denotes Y − X ≥ 0, namely, that the matrix difference Y − X is positive semi-definite.

II. P ROBLEM F ORMULATION

The objective is to determine, in a collaborative and distributed manner, the M ×1 column vector wo
that minimizes a global cost of the form:
N
X
J glob (w) = Jl (w) (1)
l=1

where Jl (w), l = 1, 2, . . . , N , are individual real-valued functions, defined over w ∈ RM and assumed to
be differentiable and strictly convex. Then, J glob (w) in (1) is also strictly convex so that the minimizer
wo is unique [44]. In this article we study the important case where the component functions {Jl (w)} are
minimized at the same wo . This case is common in practice; situations abound where nodes in a network
need to work cooperatively to attain a common objective (such as tracking a target, locating the source
of chemical leak, estimating a physical model, or identifying a statistical distribution). This scenario is
also frequent in the context of biological networks. For example, during the foraging behavior of an
animal group, each agent in the group is interested in determining the same vector wo that corresponds
to the location of the food source or the location of the predator [3]. This scenario is equally common

November 27, 2024 DRAFT

in online distributed machine learning problems, where data samples are often generated from the same
underlying distribution and they are processed in a distributed manner by different nodes (e.g., [4], [5]).
The case where the {Jl (w)} have different individual minimizers is studied in [45]; this situation is more
challenging to study. Nevertheless, it is shown in [45] that the same diffusion strategies (18)–(19) of this
paper are still applicable and nodes would converge instead to a Pareto-optimal solution.
Our strategy to optimize the global cost J glob (w) in a distributed manner is based on three steps.
First, using a second-order Taylor series expansion, we argue that J glob (w) can be approximated by an
alternative localized cost that is amenable to distributed optimization — see (11). Second, each individual
node optimizes this alternative cost via a steepest-descent procedure that relies solely on interactions
within the neighborhood of the node. Finally, the local estimates for wo are spatially combined by each
node and the procedure repeats itself in real-time.
To motivate the approach, we start by introducing a set of nonnegative coefficients {cl,k } that satisfy:
N
X
cl,k = 1, cl,k = 0 if l ∈
/ Nk , l = 1, 2, . . . , N (2)
k=1

where Nk denotes the neighborhood of node k (including node k itself); the neighbors of node k consist
of all nodes with which node k can share information. Each cl,k represents a weight value that node
k assigns to information arriving from its neighbor l. Condition (2) states that the sum of all weights
leaving each node l should be one. Using the coefficients {cl,k }, we can express J glob (w) from (1) as
N
X
J glob (w) = Jkloc (w) + Jlloc (w) (3)
l6=k

where
X
Jkloc (w) , cl,k Jl (w) (4)
l∈Nk

In other words, for each node k , we are introducing a new local cost function, Jkloc (w), which corresponds
to a weighted combination of the costs of its neighbors. Since the {cl,k } are all nonnegative and each
Jl (w) is convex, then Jkloc (w) is also a convex function (actually, the Jkloc (w) will be guaranteed to be
strongly convex in our treatment in view of Assumption 1 further ahead).
Now, each Jlloc (w) in the second term of (3) can be approximated via a second-order Taylor series
expansion as:

Jlloc (w) ≈ Jlloc (wo ) + kw − wo k2Γl (5)

November 27, 2024 DRAFT

J1 (w)
Nk
1 J8 (w)
J2 (w) 8
a1;k ak;1
2
c1;k ck;1

k J k (w)

J3 (w) 3
7
4
9
J7 (w)
J4 (w) J9 (w)
J5 (w) 5 J6 (w)
6

Fig. 1. A network with N nodes; a cost function Jk (w) is associated with each node k. The set of neighbors of node k is
denoted by Nk ; this set consists of all nodes with which node k can share information.

where Γl = 12 ∇2w Jlloc (wo ) is the (scaled) Hessian matrix relative to w and evaluated at w = wo , and the
notation kak2Σ denotes aT Σa for any weighting matrix Σ. The analysis in the subsequent sections will
show that the second-order approximation (5) is sufficient to ensure mean-square convergence of the
resulting diffusion algorithm. Now, substituting (5) into the right-hand side of (3) gives:
X X
J glob (w) ≈ Jkloc (w)+ kw−wo k2Γl + Jlloc (wo ) (6)
l6=k l6=k
The last term in the above expression does not depend on the unknown w. Therefore, we can ignore it
so that optimizing J glob (w) is approximately equivalent to optimizing the following alternative cost:
0
X
J glob (w) , Jkloc (w) + kw − wo k2Γl (7)
l6=k

III. I TERATIVE D IFFUSION S OLUTION

Expression (7) relates the original global cost (1) to the newly-defined local cost function Jkloc (w).
The relation is through the second term on the right-hand side of (7), which corresponds to a sum of
quadratic terms involving the minimizer wo . Obviously, wo is not available at node k since the nodes wish
to estimate wo . Likewise, not all Hessian matrices Γl are available to node k . Nevertheless, expression (7)
suggests a useful approximation that leads to a powerful distributed solution, as we proceed to explain.
0
Our first step is to replace the global cost J glob (w) by a reasonable localized approximation for it at
every node k . Thus, initially we limit the summation on the right-hand side of (7) to the neighbors of
node k and introduce the cost function:
0
Jkglob (w) , Jkloc (w) +
X
kw − wo k2Γl (8)
l∈Nk \{k}

November 27, 2024 DRAFT

Compared with (7), the last term in (8) involves only quantities that are available in the neighborhood of
node k . The argument involving steps (5)–(8) therefore shows us one way by which we can adjust the
earlier local cost function Jkloc (w) defined in (4) by adding to it the last term that appears in (8). Doing
0
so, we end up replacing Jkloc (w) by Jkglob (w), and this new localized cost function preserves the second
term in (3) up to a second-order approximation. This correction will help lead to a diffusion step (see
(14)–(15)).
Now, observe that the cost in (8) includes the quantities {Γl }, which belong to the neighbors of node k .
These quantities may or may not be available. If they are known, then we can proceed with (8) and rely
on the use of the Hessian matrices Γl in the subsequent development. Nevertheless, the more interesting
situation in practice is when these Hessian matrices are not known beforehand (especially since they
depend on the unknown wo ). For this reason, in this article, we approximate each Γl in (8) by a multiple
of the identity matrix, say,

Γl ≈ bl,k IM (9)

for some nonnegative coefficients {bl,k }; observe that we are allowing the coefficient bl,k to vary with
the node index k . Such approximations are common in stochastic approximation theory and help reduce
the complexity of the resulting algorithms — see [44, pp.20–28] and [46, pp.142–147]. Approximation
(9) is reasonable since, in view of the Rayleigh-Ritz characterization of eigenvalues [47], we can always
bound the weighted squared norm kw − wo k2Γl by the unweighted squared norm as follows

λmin (Γl ) · kw−wo k2 ≤ kw−wo k2Γl ≤ λmax (Γl ) · kw−wo k2

Thus, we replace (8) by

00
Jkglob (w) , Jkloc (w) +
X
bl,k kw − wo k2 (10)
l∈Nk \{k}

As the derivation will show, we do not need to worry at this stage about how the scalars {bl,k } are
selected; they will be embedded into other combination weights that the designer selects. If we replace
Jkloc (w) by its definition (4), we can rewrite (10) as
00
Jkglob (w) =
X X
cl,k Jl (w) + bl,k kw−wo k2 (11)
l∈Nk l∈Nk \{k}

Observe that cost (11) is different for different nodes; this is because the choices of the weighting scalars
{cl,k , bl,k } vary across nodes k ; moreover, the neighborhoods vary with k . Nevertheless, these localized
cost functions now constitute the important starting point for the development of diffusion strategies for
the online and distributed optimization of (1).

November 27, 2024 DRAFT

00
Each node k can apply a steepest-descent iteration to minimize Jkglob (w) by moving along the negative
direction of the gradient (column) vector of the cost function, namely,
X
wk,i = wk,i−1 − µk cl,k ∇w Jl (wk,i−1 )
l∈Nk
X
− µk 2bl,k (wk,i−1 − wo ), i≥0 (12)
l∈Nk \{k}

where wk,i denotes the estimate for wo at node k at time i, and µk denotes a small constant positive
step-size parameter. While vanishing step-sizes, such as µk (i) = 1/i, can be used in (12), we consider in
this paper the case of constant step-sizes. This is because we are interested in distributed strategies that
are able to continue adapting and learning. An important question to address therefore is how close each
of the wk,i gets to the optimal solution wo ; we answer this question later in the paper by means of a
mean-square-error convergence analysis (see expression (88)). It will be seen then that the mean-square-
error (MSE) of the algorithm will be of the order of the step-size; hence, sufficiently small step-sizes
will lead to sufficiently small MSEs.
Expression (12) adds two correction terms to the previous estimate, wk,i−1 , in order to update it to
wk,i . The correction terms can be added one at a time in a succession of two steps, for example, as:
X
ψk,i = wk,i−1 − µk cl,k ∇w Jl (wk,i−1 ) (13)
l∈Nk
X
wk,i = ψk,i − µk 2bl,k (wk,i−1 − wo ) (14)
l∈Nk \{k}

Step (13) updates wk,i−1 to an intermediate value ψk,i by using a combination of local gradient vectors.
Step (14) further updates ψk,i to wk,i by using a combination of local estimates. However, two issues
arise while examining (14):

(a) First, iteration (14) requires knowledge of the optimizer wo . However, all nodes are running similar
updates to estimate the wo . By the time node k wishes to apply (14), each of its neighbors would
have performed its own update similar to (13) and would have available their intermediate estimates,
{ψl,i }. Therefore, we replace wo in (14) by ψl,i . This step helps diffuse information over the network
and brings into node k information that exists beyond its immediate neighborhood; this is because
each ψl,i is influenced by data from the neighbors of node l. We observe that this diffusive term
arises from the quadratic approximation (5) we have made to the second term in (3).
(b) Second, the intermediate value ψk,i in (13) is generally a better estimate for wo than wk,i−1
since it is obtained by incorporating information from the neighbors through (13). Therefore, we

November 27, 2024 DRAFT

further replace wk,i−1 in (14) by ψk,i . This step is reminiscent of incremental-type approaches to
optimization, which have been widely studied in the literature [6]–[9].
Performing the substitutions described in items (a) and (b) into (14), we obtain:
X
wk,i = ψk,i − µk 2bl,k (ψk,i − ψl,i ) (15)
l∈Nk \{k}

Now introduce the coefficients

X
al,k , 2µk bl,k (l 6= k), ak,k , 1−µk 2bl,k (16)
l∈Nk \{k}

Note that the {al,k } are nonnegative for l 6= k and ak,k ≥ 0 for sufficiently small step-sizes. Moreover,
the coefficients {al,k } satisfy
N
X
al,k = 1, al,k = 0 if l ∈
/ Nk (17)
l=1
Using (16) in (15), we arrive at the following Adapt-then-Combine (ATC) diffusion strategy (whose
structure is the same as the ATC algorithm originally proposed in [29]–[31] for mean-square-error
estimation):
X
ψk,i = wk,i−1 − µk cl,k ∇w Jl (wk,i−1 )
l∈Nk
X (18)
wk,i = al,k ψl,i
l∈Nk

To run algorithm (18), we only need to select combination coefficients {al,k , cl,k } satisfying (2) and (17),
respectively; there is no need to worry about the intermediate coefficients {bl,k } any more, since they
have been blended into the {al,k }. The ATC algorithm (18) involves two steps. In the first step, node k
receives gradient vector information from its neighbors and uses it to update its estimate wk,i−1 to an
intermediate value ψk,i . All other nodes in the network are performing a similar step and generating their
intermediate estimate ψl,i . In the second step, node k aggregates the estimates {ψl,i } of its neighbors and
generates wk,i . Again, all other nodes are performing a similar step. Similarly, if we reverse the order of
steps (13) and (14) to implement (12), we can motivate the following alternative Combine-then-Adapt
(CTA) diffusion strategy (whose structure is similar to the CTA algorithm originally proposed in [26]–[32]
for mean-square-error estimation):
X
ψk,i−1 = al,k wl,i−1
l∈Nk
X (19)
wk,i = ψk,i−1 − µk cl,k ∇w Jl (ψk,i−1 )
l∈Nk

November 27, 2024 DRAFT

Adaptive diffusion strategies of the above ATC and CTA types were first proposed and extended in [26]–
[34] for the solution of distributed mean-square-error, least-squares, and state-space estimation problems
over networks. The special form of ATC strategy (18) for minimum-mean-square-error estimation is listed
further ahead as Eq. (57) in Example 3; the same strategy as (57) also appeared in [48] albeit with a
vanishing step-size sequence to ensure convergence towards consensus. A special case of the diffusion
strategy (19) (corresponding to choosing cl,k = 0 for l 6= k and ck,k = 1, i.e., without sharing gradient
information) was used in the works [39], [40], [43] to solve distributed optimization problems that require
all nodes to reach agreement about wo by relying on step-sizes that decay to zero with time. Diffusion
recursions of the forms (18) and (19) are more general than these earlier investigations in a couple of
respects. First, they do not only diffuse the local estimates, but they can also diffuse the local gradient
vectors. In other words, two sets of combination coefficients {al,k , cl,k } are used. Second, the combination
weights {al,k } are not required to be doubly stochastic (which would require both the rows and columns
of the weighting matrix A = [al,k ] to add up to one; as seen from (17), we only require the entries on
the columns of A to add up to one). Finally, and most importantly, the step-size parameters {µk } in (18)
and (19) are not required to depend on the time index i and are not required to vanish as i → ∞. Instead,
they can assume constant values, which is critical to endow the network with continuous adaptation and
learning abilities (otherwise, when step-sizes die out, the network stops learning). Constant step-sizes
also endow networks with tracking abilities, in which case the algorithms can track time changes in the
optimal wo .
Constant step-sizes will be shown further ahead to be sufficient to guarantee agreement among the
nodes when there is no noise in the data. However, when measurement noise and gradient noise are
present, using constant step-sizes does not force the nodes to attain agreement about wo (i.e., to converge
to the same wo ). Instead, the nodes will be shown to tend to individual estimates for wo that are within
a small mean-square-error (MSE) bound from the optimal solution; the bound will be proportional to the
step-size so that sufficiently small step-sizes lead to small MSE values. Multi-agent systems in nature
behave in this manner; they do not require exact agreement among their agents but allow for fluctuations
due to individual noise levels (see [3], [36]). Giving individual nodes this flexibility, rather than forcing
them to operate in agreement with the remaining nodes, ends up leading to nodes with enhanced learning
abilities.
Before proceeding to a detailed analysis of the performance of the diffusion algorithms (18)–(19), we
note that these strategies differ in important ways from traditional consensus-based distributed solutions,

November 27, 2024 DRAFT

which are of the following form [10], [14], [15], [18]:

X
wk,i = al,k wk,i−1 − µk (i) · ∇w Jl (wk,i−1 ) (20)
l∈Nk

usually with a time-variant step-size sequence, µk (i), that decays to zero. For example, if we set C ,
[cl,k ] = I in the CTA algorithm (19) and substitute the combination step into the adaptation step, we
obtain:
!
X X
wk,i = al,k wk,i−1 − µk ∇w Jl al,k wk,i−1 (21)
l∈Nk l∈Nk

Thus, note that the gradient vector in (21) is evaluated at ψk,i−1 , while in (20) it is evaluated at wk,i−1 .
Since ψk,i−1 already incorporates information from neighbors, we would expect the diffusion algorithm
to perform better. Actually, it is shown in [49] that, for mean-square-error estimation problems, diffusion
strategies achieve higher convergence rate and lower mean-square-error than consensus strategies due to
these differences in the dynamics of the algorithms.

IV. M EAN -S QUARE P ERFORMANCE A NALYSIS

The diffusion algorithms (18) and (19) depend on sharing local gradient vectors ∇w Jl (·). In many
cases of practical relevance, the exact gradient vectors are not available and approximations are instead
used. We model the inaccuracy in the gradient vectors as some random additive noise component, say,
of the form:

∇
b w Jl (w) = ∇w Jl (w) + vl,i (w) (22)

where vl,i (·) denotes the perturbation and is often referred to as gradient noise. Note that we are using
a boldface symbol v to refer to the gradient noise since it is generally stochastic in nature.

Example 1. Assume the individual cost Jl (w) at node l can be expressed as the expected value of a
certain loss function Ql (·, ·), i.e., Jl (w) = E{Ql (w, xl,i )}, where the expectation is with respect to the
randomness in the data samples {xl,i } that are collected at node l at time i. Then, if we replace the true
gradient ∇w Jl (w) with its stochastic gradient approximation ∇
b w Jl (w) = ∇w Ql (w, xl,i ), we find that

the gradient noise in this case can be expressed as

vl,i (w) = ∇w Ql (w, xl,i ) − ∇w E{Ql (w, xl,i )} (23)

November 27, 2024 DRAFT

Using the perturbed gradient vectors (22), the diffusion algorithms (18)–(19) become the following:
X
ψk,i = wk,i−1 −µk cl,k ∇
b w Jl (wk,i−1 )
l∈Nk
(ATC) X (24)
wk,i = al,k ψl,i
l∈Nk

X
ψk,i−1 = al,k wl,i−1
l∈Nk
(CTA) X (25)
wk,i = ψk,i−1 −µk cl,k ∇
b w Jl (ψk,i−1 )
l∈Nk

Observe that, starting with (24)–(25), we will be using boldface letters to refer to the various estimate
quantities in order to highlight the fact that they are also stochastic in nature due to the presence of the
gradient noise.
Given the above algorithms, it is necessary to examine their performance in light of the approximation
steps (6)–(15) that were employed to arrive at them, and in light of the gradient noise (22) that seeps
into the recursions. A convenient framework to carry out this analysis is mean-square analysis. In this
framework, we assess how close the individual estimates wk,i get to the minimizer wo in the mean-square-
error (MSE) sense. In practice, it is not necessary to force the individual agents to reach agreement and
to converge to the same wo using diminishing step-sizes. It is sufficient for the nodes to converge within
acceptable MSE bounds from wo . This flexibility is beneficial and is common in biological networks; it
allows nodes to learn and adapt in time-varying environments without the forced requirement of having
to agree with neighbors.
The main results that we derive in this section are summarized as follows. First, we derive conditions on
the constant step-sizes to ensure boundedness and convergence of the mean-square-error for sufficiently
small step-sizes — see (80) and (106) further ahead. Second, despite the fact that nodes influence each
other’s behavior, we are able to quantify the performance of every node in the network and to derive
closed-form expressions for the mean-square performance at small step-sizes — see (106)–(108). Finally,
as a special case, we are able to show that constant step-sizes can still ensure that the estimates across
all nodes converge to the optimal wo and reach agreement in the absence of noise — see Theorem 2.
Motivated by [31], we address the mean-square-error performance of the adaptive ATC and CTA
diffusion strategies (24)–(25) by treating them as special cases of a general diffusion structure of the

November 27, 2024 DRAFT

following form:
N
X
φk,i−1 = p1,l,k wl,i−1 (26)
l=1
N
X h i
ψk,i = φk,i−1 − µk sl,k ∇w Jl (φk,i−1 ) + vl,i (φk,i−1 ) (27)
l=1
N
X
wk,i = p2,l,k ψl,i (28)
l=1

The coefficients {p1,l,k }, {sl,k }, and {p2,l,k } are nonnegative real coefficients corresponding to the {l, k}-
th entries of three matrices P1 , S , and P2 , respectively. Different choices for {P1 , P2 , S} correspond to
different cooperation modes. For example, the choice P1 = I , P2 = I and S = I corresponds to the
non-cooperative case where nodes do not interact. On the other hand, the choice P1 = I , P2 = A = [al,k ]
and S = C = [cl,k ] corresponds to ATC [29]–[31], while the choice P1 = A, P2 = I and S = C
corresponds to CTA [26]–[31]. We can also set S = I in ATC and CTA to derive simplified versions
that have no gradient exchange [29]. Furthermore, if in CTA (P2 = I ), we enforce P1 = A to be doubly
stochastic, set S = I , and use a time-decaying step-size parameter (µk (i) → 0), then we obtain the
unconstrained version used by [39], [43]. The matrices {P1 , P2 , S} are required to satisfy:

P1T 1 = 1, P2T 1 = 1, S 1 = 1 (29)

where the notation 1 denotes a vector whose entries are all equal to one.

A. Error Recursions

We first derive the error recursions corresponding to the general diffusion formulation in (26)–(28).
Introduce the error vectors:

φ̃k,i , wo −φk,i , ψ̃k,i , wo −ψk,i , w̃k,i , wo −wk,i (30)

Then, subtracting both sides of (26)–(28) from wo gives:

N
X
φ̃k,i−1 = p1,l,k w̃l,i−1 (31)
l=1
N
X h i
ψ̃k,i = φ̃k,i−1 + µk sl,k ∇w Jl (φk,i−1 ) + vl,i (φk,i−1 ) (32)
l=1
N
X
w̃k,i = p2,l,k ψ̃l,i (33)
l=1

November 27, 2024 DRAFT

Expression (32) still includes terms that depend on φk,i−1 and not on the error quantity, φ̃k,i−1 . We
can find a relation in terms of φ̃k,i−1 by calling upon the following result from [44, p.24] for any
twice-differentiable function f (·):
Z 1
2
∇f (y) = ∇f (x) + ∇ f x+t(y−x) dt (y − x) (34)
0

where ∇2 f (·) denotes the Hessian matrix of f (·) and is symmetric. Now, since each component function
Jl (w) has a minimizer at wo , then, ∇w Jl (wo ) = 0 for l = 1, 2, . . . , N . Applying (34) to Jl (w) using
x = wo and y = φk,i−1 , we get

∇w Jl (φk,i−1 )
Z 1
o
= ∇w Jl (w ) − ∇2w Jl o
w − tφ̃k,i−1 dt φ̃k,i−1
0

, −Hl,k,i−1 φ̃k,i−1 (35)

where we are introducing the symmetric random matrix

Z 1
Hl,k,i−1 , ∇2w Jl wo − tφ̃k,i−1 dt (36)
0

Observe that one such matrix is associated with every edge linking two nodes (l, k); observe further that
this matrix changes with time since it depends on the estimate at node k . Substituting (35)–(36) into (32)
leads to:
N
" #
X
ψ̃k,i = IM − µk sl,k Hl,k,i−1 φ̃k,i−1
l=1
N
X
+ µk sl,k vl,i (φk,i−1 ) (37)
l=1

We introduce the network error vectors, which collect the error quantities across all nodes:
     
 φ̃1,i   ψ̃1,i   w̃1,i 
 .   .   . 
     
φ̃i ,  ..  , ψ̃i ,  ..  , w̃i ,  ..  (38)
     
     
φ̃N,i ψ̃N,i w̃N,i

November 27, 2024 DRAFT

and the following block matrices:

P1 = P1 ⊗ IM , P2 = P2 ⊗ IM (39)

S = S ⊗ IM , M = Ω ⊗ IM (40)

Ω = diag {µ1 , . . . , µN } (41)

N
X n o
D i−1 = diag sl,1 Hl,1,i−1 , · · · , sl,N Hl,N,i−1 (42)
l=1
N
X n o
gi = col sl,1 vl,i (φ1,i−1 ), · · · ,sl,N vl,i (φN,i−1 ) (43)
l=1

where the symbol ⊗ denotes Kronecker products [50]. Then, recursions (31), (37) and (33) lead to:

w̃i = P2T [IM N − MD i−1 ]P1T w̃i−1 + P2T Mgi (44)

To proceed with the analysis, we introduce the following assumption on the cost functions and gradient
noise, followed by a lemma on Hl,k,i−1 .

Assumption 1 (Bounded Hessian). Each component cost function Jl (w) has a bounded Hessian matrix,
i.e., there exist nonnegative real numbers λl,min and λl,max such that λl,min ≤ λl,max and that for all w:

λl,min IM ≤ ∇2w Jl (w) ≤ λl,max IM (45)

Furthermore, the {λl,min }N

l=1 satisfy
N
X
sl,k λl,min > 0, k = 1, 2, . . . , N (46)
l=1

Condition (46) ensures that the local cost functions {Jkloc (w)} defined earlier in (4) are strongly convex
and, hence, have a unique minimizer at wo .

Assumption 2 (Gradient noise). There exist α ≥ 0 and σv2 ≥ 0 such that, for all w ∈ Fi−1 and for all
i, l:

E {vl,i (w) | Fi−1 } = 0 (47)

E kvl,i (w)k2 ≤ αEkwo − wk2 + σv2

(48)

where Fi−1 denotes the past history (σ−field) of estimates {wk,j } for j ≤ i − 1 and all k .

November 27, 2024 DRAFT

Lemma 1 (Bound on Hl,k,i−1 ). Under Assumption 1, the matrix Hl,k,i−1 defined in (36) is a nonnegative-
definite matrix that satisfies:

λl,min IM ≤ Hl,k,i−1 ≤ λl,max IM (49)

Proof: It suffices to prove that λl,min ≤ xT Hl,k,i−1 x ≤ λl,max for arbitrary M × 1 unit Euclidean norm
vectors x. By (36) and (45), we have
Z 1
T
x Hl,k,i−1 x = xT ∇2w Jl wo − tφ̃k,i−1 x dt
0
Z 1
≤ λl,max dt = λl,max
0

In a similar way, we can verify that xT Hl,k,i−1 x ≥ λl,min .

In distributed subgradient methods (e.g., [15], [39], [43]), the norms of the subgradients are usually
required to be uniformly bounded. Such assumption is restrictive in the unconstrained optimization of
differentiable functions. Assumption 1 is more relaxed in that it allows the gradient vector ∇w Jl (w)
to have unbounded norm (e.g., quadratic costs). Furthermore, condition (48) allows the variance of the
gradient noise to grow no faster than Ekwo − wk2 . This condition is also more general than the uniform
bounded assumption used in [39] (Assumptions 5.1 and 6.1), which requires instead:

Ekvl,i (w)k2 ≤ σv2 , E kvl,i (w)k2 |Fi−1 ≤ σv2

(50)

Furthermore, condition (48) is similar to condition (4.3) in [51, p.635]:

h i
E kvl,i (w)k2 |Fi−1 ≤ α k∇w Jl (w)k2 + 1

(51)

which is a combination of the “relative random noise” and the “absolute random noise” conditions defined
in [44, pp.100–102]. Indeed, we can derive (48) by substituting (35) into (51), taking expectation with
respect to Fi−1 , and then using (49).

Example 2. Such a mix of “relative random noise” and “absolute random noise” is of practical impor-
tance. For instance, consider an example in which the loss function at node l is chosen to be of the
following quadratic form:

Ql (w, {ul,i , dl (i)}) = |dl (i) − ul,i w|2

for some scalars {dl (i)} and 1 × M regression vectors {ul,i }. The corresponding cost function is then:

Jl (w) = E|dl (i) − ul,i w|2 (52)

November 27, 2024 DRAFT

Assume further that the data {ul,i , dl (i)} satisfy the linear regression model

dl (i) = ul,i wo + zl (i) (53)

where the regressors {ul,i } are zero mean and independent over time with covariance matrix Ru,l =
E{uTl,i ul,i }, and the noise sequence {zk (j)} is also zero mean, white, with variance σz,k
2 , and independent

of the regressors {ul,i } for all l, k, i, j . Then, using (53) and (23), the gradient noise in this case can be
expressed as:

vl,i (w) = 2(Ru,l − uTl,i ul,i )(wo − w) − 2uTl,i zl (i) (54)

It can easily be verified that this noise satisfies both conditions stated in Assumption 2, namely, (47) and
also:

E kvl,i (w)k2

≤ 4EkRu,l −uTl,i ul,i k2 · Ekwo −wk2 +4σz,l

2
Tr(Ru,l ) (55)

for all w ∈ Fi−1 . Note that both relative random noise and absolute random noise components appear in
(55) and are necessary to model the statistical gradient perturbation even for quadratic costs. Such costs,
and linear regression models of the form (53), arise frequently in the context of adaptive filters — see,
e.g., [9], [26]–[33], [36], [46], [52]–[55].

Example 3. Quadratic costs of the form (52) are common in mean-square-error estimation for linear
regression models of the type (53). If we use the instantaneous approximations as is common in the context
of stochastic approximation and adaptive filtering [44], [46], [52], then the actual gradient ∇w Jl (w) can
be approximated by

∇
b w Jl (w) = ∇w Ql (w, {ul,i , dl (i)})

= −2uTl,i [dl (i) − ul,i w] (56)

Substituting into (24)–(25), and assuming C = I for illustration purposes only, we arrive at the following
ATC and CTA diffusion strategies originally proposed and extended in [26]–[31] for the solution of
distributed mean-square-error estimation problems:

ψk,i = wk,i−1 +2µk uTk,i [dk (i)−uk,i wk,i−1 ]

(ATC) X (57)
wk,i = al,k ψl,i
l∈Nk

November 27, 2024 DRAFT

X
ψk,i−1 = al,k wl,i−1
(CTA) l∈Nk (58)
wk,i = ψk,i−1 +2µk uTk,i [dk (i)−uk,i ψk,i−1 ]

B. Variance Relations

The purpose of the mean-square analysis in the sequel is to answer two questions in the presence
of gradient perturbations. First, how small the mean-square error, Ekw̃k,i k2 , gets as i → ∞ for any
of the nodes k . Second, how fast this error variance tends towards its steady-state value. The first
question pertains to steady-state performance and the second question pertains to transient/convergence
rate performance. Answering such questions for a distributed algorithm over a network is a challenging
task largely because the nodes influence each other’s behavior: performance at one node diffuses through
the network to the other nodes as a result of the topological constraints linking the nodes. The approach
we take to examine the mean-square performance of the diffusion algorithms is by studying how the
variance Ekw̃k,i k2 , or a weighted version of it, evolves over time. As the derivation will show, the
evolution of this variance satisfies a nonlinear relation. Under some reasonable assumptions on the noise
profile, and the local cost functions, we will be able to bound these error variances as well as estimate
their steady-state values for sufficiently small step-sizes. We will also derive closed-form expressions that
characterize the network performance. The details are as follows.
Equating the squared weighted Euclidean norm of both sides of (44), applying the expectation operator
and using using (47), we can show that the following variance relation holds:
n o
Ekw̃i k2Σ = E kw̃i−1 k2Σ0 + EkP2T Mgi k2Σ
(59)
Σ0 = P1 [IM N −MD i−1 ]P2 ΣP2T [IM N −MD i−1 ]P1T
where Σ is a positive semi-definite weighting matrix that we are free to choose. The variance expression
(59) shows how the quantity Ekw̃i k2Σ evolves with time. Observe, however, that the weighting matrix
on w̃i−1 on the right-hand side of (59) is a different matrix, denoted by Σ0 , and this matrix is actually
random in nature (while Σ is deterministic). As such, result (59) is not truly a recursion. Nevertheless,
it is possible, under a small step-size approximation, to rework variance relations such as (59) into a
recursion by following certain steps that are characteristic of the energy conservation approach to mean-
square analysis [46].
The first step in this regard would be to replace Σ0 by its mean EΣ0 . However, the matrix Σ0 depends on
the {Hl,k,i−1 } via D i−1 (see (42)). It follows from the definition of Hl,k,i−1 in (36) that Σ0 is dependent

November 27, 2024 DRAFT

on φ̃k,i−1 as well, which in turn is a linear combination of the {w̃l,i−1 }. Therefore, the main challenge to
continue from (59) is that Σ0 depends on w̃i−1 . For this reason, we cannot apply directly the traditional
step of replacing Σ0 in the first equation of (59) by EΣ0 as is typically done in the study of stand-alone
adaptive filters to analyze their transient behavior [46, p.345]; in the case of conventional adaptive filters,
the matrix Σ0 is independent of w̃i−1 . To address this difficulty, we shall adjust the argument to rely
on a set of inequality recursions that will enable us to bound the steady-state mean-square-error at each
node — see Theorem 1 further ahead.
The procedure is as follows. First, we note that kxk2 is a convex function of x, and that expressions
(31) and (33) are convex combinations of {w̃l,i−1 } and {ψ̃l,i }, respectively. Then, by Jensen’s inequality
[56, p.77] and taking expectations, we obtain
N
X
Ekφ̃k,i−1 k2 ≤ p1,l,k Ekw̃l,i−1 k2 (60)
l=1
N
X
Ekw̃k,i k2 ≤ p2,l,k Ekψ̃l,i k2 (61)
l=1

for k = 1, . . . , N . Next, we derive a variance relation for (37). Equating the squared Euclidean norms of
both sides of (37), applying the expectation operator, and using (47) from Assumption 2, we get
n o N 2
X
Ekψ̃k,i k = E kφ̃k,i−1 k2Σk,i−1 +µ2k E
2
sl,k vl,i (φk,i−1 ) (62)
l=1

where
" N
#2
X
Σk,i−1 = IM −µk sl,k Hl,k,i−1 (63)
l=1

We call upon the following two lemmas to bound (62).

Lemma 2 (Bound on Σk,i−1 ). The weighting matrix Σk,i−1 defined in (63) is a symmetric, positive
semi-definite matrix, and satisfies:

0 ≤ Σk,i−1 ≤ γk2 IM (64)

where
N N
( )
X X
γk , max 1−µk sl,k λl,max , 1−µk sl,k λl,min (65)
l=1 l=1

November 27, 2024 DRAFT

Proof: By definition (63) and the fact that Hl,k,i−1 is symmetric — see definition (36), the matrix
IM−µk N
P
l=1 sl,k Hl,k,i−1 is also symmetric. Hence, its square, Σk,i−1 , is symmetric and also nonnegative-

definite. To establish (64), we first use (49) to note that:

N N
!
X X
IM −µk sl,k Hl,k,i−1 ≥ 1−µk sl,k λl,max IM (66)
l=1 l=1
N N
!
X X
IM −µk sl,k Hl,k,i−1 ≤ 1−µk sl,k λl,min IM (67)
l=1 l=1
PN
The matrix IM − µk l=1 sl,k Hl,k,i−1 may not be positive semi-definite because we have not specified
a range for µk yet; the expressions on the right-hand side of (66)–(67) may still be negative. However,
inequalities (66)–(67) imply that the eigenvalues of IM − µk N
P
l=1 sl,k Hl,k,i−1 are bounded as:
N N
!
X X
λ IM − µk sl,k Hl,k,i−1 ≥ 1 − µk sl,k λl,max (68)
l=1 l=1
N N
!
X X
λ IM − µk sl,k Hl,k,i−1 ≤ 1 − µk sl,k λl,min (69)
l=1 l=1

By definition (63), Σk,i−1 is the square of the symmetric matrix IM −µk N

P
l=1 sl,k Hl,k,i−1 , meaning that
" N
!#2
X
λ (Σk,i−1 ) = λ IM − µk sl,k Hl,k,i−1 ≥0 (70)
l=1

Substituting (68)–(69) into (70) leads to

λ (Σk,i−1 )
( N 2 N 2)
X X
≤ max 1 − µk sl,k λl,max , 1 − µk sl,k λl,min (71)
l=1 l=1

which is equivalent to (64).

Lemma 3 (Bound on noise combination). The second term on the right-hand-side of (62) satisfies:
N 2 h i
X
E sl,k vl,i (φk,i−1 ) ≤ kSk21 · αEkφ̃k,i−1 k2 +σv2 (72)
l=1

where kSk1 denotes the 1-norm of the matrix S (i.e., the maximum absolute column sum).

November 27, 2024 DRAFT

Proof: Applying Jensen’s inequality [56, p.77], it holds that

N 2
X
E sl,k vl,i (φk,i−1 )
l=1
N N 2
X 2 X sl,k
= sl,k ·E PN vl,i (φk,i−1 )
l=1 l=1 l=1 sl,k
N N
X 2 X sl,k
≤ sl,k · PN Ekvl,i (φk,i−1 )k2
l=1 l=1 l=1 s l,k
N
X XN
= sl,k · sl,k Ekvl,i (φk,i−1 )k2
l=1 l=1
N
X 2 h i
2 2
≤ sl,k · αEkφ̃k,i−1 k + σv (73)
l=1
h i
≤ kSk21 · αEkφ̃k,i−1 k2 + σv2 (74)

where inequality (73) follows by substituting (48), and (74) is obtained using the fact that kSk1 is the
maximum absolute column sum and that the entries {sl,k } are nonnegative.

Substituting (64) and (72) into (62), we obtain:

Ekψ̃k,i k2 ≤ (γk2 +µ2k αkSk21 )· Ekφ̃k,i−1 k2 +µ2k kSk21 σv2 (75)

for k = 1, . . . , N . Finally, introduce the following network mean-square-error vectors (compare with
(38)):
     
Ekφ̃ k2 Ekψ̃ k2 Ek w̃ k 2
 1,i   1,i   1,i 
.. .. ..
     
Xi =  , Y = , W =
     
 . 

i 
 . 

i 
 . 

     
Ekφ̃N,i k2 Ekψ̃N,i k2 Ekw̃N,i k2
and the matrix

Γ = diag γ12 + µ21 αkSk21 , . . . , γN

2
+ µ2N αkSk21

(76)

Then, (60)–(61) and (75) can be written as





 Xi−1 P1T Wi−1



Yi ΓXi−1 + σv2 kSk21 Ω2 1 (77)





W P T Y

i 2 i

November 27, 2024 DRAFT

where the notation x y denotes that the components of vector x are less than or equal to the
corresponding components of vector y . We now recall the following useful fact that for any matrix
F with nonnegative entries,

x y ⇒ Fx Fy (78)

This is because each entry of the vector F y − F x = F (y − x) is nonnegative. Then, combining all three
inequalities in (77) leads to:

Wi P2T ΓP1T Wi−1 + σv2 kSk21 · P2T Ω2 1 (79)

C. Mean-Square Stability

Based on (79), we can now prove that, under certain conditions on the step-size parameters {µk }, the
mean-square-error vector Wi is bounded as i → ∞, and we use this result in the next subsection to
evaluate the steady-state MSE for sufficiently small step-sizes.

Theorem 1 (Mean-Square Stability). If the step-sizes {µk } satisfy the following condition:
( )
2σk,max 2σk,min
0 < µk < min 2 , 2 (80)
σk,max +αkSk21 σk,min +αkSk21

for k = 1, . . . , N , where σk,max and σk,min are defined as

N
X N
X
σk,max , sl,k λl,max , σk,min , sl,k λl,min (81)
l=1 l=1

then, as i → ∞,

max µ2k · kSk21 σv2
1≤k≤N
lim sup kWi k∞ ≤ (82)
i→∞ 1 − max (γk2 + µ2k αkSk21 )
1≤k≤N

where kxk∞ denotes the maximum absolute entry of vector x.

Proof: See Appendix A.

If we let α=0 and σv2 =0 in Theorem 1, and examine the arguments leading to it, we conclude the
validity of the following result, which establishes the convergence of the diffusion strategies (24)–(25)
in the absence of gradient noise (i.e., using the true gradient rather than stochastic gradient — see (18)
and (19)).

November 27, 2024 DRAFT

Theorem 2 (Convergence in Noise-free Case). If there is no gradient noise, i.e., α = 0 and σv2 = 0, then
the mean-square-error vector becomes the deterministic vector Wi = col{kw̃1,i k2 , · · · , kw̃N,i k2 }, and its
entries converge to zero if the step-sizes {µk } satisfy the following condition:
2
0 < µk < (83)
σk,max
for k = 1, . . . , N , where σk,max was defined in (81).

We observe that, in the absence of noise, the deterministic error vectors, w̃k,i , will tend to zero as
i → ∞ even with constant (i.e., non-vanishing) step-sizes. This result implies the interesting fact that, in
the noise-free case, the nodes can reach agreement without the need to impose diminishing step-sizes.

D. Steady-State Performance

Expression (80) provides a condition on the step-size parameters {µk } to ensure the mean-square
stability of the diffusion strategies (24)–(25). At the same time, expression (82) gives an upper bound on
how large Wi can be at steady-state. Since the ∞-norm of a vector is defined as the largest absolute value
of its entries, then (82) bounds the MSE of the worst-performing node in the network. We can derive
closed-form expressions for MSEs when the step-sizes are assumed to be sufficiently small. Indeed, we
first conclude from (82) that for step-sizes that are sufficiently small, each wk,i will get closer to wo at
steady-state. To verify this fact, assume the step-sizes are small enough so that the nonnegative factor γk
that was defined earlier in (65) becomes
N
X
γ k = 1 − µk sl,k λl,min = 1 − µk σk,min (84)
l=1

where σk,min was given by (81). Substituting (84) into (82), we obtain:

lim sup kWi k∞

i→∞
!
max µ2k · kSk21 σv2
1≤k≤N
≤ ( )
1− max (1−µk σk,min )2 +µ2k αkSk21
1≤k≤N
!
max µ2k · kSk21 σv2
1≤k≤N
≤ ( " #)
2
min µk 2σk,min − µk (σk,min + αkSk21 )
1≤k≤N

November 27, 2024 DRAFT

kSk21 σv2 µ2max

≤ ( )· (85)
µmin
2
min 2σk,min −µk (σk,min +αkSk21 )
1≤k≤N

where

µmax , max µk , µmin , min µk (86)

1≤k≤N 1≤k≤N

For sufficiently small step-sizes, the denominator in (85) can be approximated as

2
2σk,min −µk (σk,min +αkSk21 ) ≈ 2σk,min (87)

Substituting into (85), we get

kSk21 σv2 µ2
lim sup kWi k∞ ≤ · max (88)
i→∞ 2 min σk,min µmin
1≤k≤N

Therefore, if the step-sizes are sufficiently small, the MSE of each node becomes small as well. This
result is clear when all nodes use the same step-sizes such that µmax = µmin = µ. Then, the right-hand
side of (88) is on the order of O(µ), as indicated. It follows that {w̃k,i } are small in the mean-square-error
sense at small step-sizes, which also means that the mean-square value of φ̃k,i−1 is small because it is
a convex combination of {w̃k,i } (recall (31)). Then, by definition (36), in steady-state (for large enough
i), the matrix Hl,k,i−1 can be approximated by:
Z 1
Hl,k,i−1 ≈ ∇2 Jl (wo )dt = ∇2 Jl (wo ) (89)
0

In this case, the matrix Hl,k,i−1 is not random anymore and is not dependent on the error vector φ̃k,,i−1 .
Accordingly, in steady-state, the matrix D i−1 that was defined in (42) is not random anymore and it
becomes
N
X n o
2 o 2 o
D i−1 ≈ D∞ , diag sl,1 ∇w Jl (w ), · · · ,sl,N ∇w Jl (w ) (90)
l=1

As a result, in steady-state, the original error recursion (44) can be approximated by

w̃i = P2T [IM N − MD∞ ]P1T w̃i−1 + P2T Mgi (91)

Taking expectations of both sides of (91), we obtain the following mean-error recursion

Ew̃i = P2T [IM N − MD∞ ]P1T · Ew̃i−1 , i→∞ (92)

which converges to zero if the matrix

B , P2T [IM N − MD∞ ]P1T (93)

November 27, 2024 DRAFT

is stable. The stability of B can be guaranteed when the step-sizes are sufficiently small (or chosen
according to (80)) — see the proof in Appendix C. Therefore, in steady-state, we have

lim Ew̃i = 0 (94)

i→∞

Next, we determine an expression (rather than a bound) for the MSE. To do this, we need to evaluate the
covariance matrix of the gradient noise vector gi . Recall from (43) that gi depends on {φk,i−1 }, which
is close to wo at steady-state for small step-sizes. Therefore, it is sufficient to determine the covariance
matrix of gi at wo . We denote this covariance matrix by:

Rv , E{gi giT }
φk,i−1 =wo
"
N
#
 X n o
=E col sl,1 vl,i (wo ), · · · , sl,N vl,i (wo )

l=1
# 
N o T
"
X n
× col sl,1 vl,i (wo ), · · · , sl,N vl,i (wo ) (95)

l=1

In practice, we can evaluate Rv from the expressions of {vl,i (wo )}. For example, for the case of the
quadratic cost (52), we can substitute (54) into (95) to evaluate Rv .
Returning to the last term in the first equation of (59), we can evaluate it as follows:

EkP2T Mgi k2Σ = EgiT MP2 ΣP2T Mgi

= Tr ΣP2T ME{gi giT }MP2

= Tr ΣP2T MRv MP2

(96)

Using (90), the matrix Σ0 in (59) becomes a deterministic quantity as well, and is given by:

Σ0 ≈ P1 [IM N − MD∞ ]P2 ΣP2T [IM N − MD∞ ]P1T (97)

Substituting (96) and (97) into (59), an approximate variance relation is obtained for small step-sizes:

Ekw̃i k2Σ ≈ Ekw̃i−1 k2Σ0 + Tr ΣP2T MRv MP2

(98)

Σ0 ≈ P1 [IM N−MD∞ ]P2 ΣP2T [IM N−MD∞ ]P1T (99)

Let σ = vec(Σ) denote the vectorization operation that stacks the columns of a matrix Σ on top of each
other. We shall use the notation kxk2σ and kxk2Σ interchangeably to denote the weighted squared Euclidean
norm of a vector. Using the Kronecker product property [57, p.147]: vec(U ΣV ) = (V T ⊗ U )vec(Σ), we
can vectorize Σ0 in (99) and find that its vector form is related to Σ via the following linear relation:

November 27, 2024 DRAFT

σ 0 , vec(Σ0 ) ≈ Fσ , where, for sufficiently small steps-sizes (so that higher powers of the step-sizes can
be ignored), the matrix F is given by

F , P1 [IM N −MD∞ ]P2 ⊗ P1 [IM N −MD∞ ]P2 (100)

Here, we used the fact that M and D∞ are block diagonal and symmetric. Furthermore, using the property
Tr(ΣX) = vec(X T )T σ , we can rewrite (98) as
T
Ekw̃i k2σ ≈ Ekw̃i−1 k2F σ + vec P2T MRv MP2 σ

(101)

It is shown in [46, pp.344–346] that recursion (101) converges to a steady-state value if the matrix F
is stable. This condition is guaranteed when the step-sizes are sufficiently small (or chosen according to
(80)) — see Appendix C. Finally, denoting

Ekw̃∞ k2σ , lim Ekw̃i k2σ (102)

i→∞

and letting i → ∞, expression (101) becomes

T
Ekw̃∞ k2σ ≈ Ekw̃∞ k2F σ + vec P2T MRv MP2

σ

so that
T
Ekw̃∞ k2(I−F )σ ≈ vec P2T MRv MP2

σ (103)

Expression (103) is a useful result: it allows us to derive several performance metrics through the proper
selection of the free weighting parameter σ (or Σ). First, to be able to evaluate steady-state performance
metrics from (103), we need (I − F) to be invertible, which is guaranteed by the stability of matrix F
— see Appendix C. Given that (I − F) is a stable matrix, we can now resort to (103) and use it to
evaluate various performance metrics by choosing proper weighting matrices Σ (or σ ), as it was done in
[31] for the mean-square-error estimation problem. For example, the MSE of any node k can be obtained
by computing Ekw̃∞ k2T with a block weighting matrix T that has an identity matrix at block (k, k) and
zeros elsewhere:

Ekw̃k,∞ k2 = Ekw̃∞ k2T (104)

Denote the vectorized version of this matrix by tk , i.e.,

tk , vec(diag(ek ) ⊗ IM ) (105)

November 27, 2024 DRAFT

where ek is a vector whose k th entry is one and zeros elsewhere. Then, if we select σ in (103) as
σ = (I − F)−1 tk , the term on the left-hand side becomes the desired Ekw̃k,∞ k2 and MSE for node k
is therefore given by:
T
MSEk ≈ vec P2T MRv MP2 (I − F)−1 tk

(106)

This value for MSEk is actually the k th entry of W∞ defined as

W∞ , lim Wi (107)
i→∞

Then, we arrive at an expression for W∞ (as opposed to the bound for it in (82), as was explained earlier;
expression (108) is derived under the assumption of sufficiently small step-sizes):
n T o
W∞ ≈ IN ⊗ vec P2T MRv MP2 (I−F)−1 t (108)

where t = col{t1 , . . . , tN }. If we are interested in the network MSE, then the weighting matrix of
Ekw̃∞ k2T should be chosen as T = IM N /N . Let q denote the vectorized version of IM N , i.e.,

q , vec(IM N ) (109)

and select σ in (103) as σ = (I −F)−1 q/N . The network MSE is then given by
N
1 X
MSE , MSEk
N
k=1 (110)
1 T
≈ vec P2T MRv MP2 (I − F)−1 q
N
The approximate expressions (108) and (110) hold when the step-sizes are small enough so that (90)
holds. In the next section, we will see that they are consistent with the simulation results.

V. S IMULATION R ESULTS

In this section we illustrate the performance of the diffusion strategies (24)–(25) by considering two
applications. We consider a randomly generated connected network topology with a cyclic path. There
are a total of N = 10 nodes in the network, and nodes are assumed connected when they are close
enough geographically. In the simulations, we consider two applications: a regularized least-mean-squares
estimation problem with sparse parameters, and a collaborative localization problem.

November 27, 2024 DRAFT

A. Distributed Estimation with Sparse Data

Assume each node k has access to data {Uk,i , dk,i }, generated according to the following model:

dk,i = Uk,i wo + zk,i (111)

where {Uk,i } is a sequence of K × M i.i.d. Gaussian random matrices. The entries of each Uk,i have
zero mean and unit variance, and zk,i ∼ N (0, σz2 IK ) is the measurement noise that is temporally and
spatially white and is independent of Ul,j for all k, l, i, j . Our objective is to estimate wo from the data
set {Uk,i , dk,i } in a distributed manner. In many applications, the vector wo is sparse such as

wo = [1 0 . . . 0 1]T ∈ RM

One way to search for sparse solutions is to consider a global cost function of the following form [58]:
N
X
J glob (w) = Ekdl,i − Ul,i wk22 + ρR(w) (112)
l=1
where R(w) and ρ are the regularization function and regularization factor, respectively. A popular
choice is R(w) = kwk1 , which helps enforce sparsity and is convex [58]–[63]. However, this choice
is non-differentiable, and we would need to apply sub-gradient methods [44, pp.138–144] for a proper
implementation. Instead, we use the following twice-differentiable approximation for kwk1 :
M p
X
R(w) = [w]2m + 2 (113)
m=1
where [w]m denotes the m-th entry of w, and is a small number. We see that, as goes to zero,
R(w) ≈ kwk1 . Obviously, R(w) is convex, and we can apply the diffusion algorithms to minimize (112)
in a distributed manner. To do so, we decompose the global cost into a sum of N individual costs:
ρ
Jl (w) = Ekdl,i − Ul,i wk22 + R(w) (114)
N
for l = 1, . . . , N . Then, using algorithms (18) and (19), each node k would update its estimate of wo by
using the gradient vectors of {Jl (w)}l∈Nk , which are given by:
T T

∇w Jl (w) = 2E Ul,i Ul,i w − 2E Ul,i dl,i
ρ
+ ∇w R(w) (115)
N
However, the nodes are assumed to have access to measurements {Ul,i , dl,k } and not to the second-

TU
order moments E Ul,i T
l,i and E Ul,i dl,i . In this case, nodes can use the available measurements to

approximate the gradient vectors in (24) and (25) as:

b w Jl (w) = 2U T [Ul,i w−dl,i ]+ ρ ∇w R(w)

∇ (116)
l,i
N

November 27, 2024 DRAFT

where
#T
[w]1 [w]M
"
∇w R(w) = p[w]2 + 2 ··· q (117)
1 [w]2M + 2

In the simulation, we set M = 50, K = 5, σv2 = 1, and wo = [1 0 . . . 0 1]T . We apply both diffusion and
incremental methods to solve the distributed learning problem, where the incremental approach [6]–[9]
uses the following construction to determine wi :

• Start with ψ0,i = wi−1 at the node at the beginning of the incremental cycle.
• Cycle through the nodes k = 1, . . . , N :

ψk,i = ψk−1,i − µ∇
b w Jk (ψk−1,i ) (118)

• Set wi ← ψN,i .
• Repeat.

The results are averaged over 100 trials. The step-sizes for ATC, CTA and non-cooperative algorithms are
set to µ = 10−3 , and the step-size for the incremental algorithm is set to µ = 10−3 /N . This is because the
incremental algorithm cycles through all N nodes every iteration. We therefore need to ensure the same
convergence rate for both algorithms for a fair comparison [35]. For ATC and CTA strategies, we use
simple averaging weights for the combination step, and for ATC and CTA with gradient exchange, we use
Metropolis weights for {cl,k } to combine the gradients (see Table III in [31] for the definitions of averaging
weights and Metropolis weights). We use expression (110) to evaluate the theoretical performance of the
diffusion strategies. As a remark, expression (110) gives the MSE with respect to the minimizer of the
cost J glob (w) in (112). In this example, the minimizer of the cost (112), denoted as ŵo , is biased away
from the model parameter wo in (111) when the regularization factor γ 6= 0. To evaluate the theoretical
MSE with respect to wo , we use
N
1 X
MSD = lim Ekwo − wk,i k2
i→∞ N
k=1
N
1 X
= Ekwo − ŵo k2 + lim Ekŵo − wk,i k2 (119)
i→∞ N
k=1

where the second term in (119) can be evaluated by expression (110) with wo replaced by ŵo . Moreover,
in the derivation of (119), we used the fact that limi→∞ E(ŵo −wk,i ) = 0 to eliminate the cross term, and
this result is due to (94) with wo there replaced by ŵo . Fig. 2(a) shows the learning curves for different
algorithms for γ = 2 and = 10−3 . We see that the diffusion and incremental schemes have similar
performance, and both of them have about 10 dB gain over the non-cooperation case. To examine the

November 27, 2024 DRAFT

Steady state MSE (dB)

5 −10
Incremental
0 ATC (S=I) −15 ²=1
CTA (S=I)
Network MSE (dB)

Incremental
-5 ATC (S=C) −20
ATC (S=I)
CTA (S=C) ATC (S=I), theory
-10 Non-cooperative −25 CTA(S=I)
0 2 4 6 8 10 12 14 16 18 20
Regularization factor γ CTA(S=I), theory
-15 ATC (S=C)

Steady state MSE (dB)

−10
ATC (S=C), theory
-20 CTA (S=C)
−15 ² = 10−2 CTA (S=C), theory
-25 −20

−25
-30
0 500 1000 1500 2000 0 2 4 6 8 10 12 14 16 18 20
Number of Iterations Regularization factor γ

(a) Learning curves (γ = 2 and = 10−3 ). (b) Steady-state MSD (µ = 10−3 ).

Fig. 2. Transient and steady-state performance of distributed estimation with sparse parameters.

impact of the parameter and the regularization factor γ , we show the steady-state MSE for different
values of γ and in Fig. 2(b). When is small ( = 10−2 ), adding a reasonable regularization (γ = 1 ∼ 4)
decreases the steady-state MSE. However, when is large ( = 1), expression (113) is no longer a good
approximation for kwk1 , and regularization does not improve the MSE.

B. Distributed Collaborative Localization

The previous example deals with a convex cost (112). Now, we consider a localization problem that
has a non-convex cost function and apply the same diffusion strategies to its solution. Assume each node
is interested in locating a common target located at wo = [0 0]T . Each node k knows its position xk and
has a noisy measurement of the squared distance to the target:

dk (i) = kwo − xk k2 + zk (i), k = 1, 2, . . . , N

2 ) is the measurement noise of node k at time i. The component cost function
where zk (i) ∼ N (0, σz,k
Jk (w) at node k is chosen as
1 2
Jk (w) = E dk (i) − kw − xk k2 (120)
4
where we multiply by 1/4 here to eliminate a factor of 4 that will otherwise appear in the gradient. If each
node k minimizes Jk (w) individually, it is not possible to solve for wo . Therefore, we use information
from other nodes, and instead seek to minimize the following global cost:
N
1X 2
J glob (w) = E dk (i) − kw − xk k2 (121)
4
k=1

November 27, 2024 DRAFT

This problem arises, for example, in cellular communication systems, where multiple base-stations are
interested in locating users using the measured distances between themselves and the user. Diffusion
algorithms (18) and (19) can be applied to solve the problem in a distributed manner. Each node k would
update its estimate of wo by using the gradient vectors of {Jl (w)}l∈Nk , which are given by:

∇w Jl (w) = −Edl (i) (w − xl ) + kw − xl k2 (w − xl ) (122)

However, the nodes are assumed to have access to measurements {dl (i), xl } and not to Edl (i). In this
case, nodes can use the available measurements to approximate the gradient vectors in (24) and (25) as:

b w Jl (w) = −dl (i)(w − xl ) + kw − xl k2 (w − xl )

∇ (123)

If we do not exchange the local gradients with neighbors, i.e., if we set S = I , then the base-stations
only share the local estimates of the target position wo with their neighbors (no exchange of {xl }l∈Nk ).
We first simulate the stationary case, where the target stays at wo . In Fig. 3(a), we show the MSE curves
2 = 1. We
for non-cooperative, ATC, CTA, and incremental algorithms. The noise variance is set to σz,k
set the step-sizes to µ = 0.0025/N for the incremental algorithm, and µ = 0.0025 for other algorithms.
For ATC and CTA strategies, we use simple averaging for the combination step {al,k }, and for ATC
and CTA with gradient exchange, we use Metropolis weights for {cl,k } to combine the gradients. The
performance of CTA and ATC algorithms are close to each other, and both of them are close to the
incremental scheme. In Fig. 3(b), we show the steady state MSE with respect to different values of µ.
We also use expression (110) to evaluate the theoretical performance of the diffusion strategies. As the
step-size becomes small, the performances of diffusion and incremental algorithms are close, and the
MSE decreases as µ decreases. Furthermore, we see that exchanging only local estimates (S = I ) is
enough for localization, compared to the case of exchanging both local estimates and gradients (S = C ).
Next, we apply the algorithms to a non-stationary scenario, where the target moves along a trajectory,
as shown in Fig. 3(c). The step-size is set to µ = 0.01 for diffusion algorithms, and to µ = 0.01/N
for the incremental approach. To see the advantage of using a constant step-size for continuous tracking,
we also simulate the vanishing step-size version of the algorithm from [39], [43] (µk,i = 0.01/i). The
diffusion algorithms track well the target but not the non-cooperative algorithm and the algorithm from
[39], [43], because a decaying step-size is not helpful for tracking. The tracking performance is shown
in Fig. 3(d).

November 27, 2024 DRAFT

555
Incremental

0 -5 ATCATC
CTA (S=C)
(S=I)
(S=C) ATC (S=C)
CTA (S=C)
Network MSE

-10 CTACTA (S=C)

(S=I)
Non-cooperation −15 CTA (S=C)
Network MSE (dB)

-10 Non-cooperation
-5 -10 ATCNon-cooperation
(S=C)
AverageNetwork

−20 Non−cooperative
-15 CTA (S=C) ATC (S=I), theory
-15 −25
-10 -15 Non-cooperative CTA (S=I), theory
-20 −30 ATC (S=C), theory
-20
Average

-15 -20 CTA (S=C), theory

−35
-25
-25
-20 -25 −40
-30
-30
-25 -30000 500 1000 1500 2000 −2 −3
500
500 1000 1500 2000 10 10
Number 1000
Number
1500
of Iterations 2000
Step−size µ
Number of
of Iterations
Iterations
-30
0
(a) Learning 500 1000
curves for stationary 1500
target 2000
(µ = 0.0025). (b) Steady-state performance for stationary target.
Number of Iterations
Target Trajectory 20
6
Incremental
ATC (S=I)
4 CTA (S=I) 10
Incremental
Network MSE (dB)
ATC (S=C)
CTA (S=C) ATC (S=I)
2
Approach in [39,43] 0 CTA (S=I)
y (km)

Non−cooperative ATC (S=C)

0 CTA (S=C)
-10 Approach in [39,43]
−2 Non-cooperative

-20
−4

−6 -30
−6 −4 −2 0 2 4 6 0 200 400 600 800 1000
x (km) Number of Iterations
(c) Tracking a moving-target by node 1 (µ = 0.01). (d) Learning curves for moving target (µ = 0.01).

Fig. 3. Performance of distributed localization for stationary and moving targets. Diffusion strategies employ constant step-sizes,
which enable continuous adaptation and learning even when the target moves (which corresponds to a changing cost function).

VI. C ONCLUSION

This paper proposed diffusion adaptation strategies to optimize global cost functions over a network
of nodes, where the cost consists of several components. Diffusion adaptation allows the nodes to
solve the distributed optimization problem via local interaction and online learning. We used gradient
approximations and constant step-sizes to endow the networks with continuous learning and tracking
abilities. We analyzed the mean-square-error performance of the algorithms in some detail, including
their transient and steady-state behavior. Finally, we applied the scheme to two examples: distributed
sparse parameter estimation and distributed localization. Compared to incremental methods, diffusion

November 27, 2024 DRAFT

strategies do not require a cyclic path over the nodes, which makes them more robust to node and link
failure.

A PPENDIX A
P ROOF OF M EAN -S QUARE S TABILITY

Taking the ∞−norm of both sides of (79), we obtain

kWi k∞ ≤ kP2T ΓP1T k∞ ·kWi−1 k∞ +σv2 kSk21 ·kP2T k∞ ·kΩk2∞

≤ kP2T k∞ · kΓk∞ · kP1T k∞ · kWi−1 k∞

+ σv2 kSk21 · kP2T k∞ · kΩk2∞

= kΓk∞ ·kWi−1 k∞ + max µ2k ·σv2 kSk21 (124)
1≤k≤N

where we used the fact that kP1T k∞ = kP2T k∞ = 1 because each row of P1T and P2T sums up to one.
Moreover, from (76), we have

kΓk∞ = max (γk2 + µ2k αkSk21 ) (125)

1≤k≤N

Iterating (124), we obtain

kWi k∞ ≤ kΓki∞ · kW0 k∞

i−1
X
+ max µ2k · σv2 kSk21 kΓkj∞ (126)
1≤k≤N
j=0

We are going to show further ahead that condition (80) guarantees kΓk∞ < 1. Now, given that kΓk∞ < 1,
the first term on the right hand side of (126) converges to zero as i → ∞, and the second term on the
right-hand side of (126) converges to:
i−1
X σv2 kSk21
lim σv2 kSk21 kΓkj∞ = (127)
i→∞ 1 − kΓk∞
j=0

Therefore, we establish (82) as follows:

σ 2 kSk2
lim sup kWi k∞ ≤ max µ2k · v 1
i→∞ 1≤k≤N 1 − kΓk∞
!
max µ2k · kSk21 σv2
1≤k≤N
= (128)
1 − max (γk2 + µ2k αkSk21 )
1≤k≤N

November 27, 2024 DRAFT

The only fact that remains to prove is to show that (80) ensures kΓk∞ < 1. From (125), we see that
the condition kΓk∞ < 1 is equivalent to requiring:

γk2 + µ2k αkSk21 < 1, k = 1, . . . , N. (129)

Then, using (65), this is equivalent to:

N
X 2
1 − µk sl,k λl,max + µ2k αkSk21 < 1 (130)
l=1
N
X 2
1 − µk sl,k λl,min + µ2k αkSk21 < 1 (131)
l=1
for k = 1, . . . , N . Recalling the definitions for σk,max and σk,min in (81) and solving these two quadratic
inequalities with respect to µk , we arrive at:
2σk,max 2σk,min
0 < µk < 2 , 0 < µk <
+ αkSk21
σk,max 2
σk,min + αkSk21
and we are led to (80).

A PPENDIX B
B LOCK M AXIMUM N ORM OF A M ATRIX

Consider a block matrix X with blocks of size M × M each. Its block maximum norm is defined as
[35]:
kXxkb,∞
kXkb,∞ , max (132)
x6=0 kxkb,∞
where the block maximum norm of a vector x , col{x1 , . . . , xN }, formed by stacking N vectors of size
M each on top of each other, is defined as [35]:

kxkb,∞ , max kxk k (133)

1≤k≤N

where k · k denotes the Euclidean norm of its vector argument.

Lemma 4 (Block maximum norm). If a block diagonal matrix X , diag{X1 , . . . , XN } ∈ RN M ×N M

consists of N blocks along the diagonal with dimension M × M each, then the block maximum norm of
X is bounded as

kXkb,∞ ≤ max kXk k (134)

1≤k≤N

in terms of the 2-induced norms of {Xk } (largest singular values). Moreover, if X is symmetric, then
equality holds in (134).

November 27, 2024 DRAFT

Proof: Note that Xx = col{X1 x1 ,. . . ,XN xN }. Evaluating the block maximum norm of vector Xx leads
to

kXxkb,∞ = max kXk xk k

1≤k≤N

≤ max kXk k · kxk k

1≤k≤N

≤ max kXk k · max kxk k (135)

1≤k≤N 1≤k≤N

Substituting (135) and (133) into (132), we establish (134) as

kXxkb,∞
kXkb,∞ , max
x6=0 kxkb,∞
max1≤k≤N kXk k · max1≤k≤N kxk k
≤ max
x6=0 max1≤k≤N kxk k

= max kXk k (136)

1≤k≤N

Next, we prove that, if all the diagonal blocks of X are symmetric, then equality should hold in (136).
To do this, we only need to show that there exists an x0 6= 0, such that
kXx0 kb,∞
= max kXk k (137)
kx0 kb,∞ 1≤k≤N

which would mean that

kXxkb,∞
kXkb,∞ , max
x6=0 kxkb,∞
kXx0 kb,∞
≥
kx0 kb,∞

= max kXk k (138)

1≤k≤N

Then, combining inequalities (136) and (138), we would obtain desired equality that

kXkb,∞ = max kXk k (139)

1≤k≤N

when X is block diagonal and symmetric. Thus, without loss of generality, assume the maximum in
(137) is achieved by X1 , i.e.,

max kXk k = kX1 k

1≤k≤N

For a symmetric Xk , its 2-induced norm kXk k (defined as the largest singular value of Xk ) coincides with
the spectral radius of Xk . Let λ0 denote the eigenvalue of X1 of largest magnitude, with the corresponding
right eigenvector given by z0 . Then,

max kXk k = |λ0 |, X1 z0 = λ0 z0

1≤k≤N

November 27, 2024 DRAFT

We select x0 = col{z0 , 0, . . . , 0}. Then, we establish (137) by:

kXx0 kb,∞ kcol{X1 z0 , 0, . . . , 0}kb,∞
=
kx0 kb,∞ kcol{z0 , 0, . . . , 0}kb,∞
kX1 z0 k kλ0 z0 k
= = = |λ0 | = max kXk k
kz0 k kz0 k 1≤k≤N

A PPENDIX C
S TABILITY OF B AND F

Recall the definitions of the matrices B and F from (93) and (100):

B = P2T [IM N − MD∞ ]P1T (140)

F = P1 [IM N − MD∞ ]P2 ⊗ P1 [IM N − MD∞ ]P2

= BT ⊗ BT (141)

From (140)–(141), we obtain (see Theorem 13.12 from [57, p.141]):

ρ(F) = ρ(B T ⊗ B T ) = [ρ(B T )]2 = [ρ(B)]2 (142)

where ρ(·) denotes the spectral radius of its matrix argument. Therefore, the stability of the matrix F is
equivalent to the stability of the matrix B , and we only need to examine the stability of B . Now note
that the block maximum norm (see the definition in Appendix B) of the matrix B satisfies

kBkb,∞ ≤ kIM N − MD∞ kb,∞ (143)

since the block maximum norms of P1 and P2 are one (see [35, p.4801]):

P1T b,∞
= 1, P2T b,∞
=1 (144)

Moreover, by noting that the spectral radius of a matrix is upper bounded by any matrix norm (Theorem
5.6.9, [50, p.297]) and that IM N − MD∞ is symmetric and block diagonal, we have

ρ(B) ≤ kIM N − MD∞ kb,∞ = ρ(IM N − MD∞ ) (145)

Therefore, the stability of B is guaranteed by the stability of IM N − MD∞ . Next, we call upon the
following lemma to bound kIM N −MD∞ kb,∞ .

Lemma 5 (Norm of IM N −MD∞ ). It holds that the matrix D∞ defined in (90) satisfies

kIM N −MD∞ kb,∞ ≤ max γk (146)

1≤k≤N

November 27, 2024 DRAFT

where γk is defined in (65).

Proof: Since D∞ is block diagonal and symmetric, IM N − MD∞ is also block diagonal with blocks
{IM −µk Dk,∞ }, where Dk,∞ denotes the k th diagonal block of D∞ . Then, from (134) in Lemma 4 in
Appendix B, it holds that

kIM N −MD∞ kb,∞ = max kIM −µk Dk,∞ k (147)

1≤k≤N

By the definition of D∞ in (90), and using condition (45) from Assumption 1, we have
N N
! !
X X
sl,k λl,min · IM ≤ Dk,∞ ≤ sl,k λl,max · IM
l=1 l=1

which implies that

N
!
X
IM − µk Dk,∞ ≥ 1 − µk sl,k λl,max · IM (148)
l=1
N
!
X
IM − µk Dk,∞ ≤ 1 − µk sl,k λl,min · IM (149)
l=1

Thus, kIM −µk Dk,∞ k ≤ γk . Substituting into (147), we get (146).

Substituting (146) into (145), we get:

ρ(B) ≤ max γk (150)

1≤k≤N

As long as max γk < 1, then all the eigenvalues of B will lie within the unit circle. By the definition
1≤k≤N
of γk in (65), this is equivalent to requiring

|1 − µk σk,max | < 1, |1 − µk σk,min | < 1

for k = 1, . . . , N , where σk,max and σk,min are defined in (81). These conditions are satisfied if we
choose µk such that

0 < µk < 2/σk,max , k = 1, . . . , N (151)

which is obviously guaranteed for sufficiently small step-sizes (and also by condition (80)).

R EFERENCES

[1] J. Chen, S.-Y. Tu, and A. H. Sayed, “Distributed optimization via diffusion adaptation,” in Proc. IEEE International
Workshop on Comput. Advances Multi-Sensor Adaptive Process. (CAMSAP), Puerto Rico, Dec. 2011, pp. 281–284.
[2] J. Chen and A. H. Sayed, “Performance of diffusion adaptation for collaborative optimization,” in Proc. IEEE International
Conf. Acoustics, Speech and Signal Process. (ICASSP), Kyoto, Japan, March 2012, pp. 1–4.

November 27, 2024 DRAFT

[3] S.-Y. Tu and A. H. Sayed, “Mobile adaptive networks,” IEEE J. Sel. Topics. Signal Process., vol. 5, no. 4, pp. 649–664,
Aug. 2011.
[4] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction,” in Proc. International Conf.
Machin. Learning (ICML), Bellevue, USA, June 2011, pp. 713–720.
[5] Z. J. Towfic, J. Chen, and A. H. Sayed, “Collaborative learning of mixture models using diffusion adaptation,” in Proc.
IEEE Workshop on Mach. Learning Signal Process. (MLSP), Beijing, China, Sep. 2011, pp. 1–6.
[6] D. P. Bertsekas, “A new class of incremental gradient methods for least squares problems,” SIAM J. Optim., vol. 7, no. 4,
pp. 913–926, 1997.
[7] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,” SIAM J. Optim.,
vol. 12, no. 1, pp. 109–138, 2001.
[8] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms for distributed optimization,” IEEE J. Sel. Areas
Commun., vol. 23, no. 4, pp. 798–808, Apr. 2005.
[9] C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies over distributed networks,” IEEE Trans. Signal Process.,
vol. 55, no. 8, pp. 4064–4077, Aug. 2007.
[10] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, 1st edition. Athena
Scientific, Singapore, 1997.
[11] J. N. Tsitsiklis and M. Athans, “Convergence and asymptotic agreement in distributed decision problems,” IEEE Trans.
Autom. Control, vol. 29, no. 1, pp. 42–50, Jan. 1984.
[12] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization
algorithms,” IEEE Trans. Autom. Control, vol. 31, no. 9, pp. 803–812, Sep. 1986.
[13] S. Barbarossa and G. Scutari, “Bio-inspired sensor network design,” IEEE Signal Process. Mag., vol. 24, no. 3, pp. 26–35,
May 2007.
[14] A. Nedic and A. Ozdaglar, “Cooperative distributed multi-agent optimization,” in Convex Optimization in Signal Processing
and Communications, Y. Eldar and D. Palomar, Eds., pp. 340–386, 2009.
[15] ——, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp.
48–61, Jan. 2009.
[16] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc WSNs with noisy links—Part I: Distributed estimation
of deterministic signals,” IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350–364, 2008.
[17] S. Kar and J. M. F. Moura, “Sensor networks with random links: Topology design for distributed consensus,” IEEE Trans.
Signal Process., vol. 56, no. 7, pp. 3315–3326, July 2008.
[18] ——, “Convergence rate analysis of distributed gossip (linear parameter) estimation: Fundamental limits and tradeoffs,”
IEEE J. Sel. Topics. Signal Process., vol. 5, no. 4, pp. 674–690, Aug. 2011.
[19] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal
processing,” Proc. of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
[20] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with switching topology and time-delays,”
IEEE Trans. Autom. Control, vol. 49, no. 9, pp. 1520–1533, Sep. 2004.
[21] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione, “Broadcast gossip algorithms for consensus,” IEEE Trans.
Signal Process., vol. 57, no. 7, pp. 2748–2761, 2009.
[22] S. Sardellitti, M. Giona, and S. Barbarossa, “Fast distributed average consensus algorithms based on advection-diffusion
processes,” IEEE Trans. Signal Process., vol. 58, no. 2, pp. 826–842, Feb. 2010.

November 27, 2024 DRAFT

[23] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus,” in Proc. Int.
Symp. Information Processing Sensor Networks (IPSN), Los Angeles, CA, Apr. 2005, pp. 63–70.
[24] C. Eksin and A. Ribeiro, “Network optimization with heuristic rational agents,” in Proc. Asilomar Conf. on Signals Systems
Computers, Pacific Grove, CA, Nov. 2011, pp. 1–5.
[25] R. M. Karp, “Reducibility among combinational problems,” Complexity of Computer Computations (R. E. Miller and J.
W. Thatcher, Eds.), pp. 85–104, 1972.
[26] C. G. Lopes and A. H. Sayed, “Distributed processing over adaptive networks,” in Proc. Adaptive Sensor Array Processing
Workshop, MIT Lincoln Laboratory, MA, June 2006, pp. 1–5.
[27] C. Lopes and A. Sayed, “Diffusion least-mean squares over adaptive networks,” in IEEE ICASSP, vol. 3, Honolulu, HI,
Apr. 2007, pp. 917–920.
[28] A. H. Sayed and C. G. Lopes, “Adaptive processing over distributed networks,” IEICE Trans. Fund. Electron., Commun.
Comput. Sci., vol. E90-A, no. 8, pp. 1504–1510, Aug. 2007.
[29] C. G. Lopes and A. H. Sayed, “Diffusion least-mean squares over adaptive networks: Formulation and performance
analysis,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 3122–3136, July 2008.
[30] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS algorithms with information exchange,” in Proc. Asilomar Conf. Signals,
Syst. Comput., Pacific Grove, CA, Nov. 2008, pp. 251–255.
[31] ——, “Diffusion LMS strategies for distributed estimation,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1035–1048,
March 2010.
[32] F. S. Cattivelli, C. G. Lopes, and A. H. Sayed, “A diffusion RLS scheme for distributed estimation over adaptive networks,”
in Proc. IEEE Workshop on Signal Process. Advances Wireless Comm. (SPAWC), Helsinki, Finland, June 2007, pp. 1–5.
[33] ——, “Diffusion recursive least-squares for distributed estimation over adaptive networks,” IEEE Trans. Signal Process.,
vol. 56, no. 5, pp. 1865–1877, May 2008.
[34] F. S. Cattivelli and A. H. Sayed, “Diffusion strategies for distributed Kalman filtering and smoothing,” IEEE Trans. Autom.
Control, vol. 55, no. 9, pp. 2069–2084, Sep. 2010.
[35] N. Takahashi, I. Yamada, and A. H. Sayed, “Diffusion least-mean squares with adaptive combiners: Formulation and
performance analysis,” IEEE Trans. Signal Process., vol. 58, no. 9, pp. 4795–4810, Sep. 2010.
[36] F. S. Cattivelli and A. H. Sayed, “Modeling bird flight formations using diffusion adaptation,” IEEE Trans. Signal Process.,
vol. 59, no. 5, pp. 2038–2051, May 2011.
[37] P. Di Lorenzo, S. Barbarossa, and A. H. Sayed, “Bio-inspired swarming for dynamic radio access based on diffusion
adaptation,” in Proc. European Signal Process. Conf. (EUSIPCO), Aug. 2011, pp. 1–6.
[38] S. Chouvardas, K. Slavakis, and S. Theodoridis, “Adaptive robust distributed learning in diffusion sensor networks,” IEEE
Trans. Signal Process., vol. 59, no. 10, pp. 4692–4707, 2011.
[39] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex
optimization,” J. Optim. Theory Appl., vol. 147, no. 3, pp. 516–545, 2010.
[40] P. Bianchi, G. Fort, W. Hachem, and J. Jakubowicz, “Convergence of a distributed parameter estimator for sensor networks
with local averaging of the estimates,” in Proc. IEEE ICASSP, Prague, Czech, May 2011, pp. 3764–3767.
[41] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” LIDS
Technical Report, MIT, no. 2848, 2010.
[42] V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,”
SIAM J. Control Optim., vol. 38, no. 2, pp. 447–469, 2000.

November 27, 2024 DRAFT

[43] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE J. Sel. Topics. Signal
Process., vol. 5, no. 4, pp. 772–790, Aug. 2011.
[44] B. Polyak, Introduction to Optimization. Optimization Software, NY, 1987.
[45] J. Chen and A. H. Sayed, “Distributed Pareto-optimal solutions via diffusion adaptation,” in Proc. IEEE Statistical Signal
Process. Workshop (SSP), Ann Arbor, MI, Aug. 2012.
[46] A. H. Sayed, Adaptive Filters. Wiley, NJ, 2008.
[47] G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Edition). Johns Hopkins University Press, 1996.
[48] S. S. Stankovic, M. S. Stankovic, and D. M. Stipanovic, “Decentralized parameter estimation by consensus based stochastic
approximation,” IEEE Trans. Autom. Control, vol. 56, no. 3, pp. 531–543, Mar. 2011.
[49] S.-Y. Tu and A. H. Sayed, “Diffusion networks outperform consensus networks,” in Proc. IEEE Statistical Signal Processing
Workshop (SSP), Ann Arbor, MI, Aug. 2012.
[50] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 1990.
[51] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM J. Optim., vol. 10,
no. 3, pp. 627–642, 2000.
[52] S. Haykin, Adaptive Filter Theory, 2nd Edition. Prentice Hall, 2002.
[53] J. Arenas-Garcia, M. Martinez-Ramon, A. Navia-Vazquez, and A. R. Figueiras-Vidal, “Plant identification via adaptive
combination of transversal filters,” Signal Processing, vol. 86, no. 9, pp. 2430–2438, 2006.
[54] M. Silva and V. Nascimento, “Improving the tracking capability of adaptive filters via convex combination,” IEEE Trans.
Signal Process., vol. 56, no. 7, pp. 3137–3149, 2008.
[55] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning in a world of projections,” IEEE Signal Process. Mag.,
vol. 28, no. 1, pp. 97–123, Jan. 2011.
[56] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[57] A. J. Laub, Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics (SIAM), PA,
2005.
[58] P. Di Lorenzo, S. Barbarossa, and A. H. Sayed, “Sparse diffusion LMS for distributed adaptive estimation,” in Proc. IEEE
ICASSP, Kyoto, Japan, March 2012, pp. 1–4.
[59] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal Statist. Soc. B, pp. 267–288, 1996.
[60] R. G. Baraniuk, “Compressive sensing,” IEEE Signal Process. Mag., vol. 24, no. 4, pp. 118–121, Mar. 2007.
[61] E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted `1 minimization,” J. Fourier Anal. Appl., vol. 14,
no. 5, pp. 877–905, 2008.
[62] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Trans. Signal Process.,
vol. 58, no. 10, pp. 5262–5276, 2010.
[63] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identification and signal reconstruction using projections
onto weighted `1 balls,” IEEE Trans. Signal Process., vol. 59, no. 3, pp. 936–952, Mar. 2011.

November 27, 2024 DRAFT

Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
No ratings yet
Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
6 pages
Zhang 2023 Distributed 2
No ratings yet
Zhang 2023 Distributed 2
5 pages
Distributed Subgradient Methods for Optimization
No ratings yet
Distributed Subgradient Methods for Optimization
28 pages
Corrigendum To "Balance of Communication and Convergence: Predefined-Time Distributed Optimization Based On Zero-Gradient-Sum"
No ratings yet
Corrigendum To "Balance of Communication and Convergence: Predefined-Time Distributed Optimization Based On Zero-Gradient-Sum"
11 pages
Linear Convergence in Decentralized Optimization
No ratings yet
Linear Convergence in Decentralized Optimization
34 pages
Riemannian Diffusion for Distributed PCA
No ratings yet
Riemannian Diffusion for Distributed PCA
5 pages
Distributed Zero-Order Algorithms for Nonconvex Optimization
No ratings yet
Distributed Zero-Order Algorithms for Nonconvex Optimization
41 pages
Consensus-Based Optimization Methods Converge Globally: Abstract
No ratings yet
Consensus-Based Optimization Methods Converge Globally: Abstract
32 pages
Recovery Guarantees For Distributed-OMP
No ratings yet
Recovery Guarantees For Distributed-OMP
47 pages
Augmented Lagrangian Approach To Design of Structured Optimal State Feedback Gains
No ratings yet
Augmented Lagrangian Approach To Design of Structured Optimal State Feedback Gains
7 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
Stochastic Search Optimization
No ratings yet
Stochastic Search Optimization
317 pages
Communication Limits in Distributed Convex Optimization
No ratings yet
Communication Limits in Distributed Convex Optimization
9 pages
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
No ratings yet
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
23 pages
1 s2.0 S0165168416303747 Main
No ratings yet
1 s2.0 S0165168416303747 Main
7 pages
Distributed Journal Final
No ratings yet
Distributed Journal Final
29 pages
A New Diffusion Variable Spatial Regularized QRRLS Algorithm
No ratings yet
A New Diffusion Variable Spatial Regularized QRRLS Algorithm
5 pages
Federated Learning
No ratings yet
Federated Learning
36 pages
Applications of Fixed Point Theory To Distributed Optimization R
No ratings yet
Applications of Fixed Point Theory To Distributed Optimization R
123 pages
A Lagrange Multiplier Method For Distributed Optimization Based On Multi-Agent Network With Private and Shared Information
No ratings yet
A Lagrange Multiplier Method For Distributed Optimization Based On Multi-Agent Network With Private and Shared Information
9 pages
2014 Sparsity-Aware Adaptive Link Combination Approach Over Distributed Networks
No ratings yet
2014 Sparsity-Aware Adaptive Link Combination Approach Over Distributed Networks
3 pages
Overview of Traditional Optimization Methods
No ratings yet
Overview of Traditional Optimization Methods
17 pages
Distributed Gaussian Mean Estimation Under Communication Constraints: Optimal Rates and Communication-E Cient Algorithms
No ratings yet
Distributed Gaussian Mean Estimation Under Communication Constraints: Optimal Rates and Communication-E Cient Algorithms
63 pages
Convergence of Distributed Adaptive Optimization With Local Updates
No ratings yet
Convergence of Distributed Adaptive Optimization With Local Updates
66 pages
A Distributed Optimization Scheme For State Estimation of Nonlinear Networks With Norm-Bounded Uncertainties
No ratings yet
A Distributed Optimization Scheme For State Estimation of Nonlinear Networks With Norm-Bounded Uncertainties
8 pages
Distributed Multi-Agent Optimization Based On An Exact Penalty Method With Equality and Inequality Constraints
No ratings yet
Distributed Multi-Agent Optimization Based On An Exact Penalty Method With Equality and Inequality Constraints
8 pages
BolognaniZampieri NECSYS2010
No ratings yet
BolognaniZampieri NECSYS2010
8 pages
Optimization Techniques Explained
No ratings yet
Optimization Techniques Explained
56 pages
A Collaborative Neurodynamic Approach To Multiple-Objective Distributed Optimization
No ratings yet
A Collaborative Neurodynamic Approach To Multiple-Objective Distributed Optimization
12 pages
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
No ratings yet
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
13 pages
7 Unnamed 11 04 2023
No ratings yet
7 Unnamed 11 04 2023
135 pages
Nature Inspired Computing Lec 1
No ratings yet
Nature Inspired Computing Lec 1
36 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
Distributed Optimal Control for Consensus
No ratings yet
Distributed Optimal Control for Consensus
8 pages
Nonlinear Optimization Course Overview
No ratings yet
Nonlinear Optimization Course Overview
75 pages
Optimization PSO
No ratings yet
Optimization PSO
57 pages
Online Learning For IoT Optimization A FrankWolfe Adam-Based Algorithm
No ratings yet
Online Learning For IoT Optimization A FrankWolfe Adam-Based Algorithm
10 pages
Entropy Search for Global Optimization
No ratings yet
Entropy Search for Global Optimization
33 pages
Symmetry 16 01289
No ratings yet
Symmetry 16 01289
18 pages
Majorization Minimization Methods For Distributed Pose Graph Optimization
No ratings yet
Majorization Minimization Methods For Distributed Pose Graph Optimization
21 pages
MATLAB Optimal Control Solutions
No ratings yet
MATLAB Optimal Control Solutions
13 pages
Speeding Up Distributed Machine Learning Using Codes
No ratings yet
Speeding Up Distributed Machine Learning Using Codes
16 pages
Lec 30
No ratings yet
Lec 30
22 pages
PDF 4
No ratings yet
PDF 4
11 pages
Stability and Convergence Properties of Dynamic Average Consensus Estimators
No ratings yet
Stability and Convergence Properties of Dynamic Average Consensus Estimators
6 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Nonlinear Optimization with MATLAB
No ratings yet
Nonlinear Optimization with MATLAB
38 pages
Op Tim Ization
No ratings yet
Op Tim Ization
4 pages
A Modified Newton's Method For Solving Nonlinear Programing Problems
No ratings yet
A Modified Newton's Method For Solving Nonlinear Programing Problems
15 pages
EE227C Course Notes: Convex Optimization
No ratings yet
EE227C Course Notes: Convex Optimization
122 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Continuous Optimization Methods
No ratings yet
Continuous Optimization Methods
126 pages
LSCOMO大规模凸优化使用单调算子 (Book)
No ratings yet
LSCOMO大规模凸优化使用单调算子 (Book)
348 pages
Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022
No ratings yet
Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022
32 pages
Distributed Recursive Least-Squares: Stability and Performance Analysis
No ratings yet
Distributed Recursive Least-Squares: Stability and Performance Analysis
30 pages
GRG Method with Search Directions
100% (1)
GRG Method with Search Directions
14 pages
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
No ratings yet
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
20 pages
Optimization Methods 1755746982
No ratings yet
Optimization Methods 1755746982
116 pages
FALLSEM2022-23 SWE1011 ETH VL2022230101053 2022-11-02 Reference-Material-I
No ratings yet
FALLSEM2022-23 SWE1011 ETH VL2022230101053 2022-11-02 Reference-Material-I
142 pages
Linear Programming Problem Formulation
No ratings yet
Linear Programming Problem Formulation
42 pages
Optimization Problem Essentials
No ratings yet
Optimization Problem Essentials
2 pages
Chapter 14 PDF
No ratings yet
Chapter 14 PDF
82 pages
1140-Article Text-13994-2-10-20240715
No ratings yet
1140-Article Text-13994-2-10-20240715
11 pages
Designing Effective Service Systems
No ratings yet
Designing Effective Service Systems
33 pages
ML Unit-I
No ratings yet
ML Unit-I
34 pages
Monotone Comparative Statics
No ratings yet
Monotone Comparative Statics
10 pages
Deterministic Optimization Models in Operations Research
No ratings yet
Deterministic Optimization Models in Operations Research
68 pages
NNDL Unit 3: Deep Learning Overview
No ratings yet
NNDL Unit 3: Deep Learning Overview
17 pages
Taguchi Methods for Quality Control
No ratings yet
Taguchi Methods for Quality Control
19 pages
MIT15 053S13 Lec1 PDF
No ratings yet
MIT15 053S13 Lec1 PDF
36 pages
Multiple Work Scheduling
No ratings yet
Multiple Work Scheduling
25 pages
Linear Programming Model Formulation
No ratings yet
Linear Programming Model Formulation
54 pages
Horngrens Cost Accounting A Managerial Emphasis 16th Edition Datar Test Bank Download
No ratings yet
Horngrens Cost Accounting A Managerial Emphasis 16th Edition Datar Test Bank Download
77 pages
00 Lecture 1 - DM
No ratings yet
00 Lecture 1 - DM
9 pages
Machine Learning in Compiler Optimisation
No ratings yet
Machine Learning in Compiler Optimisation
24 pages
SEE705 Lecture Week 9
No ratings yet
SEE705 Lecture Week 9
18 pages
Genetic Bee Colony
No ratings yet
Genetic Bee Colony
22 pages
Optimization Examples: For More Information See Aspen Plus User Guide Chapter 22, Optimization
No ratings yet
Optimization Examples: For More Information See Aspen Plus User Guide Chapter 22, Optimization
2 pages
Model-Based Deep Learning
No ratings yet
Model-Based Deep Learning
35 pages
2marks ML
No ratings yet
2marks ML
3 pages
EDAV Assignment2 Devyani2
No ratings yet
EDAV Assignment2 Devyani2
7 pages
Aspen HYSYS SQP Optimization
No ratings yet
Aspen HYSYS SQP Optimization
24 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Optimizing Generative AI by Backpropagating Language Model Feedback
No ratings yet
Optimizing Generative AI by Backpropagating Language Model Feedback
18 pages
Stat 618 Bayesian Statistics Lecture Notes - American University
No ratings yet
Stat 618 Bayesian Statistics Lecture Notes - American University
187 pages
Power System Simulation Lab software-NEPLAN Features PDF
100% (1)
Power System Simulation Lab software-NEPLAN Features PDF
14 pages
Robust Quality: Powerful Integration of Data Science and Process Engineering Rajesh Jugulum - The Ebook in PDF and DOCX Formats Is Ready For Download
100% (3)
Robust Quality: Powerful Integration of Data Science and Process Engineering Rajesh Jugulum - The Ebook in PDF and DOCX Formats Is Ready For Download
59 pages
1st Quiz in Management Science
No ratings yet
1st Quiz in Management Science
3 pages
1 Introduction and Formulation
No ratings yet
1 Introduction and Formulation
34 pages

Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks

Uploaded by

Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks

Uploaded by

1

Diffusion Adaptation Strategies for Distributed

We propose an adaptive diffusion mechanism to optimize a global cost function in a distributed

Distributed optimization, diffusion adaptation, incremental techniques, learning, energy conservation,

November 27, 2024 DRAFT

November 27, 2024 DRAFT

II. P ROBLEM F ORMULATION

November 27, 2024 DRAFT

Jlloc (w) ≈ Jlloc (wo ) + kw − wo k2Γl (5)

November 27, 2024 DRAFT

III. I TERATIVE D IFFUSION S OLUTION

November 27, 2024 DRAFT

λmin (Γl ) · kw−wo k2 ≤ kw−wo k2Γl ≤ λmax (Γl ) · kw−wo k2

Thus, we replace (8) by

November 27, 2024 DRAFT

November 27, 2024 DRAFT

Now introduce the coefficients

November 27, 2024 DRAFT

November 27, 2024 DRAFT

which are of the following form [10], [14], [15], [18]:

IV. M EAN -S QUARE P ERFORMANCE A NALYSIS

the gradient noise in this case can be expressed as

vl,i (w) = ∇w Ql (w, xl,i ) − ∇w E{Ql (w, xl,i )} (23)

November 27, 2024 DRAFT

November 27, 2024 DRAFT

P1T 1 = 1, P2T 1 = 1, S 1 = 1 (29)

φ̃k,i , wo −φk,i , ψ̃k,i , wo −ψk,i , w̃k,i , wo −wk,i (30)

Then, subtracting both sides of (26)–(28) from wo gives:

November 27, 2024 DRAFT

, −Hl,k,i−1 φ̃k,i−1 (35)

where we are introducing the symmetric random matrix

November 27, 2024 DRAFT

and the following block matrices:

Ω = diag {µ1 , . . . , µN } (41)

w̃i = P2T [IM N − MD i−1 ]P1T w̃i−1 + P2T Mgi (44)

λl,min IM ≤ ∇2w Jl (w) ≤ λl,max IM (45)

Furthermore, the {λl,min }N

E {vl,i (w) | Fi−1 } = 0 (47)

E kvl,i (w)k2 ≤ αEkwo − wk2 + σv2

November 27, 2024 DRAFT

λl,min IM ≤ Hl,k,i−1 ≤ λl,max IM (49)

In a similar way, we can verify that xT Hl,k,i−1 x ≥ λl,min .

Ekvl,i (w)k2 ≤ σv2 , E kvl,i (w)k2 |Fi−1 ≤ σv2

Furthermore, condition (48) is similar to condition (4.3) in [51, p.635]:

Ql (w, {ul,i , dl (i)}) = |dl (i) − ul,i w|2

Jl (w) = E|dl (i) − ul,i w|2 (52)

November 27, 2024 DRAFT

dl (i) = ul,i wo + zl (i) (53)

vl,i (w) = 2(Ru,l − uTl,i ul,i )(wo − w) − 2uTl,i zl (i) (54)

≤ 4EkRu,l −uTl,i ul,i k2 · Ekwo −wk2 +4σz,l

= −2uTl,i [dl (i) − ul,i w] (56)

ψk,i = wk,i−1 +2µk uTk,i [dk (i)−uk,i wk,i−1 ]

November 27, 2024 DRAFT

November 27, 2024 DRAFT

We call upon the following two lemmas to bound (62).

0 ≤ Σk,i−1 ≤ γk2 IM (64)

November 27, 2024 DRAFT

definite. To establish (64), we first use (49) to note that:

By definition (63), Σk,i−1 is the square of the symmetric matrix IM −µk N

Substituting (68)–(69) into (70) leads to

which is equivalent to (64).

November 27, 2024 DRAFT

Proof: Applying Jensen’s inequality [56, p.77], it holds that

Substituting (64) and (72) into (62), we obtain:

Ekψ̃k,i k2 ≤ (γk2 +µ2k αkSk21 )· Ekφ̃k,i−1 k2 +µ2k kSk21 σv2 (75)

Γ = diag γ12 + µ21 αkSk21 , . . . , γN

Then, (60)–(61) and (75) can be written as

November 27, 2024 DRAFT

Wi  P2T ΓP1T Wi−1 + σv2 kSk21 · P2T Ω2 1 (79)

for k = 1, . . . , N , where σk,max and σk,min are defined as

where kxk∞ denotes the maximum absolute entry of vector x.

Proof: See Appendix A.

November 27, 2024 DRAFT

lim sup kWi k∞

November 27, 2024 DRAFT

kSk21 σv2 µ2max

Wi P2T ΓP1T Wi−1 + σv2 kSk21 · P2T Ω2 1 (79)

(a) Learning curves (γ = 2 and = 10−3 ). (b) Steady-state MSD (µ = 10−3 ).