Mean Robust Optimization Framework
Mean Robust Optimization Framework
https://doi.org/10.1007/s10107-024-02170-4
Abstract
Robust optimization is a tractable and expressive technique for decision-making
under uncertainty, but it can lead to overly conservative decisions when pessimistic
assumptions are made on the uncertain parameters. Wasserstein distributionally robust
optimization can reduce conservatism by being data-driven, but it often leads to very
large problems with prohibitive solution times. We introduce mean robust optimiza-
tion, a general framework that combines the best of both worlds by providing a trade-off
between computational effort and conservatism. We propose uncertainty sets con-
structed based on clustered data rather than on observed data points directly thereby
significantly reducing problem size. By varying the number of clusters, our method
bridges between robust and Wasserstein distributionally robust optimization. We show
finite-sample performance guarantees and explicitly control the potential additional
pessimism introduced by any clustering procedure. In addition, we prove conditions
for which, when the uncertainty enters linearly in the constraints, clustering does not
affect the optimal solution. We illustrate the efficiency and performance preservation
of our method on several numerical examples, obtaining multiple orders of magnitude
speedups in solution time with little-to-no effect on the solution quality.
Irina Wang and Bartolomeo Stellato are supported by the NSF CAREER Award ECCS 2239771.
B Bartolomeo Stellato
[email protected]
Irina Wang
[email protected]
Cole Becker
[email protected]
Bart Van Parys
[email protected]
123
I. Wang et al.
1 Introduction
Robust optimization (RO) and distributionally robust optimization (DRO) are popu-
lar tools for decision-making under uncertainty due to their high expressiveness and
versatility. The main idea of RO is to define an uncertainty set and to minimize the
worst-case cost across possible uncertainty realizations in that set. However, while
RO often leads to tractable formulations, it can be overly-conservative [42]. To reduce
conservatism, DRO takes a probabilistic approach, by modeling the uncertainty as
a random variable following a probability distribution known only to belong to an
uncertainty set (also called ambiguity set) of distributions. In both RO and DRO, the
choice of the uncertainty or ambiguity set can greatly influence the quality of the solu-
tion. Good-quality uncertainty sets can lead to excellent practical performance while
ill chosen sets can lead to overly-conservative actions and intractable computations.
Traditional approaches design uncertainty sets based on theoretical assumptions
on the uncertainty distributions [1, 3, 5, 13]. While these methods have been quite
successful, they rely on a priori assumptions that are difficult to verify in practice. On
the other hand, the last decade has seen an explosion in the availability of data, which
has brought a shift in focus from a priori assumptions on the probability distributions
to data-driven methods in operations research and decision sciences. In RO and DRO,
this new paradigm has fostered data-driven methods where uncertainty sets are shaped
directly from data [10]. In data-driven DRO, a popular choice of the ambiguity set is
the ball of distributions whose Wasserstein distance to a nominal distribution is at most
> 0 [24, 28, 29, 35]. When the reference distribution is an empirical distribution,
the associated Wasserstein DRO can be formulated as a convex minimization problem
where the number of constraints grows linearly with the number of data-points [24].
While less conservative than RO, data-driven DRO can lead to very large formulations
that are intractable, especially in mixed-integer optimization (MIO).
A common idea to reduce the dimensionality of data-driven decision-making prob-
lems is to use clustering techniques from machine learning. While clustering has
recently appeared in various works within the stochastic programming literature [7,
17, 23, 34], the focus has been on the improvement of and comparisons to the sample
average approximation (SAA) approach and not in a distributionally robust sense. In
contrast, recent approaches in the DRO literature cluster data into partitions and either
build moment-based uncertainty sets for each partition [19, 39], or enrich Wasserstein
DRO formulations with partition-specific information (e.g., relative weights) [25].
While these approaches are promising, clustering is still used as a pre-processing
heuristic on the data-sets in DRO, without a clear understanding of how it affects
the conservatism of the optimal solutions. In particular, choosing the right clustering
parameters to carefully balance computational tractability and out-of-sample perfor-
mance is still an unsolved challenge.
123
Mean robust optimization
We present mean robust optimization (MRO), a data-driven method that, via machine
learning clustering, bridges between RO and Wasserstein DRO.
– We design the uncertainty set for RO as a ball around clustered data. With-
out clustering, our formulation corresponds to the finite convex reformulation
in Wasserstein DRO. With one cluster, our formulation corresponds to the clas-
sical RO approach. The number of clusters is a tunable parameter that provides
a tradeoff between the worst-case objective value and computational efficiency,
which includes both speed and memory usage.
– We provide probabilistic guarantees of constraint satisfaction for our method,
based on the quality of the clustering procedure.
– We derive bounds on the effect of clustering in case of constraints with concave
and maximum-of-concave dependency on the uncertainty. In addition, we show
that, when constraints are linearly affected by the uncertainty, clustering does not
affect the solution nor the probabilistic guarantees.
– We show on various numerical examples that, thanks to our clustering proce-
dure, our approach provides multiple orders of magnitude speedups over classical
approaches while guaranteeing the same probability of constraint satisfaction. The
code to reproduce our results is available at https://github.com/stellatogrp/mro_
experiments.
123
I. Wang et al.
123
Mean robust optimization
123
I. Wang et al.
123
Mean robust optimization
where β > 0 is the specified probability of constraint violation. From now on, the
probabilistic guarantees of constraint satisfaction refers to (2).
where we partition D N into K disjoint subsets Ck , and d̄k is the centroid of the k-
th subset, for k = 1, . . . , K . The weight wk > 0 of each subset is equivalent to
the proportion of points in the subset, i.e., wk = |Ck |/N . We choose p to be an
integer exponent, and will be chosen depending on the other parameters to ensure
satisfaction of the probability guarantee (2). When p = 2 and S = Rm , the set is
an ellipsoid in R K m with the center formed by stacking together all d̄k into a single
vector of dimension R K m . When we additionally have K = N or K = 1, this ellipsoid
becomes a ball of dimension R N m or Rm respectively.
Case p = ∞. In the case where p = ∞, the set we consider takes becomes
U(K , ) = u = (v1 , . . . , v K ) ∈ S K max vk − d̄k ≤ ,
k=1,...,K
where the constraints for individual vk become decoupled. This decoupling follows the
result for the Wasserstein type p = ∞ metric [30, Equation 2], as our uncertainty set is
analogous to the set of all distributions within Wasserstein-∞ distance of d̄. Note that, if
K
any of the decoupled constraints are violated, then lim p→∞ k=1 wk vk −d̄k p ≥ p ,
and the summation constraint is violated.
For both cases, when K = 1, we have a simple uncertainty set: U(1, ) =
v ∈ S | v − d̄ ≤ , a ball of radius around the empirical mean of the dataset.
This is equivalent to the uncertainty set of traditional RO, as it is of the same dimension
123
I. Wang et al.
K
ḡ(u, x) = wk g(vk , x), (3)
k=1
where g is defined in the original constraint (1) and wk are the weights from above.
Subsequently, x̂ N is the solution to the robust optimization problem
minimize f (x)
(MRO)
subject to ḡ(u, x) ≤ 0 ∀u ∈ U(K , ),
where f is the objective function. We call this problem the mean robust optimization
(MRO) problem. Given the problem data, we formulate the uncertainty set from clus-
tered data using machine learning, with the choice of K and chosen experimentally.
Then, we solve the MRO problem to arrive at a data-driven solution x̂ N which satisfies
the probabilistic guarantee (2).
We solve the MRO problem using a direct convex reformulation, following usual
techniques for RO problems in existing literature [3, 11, 35], with adaptations made
for the MRO setup, as well as a reformulation derived for the case p = ∞. We include
simple examples for completeness.
Case p ≥ 1. In the case where p ≥ 1, the MRO can be rewritten as
minimize ⎧f (x) ⎫
⎨ maximize K
k=1 wk g(vk , x)
⎬
(4)
subject to v1 ,...,v K ∈S ≤ 0,
⎩ subject to K
p⎭
k=1 wk vk − d̄k ≤
p
minimize f (x)
K
subject to k=1 wk sk ≤ 0
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + φ(q)λ z k /λ ∗ + λ p ≤ sk (5)
q
k = 1, . . . , K
λ ≥ 0,
123
Mean robust optimization
constraint is delayed to Appendix A. Since the dual of the constraint becomes a mini-
mization problem, any feasible solution that with objective less than or equal to 0 will
satisfy the constraint, so we can remove the minimization to arrive at the above form.
While traditionally we take the supremum instead of maximizing, here the supremum
is always achieved as we assume g to be upper-semicontinuous. For specific examples
of the conjugate forms of different g, see Bertsimas and den Hertog [11, Section 2.5]
and Beck [2, Chapter 4].
When K is set to be N , wk is 1/N , and this is of an analogous form to the convex
reduction of the worst case problem for Wasserstein DRO, which we will introduce in
Sect. 3.
q
We observe from [35, Section 2.2 Remark 1] that limq→∞ φ(q)λ z k /λ ∗ = 0 if
z k ≤ λ and = ∞ otherwise. Therefore, when p = 1 the reformulation becomes
minimize f (x)
K
subject to k=1 wk sk ≤ 0
(6)
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + λ ≤ sk k = 1, . . . , K
λ ≥ 0, z k ≤ λ, k = 1, . . . , K .
minimize ⎧f (x) ⎫
⎨ maximize K
k=1 wk g(vk , x)
⎬
(7)
subject to v1 ,...,v K ∈S ≤ 0,
⎩ subject to vk − d̄k ≤ , k = 1, . . . , K ⎭
minimize f (x)
K
subject to k=1 wk sk ≤ 0
(8)
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + λk ≤ sk k = 1, . . . , K
z k ∗ ≤ λk k = 1, . . . , K ,
Remark 1 (Case p = ∞ is the limit of case p ≥ 1) In terms of the primal problem, (7)
is the limiting case of (4) as p → ∞. In terms of the reformulated problem with
dualized constraints, problem (8) is the limiting case of (5). The proof, delayed to
Appendix C, extends the ideas stated in [14, Proposition 3].
Example with affine constraints. Consider a single affine constraint of the form
(a + Pu)T x ≤ b, (9)
123
I. Wang et al.
minimize f (x) q
subject to a T x − b + φ(q)λ P T x/λ∗ + λ p + (P T x)T K
k=1 wk d̄k ≤ 0 (10)
λ ≥ 0,
where the number of variables or constraints does not depend on K . Since vector
K
k=1 wk d̄k is the average of the data-points in D N for any K ∈ {1, . . . , N }, this
formulation corresponds to always choosing K = 1. We also note that, for both p = 1
and p = ∞, the subsequent reformulation (11) can be viewed as the robust counterpart
N
when the uncertainty set is a norm ball of radius centered at (1/N ) i=1 di . If d̄ = 0,
the constraint can be simplified even further, obtaining a T x + P T x ∗ ≤ b, which
corresponds to the robust counterpart in RO with norm uncertainty sets [11, Section
2.3], [3, Chapter 2].
with each −g j being proper, convex, and lower-semicontinuous in u for all x. When we
take J = 1, we arrive back at the formulations given in Sect. 2. Note that any problem
with multiple uncertain constraints g j (u, x), j = 1, . . . , J , where we assume the
usual conditions on g j , can be combined to create a joint constraint of this maximum-
of-concave form. As mentioned in Sect. 2.1, this can also be used to model CVaR
constraints, which has a maximum-of-concave analytical form. The intuitive constraint
to formulate is then
K
K
ḡ(u, x) = wk max g j (vk , x) = max wk g jk (vk , x), (12)
j≤J ( j1 ,..., j K )∈G
k=1 k=1
where in the last expression we brought the maximum outside the summation by
defining the set G of all possible choices of ( j1 , . . . , j K ). This set has size J K , as
each cluster k has J possible pieces in the maximization function. We perform this
switch in order to attain a reformulation akin to (4), where we have a single inner
maximization problem for the MRO constraint. As these indices are hard to express,
we seek an alternative formulation. We turn to Section 4.2 of Esfahani and Kuhn [24],
123
Mean robust optimization
K
J
ḡ(u, α, x) = α jk g j (v jk , x) (13)
k=1 j=1
Note that the single concave case given previously follows when we take J = 1. All
parameters are defined as in the single concave case.
Case p = ∞. In the case where p = ∞, the set we consider becomes
⎧ ⎫
⎨ J ⎬
α
U(K , ) = u = (v11 , . . . , v J K ) ∈ S J ×K ∃α ∈ , max
jk
v jk − d̄k ≤ .
⎩ k=1,...,K w ⎭
j=1 k
Following these changes, x̂ N is again the solution to the robust optimization prob-
lem (MRO), defined now with the generalized uncertainty set.
Solving the robust problem. We give the direct reformulation approach for solving the
generalized problem for p ≥ 1. The case p = ∞ is delayed to Appendix B. We write
the MRO problem as the optimization problem
minimize ⎧f (x) ⎫
⎨ maximize K
k=1
J
j=1 α jk g j (v jk , x) ⎬
(14)
subject to v11 ,...,v J K ∈S,α∈ ≤ 0,
⎩ subject to K J
≤ p⎭
j=1 α jk v jk − d̄k p
k=1
123
I. Wang et al.
minimize f (x)
K
subject to k=1 wk sk ≤ 0 q
[−g j ] (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k + φ(q)λ z jk /λ∗ + λ p ≤ (15)
∗ sk
k = 1, . . . , K , j = 1, . . . , J
λ ≥ 0,
minimize f (x)
(16)
subject to supQ∈P N EQ (g(u, x)) ≤ 0,
where the ambiguity set P N contains, with high confidence, all distributions that could
have generated the training samples D N , such that the probabilistic guarantee (2) is
satisfied. Wasserstein DRO constructs P N as a ball of radius with respect to the
N
Wasserstein metric around the empirical distribution P̂ N = i=1 δdi /N , where δdi
denotes the Dirac distribution concentrating unit mass at di ∈ Rm . Specifically, we
p
write P N = B (P̂ N ) = {Q ∈ M(S) | W p (P̂ N , Q) ≤ }, where M(S) is the set
of probability distributions supported on S satisfying a light-tailed assumption (more
details in Sect. 3.1), and
1/ p
p
W p (Q, Q ) = inf u−u (du, du ) .
S
Here, p is any integer greater than 1, and is any joint distribution of u and u with
marginals Q and Q .
When K = N , the constraint of the DRO problem (16) is equivalent to the constraint
of (MRO). In particular, for case p ≥ 1, the dual of the constraint of (16) is equivalent
to the dual of the constraint of (14), with wk = 1/N [35, 51]. Similarly, in the case
where p = ∞, the dual of the constraint of (16) is equivalent to the dual of the
constraint of (17). We can then rewrite the Wasserstein DRO problem as (14), the
MRO problem, when K = N .
Our approach can be viewed as a form of Wasserstein DRO, with the difference that,
when K < N , we deal with the clustered dataset. We form P N as a ball around the
K
empirical distribution P̂ K of the centroids of our clustered data P̂ K = k=1 wk δd̄k ,
where wk is the proportion of data in cluster k. This formulation allows for the reduction
123
Mean robust optimization
of the sample size while preserving key properties of the sample, which translates
directly to a reduction in the number of constraints and variables, while maintaining
high quality solutions.
Following the parallels between MRO and Wasserstein DRO, we now show that the
conditions for satisfying the probabilistic guarantees are also analogous.
Theorem (1) estimates the probability that the unknown data-generating distribution
p
P lies outside the Wasserstein ball B (P̂ N ), which is our ambiguity set. Thus, we can
estimate the smallest radius such that the Wasserstein ball contains the true distribu-
tion with probability 1 − β, for some target β ∈ (0, 1). We equate the right-hand-side
to β, and solve for N (β) that provides us the desired guarantees for Wasserstein
DRO [24, Theorem 3.5].
Case p = ∞. When p = ∞, Bertsimas et al. [14, Section 6] note that the light-
tailed assumption is no longer sufficient. Wasserstein DRO satisfies (2) under stronger
assumptions, as given in the following theorem.
We can again equate the right-hand-side to β and find N (β). We extend this result to
the clustered set in MRO.
Theorem 3 (MRO finite sample guarantee) Assume the light-tailed assumption holds
when p ≥ 1, and the corresponding assumptions hold when p = ∞. If β ∈ (0, 1),
η N (K ) is the average p-th powered distance of data-points in D N from their assigned
cluster centers, and x̂ N is the optimal solution to (MRO) with uncertainty set
U(K , N (β) + η N (K )1/ p ), then the finite sample guarantee (2) holds.
Proof Compared with Wasserstein DRO, MRO has to account for the additional dif-
ference between the two empirical distributions P̂ N and P̂ K . If we introduce a new
123
I. Wang et al.
parameter, η N (K ), defined as
1
K
η N (K ) = di − d̄k p
N
i=1 i∈Ck
the average p-powered distance with respect to the norm used in the Wasserstein
metric, of all data-points in D N from their assigned cluster centers d̄k , we notice that
W p (P̂ , P̂ ) = inf
K N p
u − u p
(du, du ) ( any joint dist. of P̂ K , P̂ N )
S
K
|Ck |
≤ u − d̄k p N
P̂ (u|u = d̄k )(du)
N S
i=1
K
|Ck | 1
≤ di − d̄k p
= η N (K ),
N |Ck |
i=1 i∈Ck
where we have replaced the integral with a finite sum, as the distributions are discrete.
Therefore, by Theorems 1, 2 and the triangle inequality [20],
p
P(P ∈ B (P̂ K )) ≥ 1 − β,
N (β)+η N (K )
1/ p
which implies U(K , N (β) + η N (K )1/ p ) contains all possible realizations of uncer-
tainty with probability 1 − β, so the finite sample guarantee (2) holds.
123
Mean robust optimization
the objective is only minimally affected. With this goal, we would like to minimize
the average distance of the points in each cluster to their data-center,
1
K
D(K ) = minimize D(K ) = minimize di − d̄k 2,
2
N
k=1 di ∈Ck
where d̄k is the mean of the points in cluster Ck . While the best performance is attained
with D(K ) , in practice we work with the approximation D(K ), where Ck is decided
by a clustering algorithm. This value upper bounds D(K ) . A well-known algorithm
is K -means [32], where we create K clusters by iteratively solving a least-squares
problem. From here on we use only D(K ), and note that for the case p = 2, we have
D(K ) = η N (K ) from Theorem 3.
In this section, we then show the effects of clustering on the worst-case value of the
constraint function in (MRO). We prove two sets of results, corresponding to g given
as a single concave function, and as a more general maximum-of-concave function,
which includes the maximum-of-affine function.
Quantifying the clustering effect. We calculate the difference between the worst-case
value of the constraint in (MRO), for different K ,
K
|Ck |
ḡ K (x) = maximize g(u k , x)
u 1 ,...,u K ∈S N
k=1
(MRO-K)
K
|Ck |
subject to u k − d̄k p
≤ .p
N
k=1
When K = N , ḡ N (x) is akin to the constraint value for traditional Wasserstein DRO.
We also denote by ḡ N ∗ (x) the value of the constraint value without the support con-
straints u 1 , . . . , u N ∈ S. From here on, when we mention that the support affects the
worst-case constraint value, we refer to situations where at least one of the constraints
u i ∈ S for i = 1, . . . , N is binding. Formally, the definition is ḡ N (x) = ḡ N ∗ (x) for
any x feasible for the DRO problem. We note a sufficient but not necessary condition
for the support to not affect the worst-case constraint value: the situation in which the
support doesn’t affect the uncertainty set, which is defined as
N
N ×m
u∈R : (1/N ) u i − di p
≤ p
i=1
N
N ×m
= u∈S : (1/N ) u i − di p
≤ p
.
i=1
123
I. Wang et al.
If the support satisfies this condition, we can conclude that ḡ N (x) = ḡ N ∗ (x) for any x
feasible for the DRO problem, and obtain improved bounds below. While the condition
depends on the location of the data-points, it is acceptable, as this is a condition we
can check given data to potentially improve the following bounds, without having to
solve the MRO problem. We also define the L-smooth condition, which is needed in
the subsequent theorems.
Theorem 4 With the same x and , and for any integer p ≥ 1, we always have
ḡ N (x) ≤ ḡ K (x).
The proof is delayed to Appendix D. The results also hold for p = ∞, as we have
shown in Remark 1 that the case p = ∞ is the limit of the case p ≥ 1, and these
results hold under the limit.
Let be the maximum difference in constraint value resultant from relaxing
the support constraint on the MRO uncertainty sets, subject to x being feasible for
problem (MRO), = max x∈X (ḡ N ∗ (x) − ḡ N (x)). As we assume Assumption 1 to
hold, combined with the smoothness of g, we note that when solving for ḡ N ∗ (x),
the chosen vi values without the support constraint will still remain in the domain
domu g. Refer to a similar argument in Appendix D (ii) for details. The function
ḡ N ∗ (x) − ḡ N (x) is then continuous in x and everywhere defined for x ∈ X , thus
maximizing with respect to X , a compact set, the value is finite. Then, we observe
that ḡ K (x) − ḡ N (x) ≤ + (L/2)D(K ) for all such x, so the smaller the D(K ),
(i.e., higher-quality clustering procedure), the smaller the increase in the worst-case
constraint value. In addition, the value is independent of K , as it only depends on
ḡ N ∗ (x) and ḡ N (x).
123
Mean robust optimization
Corollary 1 Consider the problem where g is itself the objective function we would like
to minimize and X ⊆ Rn represents the constraints, which are deterministic. Then,
(L/2)D(K ) + upper bounds the difference in optimal values of the MRO problem
with K and N clusters.
K
ḡ K (x) = minimize (|Ck |/N )sk
λ≥0,γ ≥0,z,s
k
subject to [−g j ]∗ (z jk − H T γ jk ) + γ jk
T
(h − H d̄k ) − z Tjk d̄k + λ p
q
+ φ(q)λ z jk /λ∗ ≤ sk , k = 1, . . . , K , j = 1, . . . , J ,
(MRO-K-Dual)
where the variables y jk from (15) are replaced by H γ jk , with γ jk ≥ 0, due to the
T
K
ḡ K (x) = minimize (|Ck |/N )sk
λ≥0,γ ≥0,z,s
k
subject to [−g j ]∗ (z jk − H T γ jk ) + γ jk
T
(h − H d̄k ) − z Tjk d̄k
+ λk ≤ sk , k = 1, . . . , K , j = 1, . . . , J ,
z jk ≤ λk , k = 1, . . . , K , j = 1, . . . , J .
∗
(MRO-K-Dual-∞)
The following theorems hold for both p ≥ 1 and p = ∞.
123
I. Wang et al.
K
where δ(K , z, γ ) = (1/N ) k=1 i∈Ck max j≤J ((−z jk − H γ jk ) (di − d̄k )), and
T T
z, γ are the dual variables. ḡ N ∗ (x) is the problem without support constraints.
The proof is delayed to Appendix E. Due to the nonconvex and nonconcave nature
of maximum-of-concave functions, the lower bound now involves an extra term
δ(K , z, γ ). However, when g is a maximum-of-affine function, which is convex, we
know ḡ N ∗ (x) to be an upper bound on ḡ K (x).
Corollary 3 When g is the maximum of affine functions with domain domu g j = Rm
and polyhedral support S = {u | H u ≤ h}, for the same x and ,
This follows from the fact that L j = 0 for all affine functions g j .
Uncertain objective When the uncertainty is in the objective, Theorem 5 and Corol-
lary 3 quantifies the
possible difference
in optimal values between K and N clusters.
We let = max x ḡ N ∗ (x) − ḡ N (x) , subject to x being feasible for problem (MRO).
Note that this is only needed for the upper bound.
Corollary 4 Consider the problem where g is itself the objective function we would like
to minimize and X ⊆ Rn represents the constraints, which are deterministic. Then,
δ(K , z, γ ) upper bounds the possible decrease in optimal values of the MRO problem
with K clusters compared with that of N clusters. Similarly, max j≤J (L j /2)D(K )+
upper bounds the possible increase.
Uncertain constraints When the uncertainty is in the constraints, the difference
between ḡ N (x) and ḡ K (x) as given in Theorem 5 no longer directly reflect the dif-
ference in optimal values. Instead, clustering affects the feasible set for x as follows.
For any x̂, in the case ḡ N (x̂) ≥ ḡ K (x̂), ḡ K (x̂) can be at most δ(K , z, γ ) lower in
value than ḡ N (x̂). Since both values are constrained to be nonpositive from (MRO),
the feasible region of the MRO problem with K clusters may be less restricted than
that of N clusters. This indirectly allows MRO with K clusters to obtain a smaller
optimal value. On the other hand, in the case ḡ N (x̂) ≤ ḡ K (x̂), ḡ K (x̂) can be at most
max j≤J (L j /2)D(K ) + higher in value than ḡ N (x̂). This indirectly lets MRO with
K clusters to obtain a larger optimal value.
Choosing K . When the uncertain constraint is affine and S does not affect the worst-
case constraint value, the number of clusters K does not affect the final solution,
so it is always best to choose K = 1. When S affects the worst-case constraint
123
Mean robust optimization
123
I. Wang et al.
6 Numerical examples
We consider the capital budgeting problem in [4, Section 4.2], where we select a
portfolio of investment projects maximizing the total net present value (NPV) of the
portfolio, while the weighted sum of the projects is less than a total budget θ . The NPV
for all projects is η(u) ∈ Rn , where for each project j, η j (u) is the sum of discounted
T
cash flows F jt over the years t = 0, . . . , T , i.e., η j (u) = t=0 F jt /(1 + u j )t . Here,
u j ∈ R+ is the discount rate of project j. We formulate the uncertain function to be
minimized as
123
Mean robust optimization
Fig. 1 D(K ) vs. K . for all experiments. The red lines are the suggested values
g(u, x) = −η(u)T x,
where x = (x1 , . . . , xn ) ∈ {0, 1}n is the indicator for selecting each project. The
discount rate u j is subject to uncertainty, as it depends on several factors, such as
the interest rate of the country where project j is located and the level of return
the decision-maker wants to compensate the risk. The function g is concave and
monotonically increasing in u, and we can define a domain u ≥ 0 so that Assumption 1
and Theorem 4 applies. The robust problem becomes
minimize τ
x,t
subject to ḡ(u, x) ≤ τ, u ∈ U(K , )
h T x ≤ θ, x ∈ {0, 1},
123
I. Wang et al.
Fig. 2 Capital budgeting. Descriptions are given in the beginning of Sect. 6. The difference in objective
values (bottom left) is calculated as ḡ K (x) − ḡ N (x), and the theoretical bound is (L/2)D(K ) from Corol-
lary 1
K = 120∗ is the formulation without the support constraint. Therefore, while ḡ N ∗ (x)
slightly upper bounds ḡ N (x), we can approximate their difference ≈ 0 for small
enough , for which the upper bound (L/2)D(K ) thus hold. In fact in this example,
even for larger where we observe > 0, the actual difference between ḡ K and ḡ N
is bounded by (L/2)D(K ). We see that the elbow of the upper bound is at K = 2,
and the true difference follows the same trend, matching the suggestion from Fig. 1.
Therefore, setting K = 2 is the optimal decision, with a time reduction of 2 orders
of magnitude, and a complexity reduction from 26,626 variables and 12,000 power
cones to 666 variables and 200 power cones.
We consider a market that forbids short-selling and has m assets as in [24]. Daily
returns of these assets are given by the random vector d = (d1 , . . . , dm ) ∈ Rm . The
percentage weights (of the total capital) invested in each asset are given by the decision
vector x = (x1 , . . . , xn ) ∈ Rn . We restrict our selection to at most θ assets, given by
the 0-th norm cardinality constraint below. The distribution P is unknown, but we have
observed a historical dataset D N . Our objective is to minimize the CVaR with respect
to variable x,
minimize CVaR(−u T x, α)
subject to 1T x = 1, x ≥ 0, x 0 ≤ θ,
123
Mean robust optimization
which represents the average of the α largest portfolio losses that occur. In other words,
the CVaR term seeks to ensure that the expected magnitude of portfolio losses, when
they occur, is low.
The objective has an analytical
form with an extra variable τ given
as [24, 46]: EP τ + α1 max{−u T x − τ, 0} . From this, we obtain g as the maximum
of affine functions,
We examine the classic facility location problem [9, 33]. Consider a set of n potential
facilities, and m customers. Variable x ∈ {0, 1}n describes whether or not we construct
each facility i for i = 1, . . . , n, with cost ci . In addition, we would like to satisfy the
uncertain demand u ∈ Rm at minimal cost. We define variable X ∈ Rn×m where X i j
corresponding to the portion of the demand of customer j shipped from facility i with
corresponding cost Ci j . Furthermore, r ∈ Rn represents the production capacity for
each facility, and u ∈ Rm represents the uncertain demand from each customer. For
each customer j, X j represents the proportion of goods shipped from any facility to
that customer, which sums to 1. For each facility i, (X T )i represents the proportion
of goods shipped to any customer. Putting this all together, we obtain the objective to
minimize, c T x + tr(C T X ), subject to constraints 1T X j = 1, j = 1, . . . , m, as well
as multiple affine uncertain capacity constraints,
gi (u, x) = (X T )i u − ri xi ≤ 0 i = 1, . . . , n,
123
I. Wang et al.
Fig. 3 Sparse portfolio. Descriptions are given in the beginning of Sect. 6. The difference in objective values
(bottom left) is calculated as ḡ N (x) − ḡ K (x), and the theoretical bound is δ(K , z, γ ) from Corollary 3
that this upper bounds the difference in constraint values, and only indirectly affects
the objective values through restrictions on the feasible region. Therefore, it is not
an upper bound on the difference in objective value, merely an estimate. We cannot
directly compare this upper bound against the change in constraint values, as the
123
Mean robust optimization
Fig. 4 Facility location. Descriptions are given in the beginning of Sect. 6. The difference in objective values
(bottom left) is calculated as Obj(N ) - Obj(K ), and we compare it to the theoretical upper bound δ(K , z, γ )
on the worst-case constraint value ḡ N (x) − ḡ K (x), from Corollary 3
constraint value will always be near 0 for optimality. We thus compare it against the
change in objective values.
We consider a 2-item newsvendor problem where, at the beginning of each day, the
vendor orders x ∈ R2+ products at price h = (4, 5). These products will be sold at the
prices c = (5, 6.5), until either the uncertain demand u or inventory x is exhausted.
The objective function to minimize is the sum of the ordering cost minus the revenue,
123
I. Wang et al.
and take the minimum between the generated values and 40. For the upper bound
K
δ(K , z, γ ) on ḡ N (x) − ḡ K (x) from Corollary 3, we have (1/N ) k=1 i∈Ck
max j≤4 ((−c̃ j − C T γ jk )T (di − d̄k )), where c̃1 = 0, c̃2 = c1 e1 , c̃3 = c2 e2 , c̃4 = c.
Results To examine the effect of the outlier, in Fig. 6, we compare, for K = 10 and
K = 100, the objectives and tradeoff curves for the three methods. We note that,
when the outlier is averaged with other data-points, the final in-sample objective may
be improved, as the centroid moves closer to the non-outlier points. We observe that
MRO, in which the outlier may be clustered with other points, offers a lower in-sample
objective than AUG-MRO, in which the outlier is considered its own cluster. MRO has
123
Mean robust optimization
in fact offered protection against the outlier. And, as expected, ROB-MRO, where the
outlier point is removed, yields the best in-sample results. Regardless of the method,
we note that the final out-of-sample tradeoff curves are near-identical. Comparing the
plots for K = 10 and K = N = 100, the difference between MRO and ROB-MRO
for K = 10 is not larger than the difference for K = 100, which shows that, while
removing outliers a priori may be helpful, the effect of outliers will not be worse for
MRO compared to classic Wasserstein DRO.
In Fig. 7, we compare the in and out-of-sample objective values for MRO. While
setting K = 1 yields suboptimal results, we note that for K = 5 and above, we can
achieve similar performance as setting K = 100. We note that the upper bound on
ḡ N (x) − ḡ K (x), given in Corollary 3, holds for MRO. We again note that bounds
we observe do not depend on ḡ N ∗ (x), so it is irrelevant whether or not the support
has an affect on the worst-case constraint value. Regardless, we see that the support
only minimally affects the worst-case constraint value, at only at higher values of .
Overall, choosing K = 5, we obtain an order of magnitude computational speed-up.
7 Conclusions
123
I. Wang et al.
Fig. 6 Newsvendor. Comparing the three methods for K = 10 and K = 100. Left: in-sample objective
values vs . Right: objective value vs β
Fig. 7 Newsvendor, MRO. Descriptions are given in the beginning of Sect. 6. The difference in objective
values (bottom left) is calculated as ḡ N (x)− ḡ K (x), and the theoretical bound is δ(K , z, γ ) from Corollary 3
123
Mean robust optimization
We applied the Lagrangian in the first equality. Then, as the summation is over upper-
semicontinuous functions g j (v jk , x) concave in v jk , we applied the Von Neumann-
Fan minimax theorem [38] to interchange the inf and the sup. Next, we rewrite the
formulation using an epigraph trick, and make a change of variables.
⎧
⎪
⎨ sup λ≥0,s
inf K
k=1 sk
α∈
=
⎪
⎩subject to sup J
j=1 α jk (g j (v jk , x) − λck (v jk )) ≤ sk k = 1, . . . , K ,
v11 ,...,v J K ∈S
⎧ K
⎪
⎪ sup inf k=1 sk
⎪
⎨α∈ λ≥0,s
J
= subject to sup j=1 α jk (g j ((α jk v jk )/α jk , x)
⎪
⎪ α11 v11 ∈α11 S,...,α J K v J K ∈α J K S
⎪
⎩
−λck ((α jk v jk )/α jk )) ≤ sk k = 1, . . . , K .
For the new functions defined, we applied the definition of conjugate functions. We
also define the characteristic function χ S (v) with χ S (v) = 0 if v ∈ S; = ∞ otherwise.
123
I. Wang et al.
J
α jk [−g j + χ S + λck ]∗ (0) ≤ sk k = 1, . . . , K .
j=1
Next, borrowing results from Esfahani et al. [24, Theorem 4.2], Rockafellar and Wets
[41, Theorem 11.23(a), p. 493], and Zhen et al. [51, Lemma B.8], with regards to the
conjugate functions of infimal convolutions and p-norm balls, we note that:
⎧ K
⎪
⎪ sup inf k sk
⎪
⎪ λ≥0,s,z,y
⎨α∈ J ∗
subject to j=1 α jk ([−g j ] (z jk − y jk, x) q
⎪
⎪ +σ S (y jk ) − z jk d̄k + φ(q)λ z jk /λ∗ + λ p ) ≤ sk
T
⎪
⎪
⎩
k = 1, . . . , K .
J
Taking the supremum over α, noting that j=1 α jk = wk for all k, we arrive at
⎧
⎪
⎪ inf K
k sk
⎨ λ≥0,s,z,y
⎪ subject to wk ([−g j ]∗ (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k
⎪
⎩ q
+ φ(q)λ z jk /λ∗ + λ p ) ≤ sk k = 1, . . . , K , j = 1, . . . , J ,
We again give the general proof for J ≥ 1. For the case J = 1, refer to the simpler
proof [47, Appendix B]. When p = ∞, we have
123
Mean robust optimization
minimize ⎧f (x) ⎫
⎨ maximize K
k=1
J
j=1 α jk g(v jk , x) ⎬
subject to v11 ,...,v J K ∈S,α∈ ≤ 0,
⎩ subject to J
v jk − d̄k ≤ , k = 1, . . . , K ⎭
j=1 (α jk /wk )
(17)
minimize f (x)
K
subject to k=1 wk sk ≤ 0
[−g j ]∗ (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k + λk ≤ sk (18)
k = 1, . . . , K , j = 1, . . . , J
z jk ≤ λk k = 1, . . . , K , j = 1, . . . , J ,
∗
We have again formulated the Lagrangian and applied the minmax theorem to the sum
of concave functions in v jk . Now, we can rewrite this in epigraph form,
⎧ K
⎪
⎪ sup inf k=1 sk
⎪
⎨α∈ λ≥0,s
J
= subject to sup λk + j=1 α jk g j (v jk , x)
⎪
⎪ v11 ,...,v J K ∈S
⎪
⎩
−λk (α jk /wk ) v jk − d̄k ≤ sk k = 1, . . . , K ,
then use the definition of the dual norm to rewrite the constraints,
⎧
⎨ sup λk + Jj=1 min α jk g j (v jk , x)
v11 ,...,v J K ∈S z jk ∗ ≤λk /wk
⎩ −α jk z Tjk (v jk − d̄k ) ≤ sk k = 1, . . . , K ,
⎧
⎨ sup λk + J
j=1 min α jk g j
= α11 v11 ∈α11 S,...,α J K v J K ∈α J K S z jk ∗ ≤λk /wk
⎩ ((α jk v jk )/α jk , x) − α jk z Tjk (v jk − d̄k ) ≤ sk k = 1, . . . , K .
123
I. Wang et al.
Note that rescaling λk did not affect value of the problem, as minimizing λk is equiv-
alent to minimizing λk wk , as wk > 0. Lastly, taking the supremum over α, we arrive
at
⎧ K
⎪
⎪ inf k=1 sk
⎪
⎪ λ≥0,s,z
⎨ ∗
= subject to wk (λk + [−g j ] (z jk − y jk , x) + σ S (y jk ) − z jk d̄k ) ≤ sk
T
⎪
⎪ k = 1, . . . , K , j = 1, . . . , J
⎪
⎪
⎩
z jk ∗ ≤ λk , k = 1, . . . , K , j = 1, . . . , J .
We prove the dual equivalence. For a proof of primal equivalence, see the appendix
of [47].
123
Mean robust optimization
ḡ K (x; p)
⎧ K
⎪
⎪
⎨
minimize k=1 wk sk
subject to λ ≥ 0, z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
=
⎪
⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎩ q
+φ(q)λ z k /λ ∗ + λ p ≤ sk k = 1, . . . , K
⎧ K
⎪
⎪
⎨
minimize k=1 wk sk
subject to λk ≥ 0, z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
≥
⎪
⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎩ q
+φ(q)λk z k /λk ∗ + λk p ≤ sk k = 1, . . . , K
⎧ K
⎪
⎪
⎨
minimize k=1 wk sk
subject to z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
=
⎪
⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎩
+ z k ∗ ≤ sk k = 1, . . . , K
where the first equality is established in Appendix A and the second equality follows
from Lemma 1. Remark that the inequality in the second step simply follows as we
introduce λk and do not impose that λk = λ for all k = 1, . . . , K . Hence, considering
the limit for p tending to infinity gives us now lim inf p→∞ ḡ K (x; p) ≥ ḡ K (x; ∞). It
remains to prove the reverse lim sup p→∞ ḡ K (x; p) ≤ ḡ K (x; ∞).
Second, we have for any p > 1 with 1/ p + 1/q = 1 that
ḡ K (x; p)
⎧
⎪
⎪ minimize K
k=1 wk sk
⎪
⎪
⎪
⎪ subject to z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
⎪
⎪
⎪
⎨ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk + φ(q)
1−q
≤ 1
q−1 1−q
K q
⎪
⎪ max
k =1 z k ∗ zk ∗
⎪
⎪
q
⎪
⎪ + q−1
1
1−q maxkK =1 z k ∗ p ≤ sk k = 1, . . . , K
⎪
⎪
⎪
⎩
q
(q − 1)1/4 ≤ z k ∗ ≤ (q − 1)−1/4 k = 1, . . . , K ,
1
q−1 1−q
which follows from the choice λk = q maxkK =1 zk ∗ and by imposing the
restrictions (q − 1)1/4 ≤ zk ∗ ≤ (q − 1)−1/4 for all k = 1, . . . , K . Next, using
1−q 1−q
the identities φ(q) q−1
q = (q − 1)(q−1) /q q q−1
q = 1/q and p + 1−q 1
=
−q 1−q
1
1 + 1
= 1
+ 1
= −q+1 + 1
= = 1, as well as pulling z k ∗ >0
p
1−q 1− q1 1−q 1−q 1−q
out of the last two terms, we can rewrite the first constraints as
⎧
⎨ [−g]∗ (z k − yk) + σ S (yk ) − z kT dk
1−q
z zk
⎩ + z k ∗ q1 maxkK =1 zk ∗ + q−1 K
q maxk =1 z k
∗
≤ sk k = 1, . . . , K .
k ∗ ∗
123
I. Wang et al.
⎧ K
⎪
⎪
⎪
minimize k=1 wk sk
⎪
⎨ subject to z k ∈ R , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
m
ḡ (x; p) ≤
K
[−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎪
⎪
⎪
⎪ + z k ∗ D(q) ≤ sk k = 1, . . . , K ,
⎩
(q − 1)1/4 ≤ z k ∗ ≤ (q − 1)−1/4 k = 1, . . . , K .
Let
⎧ K
⎪
⎪ minimize k=1 wk sk
⎪
⎪
⎪
⎨ subject to z k ∈ R m
, yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
∗
[−g] (z k − yk ) + σ S(yk ) − z kT dk
ḡu (x; p) =
K
(20)
⎪
⎪ p
⎪
⎪ + z k ∗ D p−1 ≤ sk k = 1, . . . , K
⎪
⎩
( p − 1)−1/4 ≤ z k ∗ ≤ ( p − 1)1/4 k = 1, . . . , K .
Hence, as q = p/( p −1) and q −1 = 1/( p −1) we have ḡ K (x; p) ≤ ḡuK (x; p) for all
p > 1. Hence, taking the limit p → ∞ we have ḡ K (x; ∞) ≤ lim sup p→∞ ḡ K (x; p).
p
In fact, as the function D p−1 defined in Lemma 2 is nonincreasing for all p suf-
ficiently large this implies that ḡuK (x; p) is nonincreasing for p sufficiently large
and hence we have ḡ K (x; ∞) ≤ lim sup p→∞ ḡ K (x; p) ≤ lim inf p→∞ ḡuK (x; p) =
lim p→∞ ḡuK (x; p). We now prove here that lim p→∞ ḡuK (x; p) = lim inf p→∞ ḡuK (x;
p) ≤ ḡ K (x; ∞). Consider ∗ t
t any feasible sequence {(z k , yk , sk = [−g] (z k K− yk ) +
t t t t
t t T
σ S (yk )−(z k ) dk + z k ∗ )}t≥1 in the optimization problem characterizing ḡ (x; ∞)
K
in Equation (19) so that limt→∞ k=1 wk skt = ḡ K (x; ∞). Let z̃ kt ∈ arg max{ z ∗ |
z ∈ R , z − z k ∗ ≤ 1/t} for all t ≥ 1 and k = 1, . . . , K and observe that
m t
K
lim guK (x; p) = lim guK (x; pt ) = lim wk s̃kt
p→∞ t→∞ t→∞
k=1
K
= lim wk [−g]∗ (z̃ kt − ỹkt ) + σ S ( ỹkt ) − (z̃ kt )T dk + z̃ kt ∗ D ( pt /( pt − 1))
t→∞
k=1
K
≤ lim wk [−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk + z kt ∗ D ( pt /( pt − 1))
t→∞
k=1
123
Mean robust optimization
K
+ wk max s + dk + D ( pt /( pt − 1)) /t
s∈S
k=1
K
≤ lim wk [−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk + z kt ∗ D ( pt /( pt − 1))
t→∞
k=1
K
= lim wk ([−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk
t→∞
k=1
+ z kt ∗ + z kt ∗ (D ( pt /( pt − 1)) − 1))
K
≤ lim wk ([−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk
t→∞
k=1
+ z kt ∗ + ( pt − 1)1/4 (D ( pt /( pt − 1)) − 1))
K
≤ lim wk sk = ḡ K (x; ∞).
t→∞
k=1
To establish the third inequality observe first that −(z̃ kt )T dk = −(z kt )T dk − (z̃ kt −
z kt )T dk ≤ −(z kt )T dk + z̃ kt − z kt ∗ dk ≤ −(z kt )T dk + dk /t. Second, remark that
we have
Lemma 1 We have
q
min φ(q)λ z/λ ∗ + λ p = z ∗
λ≥0
for any p > 1 and q > 1 for which 1/ p + 1/q = 1, φ(q) = (q − 1)q−1 /q q and
> 0.
q
Proof Remark that as the objective function λ → φ(q)λ z/λ ∗ + λ p is continuous
q q
and we have limλ→0 φ(q)λ z/λ ∗ + λ p = limλ→∞ φ(q)λ z/λ ∗ + λ p = ∞ as
123
I. Wang et al.
> 0 there must exist a minimizer λ ∈ minλ≥0 φ(q)λ z/λ ∗ +λ p with λ > 0. The
q
q − 1 1−q
1
⇐⇒ λ = z
q
Hence, we have
q 1−q q
min φ(q)λ1−q z + λ p = φ(q)λ z + λ p
λ≥0
(q − 1)1−q 1−q q q − 1 1−q
1
= φ(q) z z + z p
q 1−q q
(q − 1)1−q q − 1 p+ 1−q
1
= φ(q) 1−q
z + z
q q
(q − 1)1−q q −1
= φ(q) z + z
q 1−q q
(q − 1)q−1 (q − 1)1−q q −1 1 q −1
= z + z = z + z
qq q 1−q q q q
1 q −1
= + z = z
q q
1 1 1 1 1 −q 1 1−q
p+ = 1 + = + = + = =1
1−q p
1−q 1− 1
q
1−q −q + 1 1 − q 1−q
123
Mean robust optimization
Proof The objective function is convex in t. Convex functions attain their maximum
on the extreme points of their domain. The limits can be verified using standard
manipulations.
D Proof of Theorem 4
We prove (i) ḡ N (x) ≤ ḡ K (x), (ii) ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ), and (iii) when the
support constraint does not affect the worst-case value, ḡ K (x) ≤ ḡ N (x) + (L/2)D(K ).
Proof of (i). We begin with a feasible solution v1 , . . . , v N of (MRO-K) with K = N .
Then for K < N , we set u k = i∈Ck vi /|Ck | for each of the K clusters. We see u k
with k = 1, . . . , K satisfies the constraints of (MRO-K) for K < N , as
K
|Ck |
K
|Ck | vi di p
u k − d̄k p = i∈Ck
−
i∈Ck
N N |Ck | |Ck |
k=1 k=1
K
|Ck | 1
≤ vi − di p
N |Ck |
k=1 i∈Ck
K
1
= vi − di p
≤ p,
N
k=1 i∈Ck
where we have applied triangle inequality, Jensen’s inequality for the convex function
f (x) = x p , and the constraint of (MRO-K) with K = N . In addition, since the
support S is convex, for every k our constructed u k , as the average of select points vi
∈ S, must also be within S. The same applies with respect to the domain of g.
Since we have shown that the u k ’s satisfies the constraints for (MRO-K) with
K < N , it is a feasible solution. We now show that for this pair of feasible solutions,
in terms of the objective value, ḡ K (x) ≥ ḡ N (x). By assumption, g is concave in the
uncertain parameter, so by Jensen’s inequality,
⎛ ⎞
K
|Ck | 1
K
|Ck | 1
g⎝ vi , x ⎠ ≥ g(vi )
N |Ck | N |Ck |
k=1 i∈Ck k=1 i∈Ck
K
|Ck | 1
g(u k , x) ≥ g(vi ).
N N
k=1 i∈N
Since this holds true for u k ’s constructed from any feasible solution vi , . . . , v N , we
must have ḡ K (x) ≥ ḡ N (x).
Proof of (ii). Next, we prove ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ) by making use of
the L-smooth condition on −g. We first solve (MRO-K) with K < N to obtain a
feasible solution u 1 , . . . , u k . We then set k = u k − d̄k for each k ≤ K , and set
vi = di + k ∀i ∈ Ck , k = 1, . . . , K . These satisfy the constraint of (MRO-K) with
123
I. Wang et al.
1 1
N K K
|Ck |
vi − di p
= k p
= u k − d̄k p
≤ p.
N N N
i=1 k=1 i∈Ck k=1
Since the constraints are satisfied, the constructed vi . . . v N are a valid solution
for (MRO-K) with K = N ∗ . We note that these vi ’s are also in the domain of g,
given that the uncertain data D N is in the domain of g. For monotonically increasing
functions g, (e.g., log(u), 1/(1 + u)), we must have k = u k − d¯k ≥ 0 in the solu-
tion of (MRO-K), as the maximization of g over u k will lead to u k ≥ d¯k . Therefore,
vi = di + k is also in the domain, as the L-smooth and concave function g with
only a potential lower bound will not have holes in its domain above the lower bound.
For monotonically decreasing functions g, the same logic applies with a nonpositive
k . We now make use of the convex and L-smooth conditions [2, Theorem 5.8] on
−g : ∀v1 , v2 ∈ S, λ ∈ [0, 1],
L
g(λv1 + (1 − λ)v2 ) ≤ λg(v1 ) + (1 − λ)g(v2 ) + λ(1 − λ) v1 − v2 22 ,
2
1 |Ck | − 1 1 |Ck | − 1
g v1 + v̄2 ≤ g(v1 ) + g(v̄2 )
|Ck | |Ck | |Ck | |Ck |
L 1 |Ck | − 1
+ v1 − v̄2 22 ,
2 |Ck | |Ck |
where v̄2 = |Ck1|−1 i∈Ck ,i=1 vi . Note that v1 − v̄2 = d1 − |Ck1|−1 i∈Ck ,i=1 di , as
they share the same k . The next iteration will be applied to g(v̄2 ), and so on. For
each cluster k, this results in:
⎛ ⎞ 2
|Ck |
− i−1
1 1 L i 1 j=1 d j
g⎝ vi , x ⎠ ≤ g(vi , x) + di −
|Ck | |Ck | 2|Ck | i i −1
i∈Ck i∈Ck i=2 2
1 L
g(d̄k + k , x) ≤ g(vi , x) + di − d̄k 2
2
|Ck | 2|Ck |
i∈Ck i∈Ck
1 L
g(u k , x) ≤ g(vi , x) + di − d̄k 2,
2
|Ck | 2|Ck |
i∈Ck i∈Ck
123
Mean robust optimization
2
|Ck | i−1
where we used the equivalence i=2 ((i − 1)/i) di − j=1 d j /(i − 1) =
2
i∈Ck di − d̄k 2.
2 Now, summing over all clusters, we have
K
N
(|Ck |/N )g(u k , x) ≤ 1/N g(vi , x) + (L/2)D(K ).
k=1 i=1
Since this holds for any feasible solution of (MRO-K) with K < N , we must have
ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ).
E Proof of Theorem 5
Proof of the lower bound. We use the dual formulations of the MRO constraints. For
the case p ≥ 1, we first solve (MRO-K-Dual) with K < N to obtain dual variables
z jk , γ jk . For each data label i in cluster Ck , for all clusters k = 1, . . . , K , and for all
pieces j = 1, . . . , J , if we set
N
K
δ(K , z, γ ) = (1/N ) si − (|Ck |/N )sk
i=1 k=1
K
= (1/N ) max{(−z jk − H T γ jk )T (di − d̄k )}.
j
k=1 i∈Ck
K
We note that δ(K , z, γ ) ≥ (|Ck |/N ) k=1 (−z 1k − H T γ1k )T (1/|Ck |) i∈Ck (di −
K
d̄k ) = (|Ck |/N ) k=1 (−z 1k − H T γ1k )T 0 = 0. The constructed feasible solution
for (MRO-K-Dual) with K = N is an upper bound for its optimal solution, since it is
a minimization problem. We then have ḡ N (x)− ḡ K (x) ≤ δ(K , z, γ ), which translates
to ḡ N (x) − δ(K , z, γ ) ≤ ḡ K (x).
Now, for p = ∞, the same procedure can be applied; we obtain a solu-
tion to (MRO-K-Dual-∞) with K < N , and construct a feasible solution
for (MRO-K-Dual-∞) with K = N . We modify the variables in the same man-
ner, with the exception of λ, which is now set by λi = λk . The definition of δ(K , z, γ )
remains unchanged, so the same result follows.
Proof of the upper bound. We use the primal formulations of the MRO constraints.
Similar to the single-concave proof, for p ≥ 1, we first solve the MRO problem with
K clusters to obtain a feasible solution u 11 , . . . , u J K , α11 , . . . , α J K . Note that the
existence of α removes the need to analyze which function g j attains the maximum
123
I. Wang et al.
for each cluster. We then set jk = u jk − d̄k for each k ≤ K , and set v ji = di + jk ,
α ji = α jk /|Ck | ∀i ∈ Ck , k = 1, . . . , K . These satisfy the constraint of the problem
with N clusters and without support constraints, as
N
J
K
J
K
J
α ji v ji − di p
= α ji jk p
= α jk u jk − d̄k p
≤ p.
i=1 j=1 k=1 j=1 i∈Ck k=1 j=1
For p = ∞, we repeat the process of obtaining and modifying a feasible solution, and
observe, for i = 1, . . . , N ,
J
J
(α ji /(1/N )) v ji − di = (α jk /(|Ck |/N )) jk
j=1 j=1
J
= (α jk /wk ) u jk − d̄k ≤ .
j=1
K
J
K
J
α jk g j (u jk , x) ≤ α ji g j (v ji , x)
k=1 j=1 k=1 j=1 i∈Ck
K
J
α ji max j≤J L j
+ di − d̄k 2
2
2
k=1 j=1 i∈Ck
N
J
N
≤ α ji g(v ji , x) + max(L j /2N ) di − d̄k 2.
2
j≤J
i=1 j=1 i=1
123
Mean robust optimization
Since this holds for all feasible solutions of the problem with K clusters, we conclude
that
Acknowledgements The simulations presented in this article were performed on computational resources
managed and supported by Princeton Research Computing, a consortium of groups including the Princeton
Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology’s
High Performance Computing Center and Visualization Laboratory at Princeton University. We would like
to thank Daniel Kuhn for the useful feedback and for pointing us to the literature on scenario reduction
techniques.
Declarations
Conflict of interest The authors have no relevant financial or non-financial interests to disclose.
References
1. Bandi, C., Bertsimas, D.: Tractable stochastic analysis in high dimensions via robust optimization.
Math. Program. 134(1), 23–70 (2012)
2. Beck, A.: First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics,
Philadelphia (2017)
3. Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton
(2009)
4. Ben-Tal, A., den Hertog, D., Vial, J.P.: Deriving robust counterparts of nonlinear uncertain inequalities.
Math. Program. 149(1–2), 265–299 (2015)
5. Ben-Tal, A., Nemirovski, A.: Robust solutions of Linear Programming problems contaminated with
uncertain data. Math. Program. 88(3), 411–424 (2000)
6. Ben-Tal, A., Nemirovski, A.: Selected topics in robust convex optimization. Math. Program. 112,
125–158 (2008)
7. Beraldi, P., Bruni, M.E.: A clustering approach for scenario tree reduction: an application to a stochastic
programming portfolio optimization problem. TOP 22(3), 934–949 (2014)
8. Bertsimas, D., Brown, D.B., Caramanis, C.: Theory and applications of robust optimization. SIAM
Rev. 53(3), 464–501 (2011)
9. Bertsimas, D., den Hertog, D., Pauphilet, J.: Probabilistic guarantees in robust optimization. SIAM J.
Optim. 31(4), 2893–2920 (2021)
10. Bertsimas, D., Gupta, V., Kallus, N.: Data-driven robust optimization. Math. Program. 167(2), 235–292
(2018)
11. Bertsimas, D., den Hertog, D.: Robust and Adaptive Optimization. Dynamic Ideas, Belmont (2022)
12. Bertsimas, D., Mundru, N.: Optimization-based scenario reduction for data-driven two-stage stochastic
optimization. Oper. Res. (2022)
13. Bertsimas, D., Sim, M.: The price of robustness. Oper. Res. 52(1), 35–53 (2004)
14. Bertsimas, D., Sturt, B., Shtern, S.: A data-driven approach to multistage stochastic linear optimization.
Manag. Sci. (2022)
15. Bradley, B.O., Taqqu, M.S.: Financial Risk and Heavy Tails, Handbooks in Finance, vol. 1. North-
Holland, Amsterdam (2003)
16. Carmona, R.A.: Rsafd: Statistical Analysis of Financial Data in R (2020). R package version 1.2
17. Chen, L.: Clustering of sample average approximation for stochastic program (2015)
18. Chen, R., Paschalidis, I.: Distributionally robust learning. Found. Trends Optim. 4(1–2), 1–243 (2020)
19. Chen, Z., Sim, M., Xiong, P.: Robust stochastic optimization made easy with RSOME. Manag. Sci.
66(8), 3329–3339 (2020)
123
I. Wang et al.
20. Clement, P., Desch, W.: An elementary proof of the triangle inequality for the Wasserstein metric.
Proc. Am. Math. Soc. 136(1), 333–339 (2008)
21. Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to
data-driven problems. Oper. Res. 58(3), 595–612 (2010)
22. Dupačová, J., Gröwe-Kuska, N., Römisch, W.: Scenario reduction in stochastic programming. Math.
Program. 95(3), 493–511 (2003)
23. Emelogu, A., Chowdhury, S., Marufuzzaman, M., Bian, L., Eksioglu, B.: An enhanced sample average
approximation method for stochastic optimization. Int. J. Prod. Econ. 182, 230–252 (2016)
24. Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein met-
ric: performance guarantees and tractable reformulations. Math. Program. 171, 115–166 (2018)
25. Esteban, P.A., Morales, J.M.: Partition-based distributionally robust optimization via optimal transport
with order cone constraints. 4OR 20(3), 465–497 (2022)
26. Fabiani, F., Goulart, P.: The optimal transport paradigm enables data compression in data-driven robust
control. In: 2021 American Control Conference (ACC), pp. 2412–2417 (2021)
27. Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure.
Probab. Theory Relat. Fields 162(3), 707–738 (2015)
28. Gao, R.: Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the
curse of dimensionality. CoRR (2020)
29. Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math.
Oper. Res. 48, 603–655 (2023)
30. Givens, C.R., Shortt, R.M.: A class of Wasserstein metrics for probability distributions. Mich. Math.
J. 31(2), 231–240 (1984)
31. Goh, J., Sim, M.: Distributionally robust optimization and its tractable approximations. Oper. Res.
58(4–part–1), 902–917 (2010)
32. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser.
C (Appl. Stat.) 28(1), 100–108 (1979)
33. Holmberg, K., Rönnqvist, M., Yuan, D.: An exact algorithm for the capacitated facility location prob-
lems with single sourcing. Eur. J. Oper. Res. 113(3), 544–559 (1999)
34. Jacobson, D., Hassan, M., Dong, Z.S.: Exploring the effect of clustering algorithms on sample average
approximation. In: 2021 Institute of Industrial and Systems Engineers (IISE) Annual Conference &
Expo (2021)
35. Kuhn, D., Esfahani, P.M., Nguyen, V., Shafieezadeh-Abadeh, S.: Wasserstein Distributionally Robust
Optimization: Theory and Applications in Machine Learning, pp. 130–166 (2019)
36. Liu, Y., Yuan, X., Zhang, J.: Discrete approximation scheme in distributionally robust optimization.
Numer. Math. Theory Methods Appl. 14(2), 285–320 (2021)
37. MOSEK ApS: The MOSEK optimization toolbox. Version 9.3. (2022)
38. Neumann, J.V.: Zur theorie der gesellschaftsspiele. Math. Ann. 100, 295–320 (1928)
39. Perakis, G., Sim, M., Tang, Q., Xiong, P.: Robust pricing and production with information partitioning
and adaptation. Manag. Sci. (2023)
40. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance
26(7), 1443–1471 (2002)
41. Rockafellar, R.T., Wets, R.J.: Variational analysis. Grundlehren der mathematischen Wissenschaften
(1998)
42. Roos, E., den Hertog, D.: Reducing conservatism in robust optimization. INFORMS J. Comput. 32(4),
1109–1127 (2020)
43. Rujeerapaiboon, N., Schindler, K., Kuhn, D., Wiesemann, W.: Scenario reduction revisited: funda-
mental limits and guarantees. Math. Program. 191(1), 207–242 (2022)
44. Thorndike, R.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
45. Trillos, N., Slepčev, D.: On the rate of convergence of empirical measures in ∞-transportation distance.
Can. J. Math. 67, 1358–1383 (2014)
46. Uryasev, S., Rockafellar, R.T.: Conditional Value-at-Risk: Optimization Approach. Springer, New York
(2001)
47. Wang, I., Becker, C., Van Parys, B., Stellato, B.: Mean robust optimization. arXiv (2023)
48. Wang, Z., Wang, P., Ye, Y.: Likelihood robust optimization for data-driven problems. CMS 13(2),
241–261 (2016)
49. Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62(6),
1358–1376 (2014)
123
Mean robust optimization
50. Xu, H., Caramanis, C., Mannor, S.: A distributional interpretation of robust optimization. Math. Oper.
Res. 37(1), 95–110 (2012)
51. Zhen, J., Kuhn, D., Wiesemann, W.: A unified theory of robust and distributionally robust optimization
via the primal-worst-equals-dual-best principle (2021). https://doi.org/10.1287/opre.2021.0268
52. Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order
moment information. Math. Program. 137(1), 167–198 (2013)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123