0% found this document useful (0 votes)
32 views43 pages

Mean Robust Optimization Framework

Uploaded by

帅李
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views43 pages

Mean Robust Optimization Framework

Uploaded by

帅李
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mathematical Programming

https://doi.org/10.1007/s10107-024-02170-4

FULL LENGTH PAPER


Series A

Mean robust optimization

Irina Wang1 · Cole Becker2 · Bart Van Parys3 · Bartolomeo Stellato1

Received: 27 December 2023 / Accepted: 1 November 2024


© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2024

Abstract
Robust optimization is a tractable and expressive technique for decision-making
under uncertainty, but it can lead to overly conservative decisions when pessimistic
assumptions are made on the uncertain parameters. Wasserstein distributionally robust
optimization can reduce conservatism by being data-driven, but it often leads to very
large problems with prohibitive solution times. We introduce mean robust optimiza-
tion, a general framework that combines the best of both worlds by providing a trade-off
between computational effort and conservatism. We propose uncertainty sets con-
structed based on clustered data rather than on observed data points directly thereby
significantly reducing problem size. By varying the number of clusters, our method
bridges between robust and Wasserstein distributionally robust optimization. We show
finite-sample performance guarantees and explicitly control the potential additional
pessimism introduced by any clustering procedure. In addition, we prove conditions
for which, when the uncertainty enters linearly in the constraints, clustering does not
affect the optimal solution. We illustrate the efficiency and performance preservation
of our method on several numerical examples, obtaining multiple orders of magnitude
speedups in solution time with little-to-no effect on the solution quality.

Keywords Robust optimization · Distributionally robust optimization · Data-driven


optimization · Machine learning · Clustering · Probabilistic guarantees

Irina Wang and Bartolomeo Stellato are supported by the NSF CAREER Award ECCS 2239771.

B Bartolomeo Stellato
[email protected]
Irina Wang
[email protected]
Cole Becker
[email protected]
Bart Van Parys
[email protected]

1 Department of Operations Research and Financial Engineering, Princeton University, Princeton,


USA
2 Operations Research Center, Massachusetts Institute of Technology, Cambridge, USA
3 Stochastics Group, Centrum Wiskunde en Informatica, Amsterdam, The Netherlands

123
I. Wang et al.

Mathematics Subject Classification 90C17 · 90C25 · 90C46

1 Introduction

Robust optimization (RO) and distributionally robust optimization (DRO) are popu-
lar tools for decision-making under uncertainty due to their high expressiveness and
versatility. The main idea of RO is to define an uncertainty set and to minimize the
worst-case cost across possible uncertainty realizations in that set. However, while
RO often leads to tractable formulations, it can be overly-conservative [42]. To reduce
conservatism, DRO takes a probabilistic approach, by modeling the uncertainty as
a random variable following a probability distribution known only to belong to an
uncertainty set (also called ambiguity set) of distributions. In both RO and DRO, the
choice of the uncertainty or ambiguity set can greatly influence the quality of the solu-
tion. Good-quality uncertainty sets can lead to excellent practical performance while
ill chosen sets can lead to overly-conservative actions and intractable computations.
Traditional approaches design uncertainty sets based on theoretical assumptions
on the uncertainty distributions [1, 3, 5, 13]. While these methods have been quite
successful, they rely on a priori assumptions that are difficult to verify in practice. On
the other hand, the last decade has seen an explosion in the availability of data, which
has brought a shift in focus from a priori assumptions on the probability distributions
to data-driven methods in operations research and decision sciences. In RO and DRO,
this new paradigm has fostered data-driven methods where uncertainty sets are shaped
directly from data [10]. In data-driven DRO, a popular choice of the ambiguity set is
the ball of distributions whose Wasserstein distance to a nominal distribution is at most
 > 0 [24, 28, 29, 35]. When the reference distribution is an empirical distribution,
the associated Wasserstein DRO can be formulated as a convex minimization problem
where the number of constraints grows linearly with the number of data-points [24].
While less conservative than RO, data-driven DRO can lead to very large formulations
that are intractable, especially in mixed-integer optimization (MIO).
A common idea to reduce the dimensionality of data-driven decision-making prob-
lems is to use clustering techniques from machine learning. While clustering has
recently appeared in various works within the stochastic programming literature [7,
17, 23, 34], the focus has been on the improvement of and comparisons to the sample
average approximation (SAA) approach and not in a distributionally robust sense. In
contrast, recent approaches in the DRO literature cluster data into partitions and either
build moment-based uncertainty sets for each partition [19, 39], or enrich Wasserstein
DRO formulations with partition-specific information (e.g., relative weights) [25].
While these approaches are promising, clustering is still used as a pre-processing
heuristic on the data-sets in DRO, without a clear understanding of how it affects
the conservatism of the optimal solutions. In particular, choosing the right clustering
parameters to carefully balance computational tractability and out-of-sample perfor-
mance is still an unsolved challenge.

123
Mean robust optimization

1.1 Our contributions

We present mean robust optimization (MRO), a data-driven method that, via machine
learning clustering, bridges between RO and Wasserstein DRO.
– We design the uncertainty set for RO as a ball around clustered data. With-
out clustering, our formulation corresponds to the finite convex reformulation
in Wasserstein DRO. With one cluster, our formulation corresponds to the clas-
sical RO approach. The number of clusters is a tunable parameter that provides
a tradeoff between the worst-case objective value and computational efficiency,
which includes both speed and memory usage.
– We provide probabilistic guarantees of constraint satisfaction for our method,
based on the quality of the clustering procedure.
– We derive bounds on the effect of clustering in case of constraints with concave
and maximum-of-concave dependency on the uncertainty. In addition, we show
that, when constraints are linearly affected by the uncertainty, clustering does not
affect the solution nor the probabilistic guarantees.
– We show on various numerical examples that, thanks to our clustering proce-
dure, our approach provides multiple orders of magnitude speedups over classical
approaches while guaranteeing the same probability of constraint satisfaction. The
code to reproduce our results is available at https://github.com/stellatogrp/mro_
experiments.

1.2 Related work

Robust optimization. RO deals with decision-making problems where some of the


parameters are subject to uncertainty. The idea is to restrict data perturbations to a
deterministic uncertainty set, then optimize the worst-case performance across all
realizations of this uncertainty. For a detailed overview of RO, we refer to the survey
papers by Ben-Tal and Nemirovski [6] and Bertsimas et al. [8], as well as the books by
Ben-Tal et al. [3] and Bertsimas and den Hertog [11]. These approaches, while pow-
erful, may be overly-conservative, and there exists a tradeoff between conservatism
and constraint violation [42].
Distributionally robust optimization. DRO minimizes the worst-case expected loss
over a probabilistic ambiguity set characterized by certain known properties of the
true data-generating distribution. Based on the type of ambiguity set, existing liter-
ature on DRO can roughly be defined in two categories. Ambiguity sets of the first
type contain all distributions that satisfy certain moment constraints [21, 31, 49, 52].
In many cases such ambiguity sets possess a tractable formulation, but have also been
criticized for yielding overly conservative solutions [48]. Ambiguity sets of the second
type enjoy the interpretation of a ball of distributions around a nominal distribution,
often the empirical distribution on the observed samples. Wasserstein uncertainty sets
are one particular example [24, 28, 29, 35] and enjoy both a tractable primal as well
as a tractable dual formulation. We refer to the work by Chen and Paschalidis [18]
for a thorough overview of DRO, and to the work by Zhen et al. [51] for a gen-
eral theory on convex dual reformulations. When the ambiguity set is well chosen,

123
I. Wang et al.

DRO formulations enjoy strong out-of-sample statistical performance guarantees. As


these statistical guarantees are typically not very sharp, in practice the radius of the
uncertainty set is chosen through time consuming cross-validation [28]. At the same
time, DRO has the downside of being more computationally expensive than traditional
robust approaches. We observe for instance that the number of constraints in Wasser-
stein DRO formulations scale linearly with the number of samples, which can become
practically prohibitive especially when integer variables are involved. Our proposed
method addresses this problem by reducing the number of constraints through cluster-
ing. While many works have recently emerged on the construction of DRO ambiguity
sets through the partitioning of data [19, 25, 39], or the discretization of the underly-
ing distribution [36], there still exists a gap in the literature. In particular, theoretical
bounds on the change in problem performance as affected by the number of clusters,
as well as by the quality of the cluster assignment, remain largely unexplored. In this
work, we fill the gap by providing such insights.

Data-driven robust optimization. Data-driven optimization has been well-studied, with


various techniques to learn the unknown data-generating distribution before formulat-
ing the uncertainty set. Bertsimas et al. [10] construct the ambiguity set as a confidence
region for the unknown data-generating distribution P using several statistical hypoth-
esis tests. By pairing a priori assumptions on P with different statistical tests, they
obtain various data-driven uncertainty sets, each with its own geometric shape, com-
putational properties, and modeling power. We, however, use machine learning in the
form of clustering algorithms to preserve the geometric shape of the dataset, without
explicitly learning and parametrizing the unknown distribution.

Distributionally robust optimization as a robust program. Gao and Kleywegt [29]


consider a robust formulation of Wasserstein DRO similar to our mean robust opti-
mization, but without the idea of dataset reduction. Given N samples and a positive
integer K , they introduce an approximation of Wasserstein DRO by defining a new
ambiguity set as a subset of the standard Wasserstein DRO set, containing all distri-
butions supported on N K points with equal probability 1/(N K ), as opposed to the
standard set supported on N points. In this work, however, we study how to reduce
the number of variables and constraints.

Robust optimization as a distributionally robust optimization program. Xu et al. [50]


take inspiration from sample-based optimization problems to investigate probabilis-
tic interpretations of RO. They generalize the ideas of Delage and Ye [21], that the
solution to a robust optimization problem is the solution to a special DRO problem,
where the distributional set contains all distributions whose support is contained in the
uncertainty set. In a related vein, Bertsimas et al. [14] show that, under a particular
construction of the uncertainty sets, multi-stage stochastic linear optimization can be
interpreted as Wasserstein-∞ DRO. We establish a similar equivalence between RO
and DRO, focusing especially on Wasserstein- p ambiguity sets for all p. We develop
an easily interpretable construction of the primal constraints and uncertainty sets, and
prove that p = ∞ is a limiting case of p ≥ 1. This provides a natural extension of the
equivalence proved in [14, Proposition 3].

123
Mean robust optimization

Probabilistic guarantees in robust and distributionally optimization. Bertsimas et al.


[9] propose a disciplined methodology for deriving probabilistic guarantees for solu-
tions of robust optimization problems with specific uncertainty sets and objective
functions. They derive a posteriori guarantee to compensate for the conservatism of
a priori uncertainty bounds. Esfahani and Kuhn [24] obtain finite-sample guarantees
for Wasserstein DRO for selecting the radius  of order N −1/ max{2,m} , where N is
the number of samples and m is the dimension of the problem data, while Gao [28]
derives finite-sample guarantees for Wasserstein DRO for selecting  of order N −1/2
under specific assumptions. We provide theoretical results of a similar vein, with a
slightly increased  to compensate for information lost through clustering and achieve
the same probabilistic guarantees. Our theoretical guarantees hold for Wasserstein- p
distance for all p ≥ 1 and p = ∞, and are independent of the uncertain function to
minimize. These bounds, however, following the literature, are theoretical in nature
and not tight in practice, typically resulting in conservative . The final  values are
usually chosen through empirical experimentation - in which case, our formulation,
by being lower dimensional, is overall much faster to solve.
Clustering in stochastic optimization. Clustering in stochastic optimization is closely
related to the idea of scenario reduction. First introduced by Dupačová et al. [22],
scenario reduction seeks to approximate, with respect to a probability metric, an N -
point distribution with a distribution with a smaller number of points. In particular,
Rujeerapaiboon et al. [43] analyze the worst-case bounds on scenario reduction the
approximation error with respect to the Wasserstein metric, for initial distributions
constrained to a unit ball. They provide constant-factor approximation algorithms for
K -medians and K -means clustering [32]. Later, Bertsimas and Mundru [12] apply
this idea to two-stage stochastic optimization problems, and provide an alternating-
minimization method for finding optimal reduced scenarios under the modified
objective. They also provide performance bounds on the stochastic optimization prob-
lem for different scenarios. Jacobson et al. [34], Emelogu et al. [23], Beraldi et al.
[7], and Chen [17] apply a similar idea of clustering to reduce the sample/scenario
size, then compare the results against the classical SAA approach where the sample
size is not reduced. In MRO, we adapt and extend the scenario reduction approach to
Wasserstein DRO, where upon fixing the reduced scenario points to ones found by the
clustering algorithm, we allow for variation around these reduced points. We then pro-
vide performance bounds on the DRO problem depending on the number of clusters.
Data compression in data-driven problems. Fabiani and Goulart [26] compress data for
robust control problems by minimizing the Wasserstein-1 distance between the origi-
nal and compressed datasets, and observe a slight loss in performance in exchange for
reduced computation time. While related, this is orthogonal to our approach of using
machine learning clustering to reduce the dataset, where we include results and theoret-
ical bounds for a more general set of robust optimization problems with Wasserstein- p
distance, and demonstrate conditions under which no performance loss is necessary.

123
I. Wang et al.

2 Mean robust optimization

2.1 The problem

We consider an uncertain constraint of the form,


g(u, x) ≤ 0, (1)
where x ∈ X ⊆ Rn is the optimization variable and X is a compact set, u ∈ Rm
is an uncertain parameter, and −g(u, x) is proper, convex, and lower-semicontinuous
in u for all x. Throughout this paper, we assume the support S of u to live within the
domain of g for the variable u, which we will refer to as domu g, i.e., S ⊆ domu g.
We assume domu g is independent of x, and that the following assumption holds.
Assumption 1 The domain domu g is Rm . Otherwise, g is either element-wise mono-
tonically increasing in u and only has a (potentially) lower-bounded domain, or
element-wise monotonically decreasing in u and only has a (potentially) upper-
bounded domain.
This assumption on the domain and monotonicity of g is very common in practice as it
is satisfied by linear and quadratic functions, as well as other common functions (e.g.,
log(u), and 1/(1 + u)). In Sect. 2.4, we extend our results for g being the maximum
of of concave functions, each satisfying the aforementioned conditions.
The RO approach defines an uncertainty set U ⊆ Rm and forms the robust coun-
terpart as g(u, x) ≤ 0, ∀u ∈ U, where the uncertainty set is chosen so that for
any solution x, the above holds with a certain probability. We define this in terms of
expectation,
EP (g(u, x)) ≤ 0,
where P is the unknown distribution of the uncertainty u.
Risk measures. Expectation constraints can represent popular risk measures, and can
imply constraints commonly used in chance-constrained programming (CCP). In CCP,
the probabilistic constraint considered is P(g(u, x) ≤ 0) ≥ 1 − α, which corresponds
to the value at risk being nonpositive, i.e.,
VaR(g(u, x), α) = inf{γ | P(g(u, x) ≤ γ ) ≥ 1 − α} ≤ 0.
Unfortunately, except in very special cases, the value at risk function is intractable [46].
A tractable approximation of the value at risk is the conditional value at risk [40,
46], defined as CVaR(g(u, x), α) = inf τ {E(τ + (1/α)(g(u, x) − τ )+ )}, where
(a)+ = max{a, 0}. This expression can be modeled through our approach, by writing
CVaR(g(u, x), α) = inf τ {E(ĝ(u, x, τ ))}, where ĝ(u, x, τ ) = τ + (1/α)(g(u, x) −
τ )+ is the maximum of concave functions, which we study in Sects. 2.4, 6.2, and 6.3.
It is well known from [46] that the relationship between these probabilistic guarantees
of constraint satisfaction is
CVaR(g(u, x), α) ≤ 0 ⇒ VaR(g(u, x), α) ≤ 0 ⇐⇒ P(g(u, x) ≤ 0) ≥ 1 − α.
Therefore, our expectation constraint implies common chance constraints.

123
Mean robust optimization

Finite-sample guarantees. In data-driven optimization, while P is unknown, it is par-


tially observable through a finite set of N independent samples of the random vector
u. We denote this training dataset by D N = {di }i≤N ⊆ S, and note that it is gov-
erned by P N , the product distribution supported on S N . A data-driven solution of a
robust optimization problem is a feasible decision x̂ N ∈ Rn found using the data-
driven uncertainty set U, which in turn is constructed by the training dataset D N .
Specifically, the feasible decision and data-driven uncertainty set U must imply the
probabilistic guarantee
 
P N EP (g(u, x̂ N )) ≤ 0 ≥ 1 − β, (2)

where β > 0 is the specified probability of constraint violation. From now on, the
probabilistic guarantees of constraint satisfaction refers to (2).

2.2 Our approach

To meet the probabilistic guarantees outlined above, we construct x̂ N to satisfy par-


ticular constraints, with respect to a particular uncertainty set.
Case p ≥ 1. In the case where p ≥ 1, the set we consider takes the form
  K 
 

U(K , ) = u = (v1 , . . . , v K ) ∈ S K
 wk vk − d̄k p
≤ p
,

k=1

where we partition D N into K disjoint subsets Ck , and d̄k is the centroid of the k-
th subset, for k = 1, . . . , K . The weight wk > 0 of each subset is equivalent to
the proportion of points in the subset, i.e., wk = |Ck |/N . We choose p to be an
integer exponent, and  will be chosen depending on the other parameters to ensure
satisfaction of the probability guarantee (2). When p = 2 and S = Rm , the set is
an ellipsoid in R K m with the center formed by stacking together all d̄k into a single
vector of dimension R K m . When we additionally have K = N or K = 1, this ellipsoid
becomes a ball of dimension R N m or Rm respectively.
Case p = ∞. In the case where p = ∞, the set we consider takes becomes
 

U(K , ) = u = (v1 , . . . , v K ) ∈ S K  max vk − d̄k ≤  ,
 k=1,...,K

where the constraints for individual vk become decoupled. This decoupling follows the
result for the Wasserstein type p = ∞ metric [30, Equation 2], as our uncertainty set is
analogous to the set of all distributions within Wasserstein-∞ distance of d̄. Note that, if
K
any of the decoupled constraints are violated, then lim p→∞ k=1 wk vk −d̄k p ≥  p ,
and the summation constraint is violated.
For both cases, when K = 1, we have a simple uncertainty set: U(1, ) =
v ∈ S | v − d̄ ≤  , a ball of radius  around the empirical mean of the dataset.
This is equivalent to the uncertainty set of traditional RO, as it is of the same dimension

123
I. Wang et al.

m as the uncertain parameter. In addition, when K = N and wk = 1/N , both cases


resemble the ambiguity sets of Wasserstein- p DRO.
Having defined the uncertainty set, we now introduce the constraints


K
ḡ(u, x) = wk g(vk , x), (3)
k=1

where g is defined in the original constraint (1) and wk are the weights from above.
Subsequently, x̂ N is the solution to the robust optimization problem
minimize f (x)
(MRO)
subject to ḡ(u, x) ≤ 0 ∀u ∈ U(K , ),
where f is the objective function. We call this problem the mean robust optimization
(MRO) problem. Given the problem data, we formulate the uncertainty set from clus-
tered data using machine learning, with the choice of K and  chosen experimentally.
Then, we solve the MRO problem to arrive at a data-driven solution x̂ N which satisfies
the probabilistic guarantee (2).

2.3 Solving the robust problem

We solve the MRO problem using a direct convex reformulation, following usual
techniques for RO problems in existing literature [3, 11, 35], with adaptations made
for the MRO setup, as well as a reformulation derived for the case p = ∞. We include
simple examples for completeness.
Case p ≥ 1. In the case where p ≥ 1, the MRO can be rewritten as

minimize ⎧f (x) ⎫
⎨ maximize K
k=1 wk g(vk , x)

(4)
subject to v1 ,...,v K ∈S ≤ 0,
⎩ subject to K
 p⎭
k=1 wk vk − d̄k ≤
p

which, by dualizing the inner maximization problem, is reformulated as:

minimize f (x)
K
subject to k=1 wk sk ≤ 0
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + φ(q)λ z k /λ ∗ + λ p ≤ sk (5)
q

k = 1, . . . , K
λ ≥ 0,

with variables λ ∈ R, sk ∈ R, z k ∈ Rm , and yk ∈ Rm . Here, [−g]∗ (z, x) =


supu∈domu g z T u − [−g(u, x)] is the conjugate of −g, σ S (z) = supu∈S z T u is the sup-
port function of S ⊆ Rm , · ∗ is the dual norm of · , and φ(q) = (q − 1)(q−1) /q q
for q > 1 [35, Theorem 8]. Note that q satisfies 1/ p + 1/q = 1, i.e., q = p/( p − 1).
The support function σ S is also the conjugate of χ S , which is defined χ S (u) = 0
if u ∈ S, and ∞ otherwise. The proof of the derivation and strong duality of the

123
Mean robust optimization

constraint is delayed to Appendix A. Since the dual of the constraint becomes a mini-
mization problem, any feasible solution that with objective less than or equal to 0 will
satisfy the constraint, so we can remove the minimization to arrive at the above form.
While traditionally we take the supremum instead of maximizing, here the supremum
is always achieved as we assume g to be upper-semicontinuous. For specific examples
of the conjugate forms of different g, see Bertsimas and den Hertog [11, Section 2.5]
and Beck [2, Chapter 4].
When K is set to be N , wk is 1/N , and this is of an analogous form to the convex
reduction of the worst case problem for Wasserstein DRO, which we will introduce in
Sect. 3.
q
We observe from [35, Section 2.2 Remark 1] that limq→∞ φ(q)λ z k /λ ∗ = 0 if
z k ≤ λ and = ∞ otherwise. Therefore, when p = 1 the reformulation becomes

minimize f (x)
K
subject to k=1 wk sk ≤ 0
(6)
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + λ ≤ sk k = 1, . . . , K
λ ≥ 0, z k ≤ λ, k = 1, . . . , K .

Case p = ∞. In the case where p = ∞, the MRO can be rewritten as

minimize ⎧f (x) ⎫
⎨ maximize K
k=1 wk g(vk , x)

(7)
subject to v1 ,...,v K ∈S ≤ 0,
⎩ subject to vk − d̄k ≤ , k = 1, . . . , K ⎭

which has a reformulation where the constraint above is dualized,

minimize f (x)
K
subject to k=1 wk sk ≤ 0
(8)
[−g]∗ (z k − yk , x) + σ S (yk ) − z kT d̄k + λk  ≤ sk k = 1, . . . , K
z k ∗ ≤ λk k = 1, . . . , K ,

with sk ∈ R, z k ∈ Rm , and yk ∈ Rm . The proof is delayed to Appendix B.

Remark 1 (Case p = ∞ is the limit of case p ≥ 1) In terms of the primal problem, (7)
is the limiting case of (4) as p → ∞. In terms of the reformulated problem with
dualized constraints, problem (8) is the limiting case of (5). The proof, delayed to
Appendix C, extends the ideas stated in [14, Proposition 3].

Example with affine constraints. Consider a single affine constraint of the form

(a + Pu)T x ≤ b, (9)

where a ∈ Rn , P ∈ Rn×m , and b ∈ R. In other words, g(u, x) = (a + Pu)T x − b,


and the support set is S = Rm . Note that, in this case, yk must be 0 for the support
function σ S (yk ) to be finite. We compute the conjugate as [−g]∗ (z, x) = supu z T u +

123
I. Wang et al.

b − (a + Pu)T x = a T x − b if z + P T x = 0, and ∞ otherwise. To substitute σ S (yk )


and [−g]∗ (z k − yk , x) into (5), we note that yk = 0 and z k = −P T x, i.e., z k is
independent from k. By combining the K constraints in (5), we arrive at the form for
general p ≥ 1

minimize f (x)  q
subject to a T x − b + φ(q)λ  P T x/λ∗ + λ p + (P T x)T K
k=1 wk d̄k ≤ 0 (10)
λ ≥ 0,

where the number of variables or constraints does not depend on K . Since vector
K
k=1 wk d̄k is the average of the data-points in D N for any K ∈ {1, . . . , N }, this
formulation corresponds to always choosing K = 1. We also note that, for both p = 1
and p = ∞, the subsequent reformulation (11) can be viewed as the robust counterpart
N
when the uncertainty set is a norm ball of radius  centered at (1/N ) i=1 di . If d̄ = 0,
the constraint can be simplified even further, obtaining a T x +  P T x ∗ ≤ b, which
corresponds to the robust counterpart in RO with norm uncertainty sets [11, Section
2.3], [3, Chapter 2].

minimize f (x)   (11)


subject to a T x − b +   P T x ∗ + (P T x)T K
k=1 wk d̄k ≤ 0.

2.4 Maximum-of-concave constraint function

We now consider a more general maximum-of-concave function

g(u, x) = max g j (u, x),


j≤J

with each −g j being proper, convex, and lower-semicontinuous in u for all x. When we
take J = 1, we arrive back at the formulations given in Sect. 2. Note that any problem
with multiple uncertain constraints g j (u, x), j = 1, . . . , J , where we assume the
usual conditions on g j , can be combined to create a joint constraint of this maximum-
of-concave form. As mentioned in Sect. 2.1, this can also be used to model CVaR
constraints, which has a maximum-of-concave analytical form. The intuitive constraint
to formulate is then


K 
K
ḡ(u, x) = wk max g j (vk , x) = max wk g jk (vk , x), (12)
j≤J ( j1 ,..., j K )∈G
k=1 k=1

where in the last expression we brought the maximum outside the summation by
defining the set G of all possible choices of ( j1 , . . . , j K ). This set has size J K , as
each cluster k has J possible pieces in the maximization function. We perform this
switch in order to attain a reformulation akin to (4), where we have a single inner
maximization problem for the MRO constraint. As these indices are hard to express,
we seek an alternative formulation. We turn to Section 4.2 of Esfahani and Kuhn [24],

123
Mean robust optimization

on the attainment of the worst-case distribution of Wasserstein DRO for a maximum-


of-concave function; Wasserstein DRO is closely related to our formulations, as will
be explored in Sect. 3. Adapting the ideas from Theorem 4.4 of Esfahani and Kuhn
[24], we note that the worst-case constraint value for any x can in fact be attained by
maximizing the function


K 
J
ḡ(u, α, x) = α jk g j (v jk , x) (13)
k=1 j=1

over u in the uncertainty set, which is to be defined, and α ∈ , with =


J
{α | α
j=1 jk = w k , α jk ≥ 0 ∀k, j}. For each constituent function g j , the uncer-
tainty set then contains a set of vectors (v j1 , . . . , v j K ), where there exists a set of
parameters (α j1 , . . . , α j K ) to denote the fraction of mass assigned to that function
and those vectors. The total amount of mass assigned for each cluster remains the
weight of the cluster, i.e., wk = Jj=1 α jk . Note that choosing the j-th function as
the maximum function for cluster k, i.e., the index jk in (12), is equivalent to setting
α jk = wk and α j  k = 0 for all j  = j in (13). We adopt this formulation instead
of (12), as it easily allows us to apply the Von Neumann-Fan minimax theorem [38]
in the dual reformulation (see Appendices A and B), and the existence of α instead of
a maximization over g j is useful for a proof in Sect. 4.2.
The uncertainty set is then as follows. Case p ≥ 1. In the case where p ≥ 1, we
have
⎧  ⎫
⎨  K  J ⎬

U(K , ) = u = (v11 , . . . , v J K ) ∈ S J ×K  ∃α ∈ , α jk v jk − d̄k p ≤  p .
⎩  ⎭
k=1 j=1

Note that the single concave case given previously follows when we take J = 1. All
parameters are defined as in the single concave case.
Case p = ∞. In the case where p = ∞, the set we consider becomes
⎧  ⎫
⎨  J ⎬
 α
U(K , ) = u = (v11 , . . . , v J K ) ∈ S J ×K  ∃α ∈ , max
jk
v jk − d̄k ≤  .
⎩  k=1,...,K w ⎭
j=1 k

Following these changes, x̂ N is again the solution to the robust optimization prob-
lem (MRO), defined now with the generalized uncertainty set.

Solving the robust problem. We give the direct reformulation approach for solving the
generalized problem for p ≥ 1. The case p = ∞ is delayed to Appendix B. We write
the MRO problem as the optimization problem

minimize ⎧f (x) ⎫
⎨ maximize K
k=1
J
j=1 α jk g j (v jk , x) ⎬
(14)
subject to v11 ,...,v J K ∈S,α∈ ≤ 0,
⎩ subject to K J
≤  p⎭
j=1 α jk v jk − d̄k p
k=1

123
I. Wang et al.

and, by dualizing the inner maximization problem, arrive at the reformulation:

minimize f (x)
K
subject to k=1 wk sk ≤ 0  q
[−g j ] (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k + φ(q)λ z jk /λ∗ + λ p ≤ (15)
∗ sk
k = 1, . . . , K , j = 1, . . . , J
λ ≥ 0,

with variables λ ∈ R, sk ∈ R, z jk ∈ Rm , and y jk ∈ Rm . The proof is delayed to


Appendix A. In addition, when K is set to be N , and wk ’s are 1/N , this is also of
an analogous form to the convex reduction of the worst case problem for Wasserstein
DRO, given in Sect. 3.

3 Links to Wasserstein distributionally robust optimization

Distributionally robust optimization (DRO) solves the problem

minimize f (x)
(16)
subject to supQ∈P N EQ (g(u, x)) ≤ 0,

where the ambiguity set P N contains, with high confidence, all distributions that could
have generated the training samples D N , such that the probabilistic guarantee (2) is
satisfied. Wasserstein DRO constructs P N as a ball of radius  with respect to the
N
Wasserstein metric around the empirical distribution P̂ N = i=1 δdi /N , where δdi
denotes the Dirac distribution concentrating unit mass at di ∈ Rm . Specifically, we
p
write P N = B (P̂ N ) = {Q ∈ M(S) | W p (P̂ N , Q) ≤ }, where M(S) is the set
of probability distributions supported on S satisfying a light-tailed assumption (more
details in Sect. 3.1), and
 1/ p 
  p 
W p (Q, Q ) = inf u−u (du, du ) .
S

Here, p is any integer greater than 1, and is any joint distribution of u and u  with
marginals Q and Q .
When K = N , the constraint of the DRO problem (16) is equivalent to the constraint
of (MRO). In particular, for case p ≥ 1, the dual of the constraint of (16) is equivalent
to the dual of the constraint of (14), with wk = 1/N [35, 51]. Similarly, in the case
where p = ∞, the dual of the constraint of (16) is equivalent to the dual of the
constraint of (17). We can then rewrite the Wasserstein DRO problem as (14), the
MRO problem, when K = N .
Our approach can be viewed as a form of Wasserstein DRO, with the difference that,
when K < N , we deal with the clustered dataset. We form P N as a ball around the
K
empirical distribution P̂ K of the centroids of our clustered data P̂ K = k=1 wk δd̄k ,
where wk is the proportion of data in cluster k. This formulation allows for the reduction

123
Mean robust optimization

of the sample size while preserving key properties of the sample, which translates
directly to a reduction in the number of constraints and variables, while maintaining
high quality solutions.

3.1 Satisfying the probabilistic guarantees

Following the parallels between MRO and Wasserstein DRO, we now show that the
conditions for satisfying the probabilistic guarantees are also analogous.

Case p ≥ 1. Wasserstein DRO satisfies (2) if the data-generating distribution, sup-


ported on a convex and closed set S, satisfies a light-tailed assumption [24, 27]:
there
 exists an exponent a > 0 and t > 0 such that A = EP (exp(t u a )) =
S exp(t u )P(du) < ∞. We refer to the following theorem.
a

Theorem 1 (Measure concentration [27, Theorem 2]) If the light-tailed assumption


holds, we have P N (W p (P, P̂ N ) ≥ ) ≤ φ( p, N , ), where φ is an exponentially
decaying function of N .

Theorem (1) estimates the probability that the unknown data-generating distribution
p
P lies outside the Wasserstein ball B (P̂ N ), which is our ambiguity set. Thus, we can
estimate the smallest radius  such that the Wasserstein ball contains the true distribu-
tion with probability 1 − β, for some target β ∈ (0, 1). We equate the right-hand-side
to β, and solve for  N (β) that provides us the desired guarantees for Wasserstein
DRO [24, Theorem 3.5].

Case p = ∞. When p = ∞, Bertsimas et al. [14, Section 6] note that the light-
tailed assumption is no longer sufficient. Wasserstein DRO satisfies (2) under stronger
assumptions, as given in the following theorem.

Theorem 2 (Measure concentration, p = ∞ [45, Theorem 1.1]) Let the support


S ⊂ Rm of the data-generating distribution be a bounded, connected, open set with
Lipschitz boundary. Let P be a probability measure on S with density ρ : S →
(0, ∞), such that there exists λ ≥ 1 for which 1/λ ≤ ρ(x) ≤ λ, ∀x ∈ S. Then,
P N (W∞ (P, P̂ N ) ≥ ) ≤ φ(N , ), where φ is an exponentially decaying function of
N.

We can again equate the right-hand-side to β and find  N (β). We extend this result to
the clustered set in MRO.

Theorem 3 (MRO finite sample guarantee) Assume the light-tailed assumption holds
when p ≥ 1, and the corresponding assumptions hold when p = ∞. If β ∈ (0, 1),
η N (K ) is the average p-th powered distance of data-points in D N from their assigned
cluster centers, and x̂ N is the optimal solution to (MRO) with uncertainty set
U(K ,  N (β) + η N (K )1/ p ), then the finite sample guarantee (2) holds.

Proof Compared with Wasserstein DRO, MRO has to account for the additional dif-
ference between the two empirical distributions P̂ N and P̂ K . If we introduce a new

123
I. Wang et al.

parameter, η N (K ), defined as

1 
K
η N (K ) = di − d̄k p
N
i=1 i∈Ck

the average p-powered distance with respect to the norm used in the Wasserstein
metric, of all data-points in D N from their assigned cluster centers d̄k , we notice that

W p (P̂ , P̂ ) = inf
K N p
u − u p
(du, du  ) ( any joint dist. of P̂ K , P̂ N )
S

K 
|Ck |
≤ u − d̄k p N
P̂ (u|u  = d̄k )(du)
N S
i=1

K
|Ck | 1 
≤ di − d̄k p
= η N (K ),
N |Ck |
i=1 i∈Ck

where we have replaced the integral with a finite sum, as the distributions are discrete.
Therefore, by Theorems 1, 2 and the triangle inequality [20],

W p (P, P̂ K ) ≤ W p (P, P̂ N ) + W p (P̂ K , P̂ N ) ≤  N (β) + η N (K )1/ p ,

with probability at least 1 − β. We thus have

p
P(P ∈ B (P̂ K )) ≥ 1 − β,
N (β)+η N (K )
1/ p

which implies U(K ,  N (β) + η N (K )1/ p ) contains all possible realizations of uncer-
tainty with probability 1 − β, so the finite sample guarantee (2) holds. 


4 Worst-case value of the uncertain constraint

In the previous section, we proposed a theoretical increase in  to maintain the same


finite sample guarantee before and after clustering. However, a question remains:
what is the extent of the effects of clustering if we don’t increase ? In this section, we
thus approach the analysis in a different manner: keeping  constant, we quantify the
change in the worst-case value of the constraint function that arises from clustering. In
fact, for select cases, our results suggest there is no need to increase  after clustering;
under specific curvature conditions on the constraint function g, we may obtain a more
conservative, or even unchanged solution after clustering, in which case the original
finite sample guarantee is retained.
We begin with a remark on the clustering value attained. The MRO approach is
closely centered around the concept of clustering to reduce sample size while main-
taining sample diversity. We wish to cluster points that are close together, such that

123
Mean robust optimization

the objective is only minimally affected. With this goal, we would like to minimize
the average distance of the points in each cluster to their data-center,

1  
K
D(K ) = minimize D(K ) = minimize di − d̄k 2,
2
N
k=1 di ∈Ck

where d̄k is the mean of the points in cluster Ck . While the best performance is attained
with D(K ) , in practice we work with the approximation D(K ), where Ck is decided
by a clustering algorithm. This value upper bounds D(K ) . A well-known algorithm
is K -means [32], where we create K clusters by iteratively solving a least-squares
problem. From here on we use only D(K ), and note that for the case p = 2, we have
D(K ) = η N (K ) from Theorem 3.
In this section, we then show the effects of clustering on the worst-case value of the
constraint function in (MRO). We prove two sets of results, corresponding to g given
as a single concave function, and as a more general maximum-of-concave function,
which includes the maximum-of-affine function.

4.1 Single concave function

Quantifying the clustering effect. We calculate the difference between the worst-case
value of the constraint in (MRO), for different K ,


K
|Ck |
ḡ K (x) = maximize g(u k , x)
u 1 ,...,u K ∈S N
k=1
(MRO-K)

K
|Ck |
subject to u k − d̄k p
≤ .p
N
k=1

When K = N , ḡ N (x) is akin to the constraint value for traditional Wasserstein DRO.
We also denote by ḡ N ∗ (x) the value of the constraint value without the support con-
straints u 1 , . . . , u N ∈ S. From here on, when we mention that the support affects the
worst-case constraint value, we refer to situations where at least one of the constraints
u i ∈ S for i = 1, . . . , N is binding. Formally, the definition is ḡ N (x) = ḡ N ∗ (x) for
any x feasible for the DRO problem. We note a sufficient but not necessary condition
for the support to not affect the worst-case constraint value: the situation in which the
support doesn’t affect the uncertainty set, which is defined as

 

N
N ×m
u∈R : (1/N ) u i − di p
≤ p

i=1
 

N
N ×m
= u∈S : (1/N ) u i − di p
≤ p
.
i=1

123
I. Wang et al.

If the support satisfies this condition, we can conclude that ḡ N (x) = ḡ N ∗ (x) for any x
feasible for the DRO problem, and obtain improved bounds below. While the condition
depends on the location of the data-points, it is acceptable, as this is a condition we
can check given data to potentially improve the following bounds, without having to
solve the MRO problem. We also define the L-smooth condition, which is needed in
the subsequent theorems.

Definition 1 (L-smoothness) A differentiable function g(u, x) is L-smooth on its


domain, with constant L, with respect to the 2 -norm and for a given x, if

∇g(v, x) − ∇g(u, x) 2 ≤ L u − v 2 , ∀u, v ∈ domu g.

With these definitions, we can prove the following relations.

Theorem 4 With the same x and , and for any integer p ≥ 1, we always have

ḡ N (x) ≤ ḡ K (x).

Suppose that Assumption 1 holds, and −g is L-smooth according to Definition 1.


Then, with the same x and , and for any integer p ≥ 1, we always have

ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ).

The proof is delayed to Appendix D. The results also hold for p = ∞, as we have
shown in Remark 1 that the case p = ∞ is the limit of the case p ≥ 1, and these
results hold under the limit.
Let  be the maximum difference in constraint value resultant from relaxing
the support constraint on the MRO uncertainty sets, subject to x being feasible for
problem (MRO),  = max x∈X (ḡ N ∗ (x) − ḡ N (x)). As we assume Assumption 1 to
hold, combined with the smoothness of g, we note that when solving for ḡ N ∗ (x),
the chosen vi values without the support constraint will still remain in the domain
domu g. Refer to a similar argument in Appendix D (ii) for details. The function
ḡ N ∗ (x) − ḡ N (x) is then continuous in x and everywhere defined for x ∈ X , thus
maximizing with respect to X , a compact set, the value  is finite. Then, we observe
that ḡ K (x) − ḡ N (x) ≤  + (L/2)D(K ) for all such x, so the smaller the D(K ),
(i.e., higher-quality clustering procedure), the smaller the increase in the worst-case
constraint value. In addition, the value  is independent of K , as it only depends on
ḡ N ∗ (x) and ḡ N (x).

Remark 2 While  could be constructed to be arbitrarily bad, in practice, we expect


our relevant range of  to be small enough such that the difference is insignificant. We
can approximate  ≈ 0 and use the upper bound (L/2)D(K ), as this bound is often
not tight. See Sects. 6.3 and 6.1 for examples.

Uncertain objective When the uncertainty is in the objective, Theorem 4 quantifies


the difference in optimal values.

123
Mean robust optimization

Corollary 1 Consider the problem where g is itself the objective function we would like
to minimize and X ⊆ Rn represents the constraints, which are deterministic. Then,
(L/2)D(K ) +  upper bounds the difference in optimal values of the MRO problem
with K and N clusters.

Uncertain constraints. When the uncertainty is in the constraints, the difference


between ḡ K (x) and ḡ K (x) no longer directly reflects the difference in optimal values.
Instead, clustering creates a restriction on the feasible set for x as follows. For the
same x̂, ḡ K (x̂) takes a greater value than ḡ N (x̂). Since both of them are constrained
to be nonpositive from (MRO), the feasible region with K clusters is smaller.
Affine dependence on uncertainty. As a special case, when g is affine in u, L = 0, so
we observe the following corollary.

Corollary 2 (Clustering with affine dependence on the uncertainty) If g(u, x) is affine


in u and the worst-case constraint value is not affected by the support constraint, then
clustering makes no difference to the optimal value and optimal solution to (MRO).

4.2 Maximum-of-concave functions

We now consider the more general case of a maximum-of-concave constraint function,


g(u, x) = max j≤J g j (u, x), subject to a polyhedral support, S = {u | H u ≤ h}. For
p ≥ 1, we make use of the dual of the optimization problem in the constraint of (14),
defined for various K ,


K
ḡ K (x) = minimize (|Ck |/N )sk
λ≥0,γ ≥0,z,s
k
subject to [−g j ]∗ (z jk − H T γ jk ) + γ jk
T
(h − H d̄k ) − z Tjk d̄k + λ p
 q
+ φ(q)λ z jk /λ∗ ≤ sk , k = 1, . . . , K , j = 1, . . . , J ,
(MRO-K-Dual)
where the variables y jk from (15) are replaced by H γ jk , with γ jk ≥ 0, due to the
T

specific form of the polyhedral support. Similarly, for p = ∞, we define


K
ḡ K (x) = minimize (|Ck |/N )sk
λ≥0,γ ≥0,z,s
k
subject to [−g j ]∗ (z jk − H T γ jk ) + γ jk
T
(h − H d̄k ) − z Tjk d̄k
+ λk  ≤ sk , k = 1, . . . , K , j = 1, . . . , J ,
 
z jk  ≤ λk , k = 1, . . . , K , j = 1, . . . , J .

(MRO-K-Dual-∞)
The following theorems hold for both p ≥ 1 and p = ∞.

Theorem 5 When g is the maximum of concave functions with domain domu g j = Rm


and polyhedral support S = {u | H u ≤ h}, and where each −g j is L j -smooth

123
I. Wang et al.

according to Definition 1, we have, for the same x and ,

ḡ N (x) − δ(K , z, γ ) ≤ ḡ K (x) ≤ ḡ N ∗ (x) + max(L j /2)D(K ),


j≤J

K
where δ(K , z, γ ) = (1/N ) k=1 i∈Ck max j≤J ((−z jk − H γ jk ) (di − d̄k )), and
T T

z, γ are the dual variables. ḡ N ∗ (x) is the problem without support constraints.

The proof is delayed to Appendix E. Due to the nonconvex and nonconcave nature
of maximum-of-concave functions, the lower bound now involves an extra term
δ(K , z, γ ). However, when g is a maximum-of-affine function, which is convex, we
know ḡ N ∗ (x) to be an upper bound on ḡ K (x).
Corollary 3 When g is the maximum of affine functions with domain domu g j = Rm
and polyhedral support S = {u | H u ≤ h}, for the same x and ,

ḡ N (x) − δ(K , z, γ ) ≤ ḡ K (x) ≤ ḡ N ∗ (x)

This follows from the fact that L j = 0 for all affine functions g j .
Uncertain objective When the uncertainty is in the objective, Theorem 5 and Corol-
lary 3 quantifies the
 possible difference
 in optimal values between K and N clusters.
We let  = max x ḡ N ∗ (x) − ḡ N (x) , subject to x being feasible for problem (MRO).
Note that this is only needed for the upper bound.
Corollary 4 Consider the problem where g is itself the objective function we would like
to minimize and X ⊆ Rn represents the constraints, which are deterministic. Then,
δ(K , z, γ ) upper bounds the possible decrease in optimal values of the MRO problem
with K clusters compared with that of N clusters. Similarly, max j≤J (L j /2)D(K )+
upper bounds the possible increase.
Uncertain constraints When the uncertainty is in the constraints, the difference
between ḡ N (x) and ḡ K (x) as given in Theorem 5 no longer directly reflect the dif-
ference in optimal values. Instead, clustering affects the feasible set for x as follows.
For any x̂, in the case ḡ N (x̂) ≥ ḡ K (x̂), ḡ K (x̂) can be at most δ(K , z, γ ) lower in
value than ḡ N (x̂). Since both values are constrained to be nonpositive from (MRO),
the feasible region of the MRO problem with K clusters may be less restricted than
that of N clusters. This indirectly allows MRO with K clusters to obtain a smaller
optimal value. On the other hand, in the case ḡ N (x̂) ≤ ḡ K (x̂), ḡ K (x̂) can be at most
max j≤J (L j /2)D(K ) +  higher in value than ḡ N (x̂). This indirectly lets MRO with
K clusters to obtain a larger optimal value.

5 Parameter selection and outliers

Choosing K . When the uncertain constraint is affine and S does not affect the worst-
case constraint value, the number of clusters K does not affect the final solution,
so it is always best to choose K = 1. When S affects the worst-case constraint

123
Mean robust optimization

value, there is a difference of at most  between setting K = 1 and K = N , which


can often be approximated ≈ 0 for small . Therefore, setting K = 1 remains the
recommendation. When the constraint is concave, we choose K to obtain a reasonable
upper bound on ḡ K (x), as described in Theorem 4. This upper bound depends linearly
on D(K ), the clustering value, so by choosing the elbow of the plot of D(K ), we
choose a cluster number that, while being a reasonably low value, best conforms to
the shape of the underlying distribution. When the constraint is maximum-of-concave,
the bounds given in Theorem 5 are also related to D(K ). The upper bound is related
in the same manner as above. For the lower bound, note that we wish for δ(K , z, γ ),
which is nonnegative from Appendix E, to take lower values. By definition, δ(K , z, γ )
is linearly dependent on the differences di − d̄k , which are smaller when D(K ) takes
lower values. Therefore, while γ and z are unknown, we can attempt to lower the
possible value of δ(K , z, γ ) by choosing a reasonably low D(K ). The elbow method
has been commonly used in machine learning problems pertaining the choice of hyper-
parameters, especially for K -means, and can be traced back to Thorndike [44] in
1953. Note that, by directly returning D(K ) and examining the elbow as an initial
step, this procedure can be completed in the clustering step without having to solve
the downstream optimization problem. To further improve the choice of K , or if the
elbow is unclear, cross-validation may be used for low K values or K values around
the elbow. No matter if the uncertainty lies in the objective or the constraints, this
bound will inform us of the potential difference between different K .
Choosing  While we have outlined theoretical results in Theorem 3 for choosing ,
in practice, we experimentally select  through cross validation to arrive at the desired
guarantee. Therefore, while the theoretical bounds suggest to choose a larger  when
we cluster, this may not be the case experimentally. In fact, for concave g, we may
even choose a smaller , due to the increase in the level of conservatism for small
K . On the other hand, for maximum-of-concave g, there is the possibility of needing
a larger , as smaller K may lead to less conservative solutions. However, for both
cases, we show a powerful result in the upcoming numerical examples: although for
the same , MRO with K clusters differs in conservatism from Wasserstein DRO (N
clusters), there are cases where we can tune  such that MRO and DRO provide almost
identical tradeoffs between objective values and probabilistic guarantees, such that no
loss in performance results from choosing a smaller cluster number K .
Data with outliers When the provided dataset contains outliers, one might imagine that
the centroids created by the clustering algorithm will be biased towards the outliers.
While this is true, the weights of the outliers will not increase through clustering, thus
the effect of outliers on these clustered Wasserstein balls is not worse than their effect
on the original Wasserstein balls, which include the Wasserstein ball around the outlier
point. In fact, by clustering the outlier point with other points, MRO offers protection
against the outlier. We demonstrate this in on the numerical experiment in Sect. 6.4,
where we compare three methods: MRO, MRO with outlier removal, and MRO with
the outlier considered as its own cluster.

123
I. Wang et al.

6 Numerical examples

We now illustrate the computational performance and robustness of the proposed


method on various numerical examples. All the code to reproduce our experiments is
available, in Python, at
https://github.com/stellatogrp/mro_experiments.
We run the experiments on the Princeton Institute for Computational Science and
Engineering (PICSciE) facility with 20 parallel 2.4 GHz Skylake cores. We solve all
optimization problems with MOSEK [37] optimizer with default settings. In Sect. 6.1,
we demonstrate the performance of MRO when the uncertain constraint is concave.
In Sect. 6.2, 6.3, and 6.4, we demonstrate the performance of MRO for maximum-of-
affine uncertainty.
The calculated in-sample objective value and out-of-sample expected values, as
well as the out-of-sample probability of constraint violation, are averaged over 50
independent runs of each experiment. For each run, we generate evaluation data of
the same size N as the training dataset. For numerical examples with an uncertain
objective, the probability of constraint violation is measured as the probability the
average out-of-sample value is above the in-sample value. For numerical examples
with an uncertain constraint, the probability of constraint violation is measured as the
probability the average constraint value is above zero. For all experiments, we plot the
following.
1. In-sample objective values and out-of-sample expected values vs.  for different
K . We use solid lines for the in-sample objective value, and dotted lines for the
out-of-sample expected value.
2. Objective value vs. β for different K ; each point represents the solution for the 
achieving the smallest objective value. Starred K values indicate the formulation
without support constraints.
3. The difference in the value of the uncertain objective between using K and N clus-
ters, compared with the theoretical upper bound from Theorems/Corollaries 4, 5, 3.
We use solid lines for the actual difference, and dotted lines for the upper bounds.
4. Solve time for select K and  values.
We also plot the clustering value D(K ) over K , and use the elbow method to suggest
an a priori K value to perform cross-validation around.

6.1 Capital budgeting

We consider the capital budgeting problem in [4, Section 4.2], where we select a
portfolio of investment projects maximizing the total net present value (NPV) of the
portfolio, while the weighted sum of the projects is less than a total budget θ . The NPV
for all projects is η(u) ∈ Rn , where for each project j, η j (u) is the sum of discounted
T
cash flows F jt over the years t = 0, . . . , T , i.e., η j (u) = t=0 F jt /(1 + u j )t . Here,
u j ∈ R+ is the discount rate of project j. We formulate the uncertain function to be
minimized as

123
Mean robust optimization

Fig. 1 D(K ) vs. K . for all experiments. The red lines are the suggested values

g(u, x) = −η(u)T x,

where x = (x1 , . . . , xn ) ∈ {0, 1}n is the indicator for selecting each project. The
discount rate u j is subject to uncertainty, as it depends on several factors, such as
the interest rate of the country where project j is located and the level of return
the decision-maker wants to compensate the risk. The function g is concave and
monotonically increasing in u, and we can define a domain u ≥ 0 so that Assumption 1
and Theorem 4 applies. The robust problem becomes

minimize τ
x,t
subject to ḡ(u, x) ≤ τ, u ∈ U(K , )
h T x ≤ θ, x ∈ {0, 1},

where h ∈ Rn is the vector of project weights. We solve the convex reformulation


obtained through applying (5), for p = 2, which gives rise to a number of power-cone
constraints proportional to the number of clusters.
Problem setup. We set n = 20, N = 120, T = 5. We generate F jt from a
uniform distribution on [0.1, 0.5 + 0.004t] for j = 1, . . . , n, t = 0, . . . , T .
For all j, h j is generated from a uniform distribution on [1, 3 − 0.5 j], and the
total budget θ is set to be 12. We generate uncertain data from two slightly dif-
ferent uniform distributions, to simulate two different sets of predictions on the
discount rates. The first half is generated on [0.005 j, 0.02 j], and the other half on
[0.01 j, 0.025 j], for all j. We calculate an upper bound on the L-smooth parameter,
L = ∇ 2 nj=1 t=0 T
F jt (x̂ N ) j (1+u j )−t 2,2 ≤ n
j=1
T
t=0 t(t +1)F jt ( x̂ N ) j 2,2
for each data-driven solution x̂ N .
Results We observe in Fig. 2 that using two clusters is enough to achieve performance
almost identical to that of using 120 clusters. Although from the left image, we see
that K = 2 slightly upper bounds K = 120, from the right, their tradeoffs between
the objective value and relevant constraint violation probability (β ≤ 0.2) are largely
the same, so we can always tune  to achieve the same performance and guarantees.
Notice that the results for K = 120 and K = 120∗ are near identical for small , where

123
I. Wang et al.

Fig. 2 Capital budgeting. Descriptions are given in the beginning of Sect. 6. The difference in objective
values (bottom left) is calculated as ḡ K (x) − ḡ N (x), and the theoretical bound is (L/2)D(K ) from Corol-
lary 1

K = 120∗ is the formulation without the support constraint. Therefore, while ḡ N ∗ (x)
slightly upper bounds ḡ N (x), we can approximate their difference  ≈ 0 for small
enough , for which the upper bound (L/2)D(K ) thus hold. In fact in this example,
even for larger  where we observe  > 0, the actual difference between ḡ K and ḡ N
is bounded by (L/2)D(K ). We see that the elbow of the upper bound is at K = 2,
and the true difference follows the same trend, matching the suggestion from Fig. 1.
Therefore, setting K = 2 is the optimal decision, with a time reduction of 2 orders
of magnitude, and a complexity reduction from 26,626 variables and 12,000 power
cones to 666 variables and 200 power cones.

6.2 Sparse portfolio optimization

We consider a market that forbids short-selling and has m assets as in [24]. Daily
returns of these assets are given by the random vector d = (d1 , . . . , dm ) ∈ Rm . The
percentage weights (of the total capital) invested in each asset are given by the decision
vector x = (x1 , . . . , xn ) ∈ Rn . We restrict our selection to at most θ assets, given by
the 0-th norm cardinality constraint below. The distribution P is unknown, but we have
observed a historical dataset D N . Our objective is to minimize the CVaR with respect
to variable x,

minimize CVaR(−u T x, α)
subject to 1T x = 1, x ≥ 0, x 0 ≤ θ,

123
Mean robust optimization

which represents the average of the α largest portfolio losses that occur. In other words,
the CVaR term seeks to ensure that the expected magnitude of portfolio losses, when
they occur, is low.
 The objective has an analytical
 form with an extra variable τ given
as [24, 46]: EP τ + α1 max{−u T x − τ, 0} . From this, we obtain g as the maximum
of affine functions,

g(u, x) = max{(−1/α)x T u + (1 − 1/α)τ, τ }.

We then apply the convex reformulation given in Appendix B, for p = ∞.


Problem setup. We take stock data from the past 10 years of S&P500, and generate
synthetic data from their fitted general Pareto distributions. We choose a generalized
Pareto fit over a normal distribution as it better models the heavy tails of the returns [15].
See the Github repository for the code, which uses the "Rsafd" R package [16]. We let
α = 20%, m = 50 stocks, and generate a dataset size of N = 1000. Our portfolio can
include at most θ = 5 stocks. For the upper bound δ(K , z, γ ) on ḡ N (x) − ḡ K (x), we
note the special structure of this problem, where one of the affine pieces is independent
of u, to arrive at a bound δ(K , z, γ ) = maxk {maxi {(d̄k − di )T x/α}}.
Results In Fig. 3, while setting K to smaller values lead to a decrease in the optimal
value across , we note that for K = 5 and above, we can already achieve a tradeoff
curve between the optimal value and probability of constraint satisfaction that is similar
to that of K = 1000, and setting K = 10 brings it slightly closer. In the plots of D(K )
and of the upper bound on the difference, we also note that the elbow is around K = 5.
We thus recommend choosing K through cross validation around 5, as tuning  for
these small K gives 1–3 orders of magnitude time reduction.

6.3 Facility location

We examine the classic facility location problem [9, 33]. Consider a set of n potential
facilities, and m customers. Variable x ∈ {0, 1}n describes whether or not we construct
each facility i for i = 1, . . . , n, with cost ci . In addition, we would like to satisfy the
uncertain demand u ∈ Rm at minimal cost. We define variable X ∈ Rn×m where X i j
corresponding to the portion of the demand of customer j shipped from facility i with
corresponding cost Ci j . Furthermore, r ∈ Rn represents the production capacity for
each facility, and u ∈ Rm represents the uncertain demand from each customer. For
each customer j, X j represents the proportion of goods shipped from any facility to
that customer, which sums to 1. For each facility i, (X T )i represents the proportion
of goods shipped to any customer. Putting this all together, we obtain the objective to
minimize, c T x + tr(C T X ), subject to constraints 1T X j = 1, j = 1, . . . , m, as well
as multiple affine uncertain capacity constraints,

gi (u, x) = (X T )i u − ri xi ≤ 0 i = 1, . . . , n,

which we combine to create a single maximum-of-affine constraint, g(u, x) =


maxi≤n ((X T )i u − ri xi ) ≤ 0. Now, to ensure a high probability of constraint sat-

123
I. Wang et al.

Fig. 3 Sparse portfolio. Descriptions are given in the beginning of Sect. 6. The difference in objective values
(bottom left) is calculated as ḡ N (x) − ḡ K (x), and the theoretical bound is δ(K , z, γ ) from Corollary 3

isfaction, we use the CVaR reformulation,


 
g(u, x, τ ) = τ + (1/α) max max((X T )i u − ri xi − τ ), 0 ≤ 0,
i≤n

where we add the auxiliary variable τ . We assume a polyhedral support S = {u |


H u ≤ b} for the demand, and apply the convex reformulation given in Appendix B,
for p = ∞.
Problem setup. To generate data, we set n = 5 facilities, m = 25 customers, and
N = 50 data samples. For the CVaR reformulation, we set α = 20%. We set costs c =
(46.68, 58.81, 30, 42.09, 35.87), and generate the two coordinates of each customer’s
location from a uniform distribution on [0, 15]. We then calculate C as the 2 distance
between each pair of customers. We set production capacities r = (33, 26, 41, 26, 22).
We assume the demand d is supported between 1 and 6, written as H u ≤ b, where
H = [−I I ]T and b is the concatenation of a vector of −1’s of length m and a
vector of 6’s of length m. We generate demands as the combination of two normal
distributions. Half of the data is generated with mean μ1 = 3 and variance σ1 = 0.9,
the second half has mean μ1 = 4 and variance σ1 = 0.8. We then project the demands
onto (1, 6). For the upper bound δ(K , z, γ ) on ḡ N (x) − ḡ K (x) from Corollary 3, we
K
have (1/N ) k=1 i∈Ck max(max j≤J (((1/α)X [i] − H γ jk ), 0) (di − d̄k )). Note
T T

that this upper bounds the difference in constraint values, and only indirectly affects
the objective values through restrictions on the feasible region. Therefore, it is not
an upper bound on the difference in objective value, merely an estimate. We cannot
directly compare this upper bound against the change in constraint values, as the

123
Mean robust optimization

Fig. 4 Facility location. Descriptions are given in the beginning of Sect. 6. The difference in objective values
(bottom left) is calculated as Obj(N ) - Obj(K ), and we compare it to the theoretical upper bound δ(K , z, γ )
on the worst-case constraint value ḡ N (x) − ḡ K (x), from Corollary 3

constraint value will always be near 0 for optimality. We thus compare it against the
change in objective values.

Results As expected of maximum-of-affine g, we note in Fig. 4 that setting K to


smaller values lead to a decrease in the optimal value across different  values. While
K = 1 yields poor performance in terms of the probability of constraint violation, we
observe that K = 2 already yields a tradeoff between the objective and probability of
constraint violation close to that of K = 50. Through cross-validation with different
K , we select K = 5, which provides a tradeoff curve closer to optimality. As this
problem has uncertainty in the constraints and not the objective, the bounds given in
Corollary 3 do not directly reflect the difference in the objective values. However, they
do give a reference value and inform us of the general trend of the difference. In this
case, they still upper bound the actual difference, as shown in Fig. 4. We note that the
bounds we use do not depend on ḡ N ∗ (x), so it is irrelevant whether or not the support
has an affect on the worst-case constraint value. Overall, choosing K = 5 leads to a
time reduction of an order of magnitude while achieving near-optimal performance.

6.4 Newsvendor problem

We consider a 2-item newsvendor problem where, at the beginning of each day, the
vendor orders x ∈ R2+ products at price h = (4, 5). These products will be sold at the
prices c = (5, 6.5), until either the uncertain demand u or inventory x is exhausted.
The objective function to minimize is the sum of the ordering cost minus the revenue,

123
I. Wang et al.

Fig. 5 Newsvendor. data-points


and the outlier at (0,0)

h T x − c T min{x, u}, from which we obtain the maximum-of-affine uncertain function


g to minimize,

g(u, x) = h T x + max(−c1 x1 − c2 x2 , −c1 x1 − c2 u 2 , −c1 u 1 − c2 x2 , −c1 u 1 − c2 u 2 ).

We assume a polyhedral support S = {u | Cu ≤ b}, and obtain the convex


reformulation, for p = 1, by applying (15).
For this problem, we consider the effects of outliers on the performance of MRO.
Therefore, we consider the data to have an outlier at (0, 0), the worst-case value of the
support set. In Fig. 5, we show a set of generated data along with this outlier point.
We consider three ways to solve the problem.
1. MRO, where we directly apply MRO to the dataset with the outlier.
2. ROB-MRO, where we perform preliminary analysis on the dataset to remove the
outlier point, then apply MRO to the cleaned dataset.
3. AUG-MRO, where we perform the clustering step on data without the outlier, then
define an augmented distribution supported on K + 1 points, where the extra point
is the outlier point (0,0), with weight 1/N . The weights of the other clusters are
adjusted accordingly.
Problem setup. To generate data, we set N = 100 data samples. We assume demand
is supported between 0 and 40, which we write as Cu ≤ b, where C = [−I I ]T
and b = (0, 0, 0, 0, 40, 40, 40, 40). We allow non-integer demand to allow for more
variance in the data. We generate the demand from a log-normal distribution, where
the underlying normal distribution has parameters
   
3.0 0.3 −0.1
μ= , = ,
2.8 −0.1 0.2

and take the minimum between the generated values and 40. For the upper bound
K
δ(K , z, γ ) on ḡ N (x) − ḡ K (x) from Corollary 3, we have (1/N ) k=1 i∈Ck
max j≤4 ((−c̃ j − C T γ jk )T (di − d̄k )), where c̃1 = 0, c̃2 = c1 e1 , c̃3 = c2 e2 , c̃4 = c.
Results To examine the effect of the outlier, in Fig. 6, we compare, for K = 10 and
K = 100, the objectives and tradeoff curves for the three methods. We note that,
when the outlier is averaged with other data-points, the final in-sample objective may
be improved, as the centroid moves closer to the non-outlier points. We observe that
MRO, in which the outlier may be clustered with other points, offers a lower in-sample
objective than AUG-MRO, in which the outlier is considered its own cluster. MRO has

123
Mean robust optimization

in fact offered protection against the outlier. And, as expected, ROB-MRO, where the
outlier point is removed, yields the best in-sample results. Regardless of the method,
we note that the final out-of-sample tradeoff curves are near-identical. Comparing the
plots for K = 10 and K = N = 100, the difference between MRO and ROB-MRO
for K = 10 is not larger than the difference for K = 100, which shows that, while
removing outliers a priori may be helpful, the effect of outliers will not be worse for
MRO compared to classic Wasserstein DRO.
In Fig. 7, we compare the in and out-of-sample objective values for MRO. While
setting K = 1 yields suboptimal results, we note that for K = 5 and above, we can
achieve similar performance as setting K = 100. We note that the upper bound on
ḡ N (x) − ḡ K (x), given in Corollary 3, holds for MRO. We again note that bounds
we observe do not depend on ḡ N ∗ (x), so it is irrelevant whether or not the support
has an affect on the worst-case constraint value. Regardless, we see that the support
only minimally affects the worst-case constraint value, at only at higher values of .
Overall, choosing K = 5, we obtain an order of magnitude computational speed-up.

7 Conclusions

We have presented mean robust optimization (MRO), a new data-driven methodology


for decision-making under uncertainty that bridges robust and distributionally robust
optimization while preserving rigorous probabilistic guarantees. By clustering the
dataset before performing MRO, we solve an efficient and computationally tractable
formulation with limited performance degradation. In particular, we showed that when
the constraints are affine in the uncertainty, clustering does not affect the optimal
value of the objective. When the constraint is concave or maximum-of-concave in the
uncertainty, we directly quantified the change in worst-case constraint value that is
caused by clustering. For problems with objective uncertainty, this directly bounds
the change in the optimal value caused by clustering. We demonstrated this result
through a set of numerical examples, where we observed the possibility of tuning the
size of the uncertainty set such that using a small number of clusters achieves near-
identical performance of traditional DRO, with much higher computational efficiency.
In the final example, we also demonstrated that MRO offers protection against outliers
compared to Wasserstein DRO.

A Proof of the constraint reformulation in (15)

We give a prove for the general case of maximum-of-concave functions. When J = 1,


we take α1k = wk , for all k. For a simpler proof for the case of single-concave
functions, refer to [47, Appendix A].

123
I. Wang et al.

Fig. 6 Newsvendor. Comparing the three methods for K = 10 and K = 100. Left: in-sample objective
values vs . Right: objective value vs β

Fig. 7 Newsvendor, MRO. Descriptions are given in the beginning of Sect. 6. The difference in objective
values (bottom left) is calculated as ḡ N (x)− ḡ K (x), and the theoretical bound is δ(K , z, γ ) from Corollary 3

123
Mean robust optimization

To simplify notation, we define ck (v jk ) = v jk − d̄k p −  p . Then, starting from


the inner optimization problem of (14):

⎨ sup K
k=1
J
j=1 α jk g j (v jk , x)
v11 ,...,v J K ∈S,α∈
⎩ subject to K J
k=1 j=1 α jk ck (v jk ) ≤0

K J K J
= sup inf k=1 j=1 α jk g j (v jk , x) − λ k=1 j=1 α jk ck (v jk )
v11 ,...,v J K ∈S,α∈ λ≥0

K J K J
= sup λ≥0 inf sup k=1 j=1 α jk g j (v jk , x) − λ k=1 j=1 α jk ck (v jk ),
α∈ v11 ,...,v J K ∈S

We applied the Lagrangian in the first equality. Then, as the summation is over upper-
semicontinuous functions g j (v jk , x) concave in v jk , we applied the Von Neumann-
Fan minimax theorem [38] to interchange the inf and the sup. Next, we rewrite the
formulation using an epigraph trick, and make a change of variables.


⎨ sup λ≥0,s
inf K
k=1 sk
α∈
=

⎩subject to sup J
j=1 α jk (g j (v jk , x) − λck (v jk )) ≤ sk k = 1, . . . , K ,
v11 ,...,v J K ∈S
⎧ K

⎪ sup inf k=1 sk

⎨α∈ λ≥0,s
J
= subject to sup j=1 α jk (g j ((α jk v jk )/α jk , x)

⎪ α11 v11 ∈α11 S,...,α J K v J K ∈α J K S


−λck ((α jk v jk )/α jk )) ≤ sk k = 1, . . . , K .

In the last step, we rewrote v jk = (α jk v jk )/α jk , and maximized over α jk v jk ∈ α jk S.


In the case αi j > 0, the terms in the summation are unchanged, and maximizing over
v jk is equivalent to maximizing over α jk v jk . In the case α jk = 0, we have in the
transformed formulation (α jk v jk )/α jk = 0/0 = 0, and α jk (g j (0, x) − λck (0)) =
0(g j (0, x) − λck (0)) = 0. In the original formulation, the term α jk (g j (v jk , x) −
λck (v jk )) is also equivalent to 0. As the terms in the summation are 0 regard-
less of the value of v jk , maximizing over v jk is equivalent to maximizing over
0. Therefore, we note that the optimal value remains unchanged. Next, we make
substitutions h jk = α jk v jk , and define functions g j (h jk , x) = α jk g j (h jk /α jk , x),
ck (h jk ) = α jk ck (h jk /α jk ).


⎨ sup λ≥0,s
inf K
k=1 sk
α∈
=

⎩subject to sup J
j=1 g j (h jk , x) − λck (h jk ) ≤ sk k = 1, . . . , K ,
h 11 ∈α11 S,...,h J K ∈α J K S

⎨ sup inf K
k=1 sk
= α∈ λ≥0,s
⎩subject to J  + χα jk S + λck ]∗ (0) ≤ sk k = 1, . . . , K .
j=1 [−g j

For the new functions defined, we applied the definition of conjugate functions. We
also define the characteristic function χ S (v) with χ S (v) = 0 if v ∈ S; = ∞ otherwise.

123
I. Wang et al.

Now, using the conjugate form f ∗ (y) = αg ∗ (y) of a right-scalar-multiplied function


f (x) = αg(x/α), and noting that χα jk S (h) takes the same value as α jk χ S (h/α jk ), we
can rewrite the constraint as


J
α jk [−g j + χ S + λck ]∗ (0) ≤ sk k = 1, . . . , K .
j=1

Next, borrowing results from Esfahani et al. [24, Theorem 4.2], Rockafellar and Wets
[41, Theorem 11.23(a), p. 493], and Zhen et al. [51, Lemma B.8], with regards to the
conjugate functions of infimal convolutions and p-norm balls, we note that:

α jk [(−g j + χ S + λck )]∗ (0) = α jk inf ([−g j ]∗ (z jk − y jk , x)


y jk ,z jk

+ σ S (y jk ) + [λck ]∗ (−z jk )),

[λck ]∗ (−z jk ) = sup(−z Tjk v jk − λ v jk − d̄k p


+ λ p )
v jk
q
= −z Tjk d̄k + φ(q)λ z jk /λ ∗ + λ p .

Substituting this in, we have

⎧ K

⎪ sup inf k sk

⎪ λ≥0,s,z,y
⎨α∈ J ∗
subject to j=1 α jk ([−g j ] (z jk − y jk, x) q

⎪ +σ S (y jk ) − z jk d̄k + φ(q)λ z jk /λ∗ + λ p ) ≤ sk
T



k = 1, . . . , K .

J
Taking the supremum over α, noting that j=1 α jk = wk for all k, we arrive at



⎪ inf K
k sk
⎨ λ≥0,s,z,y
⎪ subject to wk ([−g j ]∗ (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k

⎩  q
+ φ(q)λ z jk /λ∗ + λ p ) ≤ sk k = 1, . . . , K , j = 1, . . . , J ,

which is equivalent to (15).

B Reformulation of the maximum-of-concave case for p = ∞

We again give the general proof for J ≥ 1. For the case J = 1, refer to the simpler
proof [47, Appendix B]. When p = ∞, we have

123
Mean robust optimization

minimize ⎧f (x) ⎫
⎨ maximize K
k=1
J
j=1 α jk g(v jk , x) ⎬
subject to v11 ,...,v J K ∈S,α∈ ≤ 0,
⎩ subject to J
v jk − d̄k ≤ , k = 1, . . . , K ⎭
j=1 (α jk /wk )
(17)

which has a reformulation where the constraint above is dualized,

minimize f (x)
K
subject to k=1 wk sk ≤ 0
[−g j ]∗ (z jk − y jk , x) + σ S (y jk ) − z Tjk d̄k + λk  ≤ sk (18)
  k = 1, . . . , K , j = 1, . . . , J
z jk  ≤ λk k = 1, . . . , K , j = 1, . . . , J ,

with new variables sk ∈ R, z jk ∈ Rm , and y jk ∈ Rm . We prove this by starting from


the inner optimization problem of (17):

⎨ sup K
k=1
J
j=1 α jk g j (v jk , x)
v11 ,...,v J K ∈S,α∈
⎩ subject to J
j=1 (α jk /wk ) v jk − d̄k ≤ , k = 1, . . . , K

⎨ sup inf K
k=1 (
J
j=1 α jk g j (v jk , x)
= 11
v ,...,v JK ∈S,α∈ λ≥0
⎩ +λk ( − Jj=1 (α jk /wk ) v jk − d̄k ))

⎨ sup inf sup K
k=1 (
J
j=1 α jk g j (v jk , x)
= α∈ λ≥0 v 11 ,...,v JK ∈S
⎩ +λ ( − J (α /w ) v − d̄ ))
k j=1 jk k jk k

We have again formulated the Lagrangian and applied the minmax theorem to the sum
of concave functions in v jk . Now, we can rewrite this in epigraph form,
⎧ K

⎪ sup inf k=1 sk

⎨α∈ λ≥0,s
J
= subject to sup λk  + j=1 α jk g j (v jk , x)

⎪ v11 ,...,v J K ∈S


−λk (α jk /wk ) v jk − d̄k ≤ sk k = 1, . . . , K ,

then use the definition of the dual norm to rewrite the constraints,

⎨ sup λk  + Jj=1 min α jk g j (v jk , x)
v11 ,...,v J K ∈S z jk ∗ ≤λk /wk
⎩ −α jk z Tjk (v jk − d̄k ) ≤ sk k = 1, . . . , K ,

⎨ sup λk  + J
j=1 min α jk g j
= α11 v11 ∈α11 S,...,α J K v J K ∈α J K S z jk ∗ ≤λk /wk
⎩ ((α jk v jk )/α jk , x) − α jk z Tjk (v jk − d̄k ) ≤ sk k = 1, . . . , K .

123
I. Wang et al.

In the last step, we again rewrote v jk = (α jk v jk )/α jk inside functions g j . Using


similar logic as in Appendix A, we note that this does not change the optimal value
for both α jk > 0 and α jk = 0, as taking the supremum over the transformed variables
is equivalent to taking the supremum over the original variables. For conciseness, we
omit the steps of creating the right-scalar-multiplied function and transformations. For
details, please refer to Appendix A. We apply the definition of conjugate functions
and arrive at the optimization problem
⎧ K

⎪ sup inf k=1 sk

⎪ α∈ λ≥0,s,z

subject to λk  + Jj=1 α jk [−g j + χ S ]∗ (−z jk , x) + α jk z Tjk d̄k ≤ sk

⎪ k = 1, . . . , K



z jk ∗ ≤ λk /wk k = 1, . . . , K , j = 1, . . . , J .

Now, substituting λk = λk wk , z jk = −z jk , and substituting in the conjugate functions


derived in Appendix A, we have
⎧ K

⎪ sup inf k=1 sk

⎪ α∈ λ≥0,s,z
⎨ J ∗
= subject to λk wk  + j=1 α jk ([−g j ] (z jk − y jk , x) + σ S (y jk )

⎪ −z Tjk d̄k ) ≤ sk k = 1, . . . , K



z jk ∗ ≤ λk , k = 1, . . . , K , j = 1, . . . , J .

Note that rescaling λk did not affect value of the problem, as minimizing λk is equiv-
alent to minimizing λk wk , as wk > 0. Lastly, taking the supremum over α, we arrive
at
⎧ K

⎪ inf k=1 sk

⎪ λ≥0,s,z
⎨ ∗
= subject to wk (λk  + [−g j ] (z jk − y jk , x) + σ S (y jk ) − z jk d̄k ) ≤ sk
T

⎪ k = 1, . . . , K , j = 1, . . . , J



z jk ∗ ≤ λk , k = 1, . . . , K , j = 1, . . . , J .

C Proof of the dual problem reformulation as p → ∞

We prove the dual equivalence. For a proof of primal equivalence, see the appendix
of [47].

Theorem 6 Let S be a bounded set. Define here


⎧ K



minimize k=1 wk sk
subject to λ ≥ 0, z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
ḡ K (x; ∞) = (19)

⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk +  z k ∗ ≤ sk

k = 1, . . . , K .

123
Mean robust optimization

Then, lim p→∞ ḡ K (x; p) = ḡ K (x; ∞) for any x ∈ X .


Proof First, from Equation (5) we have for any p > 1 that

ḡ K (x; p)
⎧ K



minimize k=1 wk sk
subject to λ ≥ 0, z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
=

⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎩ q
+φ(q)λ z k /λ ∗ + λ p ≤ sk k = 1, . . . , K
⎧ K



minimize k=1 wk sk
subject to λk ≥ 0, z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,


⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk
⎩ q
+φ(q)λk z k /λk ∗ + λk  p ≤ sk k = 1, . . . , K
⎧ K



minimize k=1 wk sk
subject to z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
=

⎪ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk

+ z k ∗ ≤ sk k = 1, . . . , K

where the first equality is established in Appendix A and the second equality follows
from Lemma 1. Remark that the inequality in the second step simply follows as we
introduce λk and do not impose that λk = λ for all k = 1, . . . , K . Hence, considering
the limit for p tending to infinity gives us now lim inf p→∞ ḡ K (x; p) ≥ ḡ K (x; ∞). It
remains to prove the reverse lim sup p→∞ ḡ K (x; p) ≤ ḡ K (x; ∞).
Second, we have for any p > 1 with 1/ p + 1/q = 1 that

ḡ K (x; p)


⎪ minimize K
k=1 wk sk



⎪ subject to z k ∈ Rm , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,



⎨ [−g]∗ (z k − yk ) + σ S (yk ) − z kT dk + φ(q)
 1−q
≤ 1
q−1 1−q
 K q

⎪ max 
k =1 z k  ∗ zk ∗

⎪ 
q


⎪ + q−1
1
 1−q maxkK =1 z k  ∗  p ≤ sk k = 1, . . . , K




q
(q − 1)1/4 ≤ z k ∗ ≤ (q − 1)−1/4 k = 1, . . . , K ,
 1 
q−1 1−q
which follows from the choice λk = q  maxkK =1 zk ∗ and by imposing the
restrictions (q − 1)1/4 ≤ zk ∗ ≤ (q − 1)−1/4 for all k = 1, . . . , K . Next, using
 1−q  1−q
the identities φ(q) q−1
q = (q − 1)(q−1) /q q q−1
q = 1/q and p + 1−q 1
=
−q 1−q
1
1 + 1
= 1
+ 1
= −q+1 + 1
= = 1, as well as pulling  z k ∗ >0
p
1−q 1− q1 1−q 1−q 1−q
out of the last two terms, we can rewrite the first constraints as

⎨ [−g]∗ (z k − yk) + σ S (yk ) − z kT dk 
 1−q
z  zk
⎩ + z k ∗ q1 maxkK =1 zk ∗ + q−1 K
q maxk  =1 z k

≤ sk k = 1, . . . , K .
k ∗ ∗

123
I. Wang et al.

Then, we note that maxkK =1 z k  ∗ / z k ∗ ≥ z k ∗ / z k ∗ = 1 and maxkK =1 z k  ∗/


z k ∗ ≤ (q − 1)−1/2 , and hence we can apply Lemma 2 to obtain

⎧ K



minimize k=1 wk sk

⎨ subject to z k ∈ R , yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,
m

ḡ (x; p) ≤
K
[−g]∗ (z k − yk ) + σ S (yk ) − z kT dk



⎪ + z k ∗ D(q) ≤ sk k = 1, . . . , K ,

(q − 1)1/4 ≤ z k ∗ ≤ (q − 1)−1/4 k = 1, . . . , K .

Let
⎧ K

⎪ minimize k=1 wk sk



⎨ subject to z k ∈ R m
, yk ∈ Rm , sk ∈ Rm k = 1, . . . , K ,

[−g] (z k − yk ) + σ S(yk ) − z kT dk
ḡu (x; p) =
K
(20)

⎪ p

⎪ + z k ∗ D p−1 ≤ sk k = 1, . . . , K


( p − 1)−1/4 ≤ z k ∗ ≤ ( p − 1)1/4 k = 1, . . . , K .

Hence, as q = p/( p −1) and q −1 = 1/( p −1) we have ḡ K (x; p) ≤ ḡuK (x; p) for all
p > 1. Hence, taking the limit p → ∞ we have ḡ K (x; ∞) ≤ lim sup p→∞ ḡ K (x; p).
 
p
In fact, as the function D p−1 defined in Lemma 2 is nonincreasing for all p suf-
ficiently large this implies that ḡuK (x; p) is nonincreasing for p sufficiently large
and hence we have ḡ K (x; ∞) ≤ lim sup p→∞ ḡ K (x; p) ≤ lim inf p→∞ ḡuK (x; p) =
lim p→∞ ḡuK (x; p). We now prove here that lim p→∞ ḡuK (x; p) = lim inf p→∞ ḡuK (x;
p) ≤ ḡ K (x; ∞). Consider ∗ t
 t  any feasible sequence {(z k , yk , sk = [−g] (z k K− yk ) +
t t t t
t t T  
σ S (yk )−(z k ) dk + z k ∗ )}t≥1 in the optimization problem characterizing ḡ (x; ∞)
K
in Equation (19) so that limt→∞ k=1 wk skt = ḡ K (x; ∞). Let z̃ kt ∈ arg max{ z ∗ |
z ∈ R , z − z k ∗ ≤ 1/t} for all t ≥ 1 and k = 1, . . . , K and observe that
m t

z̃ kt ∗ = 1/t + z kt ∗ ≥ 1/t. Consider now an increasing sequence { pt }t≥1 so that


( pt −1)1/4 ≥ maxk=1 K z̃ kt ∗ and ( pt −1)−1/4 ≤ 1/t. Finally observe that the auxiliary
∗ t
 t  {(z̃ k , ỹk = yk + (z̃ k − z k ), s̃k = [−g] (z̃ k − ỹk ) + σ S ( ỹk ) − (z̃ k ) dk +
sequence t t t t t t t t t T
 
 z̃ k ∗ D ( pt /( pt − 1)))}t≥1 is by construction feasible in the minimization problem
characterizing the function ḡuK (x; pt ) in Equation (20). Hence, finally, we have


K
lim guK (x; p) = lim guK (x; pt ) = lim wk s̃kt
p→∞ t→∞ t→∞
k=1

K    
= lim wk [−g]∗ (z̃ kt − ỹkt ) + σ S ( ỹkt ) − (z̃ kt )T dk +  z̃ kt ∗ D ( pt /( pt − 1))
t→∞
k=1

K    
≤ lim wk [−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk +  z kt ∗ D ( pt /( pt − 1))
t→∞
k=1

123
Mean robust optimization


K  
+ wk max s + dk +  D ( pt /( pt − 1)) /t
s∈S
k=1

K    
≤ lim wk [−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk +  z kt ∗ D ( pt /( pt − 1))
t→∞
k=1

K
= lim wk ([−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk
t→∞
k=1
   
+  z kt ∗ +  z kt ∗ (D ( pt /( pt − 1)) − 1))

K
≤ lim wk ([−g]∗ (z kt − ykt ) + σ S (ykt ) − (z kt )T dk
t→∞
k=1
 
+  z kt ∗ + ( pt − 1)1/4 (D ( pt /( pt − 1)) − 1))

K
≤ lim wk sk = ḡ K (x; ∞).
t→∞
k=1

To establish the third inequality observe first that −(z̃ kt )T dk = −(z kt )T dk − (z̃ kt −
z kt )T dk ≤ −(z kt )T dk + z̃ kt − z kt ∗ dk ≤ −(z kt )T dk + dk /t. Second, remark that
we have

σ S ( ỹkt ) = σ S (ykt + (z̃ kt − z kt )) ≤ max s T (ykt + (z̃ kt − z kt ))


s∈S
≤ max s T ykt + max s T (z̃ kt − z kt )
s∈S s∈S
≤max s yk + s z̃ kt − z kt ∗ ≤ max s T ykt + 1/t max s ,
T t
s∈S s∈S s∈S

 t z̃ t − zt t ≤ 1/t. Lemma1/4


as 2 guarantees that limt→∞ D ( pt /( pt − 1)) = 1. Finally,
z  ≤ z̃  ≤ ( pt − 1) and
k ∗ k ∗

lim ( pt − 1)1/4 (D ( pt /( pt − 1)) − 1) = lim ( p − 1)1/4 (D ( p/( p − 1)) − 1)


t→∞ p→∞

= lim (q − 1)−1/4 (D(q) − 1) = 0


q→1

with 1/ p + 1/q = 1 using again Lemma 2. 




Lemma 1 We have
q
min φ(q)λ z/λ ∗ + λ p = z ∗ 
λ≥0

for any p > 1 and q > 1 for which 1/ p + 1/q = 1, φ(q) = (q − 1)q−1 /q q and
 > 0.
q
Proof Remark that as the objective function λ → φ(q)λ z/λ ∗ + λ p is continuous
q q
and we have limλ→0 φ(q)λ z/λ ∗ + λ p = limλ→∞ φ(q)λ z/λ ∗ + λ p = ∞ as

123
I. Wang et al.

 > 0 there must exist a minimizer λ ∈ minλ≥0 φ(q)λ z/λ ∗ +λ p with λ > 0. The
q

necessary and sufficient first-order convex optimality conditions of the minimization


problem guarantee
−q
λ ∈ min φ(q)λ z/λ
q q
∗ + λ p ⇐⇒ (1 − q)φ(q)λ z  + p = 0
λ≥0
−q
⇐⇒  p = (q − 1)φ(q)λ z  ⇐⇒ λ = [(q − 1)φ(q)]1/q z   − p/q
q

q − 1 1−q
1
⇐⇒ λ =  z 
q

where we exploit that 1/ p + 1/q = 1 and φ(q) = (q − 1)q−1 /q q . Indeed, we have


 1/q
[(q − 1)φ(q)]1/q = (q − 1)q /q q = (q − 1)/q,
p 1 1 1 1
− =−1 =− =− = .
q pq
(1 − 1/q)q q −1 1−q

Hence, we have
q 1−q q
min φ(q)λ1−q z  + λ p = φ(q)λ z  + λ  p
λ≥0
   
(q − 1)1−q 1−q q q − 1 1−q
1
= φ(q)  z  z  +  z  p
q 1−q q
(q − 1)1−q q − 1 p+ 1−q
1
= φ(q) 1−q
 z +  z 
q q
(q − 1)1−q q −1
= φ(q)  z +  z 
q 1−q q
(q − 1)q−1 (q − 1)1−q q −1 1 q −1
=  z +  z =  z  +  z 
qq q 1−q q q q
 
1 q −1
= +  z = z 
q q

where we exploit that 1/ p + 1/q = 1 and φ(q) = (q − 1)q−1 /q q . Indeed, we have

1 1 1 1 1 −q 1 1−q
p+ = 1 + = + = + = =1
1−q p
1−q 1− 1
q
1−q −q + 1 1 − q 1−q

establishing the claim. 



Lemma 2 Let q > 1 then
 √ 
1 1−q q − 1 1 1 q −1
max
√ t + t = D(q) = max 1, +
t∈[1,1/ q−1] q q q (q − 1)(1−q)/2 q

with limq→1 D(q) = 1 and limq→1 (q − 1)1/4 (D(q) − 1) = 0.

123
Mean robust optimization

Proof The objective function is convex in t. Convex functions attain their maximum
on the extreme points of their domain. The limits can be verified using standard
manipulations. 


D Proof of Theorem 4

We prove (i) ḡ N (x) ≤ ḡ K (x), (ii) ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ), and (iii) when the
support constraint does not affect the worst-case value, ḡ K (x) ≤ ḡ N (x) + (L/2)D(K ).
Proof of (i). We begin with a feasible solution v1 , . . . , v N of (MRO-K) with K = N .
Then for K < N , we set u k = i∈Ck vi /|Ck | for each of the K clusters. We see u k
with k = 1, . . . , K satisfies the constraints of (MRO-K) for K < N , as

 

K
|Ck |   
K
|Ck |  vi di  p
u k − d̄k  p =  i∈Ck

i∈Ck 
N N  |Ck | |Ck | 
k=1 k=1

K
|Ck |  1
≤ vi − di p
N |Ck |
k=1 i∈Ck


K
1 
= vi − di p
≤  p,
N
k=1 i∈Ck

where we have applied triangle inequality, Jensen’s inequality for the convex function
f (x) = x p , and the constraint of (MRO-K) with K = N . In addition, since the
support S is convex, for every k our constructed u k , as the average of select points vi
∈ S, must also be within S. The same applies with respect to the domain of g.
Since we have shown that the u k ’s satisfies the constraints for (MRO-K) with
K < N , it is a feasible solution. We now show that for this pair of feasible solutions,
in terms of the objective value, ḡ K (x) ≥ ḡ N (x). By assumption, g is concave in the
uncertain parameter, so by Jensen’s inequality,
⎛ ⎞

K
|Ck | 1  
K
|Ck | 1 
g⎝ vi , x ⎠ ≥ g(vi )
N |Ck | N |Ck |
k=1 i∈Ck k=1 i∈Ck


K
|Ck | 1 
g(u k , x) ≥ g(vi ).
N N
k=1 i∈N

Since this holds true for u k ’s constructed from any feasible solution vi , . . . , v N , we
must have ḡ K (x) ≥ ḡ N (x).
Proof of (ii). Next, we prove ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ) by making use of
the L-smooth condition on −g. We first solve (MRO-K) with K < N to obtain a
feasible solution u 1 , . . . , u k . We then set k = u k − d̄k for each k ≤ K , and set
vi = di + k ∀i ∈ Ck , k = 1, . . . , K . These satisfy the constraint of (MRO-K) with

123
I. Wang et al.

K = N and no support constraints (which we will refer to as K = N ∗ from here on),


as

1  1  
N K K
|Ck |
vi − di p
= k p
= u k − d̄k p
≤  p.
N N N
i=1 k=1 i∈Ck k=1

Since the constraints are satisfied, the constructed vi . . . v N are a valid solution
for (MRO-K) with K = N ∗ . We note that these vi ’s are also in the domain of g,
given that the uncertain data D N is in the domain of g. For monotonically increasing
functions g, (e.g., log(u), 1/(1 + u)), we must have k = u k − d¯k ≥ 0 in the solu-
tion of (MRO-K), as the maximization of g over u k will lead to u k ≥ d¯k . Therefore,
vi = di + k is also in the domain, as the L-smooth and concave function g with
only a potential lower bound will not have holes in its domain above the lower bound.
For monotonically decreasing functions g, the same logic applies with a nonpositive
k . We now make use of the convex and L-smooth conditions [2, Theorem 5.8] on
−g : ∀v1 , v2 ∈ S, λ ∈ [0, 1],

L
g(λv1 + (1 − λ)v2 ) ≤ λg(v1 ) + (1 − λ)g(v2 ) + λ(1 − λ) v1 − v2 22 ,
2

which, we can apply iteratively, with the first iteration being

 
1 |Ck | − 1 1 |Ck | − 1
g v1 + v̄2 ≤ g(v1 ) + g(v̄2 )
|Ck | |Ck | |Ck | |Ck |
L 1 |Ck | − 1
+ v1 − v̄2 22 ,
2 |Ck | |Ck |

where v̄2 = |Ck1|−1 i∈Ck ,i=1 vi . Note that v1 − v̄2 = d1 − |Ck1|−1 i∈Ck ,i=1 di , as
they share the same k . The next iteration will be applied to g(v̄2 ), and so on. For
each cluster k, this results in:

⎛ ⎞  2
  |Ck |
 −  i−1 
1 1 L i 1  j=1 d j

g⎝ vi , x ⎠ ≤ g(vi , x) + di − 
|Ck | |Ck | 2|Ck | i  i −1 
i∈Ck i∈Ck i=2 2
1  L 
g(d̄k + k , x) ≤ g(vi , x) + di − d̄k 2
2
|Ck | 2|Ck |
i∈Ck i∈Ck
1  L 
g(u k , x) ≤ g(vi , x) + di − d̄k 2,
2
|Ck | 2|Ck |
i∈Ck i∈Ck

123
Mean robust optimization

 2
|Ck |  i−1 
where we used the equivalence i=2 ((i − 1)/i) di − j=1 d j /(i − 1) =
2
i∈Ck di − d̄k 2.
2 Now, summing over all clusters, we have


K 
N
(|Ck |/N )g(u k , x) ≤ 1/N g(vi , x) + (L/2)D(K ).
k=1 i=1

Since this holds for any feasible solution of (MRO-K) with K < N , we must have
ḡ K (x) ≤ ḡ N ∗ (x) + (L/2)D(K ).

E Proof of Theorem 5

Proof of the lower bound. We use the dual formulations of the MRO constraints. For
the case p ≥ 1, we first solve (MRO-K-Dual) with K < N to obtain dual variables
z jk , γ jk . For each data label i in cluster Ck , for all clusters k = 1, . . . , K , and for all
pieces j = 1, . . . , J , if we set

λ = λ, z ji = z jk , γ ji = γ jk , si = sk + max{(−z jk − H T γ jk )T (di − d̄k )},


j

we have obtained a valid solution for (MRO-K-Dual) with K = N . The increase in


the objective value from ḡ K (x) to that of ḡ N (x), i.e. ḡ N (x) − ḡ K (x), is


N 
K
δ(K , z, γ ) = (1/N ) si − (|Ck |/N )sk
i=1 k=1

K 
= (1/N ) max{(−z jk − H T γ jk )T (di − d̄k )}.
j
k=1 i∈Ck

K
We note that δ(K , z, γ ) ≥ (|Ck |/N ) k=1 (−z 1k − H T γ1k )T (1/|Ck |) i∈Ck (di −
K
d̄k ) = (|Ck |/N ) k=1 (−z 1k − H T γ1k )T 0 = 0. The constructed feasible solution
for (MRO-K-Dual) with K = N is an upper bound for its optimal solution, since it is
a minimization problem. We then have ḡ N (x)− ḡ K (x) ≤ δ(K , z, γ ), which translates
to ḡ N (x) − δ(K , z, γ ) ≤ ḡ K (x).
Now, for p = ∞, the same procedure can be applied; we obtain a solu-
tion to (MRO-K-Dual-∞) with K < N , and construct a feasible solution
for (MRO-K-Dual-∞) with K = N . We modify the variables in the same man-
ner, with the exception of λ, which is now set by λi = λk . The definition of δ(K , z, γ )
remains unchanged, so the same result follows.
Proof of the upper bound. We use the primal formulations of the MRO constraints.
Similar to the single-concave proof, for p ≥ 1, we first solve the MRO problem with
K clusters to obtain a feasible solution u 11 , . . . , u J K , α11 , . . . , α J K . Note that the
existence of α removes the need to analyze which function g j attains the maximum

123
I. Wang et al.

for each cluster. We then set  jk = u jk − d̄k for each k ≤ K , and set v ji = di +  jk ,
α ji = α jk /|Ck | ∀i ∈ Ck , k = 1, . . . , K . These satisfy the constraint of the problem
with N clusters and without support constraints, as


N 
J 
K 
J  
K 
J
α ji v ji − di p
= α ji  jk p
= α jk u jk − d̄k p
≤  p.
i=1 j=1 k=1 j=1 i∈Ck k=1 j=1

For p = ∞, we repeat the process of obtaining and modifying a feasible solution, and
observe, for i = 1, . . . , N ,


J 
J
(α ji /(1/N )) v ji − di = (α jk /(|Ck |/N ))  jk
j=1 j=1


J
= (α jk /wk ) u jk − d̄k ≤ .
j=1

Therefore, we also have constraint satisfaction for p = ∞. The constructed solutions


for both cases remain in domu g, following the arguments in the proof of (ii) in
Appendix D.
Next, the objective functions for p ≥ 1 and p = ∞ are identical, so the same
analysis applies. For each cluster k and function g j , using the L-smooth condition on
−g j , we observe
⎛ ⎞ ⎛
1   1
α jk g j ⎝ v ji , x ⎠ ≤ α jk ⎝ g j (v ji , x)
|Ck | |Ck |
i∈Ck i∈Ck
 2 ⎞
i −1 
|Ck |
Lj 
i−1
 j=1 d j
 ⎠
+ di − 
2|Ck | i  i −1 
i=2 2
 α ji L j 
α jk g j (d̄k + k , x) ≤ α ji g j (v ji , x) + di − d̄k 2.
2
2
i∈Ck i∈Ck

Then, summing over all the clusters and functions, we have


K 
J 
K 
J 
α jk g j (u jk , x) ≤ α ji g j (v ji , x)
k=1 j=1 k=1 j=1 i∈Ck


K 
J
α ji max j≤J L j 
+ di − d̄k 2
2
2
k=1 j=1 i∈Ck


N 
J 
N
≤ α ji g(v ji , x) + max(L j /2N ) di − d̄k 2.
2
j≤J
i=1 j=1 i=1

123
Mean robust optimization

Since this holds for all feasible solutions of the problem with K clusters, we conclude
that

ḡ K (x) ≤ ḡ N ∗ (x) + max(L j /2)D(K ).


j≤J

Acknowledgements The simulations presented in this article were performed on computational resources
managed and supported by Princeton Research Computing, a consortium of groups including the Princeton
Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology’s
High Performance Computing Center and Visualization Laboratory at Princeton University. We would like
to thank Daniel Kuhn for the useful feedback and for pointing us to the literature on scenario reduction
techniques.

Declarations
Conflict of interest The authors have no relevant financial or non-financial interests to disclose.

References
1. Bandi, C., Bertsimas, D.: Tractable stochastic analysis in high dimensions via robust optimization.
Math. Program. 134(1), 23–70 (2012)
2. Beck, A.: First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics,
Philadelphia (2017)
3. Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton
(2009)
4. Ben-Tal, A., den Hertog, D., Vial, J.P.: Deriving robust counterparts of nonlinear uncertain inequalities.
Math. Program. 149(1–2), 265–299 (2015)
5. Ben-Tal, A., Nemirovski, A.: Robust solutions of Linear Programming problems contaminated with
uncertain data. Math. Program. 88(3), 411–424 (2000)
6. Ben-Tal, A., Nemirovski, A.: Selected topics in robust convex optimization. Math. Program. 112,
125–158 (2008)
7. Beraldi, P., Bruni, M.E.: A clustering approach for scenario tree reduction: an application to a stochastic
programming portfolio optimization problem. TOP 22(3), 934–949 (2014)
8. Bertsimas, D., Brown, D.B., Caramanis, C.: Theory and applications of robust optimization. SIAM
Rev. 53(3), 464–501 (2011)
9. Bertsimas, D., den Hertog, D., Pauphilet, J.: Probabilistic guarantees in robust optimization. SIAM J.
Optim. 31(4), 2893–2920 (2021)
10. Bertsimas, D., Gupta, V., Kallus, N.: Data-driven robust optimization. Math. Program. 167(2), 235–292
(2018)
11. Bertsimas, D., den Hertog, D.: Robust and Adaptive Optimization. Dynamic Ideas, Belmont (2022)
12. Bertsimas, D., Mundru, N.: Optimization-based scenario reduction for data-driven two-stage stochastic
optimization. Oper. Res. (2022)
13. Bertsimas, D., Sim, M.: The price of robustness. Oper. Res. 52(1), 35–53 (2004)
14. Bertsimas, D., Sturt, B., Shtern, S.: A data-driven approach to multistage stochastic linear optimization.
Manag. Sci. (2022)
15. Bradley, B.O., Taqqu, M.S.: Financial Risk and Heavy Tails, Handbooks in Finance, vol. 1. North-
Holland, Amsterdam (2003)
16. Carmona, R.A.: Rsafd: Statistical Analysis of Financial Data in R (2020). R package version 1.2
17. Chen, L.: Clustering of sample average approximation for stochastic program (2015)
18. Chen, R., Paschalidis, I.: Distributionally robust learning. Found. Trends Optim. 4(1–2), 1–243 (2020)
19. Chen, Z., Sim, M., Xiong, P.: Robust stochastic optimization made easy with RSOME. Manag. Sci.
66(8), 3329–3339 (2020)

123
I. Wang et al.

20. Clement, P., Desch, W.: An elementary proof of the triangle inequality for the Wasserstein metric.
Proc. Am. Math. Soc. 136(1), 333–339 (2008)
21. Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to
data-driven problems. Oper. Res. 58(3), 595–612 (2010)
22. Dupačová, J., Gröwe-Kuska, N., Römisch, W.: Scenario reduction in stochastic programming. Math.
Program. 95(3), 493–511 (2003)
23. Emelogu, A., Chowdhury, S., Marufuzzaman, M., Bian, L., Eksioglu, B.: An enhanced sample average
approximation method for stochastic optimization. Int. J. Prod. Econ. 182, 230–252 (2016)
24. Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein met-
ric: performance guarantees and tractable reformulations. Math. Program. 171, 115–166 (2018)
25. Esteban, P.A., Morales, J.M.: Partition-based distributionally robust optimization via optimal transport
with order cone constraints. 4OR 20(3), 465–497 (2022)
26. Fabiani, F., Goulart, P.: The optimal transport paradigm enables data compression in data-driven robust
control. In: 2021 American Control Conference (ACC), pp. 2412–2417 (2021)
27. Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure.
Probab. Theory Relat. Fields 162(3), 707–738 (2015)
28. Gao, R.: Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the
curse of dimensionality. CoRR (2020)
29. Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math.
Oper. Res. 48, 603–655 (2023)
30. Givens, C.R., Shortt, R.M.: A class of Wasserstein metrics for probability distributions. Mich. Math.
J. 31(2), 231–240 (1984)
31. Goh, J., Sim, M.: Distributionally robust optimization and its tractable approximations. Oper. Res.
58(4–part–1), 902–917 (2010)
32. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser.
C (Appl. Stat.) 28(1), 100–108 (1979)
33. Holmberg, K., Rönnqvist, M., Yuan, D.: An exact algorithm for the capacitated facility location prob-
lems with single sourcing. Eur. J. Oper. Res. 113(3), 544–559 (1999)
34. Jacobson, D., Hassan, M., Dong, Z.S.: Exploring the effect of clustering algorithms on sample average
approximation. In: 2021 Institute of Industrial and Systems Engineers (IISE) Annual Conference &
Expo (2021)
35. Kuhn, D., Esfahani, P.M., Nguyen, V., Shafieezadeh-Abadeh, S.: Wasserstein Distributionally Robust
Optimization: Theory and Applications in Machine Learning, pp. 130–166 (2019)
36. Liu, Y., Yuan, X., Zhang, J.: Discrete approximation scheme in distributionally robust optimization.
Numer. Math. Theory Methods Appl. 14(2), 285–320 (2021)
37. MOSEK ApS: The MOSEK optimization toolbox. Version 9.3. (2022)
38. Neumann, J.V.: Zur theorie der gesellschaftsspiele. Math. Ann. 100, 295–320 (1928)
39. Perakis, G., Sim, M., Tang, Q., Xiong, P.: Robust pricing and production with information partitioning
and adaptation. Manag. Sci. (2023)
40. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance
26(7), 1443–1471 (2002)
41. Rockafellar, R.T., Wets, R.J.: Variational analysis. Grundlehren der mathematischen Wissenschaften
(1998)
42. Roos, E., den Hertog, D.: Reducing conservatism in robust optimization. INFORMS J. Comput. 32(4),
1109–1127 (2020)
43. Rujeerapaiboon, N., Schindler, K., Kuhn, D., Wiesemann, W.: Scenario reduction revisited: funda-
mental limits and guarantees. Math. Program. 191(1), 207–242 (2022)
44. Thorndike, R.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
45. Trillos, N., Slepčev, D.: On the rate of convergence of empirical measures in ∞-transportation distance.
Can. J. Math. 67, 1358–1383 (2014)
46. Uryasev, S., Rockafellar, R.T.: Conditional Value-at-Risk: Optimization Approach. Springer, New York
(2001)
47. Wang, I., Becker, C., Van Parys, B., Stellato, B.: Mean robust optimization. arXiv (2023)
48. Wang, Z., Wang, P., Ye, Y.: Likelihood robust optimization for data-driven problems. CMS 13(2),
241–261 (2016)
49. Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62(6),
1358–1376 (2014)

123
Mean robust optimization

50. Xu, H., Caramanis, C., Mannor, S.: A distributional interpretation of robust optimization. Math. Oper.
Res. 37(1), 95–110 (2012)
51. Zhen, J., Kuhn, D., Wiesemann, W.: A unified theory of robust and distributionally robust optimization
via the primal-worst-equals-dual-best principle (2021). https://doi.org/10.1287/opre.2021.0268
52. Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order
moment information. Math. Program. 137(1), 167–198 (2013)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

You might also like