0% found this document useful (0 votes)
29 views15 pages

Differentiability and Approximation of Probability

Cdf

Uploaded by

Nico Stagnaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

Differentiability and Approximation of Probability

Cdf

Uploaded by

Nico Stagnaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Differentiability and Approximation of Probability Functions under

Gaussian Mixture Models: A Bayesian Approach

Gonzalo Contador · Pedro Pérez-Aros · Emilio Vilches


arXiv:2411.02721v1 [[Link]] 5 Nov 2024

Received: date / Accepted: date

Abstract In this work, we study probability functions associated with Gaussian mixture models. Our
primary focus is on extending the use of spherical radial decomposition for multivariate Gaussian random
vectors to the context of Gaussian mixture models, which are not inherently spherical but only condition-
ally so. Specifically, the conditional probability distribution, given a random parameter of the random
vector, follows a Gaussian distribution, allowing us to apply Bayesian analysis tools to the probability
function. This assumption, together with spherical radial decomposition for Gaussian random vectors,
enables us to represent the probability function as an integral over the Euclidean sphere. Using this
representation, we establish sufficient conditions to ensure the differentiability of the probability function
and provide and integral representation of its gradient. Furthermore, leveraging the Bayesian decomposi-
tion, we approximate the probability function using random sampling over the parameter space and the
Euclidean sphere. Finally, we present numerical examples that illustrate the advantages of this approach
over classical approximations based on random vector sampling.

Keywords stochastic programming · chance constraint · probability functions · Gaussian mixture


models · Bayesian statistics

The first author was partially supported by ANID-Chile grant Exploración 13220097. The second and third author were
partially supported by Centro de Modelamiento Matemático (CMM), ACE210010 and FB210005, BASAL funds for center
of excellence, as well as ANID-Chile grants MATH-AMSUD 23-MATH-17, ECOS-ANID ECOS320027, Fondecyt Regular
1220886, Fondecyt Regular 1240120, and Exploración 13220097. Additionally, the second author was supported by ANID-
Chile grants MATH-AMSUD 23-MATH-09 and Fondecyt Regular 1240335.

Gonzalo Contador
Universidad Técnica Federico Santa Marı́a, Santiago, Chile. E-mail: [Link]@[Link]
Pedro Pérez-Aros
Departamento de Ingenierı́a Matemática and Centro de Modelamiento Matemático (CNRS UMI 2807), Universidad de
Chile, Santiago, Chile. E-mail: pperez@[Link]
Emilio Vilches
Instituto de Ciencias de la Ingenierı́a, Universidad de O’Higgins, Rancagua, Chile.
Centro de Modelamiento Matemático (CNRS UMI 2807), Santiago, Chile. E-mail: [Link]@[Link]
2 Gonzalo Contador et al.

Mathematics Subject Classification (2000) Primary: 90C15, 65K10 · Secondary: 62F15

1 Introduction

A chance-constrained optimization problem is a mathematical program of the form:

min f (x) s.t. P (g(x, ξ) ≤ 0) ≥ p, (1)

where f : Rn → R is the objective function, ξ is a m-dimensional random vector defined on a probability


space, g : Rn × Rm → R represents an inequality constraint, and p ∈ (0, 1) is a reliability parameter.
In this context, a vector x ∈ Rn is feasible for the optimization problem (1) if and only if the random
inequality g(x, ξ) ≤ 0 is satisfied with probability of at least p. For classical monographs on optimization
problems with probabilistic constraints, we refer to [16, 21].
To efficiently compute numerical solutions for chance-constrained optimization problems, access to both
the values and gradients of the probability function, denoted as Φ(x) := P (g(x, ξ) ≤ 0), is crucial. A
typical procedure for computing this function involves sampling the random vector ξ using standard
Monte Carlo techniques and then approximating the actual probability value through a sample average.
The Law of Large Numbers guarantees that this approach converges towards the real probability as the
number of sampled values of ξ approaches infinity. However, this approach presents notable challenges,
especially when the random inequality g(x, ξ) ≤ 0 has a nonlinear structure with respect to the variable
ξ. Such a sampling procedure introduces inaccuracies in the probability computations, complicating the
solution process for optimization problem. This complexity arises from the need not only to compute
probability values but also to determine gradients of probability functions across multiple iterations of
various optimization methods, ultimately making this approach impractical for addressing problem (1).
Hence, there is a clear need for an efficient procedure to compute both the values and gradients of prob-
ability functions. In this context, some authors have utilized the so-called spherical radial decomposition
as an effective method for computing probability function values. This decomposition has proven to be
specially promising for studying first-order variational information of the probability function. Specifi-
cally, it provides insights into gradients when the probability function is smooth and subgradients in the
non-differentiable case. Most works related to this approach involve the analysis of elliptical symmetrical
distributions, such as multivariate Gaussian distributions (see, e.g., [23–27] and the references therein).
Nevertheless, various random phenomena exist for which elliptical symmetric distributions are not an
adequate model. Our motivation here is to provide a framework for solving chance-constrained problems
that do not exhibit and inherently spherical structure as described above for ξ but do so conditionally,
based on the realization of a latent variable, as in the case of Gaussian mixture models. This framework
leverages known results from the spherical case and extends them to the conditionally Gaussian scenario.
Gaussian mixture models are probabilistic models used to represent normally distributed subpopulations
within an overall population [12]. A Gaussian mixture model with K components is parameterized by a
PK component weights c, each component’s
distribution of nonnegative mixture PK means and covariance matrix,
with density given by h(ξ) = i=1 ci hi (ξ). Here, the weights satisfy i=1 ci = 1 and can be viewed as
either learned or an a priori distribution over components, where ci = P(ξ comes from component i) [18].
The function hi represents the density of a Gaussian vector with mean µi and covariance matrix Σi .
Probability Functions under Gaussian Mixture Models 3

Extensions of this model to a combination of uncountably many Gaussian distributions, weighted by a


probability density, are well-established in Bayesian analysis [17].
The introduction of a Bayesian framework to account for uncertainty in P by treating it as a prior
distribution over elements in the set of probability measures P ∈ P, which also allows practitioners
to incorporate their beliefs about the overall random structure of the problem, is a recent advance-
ment in chance-constrained optimization. Authors have employed specific conjugate structures to derive
closed-form posterior constraint distributions in optimization problems related to staffing [1], hydraulic
engineering [3], and portfolio allocation [10]. More recently, approximations to posterior distributions
have been developed and shown to be asymptotically optimal and consistent in a statistical sense [7].
Contributions by [9] and [22] have incorporated Bayesian analysis techniques in numerical optimization,
using observations of random phenomena ξ1 , . . . , ξn to approximate the true constraint distribution.
In most cases, however, the decision x (a ‘here-and now-decision’) must be made before the random
phenomena are observed. For example, in [5], the design of a mechanical structure (encoded by x), which
is constructed once and permanently, must withstand future random forces ξ with high probability.
Our proposal leverages a Bayesian formulation of the uncertainty set to derive a unified radial probability
function. This approach transforms the original probability function into an expectation over conditional
probability functions. Unlike the worst-case scenario probability formulation (which maximizes over all
possible realizations of probability estimates without considering the observer’s ability to estimate the
likelihood of these conditional probabilities occurring. see, e.g., [24, 26]) our method facilitates the study
of the differentiability of the probability function by combining classical differentiation techniques under
integration with spherical radial decomposition. Additionally, it enables the estimation of the true prob-
ability function through empirical approximations based on sampling from the parameter set (Bayesian
approximation) and the unit sphere (spherical radial decomposition).
The paper is organized as follows. In Section 2, we introduce the main notation used throughout the work
and provide the mathematical formulation of our approach. In Section 3, we examine the differentiability
of the probability function under Gaussian mixture models and present an integral representation using
the spherical radial decomposition (see Theorem 1). Section 4 is dedicated to formally establishing that
an approximation based on an i.i.d. sample over the set of parameters and the m-dimensional unit sphere,
with ad hoc probability distributions, converges uniformly over compact sets under mild assumptions,
both in terms of function values and gradient information (see Theorem 2). In Section 5, we illustrate
our results a numerical example. For clarity, the proofs of the main results are presented in the appendix.

2 Notation and Preliminary

In this paper, we adopt the standard notation commonly used in variational analysis, optimization,
statistics, and probability theory. For more details and comprehensive coverage, we refer the reader to
the monographs [2, 14, 16, 20, 21].
4 Gonzalo Contador et al.

2.1 Normal Distribution and Spherical Radial Decomposition

It is known that if ξ has a (multivariate) normal distribution ξ ∼ N (µ, Σ), then ξ admits a so-called
spherical radial decomposition representation, given by
ξ = µ + RLΘ,
where R follows a χ-distribution with m-degrees of freedom,
Pm 2Θ follows a uniform distribution ζΘ over
the m-dimensional unit sphere Sm−1 := {z ∈ Rm : i=1 zi = 1}, and L is the (unique) Cholesky
decomposition of Σ, such that LLT = Σ.
Now, given a probability function φ(x) := P(g(x, ξ) ≤ 0), for ξ ∼ N (µ, Σ), the above decomposition
allows us to rewrite φ as Z
φ(x) = e(x, v)dζΘ (v),
Sm−1
m−1
where ζΘ is the uniform measure on S , and e(x, v) represents the probability of satisfying the con-
straint for a given x when Θ = v is fixed. Specifically,
Z
e(x, v) = χ(s)ds, (2)
{r≥0 :g(x,µ+rLv)≤0}

where χ is the density of a χ-distribution with m-degrees of freedom, defined by


m  2
2π 2 m−1 r
χ(r) := m r exp − . (3)
Γ( 2 ) 2
We further details, see [24, 25] and the references therein.

2.2 Gaussian Mixture Models

A finite Gaussian mixture model in Rm can be written as a density p of the form


k
X
ξ ∼ p( · |µ1 , . . . , µk , Σ1 , . . . , Σk , w1 , . . . , wk ) = wi N (µi , Σi ),
i=1

where k ∈ N is a fixed integer, µ1 , . . . , µk are the means and Σ1 , . . . , Σk are the covariances of k Gaussian
distributions, and w1 , . . . , wk are the mixing proportions, which are positive and sum 1. In this way, the
model can be interpreted as ξ having a multivariate normal distribution with parameters µk and Σk with
probability wk .
To obtain the infinite limit, the authors in [17] place a prior on the (one-dimensional) common means,
with group variances given Gaussian and Gamma priors, respectively, and the mixing proportion vector
w = (w1 , . . . , wk )⊤ given a Dirichlet prior. Subsequent work also incorporates uncertainty quantification
on the number of categories [11]. We will generalize this framework so that µ and Σ are selected according
to a density m over a subset of feasible means and covariance matrices, specifically,
ξ|µ, Σ ∼ N (µ, Σ) and (µ, Σ) ∼ m.
This formulation will be detailed in Section 2.3.
Probability Functions under Gaussian Mixture Models 5

2.3 Mathematical Formulation and Spherical Radial decomposition for Gaussian Mixture Models

Let us consider a probability space (Ω, S, P), a random vector (ξ, C) : Ω → Ξ × C ⊂ Rm+p , and a set
of conditional probability measures P := {Pc (A) = E(1A |C = c) : c ∈ C} over Ξ, where the parameter
space C ⊂ Rp is a Borel measurable set. We denote by m the measure induced by the second coordinate
of (ξ, C) over the Borel σ-algebra B(C), that is,

m(A) = P (ω ∈ Ω : C(ω) ∈ A) . (4)

Hence, the chance constraint in (1) can be written as


Z
P(g(x, ξ) ≤ 0) = Pc (g(x, ξ) ≤ 0)dm(c),
P

and the chance-constrained problem (1) is then rewritten as what we will refer to from now on as a
Bayesian Chance Constraint (BCC) problem with confidence level p:
Z
min f (x) s.t. P(g(x, ξ)) ≤ 0) = Pc (g(x, ξ) ≤ 0)dm(c) ≥ p
P

We observe that the above formulation coincides with (1) when the distribution of ξ is fully specified,
i.e., when P = {P} and m = δP . In the context of an infinite Gaussian mixture model, we consider a
probability space (C, F, m) where the random variable C has probability distribution m as defined in (4).
In what follows, we consider measurable functions z : C → Rm and L : C → Rm×m such that

a) For each c ∈ C, conditional on the event C = c, the vector ξ ∼ N (z(c), Σ(c)), i.e.,

Z
Pc (ξ ∈ A) := P(ξ ∈ A|C = c) = hc (z)dz for all Borel set A ⊂ Rm ,
A

where
 
1 1
hc (z) := p exp − (z − z(c))⊤ Σ −1 (c)(z − z(c)) .
(2π)m det Σ(c) 2

We refer to [23, 26] for more details.


b) There exists η0 > 0 such that

∥z(c)∥ + ∥L(c)∥ ≤ η0 for all c ∈ C, (5)

where L(c) the Choleski decomposition of Σ(c), i.e., Σ(c) := L(c)⊤ L(c).

Hence, the expression hc (ξ) can be understood as the conditional density of ξ given C = c, which allows
us to define the conditional probability function as
Z
φc (x) := Pc (g(x, ξ) ≤ 0) = P(g(x, ξ) ≤ 0|C = c) = hc (z)dz (6)
{z:g(x,z)≤0}
6 Gonzalo Contador et al.

Condition a) guarantees that φc is well-defined for almost every c ∈ C (see, e.g., [4]), and from (6), it
follows that
Z
P(g(x, z) ≤ 0) = Pc (g(x, ξ) ≤ 0)dm(c) = E(φC (x)).
C

It is worth emphasizing that the probability defined in (6) can be interpreted as the probability of
satisfying the constraint, calculated after randomly selecting a Gaussian distribution according to m.
Applying the spherical radial decomposition on the event C = c for a Gaussian random variable ξ, as
described in Section 2.1, we obtain the following parameterized decomposition:

ξ|{C = c} = z(c) + RL(c)Θ.

Moreover, it is clear that (Θ, R) is independent with Θ uniform over the sphere Sm−1 and R follows a
χ-distribution with m-degres of freedom.
The conditional probability function (6) can be expressed as
Z
φc (x) = ec (v, x)dµζ (v),
Sm−1

where ec := e(c, ·, ·) and e : C × Sm−1 × Rn → R ∪ {+∞} is the parametric radial probability function in
(2) on the event C = c; specifically, i.e.
Z
ec (x, v) = χ(r)dr. (7)
{r≥0:g(x,rL(c)v+z(c))≤0}

In what follows, we say that x̄ is a z-uniform Slater point if there exists an open neighborhood U of x̄
and γ0 > 0 such that
g(x, z(c)) < −γ0 , for all x ∈ U and c ∈ C. (8)

3 Gradient of Probability Functions under Gaussian Mixture Models

In this section, we study the following (unconditional) probability function:


Z
Φ(x) := E(φc (x)) = φc (x)dm(c), (9)
C

where φc (x) is defined on (6).


The following result, proven in the appendix, establishes that the above probability function is well-
defined.

Proposition 1 Assume that z and Σ are measurable functions. Then, the parametric radial probability
function e : C ×Sm−1 ×Rn → R∪{+∞}, defined in (7), is F ⊗B(Sm−1 )⊗B(Rn ) measurable. Furthermore,
for any (c, v) ∈ C × Sm−1 the function ec (·, v) is continuous at any z-uniform Slater point (8).
Probability Functions under Gaussian Mixture Models 7

Before establishing the differentiability of the probability function defined in (9), we need to introduce a
condition that allows us to control the gradient of g locally at x̄, but globally on (z, c) ∈ Rm × C. In our
GMM setting, we consider the so-called exponential growth condition, first introduced in [24]. Without
this condition, differentiability may fail even for Gaussian distributions (see, e.g., [6, 24]).

Definition 1 (Exponential growth condition) We say that a function g satisfies the exponential
growth condition at x̄ if one can find constants ε, a, b > 0 such that
∥∇x g(x, z)∥ ≤ a exp(b∥z∥) for all (x, z) ∈ Bε (x̄) × Rm .

The following result proves the continuous differentiability of the probability function (9) and provides
an integral formula for its gradient.

Theorem 1 Let g : Rn × Rm → R be a continuously differentiable function satisfying the exponential


growth condition at x̄. Assume that g is convex in the second argument and x̄ is a z-uniform Slater point
as in (8). Then, the probability function Φ defined in (9) is continuously differentiable on a neighborhood
U ′ of x̄, and
Z
∇Φ(x) = ∇x ec (x, v)dζΘ ⊗ m(v, c), for all x ∈ U ′ . (10)
Sm−1 ×C

The proof of the above result will follow directly from a more general result presented in the appendix,
which provides an explicit representation for computing the integral formula in (10) in terms of the
function g, along with a detailed technical analysis of the estimation process for the integrand (see
Theorem 3). We refer to (17) for an explicit formula for ∇x ec (x, v) in terms of the data of the problem.

4 Approximation of Probability Functions via Monte Carlo Simulation

In this section, we propose a sampling method to approximate the probability function Φ in (9) and
its gradient. As Theorem 1 guarantees continuity of Φ, our goal is to obtain an expression that remains
continuous in x. Therefore, we avoid relying on naive i.i.d.-based Monte Carlo approximations of the form
N
1 X
Φ̃N (x) = 1{g(x,ξi )≤0} . (11)
N i=1

Instead, we propose a Monte Carlo approximation to estimate Φ(x) and ∇Φ(x) that relies on an i.i.d.
sample of {(vi , ci )}i∈N . Alternatively, if directly sampling from ζΘ ⊗ m is challenging, we can use an
asymptotic surrogate Markov Chain Monte Carlo (MCMC) methods [8,13], as such sequence has the same
asymptotic distribution as an i.i.d. ζΘ ⊗m sequence after a sufficiently large number of steps [19, Theorem
7.4]. By averaging the radial probability function (7) across the sample, we define the empirical spherical
approximation to (6) and its gradient (10) as
N N
1 X 1 X
ΦN (x) = ec (x, vi ) and ∇ΦN (x) = ∇x eci (x, vi ). (12)
N i=1 i N i=1

Hence, under mild regularity and growth conditions, the hypotheses in [21, Theorem 9.60] are satisfied,
leading to the following convergence theorem, which is proved in the appendix.
8 Gonzalo Contador et al.

Theorem 2 Let {(vi , ci )}N


i=1 be an i.i.d. sequence obtained from ζΘ ⊗ m or from a surrogate Markov
Chain method with stationary distribution ζΘ ⊗ m. Consider a compact set X ⊂ Rn such that, for every
x ∈ X, condition (8) and the exponential growth condition holds at x. Then, with probability 1, the
approximations ΦN (x) and ∇ΦN (x) defined in (12) converge uniformly on X to Φ and ∇Φ, respectively,
as N → ∞.

The above theorem enables us to establish asymptotic distributional results to assess consistency of the
approximations in (12) and to establish the asymptotic normality of their errors with controlled variance.
The result is the following:

Corollary 1 Under the assumptions of Theorem 2, for all x ∈ X, the following holds:
√ √
N [ΦN (x) − Φ(x)] →d N (0, σx2 ) and N [∇ΦN (x) − ∇Φ(x)] →d N (0, Σx2 ) as N → +∞,

where
Z
σx2 := ec (x, v)2 dζΘ ⊗ m(v, c) − Φ(x)2 ,
Sm−1 ×C
Z
Σx2 := [∇x ec (x, v)][∇x ec (x, v)]T dζΘ ⊗ m(v, c) − ∇Φ(x)∇Φ(x)T .
Sm−1 ×C

Proof Fix x ∈ X. We consider the case where {(vi , ci )}N i=1 is an i.i.d. sequence of ζΘ ⊗ m. The case of
MCMC is similar (see, e.g., [19]). Indeed, the sequence eci (x, vi ) is i.i.d. with mean Φ(x) and variance
σx2 . In the same way, the sequence ∇x eci (x, vi ) is i.i.d. with mean ∇Φ(x) and covariance matrix Σx2 .
Therefore, the result follows from the Central Limit Theorem [2, Theorem 27.2].

The following result establishes convergence in probability for our approximations.

Corollary 2 Under the assumptions of Theorem 2, for all x ∈ X, the following holds:

ΦN (x) →P Φ(x) and ∇ΦN (x) →P ∇Φ(x) as N → ∞.

Proof Due to Corollary 1, for all ε > 0, we have that

P(|ΦN (x) − Φ(x)| > ε) = OP (N −1/2 ) = oP (1).

The proof is similar for the case of the gradient.

The naive approximation Φ̃N in (11) is also asymptotically normal. However, for x ∈ X, its asymptotic
variance
R is σ̃x2 := Φ(x) − Φ(x)2 . Since for any (c, v), we have ec (x, v) ≥ ec (x, v)2 , it follows that Φ(x) ≥
e (x, v)2 dζΘ ⊗ m(v, c). Hence, σ̃x2 ≥ σx2 . Therefore, estimating the probability function by ΦN is
Sm−1 ×C c
uniformly more efficient than doing so via Φ̃N .
Probability Functions under Gaussian Mixture Models 9

5 Estimation of the Probability Function and Its Gradient: A Numerical Example

In this section, we illustrate the quality of the proposed approximations.


Let C ∈ [−1, 1] be a random variable such that D := 12 (C + 1) follows, for some δ > 0, a Beta(δ, δ)
distribution, that is,
Γ (2δ)
m(c) := 2δ−1 (1 − c2 )δ−1 for c ∈ [−1, 1],
2 Γ (δ)2
   
0 10
where Γ (·) denotes the Gamma function. Let ξ ∼ N , and suppose that g : Rn × R2 → R
C 01
is given by
g(x, ξ) = ∥ξ∥2 − ∥x∥2 − 2.
Hence, on the one hand, the probability (9) becomes
Z 1 Z 1
Γ (2δ) 2
Φ(x) = P(g(x, ξ) ≤ 0|C = c)dm(c) = 2δ−1 2
(1 − c2 )δ−1 Fχ2,c (∥x∥2 + 2)dc,
−1 2 Γ (δ) −1

where Fχk,λ denotes the cumulative distribution function of a non-central chi-squared random variable
with k degrees of freedom and noncentrality parameter λ, and fχk,λ denotes its corresponding density.
Furthermore, the derivative of the probability function is therefore given by
 Z 1 
Γ (2δ) 2 δ−1 2,c2 2
∇Φ(x) = (1 − c ) fχ (∥x∥ + 2)dc x.
22δ−2 Γ (δ)2 −1
On the other hand,
g(x, (0, c)⊤ ) ≤ −1 for all x ∈ Rn and c ∈ [−1, 1].
Hence, assumption (8) is satisfied with γ0 = 1/2. On the other hand,

|∇x g(x, ξ)| ≤ 2 exp ∥ξ∥ for all x ∈ Rn and ξ ∈ R2 ,

which implies the growth condition from Definition 1.


The spherical radial decomposition for ξ is ξ = (0, c)⊤ + RL(c)Θ, where L(c) is the identity matrix
and R2 ∼ χ22 , i.e., R has the Rayleigh distribution [15]. The function defined in (3) takes the form
2
χ(r) = re−r /2 for r > 0. Moreover, it is clear that for fixed (c, v), we have
q
g(x, rv1 , c + rv2 ) ≤ 0 ⇐⇒ r ≤ ρc (x, v) := −cv2 + ∥x∥2 + 2 − c2 v12

which implies that


2 2 x
ec (x, v) = 1 − e−ρc (x,v) /2
and ∇x ec (x, v) = ρc (x, v)e−ρc (x,v) /2
p .
∥x∥2 + 2 − c2 v12

The GitHub page [Link] contains simulation results and an R


script for conducting further simulations with user-defined parameters. Figure 1 presents simulation-based
results for x ∈ [0, 4], δ = 2.5 and a sample size N = 100, demonstrating that the proposed approximation
ΦN in (12) is smooth and offers significantly higher accuracy as an estimator of the true probability
10 Gonzalo Contador et al.

function compared to the naive estimator (11), as shown in the leftmost plot. Unlike Φ̃N , the smooth
profile of ΦN admits a continuous derivative ∇ΦN , which reasonably approximates the gradient of the
probability function ∇Φ, as illustrated in the plot on the right. The highest empirical error approximation
in the gradient approximation is less than 0.0007, and this high accuracy is achieved across the entire
domain with a relatively small sample size.

Constraint probability Gradient of probability function


1.0

∇Φ

∇ΦN

0.4
0.9

0.3
Probability

Derivative
0.8

0.2
0.7

0.1

ΦN
~
ΦN
0.0
0.6

0 1 2 3 4 0 1 2 3 4

x x

Fig. 1 Estimates for N = 100 and true values of the probability function Φ (left) and its derivative Φ′ (right). Naive
estimator (11) in blue, empirical spherical approximation (12) in red, and true functions in black.

Declarations

Data availability No datasets were generated or analyzed during the current study.
Conflict of interest The authors have no conflict of interest to declare..
Probability Functions under Gaussian Mixture Models 11

A Well-Posedness of the Probability Function

Proof of Proposition 1. Observe that the function (x, v, c, r) 7→ χ(r)1{r′ ≥0:g(x,r′ L(c)v+z(c))≤0} (r) is measurable. Hence,
using Fubini’s Theorem we get that
Z
(c, v, x) 7→ χ(r)dr
{r ′ ≥0:g(x,r ′ L(c)v+z(c))≤0}

is measurable, which proves the measurability of e. Now, let us fix (c, v) ∈ C × Sm−1 and let x ∈ Rn be a z-uniform Slater
point. Consider N > 0 and a sequence xk → x. Define the functions

ηk (r) := χ(r)1{r′ ≥:g(xk ,r′ L(c)v+z(c))≤0} (r), and η(r) := χ(r)1{r′ ≥:g(x,r′ L(c)v+z(c))≤0} (r)

First, note that |ηk (r)| ≤ χ(r) for all r ∈ [0, +∞). Furthermore, due to the convexity of g with respect to its second
argument, it is straightforward to show that ηk converges almost everywhere to η as k → ∞. Therefore, by the Lebesgue
Dominated Convergence theorem, we have that
Z Z
ec (x, v) = η(r)dr = lim ηk (r)dr = lim ec (xk , v),
k→∞ k→∞
R R

which implies that the function ec (·, v) is continuous at x.

B Differentiability of the Bayesian probability function

Define the sets of finite and infinite directions as follows:

F (x) := {(v, c) ∈ Sm−1 × C : ∃r > 0 s.t. g(x, z(c) + rL(c)v) > 0},
I(x) := {(v, c) ∈ Sm−1 × C : ∀r > 0 g(x, z(c) + rL(c)v) < 0}.

Furthermore, we introduce the radius function ρ : Rn × Sm−1 × C → R+ . defined by

ρc (x, v) := ρ(x, v, c) := sup{r ≥ 0 : g(x, rL(c)v + z(c)) ≤ 0}. (13)

The following proposition is a direct adaptation of the arguments presented in [27, Lemmas 3.2, Lemma 3.3 and 3.4] (see
also [25]).

Lemma 1 Suppose that condition (8) holds at x̄. Then, for every (x, v, c) ∈ U × Sm−1 × C, the set {r ≥ 0 : g(x, rL(c)v +
z(c)) ≤ 0} = [0, ρc (x, v)] (with the convention that [0, ∞] = [0, ∞)), and

g(x, z(c))
⟨∇z g(x, z(c) + ρc (x, v)L(c)v), L(c)v⟩ ≥ − . (14)
ρc (x, v)

Moreover, the function e, defined in (7), satisfies the identity

ρc (x,v)
Z
ec (x, v) = χ(r)dr, for every (x, v, c) ∈ Bε (x̄) × Sm−1 × C. (15)
0

Furthermore, the function ρ is B(Rn ) ⊗ B(Sm−1 ) ⊗ F measurable, and for every sequence (xk , vk ) → (x, v), we have that
ρc (xk , vk ) → ρc (x, v) for all c ∈ C. In addition, if ρc (x, v) < +∞, then the function ρc is continuously differentiable with
respect to x, with gradient
∇x g(x, z(c) + ρc (x, v)L(c)v)
∇x ρc (x, v) = . (16)
⟨∇z g(x, z(c) + ρc (x, v)L(c)v), L(c)v⟩
12 Gonzalo Contador et al.

Proof The equality


{r ≥ 0 : g(x, rL(c)v + z(c)) ≤ 0} = [0, ρc (x, v)]
follows from the convexity of g with respect to z, which directly implies (15) (see, e.g., [26, Lemma 2.4] or [24, Lemma 3.2]
for further details). Furthermore, the measurability of ρ follows from the equality: for ν ≥ 0,

{(x, v, c) ∈ U × Sm−1 × C : ρ(c, x, v) > ν} = {(x, v, c) ∈ U × Sm−1 × C : g(x, νL(c)v + z(c)) < 0},

where the set on the right-hand side of the equality is measurable due to the continuity of g and measurability of z and L.

Now, we establish a lower bound for the inner product between the partial gradient of g with respect to z and the directions
over the sphere. This technical inequality follows from the convexity of the function g (with respect to z), and its proof is
obtained by adapting the arguments from [27, Lemma 3.3] to our setting.

Proposition 2 Suppose that (8) holds, and consider (x, v, c) ∈ U × Sm−1 × C with (v, c) ∈ F (x). Let r0 such that
{r ≥ 0 : g(x, rL(c)v + z(c)) ≤ 0} = [0, r0 ]. Then ⟨∇z g(x, z(c) + r0 L(c)v), L(c)v⟩ ≥ −r0−1 g(x, z(c)) > 0.

The following result pertains to a technical of estimating the gradients of the (parameterized) radius function defined in
(13). Since we cannot assume a priori that the radial probability function is differentiable, the proof of this result relies on
techniques from variational analysis and generalized differentiation. We refer to [14, 20] for the primary notation and the
tools used in the proof.

Theorem 3 Let x̄ be a z-uniform Slater point, and suppose that g satisfies the exponential growth condition atx̄. Then,
there exists ε > 0 such that for all (x, v, c) ∈ Bε (x̄) × Sm−1 × C the function ec is continuously differentiable differentiable
with respect to x, with gradient
n
χ(ρc (x, v))∇x ρc (x, v) if (v, c) ∈ F (x),
∇x ec (x, v) = (17)
0 if (v, c) ∈ I(x).

Furthermore, there exist κ̂ ≥ 0 such that

∥∇x ec (x, v)∥ ≤ κ̂, for all (x, v, c) ∈ Bε (x̄) × Sm−1 × C. (18)

Proof Let U be the neighborhood in (8). We divide the proof into three claims.
Claim 1: For all (x, v, c) ∈ U × Sm−1 × C, the following inclusions hold:

∂ˆx ec (x, v) ⊂ {χ(ρc (x, v))∇x ρc (x, v)} for (v, c) ∈ F (x) and ∂ˆx ec (x, v) ⊂ {0} for (v, c) ∈ I(x), (19)

where ∂ˆx ec (x, v) denotes the Fréchet subdifferential of the function ec with respect to x.
Proof of Claim 1: Indeed, let x ∈ U . Then, by Lemma 1, we have that (17) holds whenever (v, c) ∈ F (x). On the other
hand, if (v, c) ∈ I(x), then ec (x, v) = 1, which implies that ec attains its maximum. It is then straightforward to show that
∂ˆx ec (x, v) ⊂ {0}.
Claim 2: There exist γ > 0 and κ̂ ≥ 0 such that

∂ˆx ec (x, v) ⊂ κ̂B, for all (x, v, c) ∈ Bγ (x̄) × Sm−1 × C.

Proof of Claim 2: Let ε, a, b > 0 be the parameters in the definition of the exponential growth condition. By shrinking ε
if necessary, we can assume that Bε (x̄) ⊂ U , where U is the neighborhood of x̄ from (8). Now, consider x ∈ Bε (x̄) and
x∗ ∈ ∂ˆx ec (x, v). Then, using (3), (16), and (19), we get

∥∇x g(x, z(c) + ρc (x, v)L(c)v)∥


∥x∗ ∥ ≤ χ(ρc (x, v))∥∇x ρc (x, v)∥ = c0 ρc (x, v)m−1 exp(−ρ(x, v)2 /2) ,
|⟨∇ξ g(x, z(c) + ρc (x, v)L(c)v), L(c)v⟩|
m
2π 2
where c0 = Γ(m )
. Now, as a consequence of (5), (8), (14), and the exponential growth condition, we have that
2

∥x∗ ∥ ≤ c0 γ0 ρc (x, v)m exp(−ρc (x, v)2 /2)a exp(bρc (x, v)∥L(c)∥) exp(b∥z(c)∥)
(20)
≤ ac0 γ0 exp(bη0 )ρc (x, v)m exp(−ρc (x, v)2 /2) exp(bη0 ρc (x, v)).
Probability Functions under Gaussian Mixture Models 13

Consequently,

∥x∗ ∥ ≤ ac0 γ0 exp(bη0 ) sup{rm exp(−r2 /2) exp(bη0 r) : r ≥ 0} =: κ̂.

This concludes the proof of the claim.


Claim 3: For all (v, c) ∈ Sm−1 × C, the function x 7→ ec (x, v) is continuously differentiable over Bγ/2 (x̄). Consequently,
(17) and (18) hold.
Proof of Claim 3: Fix (v, c) ∈ Sm−1 × C. First, using [14, Theorem 4.15], we conclude that x 7→ ec (x, v) is Lipschitz
continuous around any x ∈ Bε/2 (x̄). Second, due to Lemma 1, it is enough to prove continuous differentiability when
(v, c) ∈ I(x). Indeed, consider a sequence xk → x and x∗k ∈ ∂ˆx ec (xk , v); we will prove that x∗k → 0. If (v, c) ∈ I(xk ),
we have that ∥x∗k ∥ = 0, so we can assume that (v, c) ∈ F (xk ) for all k ∈ N. Then, by Lemma 1 and using the estimate
given in (20), we have that x∗k → 0. Finally, [14, Theorem 4.17] implies that x 7→ ec (x, v) is strictly differentiable at x with
∂x ec (x, v) = {0}, and from the arbitrariness of x, we obtain its continuous differentiability. ⊓

Proof of Theorem 1 By using (18) and the classical theorem on the interchange of integration and differentiation, we conclude
that the formula (10) holds within a suitable neighborhood of x̄. Furthermore, considering the integrand in (10) and the
uniform boundedness in (18), we can establish the local continuity of the gradient around x̄. ⊓

C Approximation of Bayesian Probability Functions

Proof of Theorem 2: On the one hand, notice that for every x ∈ X, the following inequality holds:

ec (x, v) ≤ 1, for all (c, v) ∈ C × S m−1 .

On the other hand, by (18) and a classical argument of compactness, we can ensure that there exists κ̂ > 0 and an open
set U ′ ⊃ X such that

∥∇x ec (x, v)∥ ≤ κ̂, for all (x, v, c) ∈ U ′ × Sm−1 × C.

For rk a positive sequence converging to 0 and arbitrary x̄ ∈ X, define

∆1,k (v, c) = sup |ec (x, v) − ec (x̄, v)|, and ∆2,k (v, c) = sup |∇x ec (x, v) − ∇x ec (x̄, v)|.
x∈Brk (x̄) x∈Brk (x̄)

By Theorem 3, we have that for j = 1, 2, as k → ∞, ∆j,k (v, c) → 0 (ζΘ ⊗ m) almost surely. Since ∆1,k (v, c) ≤ 2 and
∆2,k (v, c) ≤ 2κ̂, the dominated convergence theorem implies that for j = 1, 2,
Z Z
lim ∆j,k (v, c)d(ζΘ ⊗ m)(v, c) = lim ∆j,k (v, c)d(ζΘ ⊗ m)(v, c) = 0.
k→∞ k→∞
C×Sm−1 C×Sm−1

By the Strong Law of Large Numbers [2, Theorem 6.2] (or equivalently by [19, Theorem 7.4] when using a MCMC-type
sampler), it follows that for fixed k ∈ N and x̄ ∈ X, as N → ∞, almost surely both
N
X
sup |ΦN (x) − ΦN (x̄)| ≤ N −1 ∆1,k (ci , vi ) → E(∆1,k ),
x∈Brk (x̄)
i=1

and
N
X
sup |∇ΦN (x) − ∇ΦN (x̄)| ≤ N −1 ∆2,k (ci , vi ) → E(∆2,k ).
x∈Brk (x̄)
i=1

As the right-hand sides above converge to 0 as k → ∞, for any ε > 0, we can find k̃ such that

sup |ΦN (x) − ΦN (x̄)| ≤ ε/3 and sup |∇ΦN (x) − ∇ΦN (x̄)| ≤ ε/3.
|x−x̄|<rk̃ |x−x̄|<rk̃
14 Gonzalo Contador et al.

Since the sets Brk̃ (x̄) for x̄ ∈ X cover X, the compactness of X implies that there exists a finite collection x̄1 , . . . , x̄F and
K = max{k̃1 , · · · , k̃F } such that BrK (x̄1 ), . . . , BrK (x̄1 ) cover X. Therefore, for x̄ ∈ X, there exist indices 1 ≤ j0 ≤ F and
1 ≤ j1 ≤ F such that
|ΦN (x̄) − ΦN (x̄j0 )| ≤ sup |ΦN (x) − ΦN (x̄j0 )| ≤ ε/3,
|x−x̄j0 |<rK

and
|∇ΦN (x̄) − ∇ΦN (x̄j1 )| ≤ sup |∇ΦN (x) − ∇ΦN (x̄j1 )| ≤ ε/3.
|x−x̄j1 |<rK

The compactness of X and continuity of Φ and its gradient ensure that, possibly enlarging F and K, for each x̄ ∈ X, there
exists 1 ≤ j ≤ F such that

sup |Φ(x) − Φ(x̄j )| ≤ ε/3 and sup |∇Φ(x) − ∇Φ(x̄j )| ≤ ε/3.


|x−x̄j |<rK |x−x̄j |<rK

PN PN
For each 1 ≤ j ≤ F , h(c, v) = N −1 e (x̄j , vi )
i=1 ci
and h̃(c, v) = N −1 i=1
∇eci (x̄j , vi ) converges almost surely to Φ(x̄j )
and ∇Φ(x̄j ), respectively. Thus, we can find k̂ such that |ΦN (x̄j ) − Φ(x̄j )| ≤ ε/3 and |∇ΦN (x̄j ) − ∇Φ(x̄j )| ≤ ε/3 whenever
N ≥ k̂ for every j. Finally, by the triangle inequality, it follows that for arbitrary ε > 0, we have for k > max{k̂, K},

sup |ΦN (x) − Φ(x)| ≤ ε and sup |∇ΦN (x) − ∇Φ(x)| ≤ ε.


x∈X x∈X

Thus, the uniform convergence of ΦN and ∇ΦN is proven. ⊓


References

1. T. Aktekin and T. Ekin. Stochastic call center staffing with uncertain arrival, service and abandonment rates: A
bayesian perspective. Naval Research Logistics (NRL), 63(6):460–478, 2016.
2. P. Billingsley. Probability and Measure. John Wiley and Sons, second edition, 1986.
3. N. Chitsazan, H. V. Pham, and F. T.-C. Tsai. Bayesian chance-constrained hydraulic barrier design under geological
structure uncertainty. Groundwater, 53(6):908–919, 2015.
4. A. M. Faden. The existence of regular conditional probabilities: necessary and sufficient conditions. Ann. Probab.,
13(1):288–298, 1985.
5. T. González Grandón, R. Henrion, and P. Pérez-Aros. Dynamic probabilistic constraints under continuous random
distributions. Math. Program., pages 1065–1096, 2020.
6. A. Hantoute, R. Henrion, and P. Pérez-Aros. Subdifferential characterization of probability functions under Gaussian
distribution. Math. Program., 174(1-2 (B)):167–194, 2019.
7. P. Jaiswal, H. Honnappa, and V. A. Rao. Bayesian joint chance constrained optimization: Approximations and statistical
consistency. SIAM J. Optim., 33(3):1968–1995, 2023.
8. G. L. Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299–320, 2004.
9. J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust bayesian optimization. In S. Chiappa
and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and
Statistics, volume 108 of Proc. Mach. Learn. Res., pages 2174–2184. PMLR, 26–28 Aug 2020.
10. T. L. Lai, H. Xing, and Z. Chen. Mean–variance portfolio optimization when means and covariances are unknown.
Ann. Appl. Stat., 5(2A):798 – 823, 2011.
11. A. Matza and Y. Bistritz. Infinite Gaussian mixture modeling with an improved estimation of the number of clusters.
Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8921–8929, 2021.
12. G. McLachlan and D. Peel. Finite mixture models. Wiley Ser. Probab. Stat. Wiley, New York, 2000.
13. K. L. Mengersen. MCMC convergence diagnostics: a review. Bayesian statistics, 6:415–440, 1999.
14. B. S. Mordukhovich. Variational analysis and applications. Springer Monogr. Math. Springer, Cham, 2018.
15. A. Papoulis and S.U. Pillai. Probability, Random Variables, and Stochastic Processes. McGraw-Hill Series in Electrical
Engineering. Communications and Information Theory. Tata McGraw-Hill, 2002.
16. A. Prékopa. Stochastic programming, volume 324 of Mathematics and its Applications. Kluwer Academic Publishers
Group, Dordrecht, 1995.
17. C. Rasmussen. The infinite Gaussian mixture model. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural
Information Processing Systems, volume 12, pages 554–560. MIT Press, 04 2000.
Probability Functions under Gaussian Mixture Models 15

18. D. Reynolds. Gaussian mixture models. In S. Z. Li and A. K. Jain, editors, Encyclopedia of Biometrics, pages 827–832.
Springer US, Boston, MA, 2015.
19. C.P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2004.
20. R.T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren der mathematischen Wis-
senschaften. Springer Verlag Berlin, 3rd edition, 2009.
21. A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming, volume 9 of MOS-SIAM Ser.
Optim. SIAM, Philadelphia, PA., second edition, 2014. Modeling and theory.
22. A. Shapiro, E. Zhou, and Y. Lin. Bayesian distributionally robust optimization. SIAM J. Optim., 33(2):1279–1304,
2023.
23. W. van Ackooij, I. Aleksovska, and M. Munoz-Zuniga. (Sub-)differentiability of probability functions with elliptical
distributions. Set-Valued Var. Anal., 26(4):887–910, 2018.
24. W. van Ackooij and R. Henrion. Gradient formulae for nonlinear probabilistic constraints with Gaussian and Gaussian-
like distributions. SIAM J. Optim., 24(4):1864–1889, 2014.
25. W. van Ackooij and R. Henrion. (Sub-) Gradient formulae for probability functions of random inequality systems under
Gaussian distribution. SIAM J. Uncertain. Quantif., 5(1):63–87, 2017.
26. W. van Ackooij and P. Pérez-Aros. Generalized differentiation of probability functions acting on an infinite system of
constraints. SIAM J. Optim., 29(3):2179–2210, 2019.
27. W. van Ackooij and P. Pérez-Aros. Generalized differentiation of probability functions: Parameter dependent sets given
by intersections of convex sets and complements of convex sets. Appl. Math. Optim., 85(1):2, Jan 2022.

You might also like