0% found this document useful (0 votes)
37 views10 pages

Rockova 19 A

Uploaded by

karinduarte30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

Rockova 19 A

Uploaded by

karinduarte30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

On Theory for BART

Veronika Ročková and Enakshi Saha


University of Chicago

Abstract of posterior objects in inifinite-dimensional parameter


spaces. Bayesian machine learning, on the other hand,
Ensemble learning is a statistical paradigm has been primarily concerned with developing scalable
built on the premise that many weak learn- tools for computing such posterior objects. In this
ers can perform exceptionally well when de- work, we bridge these two fields by providing theo-
ployed collectively. The BART method of retical insights into one of the workhorses of Bayesian
Chipman et al. (2010) is a prominent exam- machine learning, the BART method.
ple of Bayesian ensemble learning, where each Bayesian Additive Regression Trees (BART) are one
learner is a tree. Due to its impressive perfor- of the more widely used Bayesian prediction tools
mance, BART has received a lot of attention and their popularity continues to grow. Compared
from practitioners. Despite its wide popu- to its competitors (e.g. Gaussian processes, random
larity, however, theoretical studies of BART forests or neural networks) BART requires consider-
have begun emerging only very recently. Lay- ably less tuning, while maintaining robust and rela-
ing down foundation for the theoretical anal- tively scalable performance (BART R package of Mc-
ysis of Bayesian forests, Rockova and van der Culloch (2017), bartMachine R package of Bleich
Pas (2017) showed optimal posterior concen- et al. [2014], top down particle filtering of Lakshmi-
tration under conditionally uniform tree pri- narayanan et al. [2013]). BART has been successfully
ors. These priors deviate from the actual pri- deployed in many prediction tasks, often outperform-
ors implemented in BART. Here, we study ing its competitors (see predictive comparisons on 42
the exact BART prior and propose a simple data sets in Chipman et al. [2010]). More recently, its
modification so that it also enjoys optimal- flexibility and stellar prediction has been capitalized
ity properties. To this end, we dive into the on in causal inference tasks for heterogeneous/average
branching processes theory. We obtain tail treatment effect estimation (Hill [2011], Hahn et al.
bounds for the distribution of total progeny [2017] and references therein). BART has also served
under heterogeneous Galton-Watson (GW) as a springboard for various incarnations and exten-
processes using their connection to random sions including: Monotone BART (Chipman et al.
walks. We conclude with a result stating op- [2016]), Heteroscedastic BART (Pratola et al. [2017]),
timal rate of convergence for BART. treed Gaussian processes (Gramacy and Lee [2008])
and dynamic trees (Taddy et al. [2011]), to list a few.
Related non-parametric constructions based on recur-
1 Bayesian Machine Learning sive partitioning have proliferated in the Bayesian ma-
chine learning community for modeling relational data
Bayesian Machine Learning and Bayesian Non- (Mondrian process of Roy and Teh [2008], Mondrian
parametrics share the same objective: increasing flex- forests (Lakshminarayanan et al. [2014]). In short,
ibility necessary to address very complex problems us- BART continues to have a decided impact on the field
ing a Bayesian approach with minimal subjective in- of Bayesian non-parametrics/machine learning.
put. While the two fields can be, to some extent, re-
garded as synonymous, their emphasis is quite differ- Despite its widespread popularity, however, the theory
ent. Bayesian non-parametrics has subsumed a theo- has not caught up with its applications. First theoret-
retical field focused on studying frequentist properties ical results were obtained only very recently. As a pre-
cursor to these developments, Coram and Lalley [2006]
Proceedings of the 22nd International Conference on Ar- obtained a consistency result for Bayesian histograms
tificial Intelligence and Statistics (AISTATS) 2019, Naha, in binary regression with a single predictor. van der
Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by
Pas and Rockova [2017] provided a posterior concen-
the author(s).
On Theory for BART

tration result for Bayesian regression histograms in 2 The Appeal of Trees/Forests


Gaussian non-parametric regression, also with one pre-
dictor. Rockova and van der Pas [2017] (further re- The data setup under consideration consists of Yi ∈
ferred to as RP17) then extended their study to trees R, a set of one-dimensional outputs, and xi =
and forests in a high-dimensional setup where p > n (xi1 , . . . , xip )0 ∈ [0, 1]p , a set of high dimensional in-
and where variable selection uncertainty is present. puts for 1 ≤ i ≤ n. Our statistical framework is non-
They obtained the first theoretical results for Bayesian parametric regression, which characterizes the input-
CART, showing optimal posterior concentration (up output relationship through
to a log factor) around a ν-Hölder continuous regres-
iid
sion function (i.e. Hölder smooth with smoothness Yi = f0 (xi ) + εi , εi ∼ N (0, 1), (1)
0 < ν ≤ 1). Going further, they also show optimal
performance for Bayesian forests, both in additive and where f0 : [0, 1]p → R is an unknown regression
non-additive regression. Linero and Yang [2017] ob- function. A regression tree can be used to recon-
tained similar results for Bayesian ensembles, but for struct f0 via a mapping fT ,β : [0, 1]p → R so that
fractional posteriors (raised to a power). The proof of / {xi }ni=1 . Each such mapping
fT ,β (x) ≈ f0 (x) for x ∈
RP17, on the other hand, relies on a careful construc- is essentially a step function
tion of sieves and applies to regular posteriors. Build- K
ing on RP17, Linero and Yang [2017] subsequently ob-
X
fT ,β (x) = βk I(x ∈ Ωk ) (2)
tained also results for actual posteriors. In addition, k=1
Linero and Yang [2017] do not study step functions
(the essence of Bayesian CART and BART) but ag- underpinned by a tree-shaped partition T = {Ωk }K k=1
gregated smooth kernels, allowing for ν > 1. Liu et al. and a vector of step heights β = (β1 , . . . , βK )0 . The
[2018] obtained model selection consistency results (for vector β represents quantitative guesses of the average
variable and regularity selection) for Bayesian forests. outcome inside each cell. Each partition T consists of
rectangles obtained by recursively applying a splitting
Albeit related, the tree priors studied in RP17 are not rule (an axis-parallel bisection of the predictor space).
the actual priors deployed in BART. Here, we develop We focus on binary tree partitions, where each internal
new tools for the analysis of the actual BART prior node (box) is split into two children (formal definition
and obtain parallel results to those in RP17. To begin, below).
we dive into branching process theory to characterize
aspects of the distribution on total progeny under het- Definition 2.1. (A Binary Tree Partition) A binary
erogeneous Galton-Watson processes. Revisiting sev- tree partition T = {Ωk }K k=1 consists of K rectangular
eral useful facts about Galton-Watson processes, in- cells Ωk obtained with K −1 successive recursive binary
cluding their connection to random walks, we derive a splits of the form {xj ≤ c} vs {xj > c} for some j ∈
new prior tail bound for the tree size under the BART {1, . . . , p}, where the splitting value c is chosen from
prior. With our proving strategy, the actual prior of observed values {xij }ni=1 .
Chipman et al. [2010] does not appear to penalize large
Partitioning is intended to increase within-node homo-
trees aggressively enough. We suggest a very simple
geneity of outcomes. In the traditional CART method
modification of the prior by altering the splitting prob-
(Breiman et al. [1984]), the tree is obtained by “greedy
ability. With this minor change, the prior is shown to
growing” (i.e. sequential optimization of some impu-
induce the right amount of regularization and optimal
rity criterion) until homogeneity cannot be substan-
speed of posterior convergence.
tially improved. The tree growing process is often fol-
The paper is structured as follows. Section 2 revis- lowed by “optimal pruning” to increase generalizabil-
its trees and forests in the context of non-parametric ity. Prediction is then determined by terminal nodes
regression and discusses the BART prior. Section 3 of the pruned tree and takes the form either of a class
reviews the notion of posterior concentration. Section level in classification problems, or the average of the
4 discusses Galton Watson processes and their connec- response variable in least squares regression problems
tion to Bayesian CART. Section 5 is concerned with (Breiman et al. [1984]).
tail bounds on total progeny. Section 6 and 7 describe
In tree ensemble learning, each constituent is designed
prior and concentration properties of BART. Section
to be a weak learner, addressing a slightly different
7 wraps up with a discussion.
aspect of the prediction problem. These trees are in-
tended to be shallow and are woven into a forest map-
ping
X T
fE,B (x) = fTt ,βt (x), (3)
t=1
Veronika Ročková and Enakshi Saha

where each fTt ,βt (x) is of the form (2), E = x1j , . . . , xnj . Non-uniform priors can also be used
{T1 , . . . , TT } is an ensemble of trees and B = to favor splitting values that are thought to be
{β 1 , . . . , β T }0 is a collection of jump sizes for the T more important. For example, splitting values can
trees. Random forests obtain each tree learner from a be given more weight towards the center and less
bootstrapped version of the data. Here, we consider a weight towards the edges.
Bayesian variant, the BART method of Chipman et al.
[2010], which relies on the posterior distribution over 2.1.2 Prior on Step Heights π(β | T )
fE,B to reconstruct the unknown regression function
f0 . Given a tree partition Tt with Kt steps, we consider
iid Gaussian jumps
2.1 Bayesian Trees and Forests Kt
Y
π(β t | Tt ) = φ(βtj ; 0, 1/T ),
Bayesian CART was introduced as a Bayesian alter-
k=1
native to CART, where regularization/stabilization is
obtained with a prior rather than with pruning (Chip- where φ(x; 0, σ 2 ) is a Gaussian density with mean 0
man et al. [1998], Denison et al. [1998]). The prior and variance σ 2 . Chipman et al. [2010] recommend
distribution is assigned over a class of step functions first shifting and rescaling Yi ’s so that the observed
transformed values range from -0.5 to 0.5. Then they
F = {fE,B (x) of the form (3) for some E and B} assign a conjugate normal prior βtj ∼ N (0, σ 2 ), where

in a hierarchical manner. σ = 0.5/k T for some suitable value of k. This is to
ensure that the prior assigns substantial probability
The BART prior by Chipman et al. [2010] assumes to the range of the Yi ’s.
that the number of trees T is fixed. The authors rec-
ommend a default choice T = 200 which was seen
to provide good results. Next, the tree components The BART prior also involves an inverse chi-squared
(Tt , β t ) are a-priori independent of each other in the distribution on residual variance in (1). While the case
sense that of random variance can be incorporated in our analysis
(de Jonge and van Zanten [2013]), we will for simplicity
T
Y assume that the residual variance is fixed and equal to
π(E, B) = π(Tt )π(β t | Tt ), (4) one.
t=1
Existing theoretical work for Bayesian forests (RP17)
where π(Tt ) is the prior probability of a partition Tt is available for a different prior on tree partitions T .
and π(β t | Tt ) is the prior distribution over the jump Their analysis assumes a hierarchical prior consisting
sizes. of (a) a prior on the size of a tree K and (b) a uniform
prior over trees of size K. This prior is equalitarian in
2.1.1 Prior on Partitions π(T ) the sense that trees with the same number of leaves are
a-priori equally likely regardless of their topology. The
In BART and Bayesian CART of Chipman et al. prior on the number of leaves K is a very important
[1998], the prior over trees is specified implicitly as ingredient for regularization. We will study aspects of
a tree generating stochastic process, described as fol- its distribution under the actual BART prior in later
lows: sections.

1. Start with a single leave (a root node) [0, 1]p .


3 Bayesian Non-parametrics Lense
2. Split a terminal node, say Ωt , with a probability
One way of assessing the quality of a Bayesian pro-
α
psplit (Ωt ) = (5) cedure is by studying the learning rate of its poste-
(1 + d(Ωt ))γ rior, i.e. the speed at which the posterior distribution
shrinks around the truth as n → ∞. These statements
for some α ∈ (0, 1) and γ ≥ 0, where d(Ωt ) is the
are ultimately framed in a frequentist way, describing
depth of the node Ωt in the tree architecture.
the typical behavior of the posterior under the true
(n)
3. If the node Ωt splits, assign a splitting rule and generative model Pf0 . Posterior concentration rate
create left and right children nodes. The splitting results have been valuable for the proposal and cal-
rule consists of picking a split variable j uniformly ibration of priors. In infinite-dimensional parameter
from available directions {1, . . . , p} and picking a spaces, such as the one considered here, seemingly in-
split point c uniformly from available data values nocuous priors can lead to inconsistencies (Cox [1993],
On Theory for BART

Diaconis and Freedman [1986]) and far more care has that the prior zooms in on smaller, and thus more
to be exercised to come up with well-behaved priors. manageable, sets of models Fn by assigning only a
small probability outside these sets. The condition (7)
The Bayesian approach requires placing a prior mea-
is known as “the entropy condition” and controls the
sure Π(·) on F, the parameter space consisting of qual-
combinatorial richness of the approximating sets Fn .
itative guesses of f0 . Given observed data Y (n) = Finally, condition (8) requires that the prior charges
(Y1 , . . . , Yn )0 , inference about f0 is then carried out an εn neighborhood of the true function. The results
via the posterior distribution of type (6) quantify not only the typical distance be-
R Qn
Πf (Yi | xi )dΠ(f ) tween a point estimator (posterior mean/median) and
Π(A | Y ) = RAQni=1
(n)
∀A ∈ B the truth, but also the typical spread of the posterior
i=1 f (Yi | xi )dΠ(f )
Π
around the truth. These results are typically the first
where B is a σ-field on F and where Πf (Yi | xi ) is the step towards further uncertainty quantification state-
likelihood function for the output Yi under f . ments.

In Bayesian non-parametrics, one of the usual goals


is determining how fast the posterior probability mea-
4 The Galton-Watson Process Prior
sure concentrates around f0 as n → ∞. This speed
The Galton-Watson (GW) process provides a math-
can be assessed by inspecting the size of the smallest
ematical representation of an evolving population of
k · kn -neighborhoods around f0 that contain most of
individuals who reproduce and die subject to laws of
the posterior probabilityP (Ghosal and van Der Vaart
n chance. Binary tree partitions T under the prior (5)
[2007]), where kf k2n = n1 i=1 f (xi )2 . For a diameter
can be thought of as realizations of such a branch-
ε > 0 and some M > 0, we denote with
ing process. Below, we review some terminology of
Aε,M = {fE,B ∈ F : kfE,B − f0 kn ≤ M ε} branching processes and link them to Bayesian CART.
We denote with Zt the population size at time t (i.e.
the M ε-neighborhood centered around f0 . We say the number of nodes in the tth layer of the tree). The
that the posterior distribution concentrates at speed process starts at time t = 0 with a single individual, i.e.
εn → 0 such that n ε2n → ∞ when Z0 = 1. At time t, each member is split independently
(n) of one another into a random number of offsprings.
Π(Acεn ,Mn | Y (n) ) → 0 in Pf0 -probability as n → ∞ Let Yti denote the number of offsprings produced by
(6) the ith individual of the tth generation and let gt (s)
for any Mn → ∞, where Ac stands for the comple- be the associated probability generating function. A
ment of the set A. Posterior consistency statements binary tree is obtained when each node has either zero
are a bit weaker, where εn in (6) is replaced with a or two offsprings, as characterized by
fixed neighborhood ε > 0. We will position our results
using εn = n−ν/(2ν+p) log1/2 n, the near-minimax rate gt (s) = s0 P(Yt1 = 0) + s2 P(Yt1 = 2), 0 ≤ s ≤ 1.
for estimating a p-dimensional ν-smooth function. We (10)
will also assume that f0 is Hölder continuous, i.e. ν- Homogeneous GW process is obtained when all Yti ’s
Hölder smooth with 0 < ν ≤ 1. The limitation ν ≤ 1 are iid. A heterogeneous GW process is a general-
is an unavoidable consequence of using step functions ization where the offspring distribution is allowed to
to approximate smooth f0 and can be avoided with vary according to the generations, i.e. the variables
smooth kernel methods (Linero and Yang [2017]). Yti are independent but non-identical. The Bayesian
CART prior of Chipman et al. [1998] can be framed
The statement (6) can be proved by verifying the fol-
as a heterogeneous GW process, where the probability
lowing three conditions (suitably adapted from Theo-
of splitting a node (generating offsprings) depends on
rem 4 of Ghosal and van Der Vaart [2007]):
the depth t of the node in the tree. In particular, using
(5) one obtains for 0 < α < 1 and γ > 0
ε
Aε,1 ∩ Fn ; k.kn ≤ n ε2n

sup log N 36 ; (7) α
ε>εn P(Yt1 = 2) = 1 − P(Yt1 = 0) = . (11)
2
(1 + t)γ
Π(Aεn ,1 ) ≥ e−d n εn (8) PZt−1
−(d+2) n ε2n
The population size at time t satisfies Zt = i=1 Yti
Π(F\Fn ) = o( e ) (9) and its expectation can be written as
for some d > 2. In (7), N (ε; Ω; d) is the ε-covering EZt = E[E(Zt | Zt−1 )] = (2α)t [(t + 1)!]−γ .
number of a set Ω for a semimetric d, i.e. the minimal
number of d-balls of radius ε needed to cover a set Ω. Since EZ1 < 1 under (11), the process is subcritical
A few remarks are in place. The condition (9) ensures and thereby it dies out with probability one. This
Veronika Ročková and Enakshi Saha

means that the random sequence {Zt } consists of zeros a sufficient amount of regularization, we first need to
for all but a finite number of t’s. The overall number of obtain a tail bound of π(K) under the GW process
nodes in the tree (all ancestors in the family pedigree) and show that it decays fast enough. One seemingly

simple remedy would be to set γ = 0 (which coincides
X=
X
Zt (12) with the homogeneous GW case) and α = c/n with
t=0
some c > 0. Standard branching process theory then
implies Π(K > k) . e−CK k log n . This prior is more
is thus finite with probability one. The number of aggressive than (15). Moreover, letting the split prob-
leaves (bottom nodes) K can be related to X through ability psplit (Ωk ) decay with sample size is counterin-
tuitive. By choosing α = c, on the other hand, one
K = (X + 1)/2 (13) obtains Π(K > k) . e−CK k which is not aggressive
and satisfies enough.
Tex + 1 ≤ K ≤ 2Tex , (14) While the homogeneous GW processes have been stud-
where Tex = min{t : Zt = 0} is the time of extinction. ied quite extensively, the literature on tail bounds for
In (14), we have used the fact that Tex − 1 is the depth heterogeneous GW processes (for when γ 6= 0) has
of the tree, where the lower bound is obtained with been relatively deserted. We first review one interest-
asymmetric trees with only one node split at each level ing approach in the next section and then come up
and the upper bound is obtained with symmetric full with a new bound in Section 5.2.
binary trees (all nodes are split at each level).
5.1 Tail Bounds à la Agresti
Regularization is an essential remedy against overfit-
ting and Bayesian procedures have a natural way of Agresti [1975] obtained bounds for the extinction
doing so through a prior. In the context of trees, the time distribution of branching processes with indepen-
key regularization element is the prior on the number dent non-identically distributed environmental ran-
of bottom leaves K, which is completely characterized dom variables Yti .
by the distribution of total progeny X via (13). Us- Theorem 5.1. [Agresti, 1975] Consider the heteroge-
ing this connection, in the next section we study the neous Galton-Watson branching process with offspring
tail bounds of the distribution π(K) implied by the 00
p.g.f.’s {gj (s); j ≥ 0} satisfying gj (1) < ∞ for j ≥ 0.
Galton-Watson process. Qt−1 0
Denote Pt = j=0 gj (1). Then
−1
5 Bayesian Tree Regularization

t−1
1 X 00
P(Tex > t) ≤ Pt−1 + (g (0)/gj0 (1)Pj+1 ) .
2 j=0 j
If we knew ν, the optimal (rate-minimax) choice of the
number of tree leaves would be K  Kν = np/(2ν+p) (16)
(RP17). When ν is unknown, one can do almost as well
(sacrificing only a log factor in the convergence rate) Using this result, we can obtain a tail bound on the
using a suitable prior π(K). As noted by Coram and extinction time under the Bayesian CART prior.
Lalley [2006], the tail behavior of π(K) is critical for Corollary 5.1. For the heterogeneous Galton-Watson
controlling the vulnerability/resilience to overfitting. branching process with offspring p.g.f.’s (10) with (11)
The anticipation is that with smooth f0 , more rapid we have  γ −t
posterior concentration takes place when π(K) has a t
P(Tex > t) < C0 (17)
heavier tail. However, too heavy tails make it easier 2α eγ
to overfit when the true function is less smooth. To for a positive constant C0 that depends on α and γ.
achieve an equilibrium, Denison et al. [1998] suggest
the Poisson distribution (constrained to N\{0}), which Proof. We have g0 (s) = s and for j ≥ 1
satisfies
gj (s) = 1 − α(1 + j)−γ + s2 α(1 + j)−γ ,
−CK k log k
P(K > k) . e for some CK > 0. (15) 0
gj (s) = 2sα(1 + j)−γ ,
Under this prior, one can show that P(K > C Kν | 00
gj (s) = 2α(1 + j)−γ .
(n)
Y (n) ) → 0 in Pf0 probability (RP17). The posterior 0 00
thus does not overshoot the oracle Kν too much. Thus we have g00 (1) = 1 and gj (1) = gj (0) = 2α(1 +
j)−γ for j ≥ 1. Then we can write
In the BART prior, the distribution π(K) is implicitly
Qt−1
defined through the GW process rather than directly −1 (1 + i)γ (t!)γ
through (15). In order to see whether BART induces Pt = i=0 t = (18)
(2α) (2α)t
On Theory for BART

and Linking Galton-Watson trees to random walk excur-


t−1 t−1 t
sions in this way, one can obtain a useful tail bound of
X 00 X 1 X (j)!γ (t!)γ the distribution of the population size X. While per-
(gj (0)/gj0 (1)Pj+1 ) = = j
> .
j=0 j=0
Pj+1 j=1
(2α) (2α)t haps not surprising, we believe that this bound is new,
as we could not find any equivalent in the literature.
Using (18) and the fact that t! > (t/ e)t e, we Lemma 5.1. Denote by X the total population size
can upper-bound the right hand side of (16) with (12) arising from the heterogeneous Galton-Watson
C0 [tγ /( eγ 2α)]−t . process. Then we have for any c > 0
Remark 5.1. A simpler bound on the extinction time 2c
P(X > k) ≤ e−k c+( e −1)µ , (19)
can be obtained using Markov’s inequality as follows:
P(Tex > t) = P(Zt ≥ 1) ≤ EZt ≤ (2α)t [(t + 1)!]−γ . Pk
where µ = i=1 pi and pi = psplit (Ωi ), where nodes
Ωi are ordered in a top-down left-to-right fashion.
Using the upper bound in (14) we immediately con-
clude that
Proof. For k > 0, we can write
 γ − log2 k
log2 k k
!
P(K > k) < P(Tex > log2 k) ≤ C0 .
2α eγ
X
P(X > k) ≤ P(Sk > 0) = P Yi > k − 1 ,
i=1
This decay, however, is not fast enough as we would
ideally like to show (15). We try a different approach where X is the number of all nodes (internal and ex-
in the next section. ternal) in the tree and Yi has a two-point distribution
characterized by P(Yi = 2) = 1 − P(Yi = 0) = pi .
5.2 Trees as Random Walks Using the Chernoff bound, one deduces that for any
c>0
There is a curious connection between branching pro- !
k
cesses and random walks (see e.g. Dwass [1969]). Sup- X Pk
P Yi > k − 1 ≤ e−k c E ec i=1 Yi
pose that a binary tree T is revealed in the following
i=1
node-by-node exploration process: one exhausts all k
nodes in generation d before revealing nodes in gen- Y 2c
= e−k c [pi e2c + 1 − pi ] ≤ e−k c+( e −1)µ
eration d + 1. Namely, nodes are implicitly numbered
i=1
(and explored) according to their priority and this is
Pk
done in a top/down manner according to their layer where µ = pi .
i=1
and a left-to-right manner within each layer (i.e. Ω0
is the root node and, if split, Ω1 and Ω2 are the two The goal throughout this section has been to under-
children (left and right) etc.) stand whether the Bayesian CART prior of Chipman
Nodes that are waiting to be explored can be organized et al. [1998] yields (15) for some CK > 0. The prior
in a queue Q. We say that a node is active at time t assumes pi = α/(1 + d(Ωi ))γ . Choosing c = (log k)/2
if it resides in a queue. Starting with one active node in (19), the right hand side will be smaller than
at t = 0 (the root node), at each time t we deactivate e−a k log k , for some suitable 0 < a < 1/2, as long as
(remove from Q) the node with the highest priority µ ≤ (1/2 − a) log k. We note that
(lowest index) and add its children to Q. Letting St k dlog2 ke
be the number of active nodes at time t, one finds that X X α
µ= pi < 2d .
{St } satisfies i=1
(1 + d)γ
d=1

St = St−1 − 1 + Yt , t ≥ 1, Because the split probability pi decreases only poly-


nomially in depth of Ωi , this is not enough to ensure
and S0 = 1, where Yt are sampled from the offspring µ < (1/2 − a) log(k). The optimal decay, however, will
distribution. For the homogeneous GW process, St is be guaranteed if we instead choose
an actual random walk where Yt are iid with a proba-
bility generating function (10). For the heterogeneous psplit (Ω) ∝ αd(Ω) for some 0 < α < 1/2. (20)
GW process, St is not strictly a random walk in the
sense that Yt0 s are not iid. Nevertheless, using this To conclude, from our considerations it is not clear
construction one can see that the total population X that the Bayesian CART prior of Chipman et al. [1998]
equals the first time the queue is empty: has the optimal tail-bound decay. The following Corol-
lary certifies that the optimal tail behavior can be ob-
X = min{t ≥ 0 : St = 0}. tained with a suitable modification of psplit (Ω).
Veronika Ročková and Enakshi Saha

K = 2s p . The k-d tree partitions are thus balanced


in light of Definition 2.4 of Rockova and van der Pas
[2017] (i.e. have roughly the same number of observa-
tions inside). The k-d tree construction is instrumental
in establishing optimal prior/posterior concentration.
Lemma 3.2 of RP17 shows that there exists a step
function supported by a k-d partition that safely ap-
proximates f0 with an error smaller than a constant
multiple of the minimax rate. The approximating k-
d tree partition, denoted with Tb , has K b steps where
2 1/2
K  nεn / log n when p . log n (as shown in Sec-
b
tion 8.3 of RP17 and detailed in the proof of Theorem
7.1).
In order to complete the proof of posterior concentra-
tion for the Bayesian CART under the Galton-Watson
2

Figure 1: The k-d trees in two dimensions at various process prior, we need to show that π(Tb ) ≥ e−c1 nεn
resolution levels. for some c1 > 0. This is verified in the next lemma.
Lemma 6.1. Denote with Tb the k-d tree partition
Corollary 5.2. Under the Bayesian CART prior of described above. Assume the heterogeneous Galton-
Chipman et al. [1998] with (20), we obtain (15). Watson process tree prior with psplit (Ωk ) ∝ αd(Ωk ) for
some suitable 1/n ≤ α < 1/2. Assume p . log1/2 n.
Proof. Follows from the considerations above and from Then we have for some suitable c1 > 0
(13). 2
π(Tb ) ≥ e−c1 n εn .

6 Prior Concentration for BART b = 2p×s


Proof. By construction, the k-d tree Tb has K
leaves and p × s layers for some s ∈ N where p is
One of the prerequisites for optimal posterior concen-
the number of predictors. In addition, the k-d tree is
tration (6) is optimal prior concentration (Condition
complete and balanced (i.e. every layer d, including
(8)). This condition ensures that there is enough prior
the last one, has the maximal number 2d of nodes).
support around the truth. It can be verified by con-
structing one approximating tree and by showing that Since there are K−1
b internal nodes and at least 1/(p n)
it has enough prior mass. RP17 use the k-d approx- splitting rules for each internal node, we have
imating tree (Remark 3.1), which is a balanced full
2 K−1
b logY b
binary tree which partitions [0, 1]p into nearly iden- (1 − αs p )K (1 − αs p )K
b
d
π(Tb ) ≥ α2 ≥ αK−1
b
tical rectangles (in sufficiently regular designs). This (p n)K−1
b
(p n)K−1
b
d=0
tree can be regarded as the most regular partition that  K−1
b
can be obtained by splitting at observed values and K 1
≥ [α(1 − α)]
b
has been studied rigorously in the literature (see e.g. pn
Verma et al. [2009]). A formal definition of the k-d
> e−K log(2n)−(K−1) log(p n) .
b b
tree is below and a few two-dimensional examples1 (at
various resolution levels) are in Figure 1.
Since p . log1/2 n and K
b  n ε2 / log n we can lower-
n
Definition 6.1. (k-d tree partition) The k-d tree par- −c1 nε2n
bound the above with e for some c1 > 0.
tition with p × s layers is constructed by cycling s
times over coordinate directions {1, . . . , p}. At each For the actual BART method (similarly as in Theorem
tree level, all nodes are split along the same axis. For 5.1 of RP17), one needs to find an approximating tree
a given direction j ∈ {1, . . . , p}, each internal node ensemble and show that it has enough prior support.
will be split at a median of the point set (along the The approximating ensemble can be found in Lemma
j th axis). Each split thus roughly halves the number of 10.1 of RP17 and consists of Eb = {Tb1 , . . . , TbT } tree
points inside the cell.
partitions obtained by chopping of branches of Tb . The
number of trees T is fixed and the trees Tt will not
After s rounds of splits on each variable, all K ter-
overlap much when 1 ≤ T ≤ K/2. b The default BART
minal nodes have at least bn/Kc observations, where
choice T = 200 safely satisfies this as long as p > 9.
1
Source: https://salzis.wordpress.com/2014/06/28/ b t leaves and satisfy log2 K
The little trees Tbt have K b+
On Theory for BART

1≤K bt ≤ K b (depending on the choice of T ). Using Proof. Appendix.


Lemma 6.1 and the fact that the trees are independent
a-priori (from (4)) and that T is fixed, we then obtain Theorem 7.1 has very important implications. It pro-
PT b t log 2n+(K
b t −1) log(p n)] vides a frequentist theoretical justification for BART
b ≥ e−
π(E) t=1 [K
claiming that the posterior is wrapped around the
2
> e−T K log 2n−T (K−1) log(p n) > e−c2 nεn
b b truth and its learning rate is near-optimal. As a by-
product, one also obtains a statement which supports
for some c2 > 0. The BART prior thus concen- the empirical observation that BART is resilient to
trates enough mass around the truth. Condition (8) overfitting.
also requires verification that the prior on jump sizes
Corollary 7.1. Under the assumptions of Theorem
concentrates around the forest sitting on E.b This fol-
7.1 we have
lows directly from Section 9.2 of RP17. We detail the
steps in the proof of Theorem 7.1. [T
!
t p/(2ν+p) (n)
Π {K > C n }|Y →0
t=1
7 Posterior Concentration for BART
(n)
in Pf0 -probability, as n, p → ∞, for a suitable con-
We now have all the ingredients needed to state the
stant C > 0.
posterior concentration result for BART. The result is
different from Theorem 5.1 of RP17 because here we
Proof. The proof follows from the proof of Theorem
(a) assume that T is fixed, (b) assume the branching
7.1 and Lemma 1 of Ghosal and van Der Vaart [2007].
process prior on T and (c) we do not have subset se-
lection uncertainty. We will treat the design as fixed
and regular according to Definition 3.3 of RP17.
In other words, the posterior distribution rewards en-
Definition 7.1. Denote by Tb = {Ωb k }k=1 the k-d tree sembles that consist of small trees whose size does not
with K = 2s×p bottom nodes. We say that a dataset overshoot the optimal number of steps Kν = np/(2ν+p)
{xi }ni=1 is regular if K by much. In this way, the posterior is fully adaptive
X
max diam(Ω bk) < M µ(Ω
b k )diam(Ωbk) (21) to unknown smoothness, not overfitting in the sense of
1≤k≤K
k=1 split overuse.
for all s ∈ N and for some M > 0, where
  8 Discussion
diam Ωbk = max kx − yk2 .
b k ∩{xi }n
x,y∈Ω j=1 In this work, we propose a variant of the BART prior
by modifying the split probability. This new prior (a)
The design regularity assumption translates as follows: is shown to yield optimal posterior concentration (in
the data points inside the cells of a k-d tree (Figure the k · kn sense) , (b) can be just as easily implemented
1) have similar diameters (i.e. maximal interpoint dis- and (c) might enhance the performance of BART in
tances inside the cell). The k-d tree is the most regular practice.
partition one can obtain by splitting at data points.
If the cell diameters of the k-d tree are similar, the We have built on Rockova and van der Pas [2017] to
dataset is without outliers and/or isolated clouds of show optimal posterior convergence rate of the BART
points. This is a mild and reasonable requirement (see method. Similar results have been obtained for other
RP17 for more discussion). Moreover, this assump- Bayesian non-parametric constructions such as Polya
tion allows for correlated x’s and holds trivially for trees (Castillo [2017]), Gaussian processes (van der
fixed designs on a regular grid. Vaart and van Zanten [2008], Castillo [2008]) and deep
ReLU neural networks [Polson and Rockova, 2018]. Up
Theorem 7.1. (Posterior Concentration for BART) to now, the increasing popularity of BART has relied
Assume that f0 is ν-Hölder continuous with 0 < ν ≤ on its practical performance across a wide variety of
1 where kf0 k∞ . log1/2 n. Assume a regular design problems. The goal of this and future theoretical de-
{xi }ni=1 where p . log1/2 n. Assume the BART prior velopments is to establish BART as a rigorous sta-
with T fixed and with psplit (Ωt ) = αd(Ωt ) for 1/n ≤ tistical tool with solid theoretical guarantees. Simi-
α < 1/2. With εn = n−α/(2α+p) log1/2 n we have lar guarantees have been obtained for variants of the
  traditional forests/trees by multiple authors including
Π fE,B ∈ F : kf0 − fE,B kn > Mn εn | Y (n) → 0 Gordon and Olshen [1980, 1984], Donoho [1997], Biau
et al. [2008], Scornet et al. [2015], Wager and Guenther
(n)
for any Mn → ∞ in Pf0 -probability, as n, p → ∞. [2015].
Veronika Ročková and Enakshi Saha

References M. Dwass. The total progeny in a branching process


and a related random walk. Journal of Applied Prob-
A. Agresti. On the extinction times of varying and
ability, 6(3):682–686, 1969.
random environment branching processes. Journal
of Applied Probability, 12(1):39–46, 1975. S. Ghosal and A.W. van Der Vaart. Convergence rates
of posterior distributions for noniid observations.
G. Biau, L. Devroye, and G. Lugosi. Consistency The Annals of Statistics, 35(1):192–223, 2007.
of random forests and other averaging classifiers.
The Journal of Machine Learning Research, 9:2015– L. Gordon and R. Olshen. Consistent nonparamet-
2033, 2008. ric regression from recursive partitioning schemes.
Journal of Multivariate Analysis, 10:611–627, 1980.
J. Bleich, A. Kapelner, E.I. George, and S.T. Jensen.
Variable selection for BART: an application to gene L. Gordon and R. Olshen. Almost sure consistent
regulation. The Annals of Applied Statistics, 4(3): nonparametric regression from recursive partition-
1750–1781, 2014. ing schemes. Journal of Multivariate Analysis, 15:
147–163, 1984.
L. Breiman, J. Friedman, C.J. Stone, and R. A.
Olshen. Classification and Regression Trees R.B. Gramacy and H. Lee. Bayesian treed Gaus-
(Wadsworth Statistics/Probability). Chapman and sian process models with an application to computer
Hall/CRC, 1984. modeling. Journal of the American Statistical As-
sociation, 103(483):1119–1130, 2008.
I. Castillo. Lower bounds for posterior rates with
Gaussian process priors. Electronic Journal of P.R. Hahn, J.S. Murray, and C.M. Carvalho. Bayesian
Statistics, 2:1281–1299, 2008. regression tree models for causal inference: regu-
larization, confounding, and heterogeneous effects.
I. Castillo. Pólya tree posterior distributions on densi- 2017.
ties. In Annales de l’Institut Henri Poincaré, Prob-
abilités et Statistiques, volume 53, pages 2074–2102. J.L. Hill. Bayesian nonparametric modeling for causal
Institut Henri Poincaré, 2017. inference. Journal of Computational and Graphical
Statistics, 20(1):217–240, 2011.
H. Chipman, E.I. George, and R.E. McCulloch.
BART: Bayesian additive regression trees. The An- B. Lakshminarayanan, D. Roy, and Y.W. Teh. Top-
nals of Applied Statistics, 4(1):266–298, 2010. down particle filtering for Bayesian decision trees.
In International Conference on Machine Learning,
H.A. Chipman, E.I. George, and R.E. McCulloch. 2013.
Bayesian CART model search. Journal of the Amer-
B. Lakshminarayanan, D.M. Roy, and Y.W. Teh. Mon-
ican Statistical Association, 93(443):935–948, 1998.
drian forests: Efficient online random forests. In
H.A. Chipman, E.I. George, R.E. McCulloch, and T.S. Advances in Neural Information Processing Systems
Shively. High-dimensional nonparametric monotone (NIPS), 2014.
function estimation using BART. arXiv preprint
A.R. Linero and Y. Yang. Bayesian regression tree
arXiv:1612.01619, 2016.
ensembles that adapt to smoothness and sparsity.
M. Coram and S.P. Lalley. Consistency of Bayes esti- arXiv preprint arXiv:1707.09461, 2017.
mators of a binary regression function. The Annals
Y. Liu, V. Rockova, and Y. Wang. ABC variable
of Statistics, 34(3):1233–1269, 2006.
selection with Bayesian forests. arXiv preprint
D.D. Cox. An analysis of Bayesian inference for non- arXiv:1806.02304, 2018.
parametric regression. The Annals of Statistics,
N. Polson and V. Rockova. Posterior concentration for
pages 903–923, 1993.
sparse deep learning. Advances in Neural Informa-
R. de Jonge and J.H. van Zanten. Semiparametric tion Processing Systems (NIPS), 2018.
Bernstein-von Mises for the error standard devia-
M. Pratola, H.A. Chipman, E.I. George, and R.E. Mc-
tion. Electronic Journal of Statistics, 7(1):217–243,
Culloch. Heteroscedastic BART using multiplicative
2013.
regression trees. arXiv preprint arXiv:1709.07542,
D. Denison, B. Mallick, and A. Smith. A Bayesian 2017.
CART algorithm. Biometrika, 85(2):363–377, 1998. V. Rockova and S.L. van der Pas. Posterior concentra-
P. Diaconis and D. Freedman. On the consistency of tion for Bayesian regression trees and their ensem-
Bayes estimates. The Annals of Statistics, 14(1): bles. arXiv preprint arXiv:1708.08734, 2017.
1–26, 1986. D.M. Roy and Y.W. Teh. The Mondrian process. In
D. Donoho. CART and best-ortho-basis: a connection. Advances in Neural Information Processing Systems
Annals of Statistics, 25:1870–1911, 1997. (NIPS), 2008.
On Theory for BART

E. Scornet, G. Biau, and J.P. Vert. Consistency of


random forests. Annals of Statistics, 43:1716–1741,
2015.
M.A. Taddy, Robert B.B. Gramacy, and N.G. Polson.
Dynamic trees for learning and design. Journal of
the American Statistical Association, 106(493):109–
123, 2011.
S. van der Pas and V. Rockova. Bayesian dyadic trees
and histograms for regression. Advances in Neural
Information Processing Systems (NIPS), 2017.
A.W. van der Vaart and J.H. van Zanten. Rates of con-
traction of posterior distributions based on Gaussian
process priors. The Annals of Statistics, 36(3):1435–
1463, 2008.
N. Verma, S. Kpotufe, and S. Dasgupta. Which spa-
tial partition trees are adaptive to intrinsic dimen-
sion? In Proceedings of the twenty-fifth conference
on uncertainty in artificial intelligence, pages 565–
574. AUAI Press, 2009.
S. Wager and W. Guenther. Adaptive concentration of
regression trees with application to random forests.
Manuscript, 2015.

You might also like