XGBoost: A Scalable Tree
Boosting System
Tianqi Chen and Carlos Guestrin, University of Washington
XGBoost
eXtreme Gradient Boosting
29 Kaggle challenges with winners in 2015
17 used XGBoost
8 of these solely used XGBoost; the others
combined XGBoost with DNNs
KDDCup 2015
Every single top 10 finisher used XGBoost
XGBoost Applications
Store sales prediction
High energy physics event classification
Web text classification
Customer behavior prediction
Motion detection
Ad click through rate prediction
Malware classification
Product categorization
Hazard risk prediction
Massive on-line course dropout rate prediction
Properties of XGBoost
Single most important factor in its success: scalability
Due to several important systems and algorithmic optimizations
1. Highly scalable end-to-end tree boosting system
2. Theoretically justified weighted quantile sketch for efficient proposal calculation
3. Novel sparsity-aware algorithm for parallel tree learning
4. Effective cache-aware block structure for out-of-core tree learning
What is “tree boosting”?
Given a dataset (n
examples, m features)
Tree ensemble uses K
additive functions to
predict output
What is “gradient boosting”?
Regularized objective function
Objective
2nd order
approx.
Remove
constants
Scoring function to
evaluate quality of
tree structure
Regularized objective function
Split-finding algorithms
Exact
Computationally demanding
Enumerate all possible splits for continuous features
Approximate
Algorithm proposes candidate splits according to percentiles of feature distributions
Maps continuous features to buckets split by candidate points
Aggregates statistics and finds best solution among proposals
Comparison of split-finding
Two variants
Global
Local
Shrinkage and column subsampling
Shrinkage
Scales newly added weights by a factor !
Reduces influence of each individual tree
Leaves space for future trees to improve model
Similar to learning rate in stochastic optimization
Column subsampling
Subsample features
Used in Random Forests
Prevents overfitting more effectively than row-sampling
Sparsity-aware split finding
Equates sparsity with missing values
Defines a “default” direction: follow
the observed paths
Compare to “dense” method
How does this work?
Features need to be in sorted order to determine splits
Concept of blocks
Compressed column (CSC) format
Each column sorted by corresponding feature value
Exact greedy algorithm: all the data in a single block
Data are sorted once before training and used subsequently in this format
Feature transformations in blocks
More on blocks
Data is stored on multiple blocks, and these blocks are stored on disk
Independent threads pre-fetch specific blocks into memory to prevent cache misses
Block Compression
Each column is compressed before being written to disk, and decompressed on-the-fly when
read from disk into a prefetched buffer
Cuts down on disk I/O
Block Sharding
Data is split across multiple disks (i.e. cluster)
Pre-fetcher is assigned to each disk to read data into memory
Cache-aware access
Exact Greedy Algorithm Approximate Algorithms
Allocate an internal buffer in each thread Choice of block size is critical
Fetch gradient statistics Small block size results in small workloads
for each thread
Perform accumulation in mini-batch
Large block size results in cache misses as
Reduces runtime overhead when number
gradient statistics do not fit in cache
of rows is large
Cache-aware access
Exact Approximate
Results: out of core
Results: distributed
Results: scalability
Demonstration
https://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
Conclusions
Novel sparsity-aware algorithm for handling sparse data
Theoretical guarantees for weighted quantile sketching for approximate learning
Cache access patterns, data compression, and data sharding techniques
http://arxiv.org/abs/1603.02754