XGBoost tree boosting for machine learning
When reading the AutoML benchmark I saw that XGBoost was one of the more recommended models: I decided to learn about it and I read this paper:
XGBoost: A Scalable Tree Boosting System. Tianqi Chen, Carlos Guestrin, KDD ’16, August 13-17, 2016, San Francisco, CA, USA https://dl.acm.org/doi/10.1145/2939672.2939785
The article dates 2016, and reports that at that time this algorithm has been successfully used in many competitions obtaining the best placements. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets. A similar situation is reported for the KDD cup in 2015. This algorithm scales up to billion of training samples, and is open source and available on https://github.com/dmlc/xgboost and currently active.
The algorithm consists in finding a tree set: in each tree an internal node consists in a condition on the inputs features, and the leafs consists of real numbers. With a new set of inputs, each tree is evaluated and provides one weight, all the weight are summed to give a final results, that is the classification/prediction searched. Given this structure is easy to understand that it is very easy to parallelize the evaluation, obtaining good performances.
The paper describes many optimization done to make the algorithm fast. It is quite common for input data to be sparse, for instance because one-hot encoding has been used, or because many values are absent. This has been take into account, and the authors reports that the sparsity-aware algorithm runs 50 time faster that the basic version. The input data are prepared, sorting them in ascending order, and dividing inputs in blocks. Choosing the right block size allows to avoid cache misses in data access, and also allows to run the analysis on multiple cores/servers. Prepared data can also be reused many times, avoiding to pay again and again the preparation cost. The data blocks can be compressed and the work divided into reading threads and processing threads: the reading threads read an decompress the blocks, while the processing threads process them. The authors suggest also to shard the input data on multiple disks, to parallelize the access and improve performances. For instance the authors report:
We can find that compression helps to speed up computation by factor of three, and sharding into two disks further gives 2x speedup. For this type of experiment, it is important to use a very large dataset to drain the system file cache for a real out-of-core setting… Our final method is able to process 1.7 billion examples on a single machine
All these details testify the amount of work and interest for this library.
But how this library builds a model from the input data? Let’s first of all introduce some names
n is the number of input samples
m is the number of input features (dimensions)
Each input sample is composed of <x1, x2, … x_m, y> where y is the value to be predicted
i is the index in the n samples
K is the number of trees used to predict y
T is the number of leaves in a tree
w is the weight attached to a leaf
t is the learning iteration
The function that is optimized by this algorithm is

So the difference between the actual and predicted value plus a normalization factor that penalizes the number of nodes used in the trees T and the square of the weights used in the models. This normalization has been added to penalize complex models with high values, and is effective on preventing overfitting.
The algorithm iteratively adds one new tree at each step, choosing the tree that reduces the most the loss function L.

Above you see an approximation that is used in the algorithm: g_i and h_i are the first and second order derivative, and f_t is the tree function that we are looking for at iteration t. Really I do not understand how you can get these derivative, I will have to read the bibliography to understand how it is possible to calculate these values.
After some steps the authors present how to calculate the optimal value for a new tree leaf, and how much a proposed tree will reduce L


Here I_j is the set of input that is selected by the jth leaf in the proposed tree. The algorithm iteratively builds a new tree adding more and more splits, calculates the improvement of each tree and at the end chooses the best tree. To decide the splits the authors introduced a quantile based approach, so the splits are done evenly over data distribution.
The sparse data optimization consists in choosing a default direction in the trees: each time an input is missing you follow that direction. The good direction is decided tree by tree, of course the one that minimizes L
In the end the good tree is chosen enumerating possibilities, and using multiplications and sums: so you can create a good model using a very large set of trees instead of stacking layers and layers of neurons.
[…] week I read about the XGBoost library and I wanted to understand more about the theory behind it. I started then reading this referenced […]
Gradient boosting machine | Giovanni Bricconi
December 4, 2022 at 3:44 pm