XGBoost for R introduction

XGBoost for R introduction

Introduction

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

For an introduction to the concept of gradient boosting, see the tutorial Introduction to Boosted Trees in XGBoost’s online docs.

For more details about XGBoost’s features and usage, see the online documentation which contains more tutorials, examples, and details.

This short vignette outlines the basic usage of the R interface for XGBoost, assuming the reader has some familiarity with the underlying concepts behind statistical modeling with gradient-boosted decision trees.

Building a predictive model

At its core, XGBoost consists of a C++ library which offers bindings for different programming languages, including R. The R package for XGBoost provides an idiomatic interface similar to those of other statistical modeling packages using and x/y design, as well as a lower-level interface that interacts more directly with the underlying core library and which is similar to those of other language bindings like Python, plus various helpers to interact with its model objects such as by plotting their feature importances or converting them to other formats.

The main function of interest is xgboost(x, y, ...), which calls the XGBoost model building procedure on observed data of covariates/features/predictors “x”, and a response variable “y” - it should feel familiar to users of packages like glmnet or ncvreg:

library(xgboost)
data(ToothGrowth)

y <- ToothGrowth$supp # the response which we want to model/predict
x <- ToothGrowth[, c("len", "dose")] # the features from which we want to predct it
model <- xgboost(x, y, nthreads = 1, nrounds = 2)
model
## XGBoost model object
## Call:
##   xgboost(x = x, y = y, nrounds = 2, nthreads = 1)
## Objective: binary:logistic
## Number of iterations: 2
## Number of features: 2
## Classes: OJ, VC

In this case, the “y” response variable that was supplied is a “factor” type with two classes (“OJ” and “VC”) - hence, XGBoost builds a binary classification model for it based on the features “x”, by finding a maximum likelihood estimate (similar to the family="binomial" model from R’s glm function) through rule buckets obtained from the sum of two decision trees (from nrounds=2), from which we can then predict probabilities, log-odds, class with highest likelihood, among others:

predict(model, x[1:6, ], type = "response") # probabilities for y's last level ("VC")
##         1         2         3         4         5         6 
## 0.6596265 0.5402158 0.6596265 0.6596265 0.6596265 0.4953500
predict(model, x[1:6, ], type = "raw")      # log-odds
##           1           2           3           4           5           6 
##  0.66163027  0.16121151  0.66163027  0.66163027  0.66163027 -0.01860031
predict(model, x[1:6, ], type = "class")    # class with highest probability
## [1] VC VC VC VC VC OJ
## Levels: OJ VC

Compared to R’s glm function which follows the concepts of “families” and “links” from GLM theory to fit models for different kinds of response distributions, XGBoost follows the simpler concept of “objectives” which mix both of them into one, and which just like glm, allow modeling very different kinds of response distributions (e.g. discrete choices, real-valued numbers, counts, censored measurements, etc.) through a common framework.

XGBoost will automatically determine a suitable objective for the response given its object class (can pass factors for classification, numeric vectors for regression, Surv objects from the survival package for survival, etc. - see ?xgboost for more details), but this can be controlled manually through an objective parameter based the kind of model that is desired:

data(mtcars)

y <- mtcars$mpg
x <- mtcars[, -1]
model_gaussian <- xgboost(x, y, nthreads = 1, nrounds = 2) # default is squared loss (Gaussian)
model_poisson <- xgboost(x, y, objective = "count:poisson", nthreads = 1, nrounds = 2)
model_abserr <- xgboost(x, y, objective = "reg:absoluteerror", nthreads = 1, nrounds = 2)

Note: the objective must match with the type of the “y” response variable - for example, classification objectives for discrete choices require “factor” types, while regression models for real-valued data require “numeric” types.

Model parameters

XGBoost models allow a large degree of control over how they are built. By their nature, gradient-boosted decision tree ensembles are able to capture very complex patterns between features in the data and a response variable, which also means they can suffer from overfitting if not controlled appropirately.

For best results, one needs to find suitable parameters for the data being modeled. Note that XGBoost does not adjust its default hyperparameters based on the data, and different datasets will require vastly different hyperparameters for optimal predictive performance.

For example, for a small dataset like “TootGrowth” which has only two features and 60 observations, the defaults from XGBoost are an overkill which lead to severe overfitting - for such data, one might want to have smaller trees (i.e. more convervative decision rules, capturing simpler patterns) and fewer of them, for example.

Parameters can be controlled by passing additional arguments to xgboost(). See ?xgb.params for details about what parameters are available to control.

y <- ToothGrowth$supp
x <- ToothGrowth[, c("len", "dose")]
model_conservative <- xgboost(
    x, y, nthreads = 1,
    nrounds = 5,
    max_depth = 2,
    reg_lambda = 0.5,
    learning_rate = 0.15
)
pred_conservative <- predict(
    model_conservative,
    x
)
pred_conservative[1:6] # probabilities are all closer to 0.5 now
##         1         2         3         4         5         6 
## 0.6509258 0.4822042 0.6509258 0.6509258 0.6509258 0.4477925

XGBoost also allows the possibility of calculating evaluation metrics for model quality over boosting rounds, with a wide variety of built-in metrics available to use. It’s possible to automatically set aside a fraction of the data to use as evaluation set, from which one can then visually monitor progress and overfitting:

xgboost(
    x, y, nthreads = 1,
    eval_set = 0.2,
    monitor_training = TRUE,
    verbosity = 1,
    eval_metric = c("auc", "logloss"),
    nrounds = 5,
    max_depth = 2,
    reg_lambda = 0.5,
    learning_rate = 0.15
)
## [1]  train-auc:0.762238  train-logloss:0.660152  eval-auc:0.531250   eval-logloss:0.736025 
## [2]  train-auc:0.762238  train-logloss:0.637286  eval-auc:0.531250   eval-logloss:0.749103 
## [3]  train-auc:0.798077  train-logloss:0.614964  eval-auc:0.640625   eval-logloss:0.733579 
## [4]  train-auc:0.851399  train-logloss:0.595922  eval-auc:0.640625   eval-logloss:0.752773 
## [5]  train-auc:0.854895  train-logloss:0.578727  eval-auc:0.640625   eval-logloss:0.740198
## XGBoost model object
## Call:
##   xgboost(x = x, y = y, nrounds = 5, max_depth = 2, learning_rate = 0.15, 
##     reg_lambda = 0.5, verbosity = 1, monitor_training = TRUE, 
##     eval_set = 0.2, eval_metric = c("auc", "logloss"), nthreads = 1)
## Objective: binary:logistic
## Number of iterations: 5
## Number of features: 2
## Classes: OJ, VC

Examining model objects

XGBoost model objects for the most part consist of a pointer to a C++ object where most of the information is held and which is interfaced through the utility functions and methods in the package, but also contains some R attributes that can be retrieved (and new ones added) through attributes():

attributes(model)
## $call
## xgboost(x = x, y = y, nrounds = 2, nthreads = 1)
## 
## $params
## $params$objective
## [1] "binary:logistic"
## 
## $params$nthread
## [1] 1
## 
## $params$verbosity
## [1] 0
## 
## $params$seed
## [1] 0
## 
## $params$validate_parameters
## [1] TRUE
## 
## 
## $names
## [1] "ptr"
## 
## $class
## [1] "xgboost"     "xgb.Booster"
## 
## $metadata
## $metadata$y_levels
## [1] "OJ" "VC"
## 
## $metadata$n_targets
## [1] 1

In addition to R attributes (which can be arbitrary R objects), it may also keep some standardized C-level attributes that one can access and modify (but which can only be JSON-format):

xgb.attributes(model)
## list()

(they are empty for this model)

… but usually, when it comes to getting something out of a model object, one would typically want to do this through the built-in utility functions. Some examples:

xgb.importance(model)
##    Feature      Gain     Cover Frequency
##     <char>     <num>     <num>     <num>
## 1:     len 0.7444265 0.6830449 0.7333333
## 2:    dose 0.2555735 0.3169551 0.2666667
xgb.model.dt.tree(model)
##      Tree  Node     ID Feature Split    Yes     No Missing        Gain
##     <int> <int> <char>  <char> <num> <char> <char>  <char>       <num>
##  1:     0     0    0-0     len  19.7    0-1    0-2     0-2  5.88235283
##  2:     0     1    0-1    dose   1.0    0-3    0-4     0-4  2.50230217
##  3:     0     2    0-2    dose   2.0    0-5    0-6     0-6  2.50230217
##  4:     0     3    0-3     len   8.2    0-7    0-8     0-8  5.02710962
##  5:     0     4    0-4    Leaf    NA   <NA>   <NA>    <NA>  0.36000001
##  6:     0     5    0-5    Leaf    NA   <NA>   <NA>    <NA> -0.36000001
##  7:     0     6    0-6     len  29.5    0-9   0-10    0-10  0.93020594
##  8:     0     7    0-7    Leaf    NA   <NA>   <NA>    <NA>  0.36000001
##  9:     0     8    0-8     len  10.0   0-11   0-12    0-12  0.60633492
## 10:     0     9    0-9     len  24.5   0-13   0-14    0-14  0.78028417
## 11:     0    10   0-10    Leaf    NA   <NA>   <NA>    <NA>  0.15000001
## 12:     0    11   0-11    Leaf    NA   <NA>   <NA>    <NA> -0.30000001
## 13:     0    12   0-12     len  13.6   0-15   0-16    0-16  2.92307687
## 14:     0    13   0-13    Leaf    NA   <NA>   <NA>    <NA>  0.06666667
## 15:     0    14   0-14    Leaf    NA   <NA>   <NA>    <NA> -0.17142859
## 16:     0    15   0-15    Leaf    NA   <NA>   <NA>    <NA>  0.20000002
## 17:     0    16   0-16    Leaf    NA   <NA>   <NA>    <NA> -0.30000001
## 18:     1     0    1-0     len  19.7    1-1    1-2     1-2  3.51329851
## 19:     1     1    1-1    dose   1.0    1-3    1-4     1-4  1.63309026
## 20:     1     2    1-2    dose   2.0    1-5    1-6     1-6  1.65485406
## 21:     1     3    1-3     len   8.2    1-7    1-8     1-8  3.56799269
## 22:     1     4    1-4    Leaf    NA   <NA>   <NA>    <NA>  0.28835031
## 23:     1     5    1-5    Leaf    NA   <NA>   <NA>    <NA> -0.28835031
## 24:     1     6    1-6     len  26.7    1-9   1-10    1-10  0.22153124
## 25:     1     7    1-7    Leaf    NA   <NA>   <NA>    <NA>  0.30163023
## 26:     1     8    1-8     len  11.2   1-11   1-12    1-12  0.25236940
## 27:     1     9    1-9     len  24.5   1-13   1-14    1-14  0.44972166
## 28:     1    10   1-10    Leaf    NA   <NA>   <NA>    <NA>  0.05241550
## 29:     1    11   1-11    Leaf    NA   <NA>   <NA>    <NA> -0.21860033
## 30:     1    12   1-12    Leaf    NA   <NA>   <NA>    <NA> -0.03878851
## 31:     1    13   1-13    Leaf    NA   <NA>   <NA>    <NA>  0.05559399
## 32:     1    14   1-14    Leaf    NA   <NA>   <NA>    <NA> -0.13160129
##      Tree  Node     ID Feature Split    Yes     No Missing        Gain
##         Cover
##         <num>
##  1: 15.000000
##  2:  7.500000
##  3:  7.500000
##  4:  4.750000
##  5:  2.750000
##  6:  2.750000
##  7:  4.750000
##  8:  1.500000
##  9:  3.250000
## 10:  3.750000
## 11:  1.000000
## 12:  1.000000
## 13:  2.250000
## 14:  1.250000
## 15:  2.500000
## 16:  1.250000
## 17:  1.000000
## 18: 14.695991
## 19:  7.308470
## 20:  7.387520
## 21:  4.645680
## 22:  2.662790
## 23:  2.662790
## 24:  4.724730
## 25:  1.452431
## 26:  3.193249
## 27:  2.985818
## 28:  1.738913
## 29:  1.472866
## 30:  1.720383
## 31:  1.248612
## 32:  1.737206
##         Cover

Other features

XGBoost supports many additional features on top of its traditional gradient-boosting framework, including, among others:

  • Building decision tree models with characteristics such as per-feature monotonicity constraints or interaction constraints.
  • Calculating feature contributions in individual predictions.
  • Using custom objectives and custom evaluation metrics.
  • Fitting linear models.
  • Fitting models on GPUs and/or on data that doesn’t fit in RAM (“external memory”).

See the online documentation - particularly the tutorials section - for a glimpse over further functionalities that XGBoost offers.

The low-level interface

In addition to the xgboost(x, y, ...) function, XGBoost also provides a lower-level interface for creating model objects through the function xgb.train(), which resembles the same xgb.train functions in other language bindings of XGBoost.

This xgb.train() interface exposes additional functionalities (such as user-supplied callbacks or external-memory data support) and performs fewer data validations and castings compared to the xgboost() function interface.

Some key differences between the two interfaces:

  • Unlike xgboost() which takes R objects such as matrix or data.frame as inputs, the function xgb.train() uses XGBoost’s own data container called “DMatrix”, which can be created from R objects through the function xgb.DMatrix(). Note that there are other “DMatrix” constructors too, such as “xgb.QuantileDMatrix()”, which might be more beneficial for some use-cases.
  • A “DMatrix” object may contain a mixture of features/covariates, the response variable, observation weights, base margins, among others; and unlike xgboost(), requires its inputs to have already been encoded into the representation that XGBoost uses behind the scenes - for example, while xgboost() may take a factor object as “y”, xgb.DMatrix() requires instead a binary response variable to be passed as a vector of zeros and ones.
  • Hyperparameters are passed as function arguments in xgboost(), while they are passed as a named list to xgb.train().
  • The xgb.train() interface keeps less metadata about its inputs - for example, it will not add levels of factors as column names to estimated probabilities when calling predict.

Example usage of xgb.train():

data("agaricus.train")
dmatrix <- xgb.DMatrix(
    data = agaricus.train$data,  # a sparse CSC matrix ('dgCMatrix')
    label = agaricus.train$label, # zeros and ones
    nthread = 1
)
booster <- xgb.train(
    data = dmatrix,
    nrounds = 10,
    params = xgb.params(
        objective = "binary:logistic",
        nthread = 1,
        max_depth = 3
    )
)

data("agaricus.test")
dmatrix_test <- xgb.DMatrix(agaricus.test$data, nthread = 1)
pred_prob <- predict(booster, dmatrix_test)
pred_raw <- predict(booster, dmatrix_test, outputmargin = TRUE)

Model objects produced by xgb.train() have class xgb.Booster, while model objects produced by xgboost() have class xgboost, which is a subclass of xgb.Booster. Their predict methods also take different arguments - for example, predict.xgboost has a type parameter, while predict.xgb.Booster controls this through binary arguments - but as xgboost is a subclass of xgb.Booster, methods for xgb.Booster can be called on xgboost objects if needed.

Utility functions in the XGBoost R package will work with both model classes - for example:

xgb.importance(model)
##    Feature      Gain     Cover Frequency
##     <char>     <num>     <num>     <num>
## 1:     len 0.7444265 0.6830449 0.7333333
## 2:    dose 0.2555735 0.3169551 0.2666667
xgb.importance(booster)
##                            Feature         Gain        Cover  Frequency
##                             <char>        <num>        <num>      <num>
##  1:                      odor=none 0.6083687503 0.3459792871 0.16949153
##  2:                stalk-root=club 0.0959684807 0.0695742744 0.03389831
##  3:                     odor=anise 0.0645662853 0.0777761744 0.10169492
##  4:                    odor=almond 0.0542574659 0.0865120182 0.10169492
##  5:               bruises?=bruises 0.0532525762 0.0535293301 0.06779661
##  6:              stalk-root=rooted 0.0471992509 0.0610565707 0.03389831
##  7:        spore-print-color=green 0.0326096192 0.1418126308 0.16949153
##  8:                      odor=foul 0.0153302980 0.0103517575 0.01694915
##  9: stalk-surface-below-ring=scaly 0.0126892940 0.0914230316 0.08474576
## 10:                gill-size=broad 0.0066973198 0.0345993858 0.10169492
## 11:                   odor=pungent 0.0027091458 0.0032193586 0.01694915
## 12:           population=clustered 0.0025750464 0.0015616374 0.03389831
## 13:  stalk-color-below-ring=yellow 0.0016913567 0.0173903519 0.01694915
## 14:        spore-print-color=white 0.0012798160 0.0008031107 0.01694915
## 15:             gill-spacing=close 0.0008052948 0.0044110809 0.03389831

While xgboost() aims to provide a user-friendly interface, there are still many situations where one should prefer the xgb.train() interface - for example:

  • For latency-sensitive applications (e.g. when serving models in real time), xgb.train() will have a speed advantage, as it performs fewer validations, conversions, and post-processings with metadata.
  • If you are developing an R package that depends on XGBoost, xgb.train() will provide a more stable interface (less subject to changes) and will have lower time/memory overhead.
  • If you need functionalities that are not exposed by the xgboost() interface - for example, if your dataset does not fit into the computer’s RAM, it’s still possible to construct a DMatrix from it if the data is loaded in batches through xgb.ExtMemDMatrix().