XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
For an introduction to the concept of gradient boosting, see the tutorial Introduction to Boosted Trees in XGBoost’s online docs.
For more details about XGBoost’s features and usage, see the online documentation which contains more tutorials, examples, and details.
This short vignette outlines the basic usage of the R interface for XGBoost, assuming the reader has some familiarity with the underlying concepts behind statistical modeling with gradient-boosted decision trees.
At its core, XGBoost consists of a C++ library which offers bindings for different programming languages, including R. The R package for XGBoost provides an idiomatic interface similar to those of other statistical modeling packages using and x/y design, as well as a lower-level interface that interacts more directly with the underlying core library and which is similar to those of other language bindings like Python, plus various helpers to interact with its model objects such as by plotting their feature importances or converting them to other formats.
The main function of interest is xgboost(x, y, ...)
,
which calls the XGBoost model building procedure on observed data of
covariates/features/predictors “x”, and a response variable “y” - it
should feel familiar to users of packages like glmnet
or
ncvreg
:
library(xgboost)
data(ToothGrowth)
y <- ToothGrowth$supp # the response which we want to model/predict
x <- ToothGrowth[, c("len", "dose")] # the features from which we want to predct it
model <- xgboost(x, y, nthreads = 1, nrounds = 2)
model
## XGBoost model object
## Call:
## xgboost(x = x, y = y, nrounds = 2, nthreads = 1)
## Objective: binary:logistic
## Number of iterations: 2
## Number of features: 2
## Classes: OJ, VC
In this case, the “y” response variable that was supplied is a
“factor” type with two classes (“OJ” and “VC”) - hence, XGBoost builds a
binary classification model for it based on the features “x”, by finding
a maximum likelihood estimate (similar to the
family="binomial"
model from R’s glm
function)
through rule buckets obtained from the sum of two decision trees (from
nrounds=2
), from which we can then predict probabilities,
log-odds, class with highest likelihood, among others:
## 1 2 3 4 5 6
## 0.6596265 0.5402158 0.6596265 0.6596265 0.6596265 0.4953500
## 1 2 3 4 5 6
## 0.66163027 0.16121151 0.66163027 0.66163027 0.66163027 -0.01860031
## [1] VC VC VC VC VC OJ
## Levels: OJ VC
Compared to R’s glm
function which follows the concepts
of “families” and “links” from GLM theory to fit models for different
kinds of response distributions, XGBoost follows the simpler concept of
“objectives” which mix both of them into one, and which just like
glm
, allow modeling very different kinds of response
distributions (e.g. discrete choices, real-valued numbers, counts,
censored measurements, etc.) through a common framework.
XGBoost will automatically determine a suitable objective for the
response given its object class (can pass factors for classification,
numeric vectors for regression, Surv
objects from the
survival
package for survival, etc. - see
?xgboost
for more details), but this can be controlled
manually through an objective
parameter based the kind of
model that is desired:
data(mtcars)
y <- mtcars$mpg
x <- mtcars[, -1]
model_gaussian <- xgboost(x, y, nthreads = 1, nrounds = 2) # default is squared loss (Gaussian)
model_poisson <- xgboost(x, y, objective = "count:poisson", nthreads = 1, nrounds = 2)
model_abserr <- xgboost(x, y, objective = "reg:absoluteerror", nthreads = 1, nrounds = 2)
Note: the objective must match with the type of the “y” response variable - for example, classification objectives for discrete choices require “factor” types, while regression models for real-valued data require “numeric” types.
XGBoost models allow a large degree of control over how they are built. By their nature, gradient-boosted decision tree ensembles are able to capture very complex patterns between features in the data and a response variable, which also means they can suffer from overfitting if not controlled appropirately.
For best results, one needs to find suitable parameters for the data being modeled. Note that XGBoost does not adjust its default hyperparameters based on the data, and different datasets will require vastly different hyperparameters for optimal predictive performance.
For example, for a small dataset like “TootGrowth” which has only two features and 60 observations, the defaults from XGBoost are an overkill which lead to severe overfitting - for such data, one might want to have smaller trees (i.e. more convervative decision rules, capturing simpler patterns) and fewer of them, for example.
Parameters can be controlled by passing additional arguments to
xgboost()
. See ?xgb.params
for details about
what parameters are available to control.
y <- ToothGrowth$supp
x <- ToothGrowth[, c("len", "dose")]
model_conservative <- xgboost(
x, y, nthreads = 1,
nrounds = 5,
max_depth = 2,
reg_lambda = 0.5,
learning_rate = 0.15
)
pred_conservative <- predict(
model_conservative,
x
)
pred_conservative[1:6] # probabilities are all closer to 0.5 now
## 1 2 3 4 5 6
## 0.6509258 0.4822042 0.6509258 0.6509258 0.6509258 0.4477925
XGBoost also allows the possibility of calculating evaluation metrics for model quality over boosting rounds, with a wide variety of built-in metrics available to use. It’s possible to automatically set aside a fraction of the data to use as evaluation set, from which one can then visually monitor progress and overfitting:
xgboost(
x, y, nthreads = 1,
eval_set = 0.2,
monitor_training = TRUE,
verbosity = 1,
eval_metric = c("auc", "logloss"),
nrounds = 5,
max_depth = 2,
reg_lambda = 0.5,
learning_rate = 0.15
)
## [1] train-auc:0.762238 train-logloss:0.660152 eval-auc:0.531250 eval-logloss:0.736025
## [2] train-auc:0.762238 train-logloss:0.637286 eval-auc:0.531250 eval-logloss:0.749103
## [3] train-auc:0.798077 train-logloss:0.614964 eval-auc:0.640625 eval-logloss:0.733579
## [4] train-auc:0.851399 train-logloss:0.595922 eval-auc:0.640625 eval-logloss:0.752773
## [5] train-auc:0.854895 train-logloss:0.578727 eval-auc:0.640625 eval-logloss:0.740198
## XGBoost model object
## Call:
## xgboost(x = x, y = y, nrounds = 5, max_depth = 2, learning_rate = 0.15,
## reg_lambda = 0.5, verbosity = 1, monitor_training = TRUE,
## eval_set = 0.2, eval_metric = c("auc", "logloss"), nthreads = 1)
## Objective: binary:logistic
## Number of iterations: 5
## Number of features: 2
## Classes: OJ, VC
XGBoost model objects for the most part consist of a pointer to a C++
object where most of the information is held and which is interfaced
through the utility functions and methods in the package, but also
contains some R attributes that can be retrieved (and new ones added)
through attributes()
:
## $call
## xgboost(x = x, y = y, nrounds = 2, nthreads = 1)
##
## $params
## $params$objective
## [1] "binary:logistic"
##
## $params$nthread
## [1] 1
##
## $params$verbosity
## [1] 0
##
## $params$seed
## [1] 0
##
## $params$validate_parameters
## [1] TRUE
##
##
## $names
## [1] "ptr"
##
## $class
## [1] "xgboost" "xgb.Booster"
##
## $metadata
## $metadata$y_levels
## [1] "OJ" "VC"
##
## $metadata$n_targets
## [1] 1
In addition to R attributes (which can be arbitrary R objects), it may also keep some standardized C-level attributes that one can access and modify (but which can only be JSON-format):
## list()
(they are empty for this model)
… but usually, when it comes to getting something out of a model object, one would typically want to do this through the built-in utility functions. Some examples:
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: len 0.7444265 0.6830449 0.7333333
## 2: dose 0.2555735 0.3169551 0.2666667
## Tree Node ID Feature Split Yes No Missing Gain
## <int> <int> <char> <char> <num> <char> <char> <char> <num>
## 1: 0 0 0-0 len 19.7 0-1 0-2 0-2 5.88235283
## 2: 0 1 0-1 dose 1.0 0-3 0-4 0-4 2.50230217
## 3: 0 2 0-2 dose 2.0 0-5 0-6 0-6 2.50230217
## 4: 0 3 0-3 len 8.2 0-7 0-8 0-8 5.02710962
## 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> 0.36000001
## 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> -0.36000001
## 7: 0 6 0-6 len 29.5 0-9 0-10 0-10 0.93020594
## 8: 0 7 0-7 Leaf NA <NA> <NA> <NA> 0.36000001
## 9: 0 8 0-8 len 10.0 0-11 0-12 0-12 0.60633492
## 10: 0 9 0-9 len 24.5 0-13 0-14 0-14 0.78028417
## 11: 0 10 0-10 Leaf NA <NA> <NA> <NA> 0.15000001
## 12: 0 11 0-11 Leaf NA <NA> <NA> <NA> -0.30000001
## 13: 0 12 0-12 len 13.6 0-15 0-16 0-16 2.92307687
## 14: 0 13 0-13 Leaf NA <NA> <NA> <NA> 0.06666667
## 15: 0 14 0-14 Leaf NA <NA> <NA> <NA> -0.17142859
## 16: 0 15 0-15 Leaf NA <NA> <NA> <NA> 0.20000002
## 17: 0 16 0-16 Leaf NA <NA> <NA> <NA> -0.30000001
## 18: 1 0 1-0 len 19.7 1-1 1-2 1-2 3.51329851
## 19: 1 1 1-1 dose 1.0 1-3 1-4 1-4 1.63309026
## 20: 1 2 1-2 dose 2.0 1-5 1-6 1-6 1.65485406
## 21: 1 3 1-3 len 8.2 1-7 1-8 1-8 3.56799269
## 22: 1 4 1-4 Leaf NA <NA> <NA> <NA> 0.28835031
## 23: 1 5 1-5 Leaf NA <NA> <NA> <NA> -0.28835031
## 24: 1 6 1-6 len 26.7 1-9 1-10 1-10 0.22153124
## 25: 1 7 1-7 Leaf NA <NA> <NA> <NA> 0.30163023
## 26: 1 8 1-8 len 11.2 1-11 1-12 1-12 0.25236940
## 27: 1 9 1-9 len 24.5 1-13 1-14 1-14 0.44972166
## 28: 1 10 1-10 Leaf NA <NA> <NA> <NA> 0.05241550
## 29: 1 11 1-11 Leaf NA <NA> <NA> <NA> -0.21860033
## 30: 1 12 1-12 Leaf NA <NA> <NA> <NA> -0.03878851
## 31: 1 13 1-13 Leaf NA <NA> <NA> <NA> 0.05559399
## 32: 1 14 1-14 Leaf NA <NA> <NA> <NA> -0.13160129
## Tree Node ID Feature Split Yes No Missing Gain
## Cover
## <num>
## 1: 15.000000
## 2: 7.500000
## 3: 7.500000
## 4: 4.750000
## 5: 2.750000
## 6: 2.750000
## 7: 4.750000
## 8: 1.500000
## 9: 3.250000
## 10: 3.750000
## 11: 1.000000
## 12: 1.000000
## 13: 2.250000
## 14: 1.250000
## 15: 2.500000
## 16: 1.250000
## 17: 1.000000
## 18: 14.695991
## 19: 7.308470
## 20: 7.387520
## 21: 4.645680
## 22: 2.662790
## 23: 2.662790
## 24: 4.724730
## 25: 1.452431
## 26: 3.193249
## 27: 2.985818
## 28: 1.738913
## 29: 1.472866
## 30: 1.720383
## 31: 1.248612
## 32: 1.737206
## Cover
XGBoost supports many additional features on top of its traditional gradient-boosting framework, including, among others:
See the online documentation - particularly the tutorials section - for a glimpse over further functionalities that XGBoost offers.
In addition to the xgboost(x, y, ...)
function, XGBoost
also provides a lower-level interface for creating model objects through
the function xgb.train()
, which resembles the same
xgb.train
functions in other language bindings of
XGBoost.
This xgb.train()
interface exposes additional
functionalities (such as user-supplied callbacks or external-memory data
support) and performs fewer data validations and castings compared to
the xgboost()
function interface.
Some key differences between the two interfaces:
xgboost()
which takes R objects such as
matrix
or data.frame
as inputs, the function
xgb.train()
uses XGBoost’s own data container called
“DMatrix”, which can be created from R objects through the function
xgb.DMatrix()
. Note that there are other “DMatrix”
constructors too, such as “xgb.QuantileDMatrix()”, which might be more
beneficial for some use-cases.xgboost()
, requires its inputs to have already been
encoded into the representation that XGBoost uses behind the scenes -
for example, while xgboost()
may take a factor
object as “y”, xgb.DMatrix()
requires instead a binary
response variable to be passed as a vector of zeros and ones.xgboost()
, while they are passed as a named list to
xgb.train()
.xgb.train()
interface keeps less metadata about its
inputs - for example, it will not add levels of factors as column names
to estimated probabilities when calling predict
.Example usage of xgb.train()
:
data("agaricus.train")
dmatrix <- xgb.DMatrix(
data = agaricus.train$data, # a sparse CSC matrix ('dgCMatrix')
label = agaricus.train$label, # zeros and ones
nthread = 1
)
booster <- xgb.train(
data = dmatrix,
nrounds = 10,
params = xgb.params(
objective = "binary:logistic",
nthread = 1,
max_depth = 3
)
)
data("agaricus.test")
dmatrix_test <- xgb.DMatrix(agaricus.test$data, nthread = 1)
pred_prob <- predict(booster, dmatrix_test)
pred_raw <- predict(booster, dmatrix_test, outputmargin = TRUE)
Model objects produced by xgb.train()
have class
xgb.Booster
, while model objects produced by
xgboost()
have class xgboost
, which is a
subclass of xgb.Booster
. Their predict
methods
also take different arguments - for example,
predict.xgboost
has a type
parameter, while
predict.xgb.Booster
controls this through binary arguments
- but as xgboost
is a subclass of xgb.Booster
,
methods for xgb.Booster
can be called on
xgboost
objects if needed.
Utility functions in the XGBoost R package will work with both model classes - for example:
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: len 0.7444265 0.6830449 0.7333333
## 2: dose 0.2555735 0.3169551 0.2666667
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: odor=none 0.6083687503 0.3459792871 0.16949153
## 2: stalk-root=club 0.0959684807 0.0695742744 0.03389831
## 3: odor=anise 0.0645662853 0.0777761744 0.10169492
## 4: odor=almond 0.0542574659 0.0865120182 0.10169492
## 5: bruises?=bruises 0.0532525762 0.0535293301 0.06779661
## 6: stalk-root=rooted 0.0471992509 0.0610565707 0.03389831
## 7: spore-print-color=green 0.0326096192 0.1418126308 0.16949153
## 8: odor=foul 0.0153302980 0.0103517575 0.01694915
## 9: stalk-surface-below-ring=scaly 0.0126892940 0.0914230316 0.08474576
## 10: gill-size=broad 0.0066973198 0.0345993858 0.10169492
## 11: odor=pungent 0.0027091458 0.0032193586 0.01694915
## 12: population=clustered 0.0025750464 0.0015616374 0.03389831
## 13: stalk-color-below-ring=yellow 0.0016913567 0.0173903519 0.01694915
## 14: spore-print-color=white 0.0012798160 0.0008031107 0.01694915
## 15: gill-spacing=close 0.0008052948 0.0044110809 0.03389831
While xgboost()
aims to provide a user-friendly
interface, there are still many situations where one should prefer the
xgb.train()
interface - for example:
xgb.train()
will have a speed advantage, as it
performs fewer validations, conversions, and post-processings with
metadata.xgb.train()
will provide a more stable interface (less
subject to changes) and will have lower time/memory overhead.xgboost()
interface - for example, if your dataset does not
fit into the computer’s RAM, it’s still possible to construct a DMatrix
from it if the data is loaded in batches through
xgb.ExtMemDMatrix()
.