Reading the XGB vignette.
We are using the train data. As explained above, both data and label are stored in a list.
In a sparse matrix, cells containing 0 are not stored in memory. Therefore, in a dataset mainly made of 0, memory size is reduced. It is very usual to have such dataset.
After that the vignette tells you how to work with a dense matrix too.
I have a data frame derived from text data and it's thus very sparse in that most values are zero. I've been passing the data frame to XGB and it's taking a long time to run but maybe that's expected. I'm training on 1M observations, 92 variables and am using hosted RStudio 64gb with 15 processors (When I monitor in the the terminal I see XGB using all available processors too).
My question is, do I have to make some kind of transformation to my data frame to make it a sparse matrix?
library(tidyverse)
library(caret)
library(xgboost)
## xgboost
# set up parameter search
xgb_grid = expand.grid( # stopped using differing permutations of parameters because training was taking so long
nrounds = 150,
eta = 0.3, # default 0.3; previously verified 0.3 was best model with 100k sample
max_depth = 6, # default 6; previously verified 6 was best model with 100k sample
gamma = 0, #default = 0
colsample_bytree = 1, # default = 1
min_child_weight = 1, # default = 1
subsample = 1 # default = 1
)
# fit a xgboost model
print("begin xgb")
mod_xgb <- train(
cluster ~.,
data = select(training_data, -id),
method = "xgbTree",
trControl = train_control,
na.action = na.pass,
tuneGrid = xgb_grid,
metric = "Kappa"
)
> str(training_data)
'data.frame': 1000000 obs. of 92 variables:
$ violat : num 0 0 0 0 0 0 0 0 0 0 ...
$ found : num 0 0 0 0 0 0 0 0 0 0 ...
$ person : num 0 0 0 0 0 0 0 0 0 0 ...
$ theft : num 0 0 0 1 0 0 0 0 0 0 ...
$ theft_from : num 0 0 0 0 0 0 0 0 0 0 ...
I'm asking because I wonder if I somehow change my data frame training_data to a sparse matrix for XGB maybe the model will train faster? Would it?
How can I make training_data a sparse matrix to pass to XGBoost?