0
votes

Reading the XGB vignette.

We are using the train data. As explained above, both data and label are stored in a list.

In a sparse matrix, cells containing 0 are not stored in memory. Therefore, in a dataset mainly made of 0, memory size is reduced. It is very usual to have such dataset.

After that the vignette tells you how to work with a dense matrix too.

I have a data frame derived from text data and it's thus very sparse in that most values are zero. I've been passing the data frame to XGB and it's taking a long time to run but maybe that's expected. I'm training on 1M observations, 92 variables and am using hosted RStudio 64gb with 15 processors (When I monitor in the the terminal I see XGB using all available processors too).

My question is, do I have to make some kind of transformation to my data frame to make it a sparse matrix?

library(tidyverse)
library(caret)
library(xgboost)

## xgboost
# set up parameter search
xgb_grid = expand.grid(  #  stopped using differing permutations of parameters because training was taking so long
  nrounds = 150,
  eta = 0.3, # default 0.3; previously verified 0.3 was best model with 100k sample
  max_depth = 6, # default 6; previously verified 6 was best model with 100k sample
  gamma = 0, #default = 0
  colsample_bytree = 1, # default = 1
  min_child_weight = 1, # default = 1
  subsample = 1 # default = 1
)

# fit a xgboost model
print("begin xgb")
mod_xgb <- train(
  cluster ~.,
  data = select(training_data, -id),
  method = "xgbTree",
  trControl = train_control,
  na.action = na.pass,
  tuneGrid = xgb_grid,
  metric = "Kappa"
)

> str(training_data)
'data.frame':   1000000 obs. of  92 variables:
 $ violat          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ found           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ person          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ theft           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ theft_from      : num  0 0 0 0 0 0 0 0 0 0 ...

I'm asking because I wonder if I somehow change my data frame training_data to a sparse matrix for XGB maybe the model will train faster? Would it?

How can I make training_data a sparse matrix to pass to XGBoost?

1

1 Answers

1
votes

The Matrix package has the following function to create a sparse matrix sparse.model.matrix(). It may help if you remove NAs from your data before creating the sparse matrix to ensure the dependent variable y is of the same length as the sparse matrix when feeding into the xgboost function.

I also tend to make a record of the factors levels in my training data so that when it comes to predicting on an unseen test dataset I can make sure the test data has the same factor levels as the training data. This ensure the test data matrix will have the same dimensions as the training matrix.

Example from mtcars:

f<-mpg~hp+as.factor(cyl)
trainMatrix<-sparse.model.matrix(f,mtcars)