I'm not sure if xgboost
's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.
I know that xgboost
can do any 1 of those things:
- Random Forest via tweaking
xgboost
parameters:
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
- Multinomial (multiclass) dependent variable models via
multi:softmax
ormulti:softprob
xgboost(data = data, label = multinomial_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")
However, I run into an error regarding non-conforming length when I try to do all of them at once:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
Y <- train$TripType
bst <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925
The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.
The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost
, so I wonder if this is possible.
If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to come the results?