XGBoost input data issues

Question

I have a time series which has monthly granularity and 7 months of data in total, Im trying to predict the profitability of the 7th month by training it on the first six months. I do an 80/20 split on the data. XGBoost is giving an extremely low RMSE which I havent been able to get from other algos. This makes me a bit suspicious. So I decided to check which features are the most important which results in numbers instead of a list of features. That makes me suspect im not feeding the data correctly into the algorithm. My apologies for the noob question but I guess I kind of am one. Help will be much appreciated.

require(caTools)
require(Matrix)
require(data.table)
require(xgboost)
set.seed(111) 
sample = sample.split(new_flat$SUBSCRIPTION_ID, SplitRatio = .80)
train = subset(new_flat, sample == TRUE)
train <- subset( train, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
test = subset(new_flat, sample == FALSE)
test <- subset( test, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
target=test$Total_MARGIN_7 #Value I want to predict in the test set
dtrain <- xgb.DMatrix(data = as.matrix(train), label = train[,7])# I think this is the problem here
dtest <- xgb.DMatrix(data = as.matrix(test), label = test[,7]) ])# I think this is the problem here

bst <- xgboost(data = dtrain, max_depth = 5, eta = 1, nrounds = 20, 
               objective = "reg:linear")
pred <- predict(bst, dtest)
mean(pred)
RMSE <- sqrt(mean((as.numeric(target) - pred)^2)) # Yes as.numeric is redundant here
RMSE

I am not sure if XG Boost is a good algorithm for time series. Can you show some sample data? — Dinesh.hmn
Are you getting feature numbers as outputs or something else? — Rohan
I can't share the data unfortunately, And you might very well be right and xgboost might not be the best for timeseries but Im just giving it a try. — ljourney
Your description and code isn't coherent. Why are you performing a random split for train and test? I'd suggest you to visualize your data. If it involves seasonality and trend, then check forecast package. — Karthik Arumugham

Vadim Khotilovich Vadim Khotilovich · Accepted Answer · 2017-03-17T00:37:27

Extremely "good" performance often happens because of cheating in input data. Here, the dependent variable has to be removed:

dtrain <- xgb.DMatrix(data = as.matrix(train)[,-7], label = train[,7])
dtest <- xgb.DMatrix(data = as.matrix(test)[,-7], label = test[,7])

XGBoost input data issues

1 Answers