I have a time series which has monthly granularity and 7 months of data in total, Im trying to predict the profitability of the 7th month by training it on the first six months. I do an 80/20 split on the data. XGBoost is giving an extremely low RMSE which I havent been able to get from other algos. This makes me a bit suspicious. So I decided to check which features are the most important which results in numbers instead of a list of features. That makes me suspect im not feeding the data correctly into the algorithm. My apologies for the noob question but I guess I kind of am one. Help will be much appreciated.
require(caTools)
require(Matrix)
require(data.table)
require(xgboost)
set.seed(111)
sample = sample.split(new_flat$SUBSCRIPTION_ID, SplitRatio = .80)
train = subset(new_flat, sample == TRUE)
train <- subset( train, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
test = subset(new_flat, sample == FALSE)
test <- subset( test, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
target=test$Total_MARGIN_7 #Value I want to predict in the test set
dtrain <- xgb.DMatrix(data = as.matrix(train), label = train[,7])# I think this is the problem here
dtest <- xgb.DMatrix(data = as.matrix(test), label = test[,7]) ])# I think this is the problem here
bst <- xgboost(data = dtrain, max_depth = 5, eta = 1, nrounds = 20,
objective = "reg:linear")
pred <- predict(bst, dtest)
mean(pred)
RMSE <- sqrt(mean((as.numeric(target) - pred)^2)) # Yes as.numeric is redundant here
RMSE
forecastpackage. - Karthik Arumugham