I'm working with Airbnb's data, available here on Kaggle , and predicting the countries users will book their first trips to with an XGBoost model and almost 600 features in R. Running the algorithm through 50 rounds of 5-fold cross validation, I obtained 100% accuracy each time. After fitting the model to the training data, and predicting on a held out test set, I also obtained 100% accuracy. These results can't be real. There must be something wrong with my code, but so far I haven't been able to figure it out. I've included a section of my code below. It's based on this article. Following along with the article (using the article's data + copying the code), I receive similar results. However applying it to Airbnb's data, I consistently obtain 100% accuracy. I have no clue what is going on. Am I using the xgboost package incorrectly? Your help and time is appreciated.
# set up the data
# train is the data frame of features with the target variable to predict
full_variables <- data.matrix(train[,-1]) # country_destination removed
full_label <- as.numeric(train$country_destination) - 1
# training data
train_index <- caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE)
train_data <- full_variables[train_index, ]
train_label <- full_label[train_index[,1]]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)
# test data
test_data <- full_variables[-train_index, ]
test_label <- full_label[-train_index[,1]]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)
# 5-fold CV
params <- list("objective" = "multi:softprob",
"num_class" = classes,
eta = 0.3,
max_depth = 6)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = F,
maximize = T,
prediction = T)
# out of fold predictions
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p)
# confusion matrix
confusionMatrix(factor(out_of_fold_p$label),
factor(out_of_fold_p$max_prob),
mode = "everything")
Sample of the data I used for this can be found here by running this code:
library(RCurl)
x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv")
y <- read.csv(text = x)
caret
example, where the model is fit to the training data, and then used for predictions using the test data. – Maurits Evers