6
votes

I'm working with Airbnb's data, available here on Kaggle , and predicting the countries users will book their first trips to with an XGBoost model and almost 600 features in R. Running the algorithm through 50 rounds of 5-fold cross validation, I obtained 100% accuracy each time. After fitting the model to the training data, and predicting on a held out test set, I also obtained 100% accuracy. These results can't be real. There must be something wrong with my code, but so far I haven't been able to figure it out. I've included a section of my code below. It's based on this article. Following along with the article (using the article's data + copying the code), I receive similar results. However applying it to Airbnb's data, I consistently obtain 100% accuracy. I have no clue what is going on. Am I using the xgboost package incorrectly? Your help and time is appreciated.

# set up the data  
# train is the data frame of features with the target variable to predict
full_variables <- data.matrix(train[,-1]) # country_destination removed
full_label <- as.numeric(train$country_destination) - 1 

# training data 
train_index <- caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE)
train_data <- full_variables[train_index, ]
train_label <- full_label[train_index[,1]]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)

# test data 
test_data <- full_variables[-train_index, ]
test_label <- full_label[-train_index[,1]]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)

# 5-fold CV
params <- list("objective" = "multi:softprob",
               "num_class" = classes,
               eta = 0.3, 
               max_depth = 6)
cv_model <- xgb.cv(params = params,
               data = train_matrix,
               nrounds = 50,
               nfold = 5,
               early_stop_round = 1,
               verbose = F,
               maximize = T,
               prediction = T)

# out of fold predictions 
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p)

# confusion matrix
confusionMatrix(factor(out_of_fold_p$label), 
                factor(out_of_fold_p$max_prob),
                mode = "everything")

Sample of the data I used for this can be found here by running this code:

library(RCurl)
x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv")
y <- read.csv(text = x)
1
Unless the code you provide is incomplete, you don't actually use your test data for model predictions. You only assess the in-sample error/accuracy. See this link for a worked-through XGBoost + caret example, where the model is fit to the training data, and then used for predictions using the test data.Maurits Evers

1 Answers

6
votes

If you are using the train_users_2.csv.zip available on kaggle then the problem is you are not removing the country_destination from the train data set since it is at position 16 and not 1.

which(colnames(train) == "country_destination")
#output
16

1 is id which is unique for every observation and should also be removed.

length(unique(train[,1)) == nrow(train)
#output
TRUE

When I run your code with the following modification:

full_variables <- data.matrix(train[,-c(1, 16)]) 

  library(xgboost)

params <- list("objective" = "multi:softprob",
               "num_class" = length(unique(train_label)),
               eta = 0.3, 
               max_depth = 6)
cv_model <- xgb.cv(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   nfold = 5,
                   early_stop_round = 1,
                   verbose = T,
                   maximize = T,
                   prediction = T)

I obtain a test error during cross validation of 0.12 with the above settings.

out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)

head(out_of_fold_p[,13:14], 20)
#output
   max_prob label
1         8     8
2        12    12
3        12    10
4        12    12
5        12    12
6        12    12
7        12    12
8        12    12
9         8     8
10       12     5
11       12     2
12        2    12
13       12    12
14       12    12
15       12    12
16        8     8
17        8     8
18       12     5
19        8     8
20       12    12

So to sum up, you did not remove the y from the x.

EDIT: after downloading the true train set and playing around I can say the accuracy is really 100% in 5 fold CV. Not just that this is achieved by only 22 features (and possibly less).

model <- xgboost(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   verbose = T,
                   maximize = T)

This model also gets 100% accuracy on the test set:

pred <- predict(model, test_matrix)
pred <- matrix(pred, ncol=length(unique(train_label)), byrow = TRUE)
out_of_fold_p <- data.frame(pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = test_label + 1)

sum(out_of_fold_p$max_prob != out_of_fold_p$label) #0 errors

Now lets check which features are discriminatory:

xgb.plot.importance(importance_matrix = xgb.importance(colnames(train_matrix), model))

enter image description here

now if you run xgb.cv with just these features:

train_matrix <- xgb.DMatrix(data = train_data[,which(colnames(train_data) %in% xgboost::xgb.importance(colnames(train_matrix), model)$Feature)], label = train_label)

set.seed(1)
cv_model <- xgb.cv(params = params,
                   data = train_matrix,
                   nrounds = 50,
                   nfold = 5,
                   early_stop_round = 1,
                   verbose = T,
                   maximize = T,
                   prediction = T)

You will also attain 100% accuracy on the test folds

The reason is partly in the very big disbalance of the classes:

table(train_label)
train_label
  0   1   2   3   4   5   6   7   8   9  10  11 
  3  10  12  13  36  16  19 856   7  73   3 451 

and the fact the minor classes are very easily distinguished by 1 dummy variable:

gg <- data.frame(train_data[,which(colnames(train_data) %in% xgb.importance(colnames(train_matrix), model)$Feature)], label = as.factor(train_label))

gg %>%
  as.tibble() %>%
  select(1:9, 11, 12, 15:21, 23) %>%
  gather(key, value, 1:18) %>%
  ggplot()+
  geom_bar(aes(x = label))+
  facet_grid(key ~ value) +
  theme(strip.text.y = element_text(angle = 90))

enter image description here

based on the distribution of 0/1 of the 22 most important features it looks to me any tree model would be able to achieve pretty good accuracy if not 100% accuracy.

One would expect the classes 0 and 10 would be problematic for 5 - fold CV since there is a chance all the subjects would fall into one fold so that the model would not know about them at least in that instance. Which would be a possibility if one designed CV by random sampling. This does not happen with xgb.cv:

lapply(cv_model$folds, function(x){
  table(train_label[x])})