3
votes

I was trying to supersample my dataset using SMOTE and i keep running into this error.

trainSM <- SMOTE(conversion ~ ., train,perc.over = 1000,perc.under = 200)

Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, : length of 'dimnames' [2] not equal to array extent

My dataset is as follows:

          conversion horizon length_of_stay guests rooms price comp_price
            (dbl)   (int)          (int)  (int) (int) (int)      (int)
  1           1     193              2      2     1   199        210
  2           1     263              2      2     1   171         88
  3           1     300              3      2     1   164        164
  4           1      70              4      2     1    76         80
  5           1      65              6      2     2   260        260
  6           1      50              3      2     1   171        176
  7           1       4              3      2     1   158        167
  8           1      29              3      2     1   171        171
  9           0     130              1      2     1   161        160
  10          0      26              2      2     1   110        110

I have tried working only with numerical predictors and even categorical predictors. But no luck with both.

Any help/guidance is greatly appreciated.

1

1 Answers

10
votes

Passing a data.frame that is a tibble into DMwR::SMOTE() will throw this error. You can work around it by using as.data.frame(your_train_data) to 'un-tibble' your data.frame:

    trainSM <- SMOTE(conversion ~ ., as.data.frame(train), perc.over = 1000, perc.under = 200)

The issue is that SMOTE() uses single bracket subsetting. Tibbles (ie. a data.frame turned into a tibble::data_frame) are much more strict about return values: single bracket subsetting always return a data frame (even if the results are only a single vector or even a single value).

Here's the problematic part of the SMOTE() source code:

# The idea here is to determine which level of the response variable appears least.
# Unfortunately, if data is a tibble, then data[,tgt] returns a data frame, 
# which of course, doesn't have any levels, so the value of minCL is always NULL
minCl <- levels(data[, tgt])[which.min(table(data[, tgt]))]

# this is where the error is thrown--you're testing a data frame against NULL
minExs <- which(data[, tgt] == minCl)