1
votes

I am building a CART regression tree model in R using party package, but I got error message saying levels do not match when I try to apply the model with testing dataset.

I have spent the past week reading through the threads on the forum, but still couldn't find the right solution to my problem. So I am reposting this question here using fake examples I made up.. Can someone help explain the error message and provide a solution?

my training dataset has about 1000 records and testing dataset has about 150. There's no NA or blank fields in either dataset.

my CART model using ctree under the party package is:

mytree<- ctree(Rate~Bank+Product+Salary, data=data_train)

data_train example:

Rate  Bank  Product  Salary    
1.5    A     aaa     100000
0.6    B     abc      60000
3      C     bac      10000
2.1    D     cba      50000
1.1    E     cca      80000

data_test example:

Rate  Bank  Product   Salary
2.0    A     cba       80000
0.5    D     cca      250000
0.8    E     cba      120000
2.1    C     abc       65000

levels(data_train$Bank) : A, B, C, D, E

levels(data_test$Bank): A,D,E,C

I tried to set to the same level using the following codes:

>is.factor(data_test$Bank)

 TRUE 
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))

> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))

However, when I try to run prediction on the testing dataset, I get the following error:

> fit1<- predict(mytree,newdata=data_test)

Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

I have also tried the following method but it alters the fields of my testing dataset...:

levels(data_test$Bank) <-levels(data_train$Bank)

The data_test table is altered:

Rate  Bank(altered)  Bank (original)   
2.0    A              A      
0.5    B              D      
0.8    C              E      
2.1    D              C       
2
I provided a workaround solution here: stackoverflow.com/a/58728434/380403 Instead of partitioning the data, I give 0 weights to test observations while training the model. ctree will ignore those observations while training but will retain the factor information.Bakaburg

2 Answers

2
votes

You might try rebuilding your factors using comparable levels instead of assigning new levels to existing factors. Here's an example:

# start the party
library(party)

# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
                         Bank = c("A", "B", "C", "D", "E"),
                         Product = c("aaa", "abc", "bac", "cba", "cca"),
                         Salary = c(100000, 60000, 10000, 50000, 80000))

# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
                         Bank = c("A", "D", "E", "C"),
                         Product = c("cba", "cca", "cba", "abc"),
                         Salary = c(80000, 250000, 120000, 65000))

# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))

# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels)) 
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels)) 

# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels)) 
data_train$Product <- with(data_train, factor(Product, levels = product_levels)) 

# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)

# generate predictions
fit1 <- predict(mytree, newdata = data_test)

> fit1
     Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66
0
votes

I am using the example of ctree but this is basically using factors smartly and hence can be used in any algorithm (RandomForest etc) which strictly rely on factor levels

This is all about understanding how R stores and uses factor levels. If we use the same factor levels(AND IN THE SAME ORDER) used in train data (yes even without clubbing with test data), we do the prediction using pre-trained ctree models.

Actually there is no need to club train and test data for predicting using ctree (party) package. This is because you may not have that luxury of memory and processor power during run time production when you are using pre-trained models. Pre-trained models relieve us from the burden of building models on huge training data in production environment.

Step 1: While building the model you can store the factor levels for each column in train data (wherever it is applicable)

var_list <- colnames(dtrain)
for(var in var_list)
{
  if(class(dtrain[,var]) == 'character')
  {
    print(var)

    #Fill blanks with "None" to keep the factor levels consistent
    dtrain[dtrain[,var] == '',var] <- 'None'

    col_name_levels <- unique(dtrain[,var])

    #Make sure you have sorted the column levels     
    col_name_levels <- sort(col_name_levels, decreasing = FALSE)

    #Make as factors
    dtrain[,var] <- factor(dtrain[,var], levels = col_name_levels, ordered=TRUE)

    print(levels(dtrain[,var]))

    #This is the trick: Store the exact levels in a CSV which is much easier to load than the whole train data later in prediction phase    
    write.csv(levels(dtrain[,var]), paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), row.names = FALSE)
  }
}


# also store the column names and data types for detecting later
for(col_name in colnames(dtrain))
{
  abc <- data.frame('col_name' = col_name,'class_colname' = paste(class(dtrain[,col_name]), collapse = ' '))

  if(!exists('col_name_type_list'))
  {
    col_name_type_list <- abc
  }else
  {
    col_name_type_list <- rbind(col_name_type_list, abc)
  }
}

#Store for checking later
write.csv(col_name_type_list, filepath, row.names = FALSE)

And then in prediction phase (in production environment), just read those levels for each column in test data, discard the rows which have new data (ctree will not be able to predict for them anyways) and then use the rows for prediction.

###############Now in test prediction ###########################


#Read the column list of train data (stored earlier)
col_name_type_list_dtrain <- read.csv( filepath, header = TRUE)


for(i in 1:nrow(col_name_type_list_dtrain))
{
  col_name <- col_name_type_list_dtrain[i,]$col_name
  class_colname <- col_name_type_list_dtrain[i,]$class_colname

  if(class_colname == 'numeric')
  {
    dtest[,col_name] <- as.numeric(dtest[,col_name])
  }

  if(class_colname == 'ordered factor')
  {

    #Now use the column factor levels from train
    remove(col_name_levels)
    col_name_levels <- read.csv( paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), header = TRUE)
    factor_check_flag <- TRUE

    col_name_levels <- as.character(col_name_levels$x)
    print(col_name)
    print('Pre-Existing levels detected')
    print(NROW(col_name_levels))

    #Drop new rows which are not in train; the model cant predict for them
    rows_before_dropping <- nrow(dtest)
    print('Adjusting levels to train......')
    dtest <- dtest[dtest[,col_name] %in% col_name_levels,]
    rows_after_dropping <- nrow(dtest)

    cat('\nDropped Rows for adjusting ',col_name,': ',(rows_before_dropping - rows_after_dropping),'\n')

    #Convert to factors
    dtest[,col_name] <- factor(dtest[,col_name], levels=col_name_levels, ordered=TRUE)

    print(dtest[,col_name])
  }
}