Random Forest with party package cannot handle categorical predictors with more than 4 levels

Question

I am trying to run a random forest model using the party package. My response variable (10 levels) is a classification value for different lake types (interested what factors influence clustering of lakes based on water quality attributes). My predictor variables include both continuous and categorical variables. One categorical variable has 4 levels, the other categorical variable has 8 levels (US state the lake is located in). Whenever I include the 2nd categorical variable in the model I get the following error:

Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'.

I've been able to narrow it down to the fact the cforest routine in the party package does not seem to run when predictor variables have more than 4 categorical levels. I'm not sure if this is true for other datasets or just a characteristic of mine. Google suggests that the error code might be associated with convergence issues. Is anyone aware of limitations in the cforest routine with respect to categorical predictor levels (e.g. randomForest from the randomForest package has a limit of 32 levels)? I haven't seen anything explicitly discussing this for the party package. One solution would be to recode this factor into separate dummy variables, but I would like to avoid that. Based on the characteristics (correlated predictors, factors with different levels, mix of continuous and categorical data) of my data, cforest appears to be recommended over randomForest.

Any insight would be greatly appreciated.

Link to a dummy dataset (real data just limited number of variables): https://dl.dropboxusercontent.com/u/8554679/newdata.csv

library(RCurl)
library(party)
x = getURL("https://dl.dropboxusercontent.com/u/8554679/newdata.csv")
new.data = read.csv(text = x,header=TRUE)
new.data$response = as.factor(new.data$response)
new.data$factor1 = as.factor(new.data$factor1)
new.data$factor2 = as.factor(new.data$factor2)

set.seed(1123)
data.controls = data.controls = cforest_unbiased(ntree=500, mtry=3)
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)

#excuting this results in the following error: Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'

#remove factor2 which has 8 levels from the formula
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)

levels(new.data$factor2)
#arbitrarily reassign factor2 levels such that there are only 4 levels
#I've tried levels between 8 and 4 and it turns out it only works if factors have <=4 levels

random.rows = sample(x=c(1:nrow(new.data)),size=nrow(new.data),replace=FALSE)
new.data$factor2 = NA
new.data$factor2[random.rows[1:120]] = 1
new.data$factor2[random.rows[121:241]] = 2
new.data$factor2[random.rows[242:362]] = 3
new.data$factor2[random.rows[363:483]] = 4
new.data$factor2 = as.factor(new.data$factor2)
levels(new.data$factor2)

data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
#model runs fine.

SessionInfo() requested:

sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] party_1.0-13      modeltools_0.2-21 strucchange_1.5-0 sandwich_2.3-0    zoo_1.7-11            RCurl_1.95-4.1   
[7] bitops_1.0-6     

loaded via a namespace (and not attached):
[1] coin_1.0-23       lattice_0.20-29   mvtnorm_0.9-99992 splines_3.0.3     survival_2.37-7   tools_3.0.3

Why are you opposed to dummying out state? It's not like that factor is ordered or anything. — matt_k
I'm new to random forest models and initially I wasn't sure if it would be harder to interpret results with 8 variables represent state vs a single parameter in a variable importance plot. I've been looking further into examples and this may not be a well founded concern. — nrlottig
Can you post your sessionInfo()? Your code works fine for me. — nograpes
norgrapes- I'll post my session info above. When I looked closer at the data after your comment I realized that the data file I posted was the resulting data file after collapsing factor2 to 4 levels. I've updated the data file so it is correct. Thanks. — nrlottig
I am also having this issue running cforest with categorical predictors. I get the error 'Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd''. Strangely going from a 500 tree ensemble to a 10 tree ensemble got rid of the error -- however, a 10 tree ensemble obviously won't cut it. None of the solutions offered below seem to work for me. — Darren

Markus Markus · Accepted Answer · 2017-04-28T12:52:14

Late answer but still an answer I had the same problem. Solved it by closing and reopening R-Studio. It seems to me that it was a conflict between the caret and the party package, which were both loaded. As soon as I only loaded the party package, the problem was gone.

Random Forest with party package cannot handle categorical predictors with more than 4 levels

3 Answers