I am trying to run a random forest model using the party
package. My response variable (10 levels) is a classification value for different lake types (interested what factors influence clustering of lakes based on water quality attributes). My predictor variables include both continuous and categorical variables. One categorical variable has 4 levels, the other categorical variable has 8 levels (US state the lake is located in). Whenever I include the 2nd categorical variable in the model I get the following error:
Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'.
I've been able to narrow it down to the fact the cforest
routine in the party
package does not seem to run when predictor variables have more than 4 categorical levels. I'm not sure if this is true for other datasets or just a characteristic of mine. Google suggests that the error code might be associated with convergence issues. Is anyone aware of limitations in the cforest
routine with respect to categorical predictor levels (e.g. randomForest
from the randomForest
package has a limit of 32 levels)? I haven't seen anything explicitly discussing this for the party
package. One solution would be to recode this factor into separate dummy variables, but I would like to avoid that. Based on the characteristics (correlated predictors, factors with different levels, mix of continuous and categorical data) of my data, cforest
appears to be recommended over randomForest
.
Any insight would be greatly appreciated.
Link to a dummy dataset (real data just limited number of variables): https://dl.dropboxusercontent.com/u/8554679/newdata.csv
library(RCurl)
library(party)
x = getURL("https://dl.dropboxusercontent.com/u/8554679/newdata.csv")
new.data = read.csv(text = x,header=TRUE)
new.data$response = as.factor(new.data$response)
new.data$factor1 = as.factor(new.data$factor1)
new.data$factor2 = as.factor(new.data$factor2)
set.seed(1123)
data.controls = data.controls = cforest_unbiased(ntree=500, mtry=3)
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
#excuting this results in the following error: Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'
#remove factor2 which has 8 levels from the formula
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
levels(new.data$factor2)
#arbitrarily reassign factor2 levels such that there are only 4 levels
#I've tried levels between 8 and 4 and it turns out it only works if factors have <=4 levels
random.rows = sample(x=c(1:nrow(new.data)),size=nrow(new.data),replace=FALSE)
new.data$factor2 = NA
new.data$factor2[random.rows[1:120]] = 1
new.data$factor2[random.rows[121:241]] = 2
new.data$factor2[random.rows[242:362]] = 3
new.data$factor2[random.rows[363:483]] = 4
new.data$factor2 = as.factor(new.data$factor2)
levels(new.data$factor2)
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
#model runs fine.
SessionInfo() requested:
sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] party_1.0-13 modeltools_0.2-21 strucchange_1.5-0 sandwich_2.3-0 zoo_1.7-11 RCurl_1.95-4.1
[7] bitops_1.0-6
loaded via a namespace (and not attached):
[1] coin_1.0-23 lattice_0.20-29 mvtnorm_0.9-99992 splines_3.0.3 survival_2.37-7 tools_3.0.3
sessionInfo()
? Your code works fine for me. – nograpes