I have two data frames:
df_bad
die_y bin
1 11 JD
2 13 I
df_good
die_y bin
1 11 JD
2 13 I
I run logistic regression:
model_lr <- train(bin ~., data = df_bad, method = 'glm', family = 'binomial')
model_lr <- train(bin ~., data = df_good, method = 'glm', family = 'binomial')
The second succeeds (it was created directly)
df_good <- data.frame(die_y = c(11, 13), bin = as.factor(c('JD', 'I')))
The first fails (it was sliced from a larger data frame) with error: One or more factor levels in the outcome has no data: 'BA', 'dU', 'other', 'TT', 'XD'
Since it appears to me the data frames are identical, how does the algorithm have any knowledge of other potential factor values that aren't in the data? This whole mess started with errors in the original data so I figured I would try to pare the original data down to a workable dataset and go from there, except the algorithm seems to "remember" what I had and use that as another excuse to fail. Even removing the original source data doesn't change the outcome. What gives? How can I make the algorithm forget what came before? TIA
help("droplevels")
. – Roland