0
votes

I have two data frames:

df_bad

die_y bin

1 11 JD

2 13 I

df_good

die_y bin

1 11 JD

2 13 I

I run logistic regression:

model_lr <- train(bin ~., data = df_bad, method = 'glm', family = 'binomial')

model_lr <- train(bin ~., data = df_good, method = 'glm', family = 'binomial')

The second succeeds (it was created directly)

df_good <- data.frame(die_y = c(11, 13), bin = as.factor(c('JD', 'I')))

The first fails (it was sliced from a larger data frame) with error: One or more factor levels in the outcome has no data: 'BA', 'dU', 'other', 'TT', 'XD'

Since it appears to me the data frames are identical, how does the algorithm have any knowledge of other potential factor values that aren't in the data? This whole mess started with errors in the original data so I figured I would try to pare the original data down to a workable dataset and go from there, except the algorithm seems to "remember" what I had and use that as another excuse to fail. Even removing the original source data doesn't change the outcome. What gives? How can I make the algorithm forget what came before? TIA

1
Read help("droplevels").Roland
Dataframes store all of the levels of your factors in them, even if you don't have cases for each level. As Roland said, use droplevels.Hack-R
Many thanks - added the following code: df_bad <- droplevels(df_bad) and it workedds_practicioner

1 Answers

1
votes

Added code: df_bad <- droplevels(df_bad)

and works!