Are “cached” values causing logistic regression to fail?

Question

I have two data frames:

df_bad

die_y bin

1 11 JD

2 13 I

df_good

die_y bin

1 11 JD

2 13 I

I run logistic regression:

model_lr <- train(bin ~., data = df_bad, method = 'glm', family = 'binomial')

model_lr <- train(bin ~., data = df_good, method = 'glm', family = 'binomial')

The second succeeds (it was created directly)

df_good <- data.frame(die_y = c(11, 13), bin = as.factor(c('JD', 'I')))

The first fails (it was sliced from a larger data frame) with error: One or more factor levels in the outcome has no data: 'BA', 'dU', 'other', 'TT', 'XD'

Since it appears to me the data frames are identical, how does the algorithm have any knowledge of other potential factor values that aren't in the data? This whole mess started with errors in the original data so I figured I would try to pare the original data down to a workable dataset and go from there, except the algorithm seems to "remember" what I had and use that as another excuse to fail. Even removing the original source data doesn't change the outcome. What gives? How can I make the algorithm forget what came before? TIA

Dataframes store all of the levels of your factors in them, even if you don't have cases for each level. As Roland said, use droplevels. — Hack-R
Many thanks - added the following code: df_bad <- droplevels(df_bad) and it worked — ds_practicioner

ds_practicioner ds_practicioner · Accepted Answer · 2016-11-07T21:27:45

1

votes

Added code: df_bad <- droplevels(df_bad)

and works!

Are “cached” values causing logistic regression to fail?

1 Answers