logistic regression on sparse data

Question

I am using a logistic regression model for some predictive analyses. We have about 25 predictor variables and 1 binary outcome (Y/N) variable. I am modeling the probability that the outcome is "Y".

I have 400,000 records in my training data set and the same number in the scoring set. The probabililty of a "Y" in the training set is 0.1%. The C statistic for the model as output by SAS is 0.97, which is very good.

When I run the model on my scoring set, my "positive predictive value," which is the ratio of the correctly indentified "Y" to the total "Y", is less than 1, which makes my model useless. Can anybody suggest how I could improve the positive predictive value?

You need different data. You could send me your data and I could guess N for every datapoint and I would be correct 99.9% of the time. — gobrewers14
This question isn't really appropriate for Stack Overflow. It would be better suited for Cross Validated, as it is not about statistical model building, not programming. If your intent with this question is SAS programming, I suggest including code and clarifying your intent. — Alex A.
I would assume the ratio should be less than one. Greater than one implies you predicted more Y than are actually present, which is obviously wrong. Do you mean less than 0.01 or something else? — Joe

Joe Joe · Accepted Answer · 2014-04-23T15:21:01

Assuming your predictive value is below what you'd like it to be, meaning your model has high variance (it predicts well in the training set but not well in the validation set), you should consider some basic options:

Increase the complexity of your model. It's possible your model simply isn't complex enough for the data. Add more predictor variables, or combinations of predictor variables, or polynomial variables.
Increase the number of training examples. It's possible your training examples aren't sufficiently complex to prove your model. A typical ratio is 60% training - 20% validation - 20% test; 50%-50% may be insufficient (although 400,000 would usually be sufficient, who knows).
Perhaps your training examples and your validation set aren't truly random samples of your population. For example, if training set is 2011 data and validation set is 2012 data, perhaps there is some year to year variation your model doesn't account for.

logistic regression on sparse data

2 Answers