I am using a logistic regression model for some predictive analyses. We have about 25 predictor variables and 1 binary outcome (Y/N) variable. I am modeling the probability that the outcome is "Y".
I have 400,000 records in my training data set and the same number in the scoring set. The probabililty of a "Y" in the training set is 0.1%. The C statistic for the model as output by SAS is 0.97, which is very good.
When I run the model on my scoring set, my "positive predictive value," which is the ratio of the correctly indentified "Y" to the total "Y", is less than 1, which makes my model useless. Can anybody suggest how I could improve the positive predictive value?