Assessing glm by seeing how well it describes a different dataset in R

Question

I've created a logistic model using glm with ~10 predictors and a binary response variable. The model was created using a subset of my full dataset (~8000 observation) by randomly selecting 3000 observations, putting these in a new dataset (newdata) and fitting the glm to newdata.

In order to assess the model, I would like to see how well the model describes the data in a different dataset (testdata) which has a random selection of e.g. ~1000 observations from the full dataset. How would I go about doing this in R?

I have created both confidence intervals for coefficients and looked at Wald-statistics and LRT for assessing statistical significance of my model, but would like to be able to see how well it describes a randomly chosen selection of the full dataset.

Thanks a bunch!

James King James King · Accepted Answer · 2014-05-04T14:21:53

There are several possible approaches. First, to evaluate the model out of sample, you have to pick a performance metric. Say it's MSE, and suppose your test set is called test, then you would use:

mean((test$response - predict(m, newdata = test, type = "response"))^2)

For logistic regression you could calculate the deviance for the logistic family instead of using MSE. Or you could use area under the curve/Gini, which is available in the ROCR package. Also you might want to do cross-validation rather than just one out of sample test, which can be done with cvTools::cvFit.

Assessing glm by seeing how well it describes a different dataset in R

1 Answers