Predict function for logistic regression returning results for entire dataset not just training dataset

Question

I ma having an issue where the predict function for my logistic regression model is returning predictions for the entire dataset instead of just the test data. My test data has 6,931 rows in it. Here is my model:

test_model <- glm(status2~grade +  verified  + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,data=Loan_test,family="binomial")

and here is my predict function:

probabilities <- predict(test_model,newdata=Loan_test, type="response")

Any help on what I am doing wrong is appreciated.

Ok here is what I changed it to using my training dataset which has ~ 27000 rows:

test_model <- glm(status2~grade +  verified  + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,data=Loan_training,family="binomial")

probabilities <- predict(test_model,newdata=Loan_test, type="response")

but probabilities still contains 34000+ rows.

You built your model on what appears to be your test data. Is that what you meant to do? — camille
A reproducible example would help. Otherwise it's unclear what the other data is that you expect the model to be using, since you've both built the model and are now trying to predict based on the same dataset which is labeled as being a test set — camille
How is this- this is my original mode built using the training dataset: — Jim Ryan
How is this- this is my original mode built using the training dataset: Loantrain_Four <- glm(status2~grade + verified + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,data=Loan_training,family="binomial") — Jim Ryan

Len Greski Len Greski · Accepted Answer · 2020-04-06T02:24:17

In order to predict against a hold out data set, one should split the initial data into training and test data frames. Since the OP comments note that there were separate training and test data frames, we'll simply use the training data frame to build the model, and make predictions against the test data frame.

# use training data for model
test_model <- glm(status2~grade +  verified  + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,
                  data=Loan_training,family="binomial")

#make predictions using hold out data (test)
probabilities <- predict(test_model,newdata=Loan_test, type="response")

A complete example: prediction with binomial regression

Here is a complete working example using the South African Heart Disease data from the ElemStatLearnpackage that shows when we split a data frame into test and training, fit a binomial model with glm() and make predictions with the test and training data frames, the number of predictions is equal to the size of the data frame used in predict().

library(ElemStatLearn)
data(SAheart)
set.seed(801248)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]*.6,replace=F)
trainSA = SAheart[train,]
nrow(trainSA)
testSA = SAheart[-train,]
nrow(testSA)

At this point we can see that there are differing numbers of rows in trainSA and testSA.

> nrow(trainSA)
[1] 277
> testSA = SAheart[-train,]
> nrow(testSA)
[1] 185
>

Next, we fit a binomial general linear model with glm().

modFit <- glm(chd ~ age + alcohol + obesity + tobacco + typea + ldl,
            data=trainSA,
            family="binomial")

When we make predictions on both test and training data frames, we note that the lengths of the output vectors match the numbers of rows in the original data frames.

predicted_test <- predict(modFit,testSA)
length(predicted_test)
predicted_train <- predict(modFit,trainSA)
length(predicted_train)

...and the output:

> length(predicted_test)
[1] 185
> predicted_train <- predict(modFit,trainSA)
> length(predicted_train)
[1] 277

Finally, we demonstrate that differences in results of predict() by calculating the misclassification rate for each data frame.

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
# Classification errors on TrainSA
missClass(trainSA$chd,predicted_train)
# Classification Errors on TestSA
missClass(testSA$chd,predicted_test)

...and the output:

> missClass(trainSA$chd,predicted_train)
[1] 0.2924188
> # Classification Errors on TestSA
> missClass(testSA$chd,predicted_test)
[1] 0.2594595
>

CONCLUSION: Somehow the code in the original post is referencing the original data frame when input into predict(), but we can't see it because it does not include a minimal reproducible example.

Predict function for logistic regression returning results for entire dataset not just training dataset

1 Answers

A complete example: prediction with binomial regression