In order to predict against a hold out data set, one should split the initial data into training and test data frames. Since the OP comments note that there were separate training and test data frames, we'll simply use the training data frame to build the model, and make predictions against the test data frame.
# use training data for model
test_model <- glm(status2~grade + verified + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,
data=Loan_training,family="binomial")
#make predictions using hold out data (test)
probabilities <- predict(test_model,newdata=Loan_test, type="response")
A complete example: prediction with binomial regression
Here is a complete working example using the South African Heart Disease data from the ElemStatLearnpackage that shows when we split a data frame into test and training, fit a binomial model with glm() and make predictions with the test and training data frames, the number of predictions is equal to the size of the data frame used in predict().
library(ElemStatLearn)
data(SAheart)
set.seed(801248)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]*.6,replace=F)
trainSA = SAheart[train,]
nrow(trainSA)
testSA = SAheart[-train,]
nrow(testSA)
At this point we can see that there are differing numbers of rows in trainSA and testSA.
> nrow(trainSA)
[1] 277
> testSA = SAheart[-train,]
> nrow(testSA)
[1] 185
>
Next, we fit a binomial general linear model with glm().
modFit <- glm(chd ~ age + alcohol + obesity + tobacco + typea + ldl,
data=trainSA,
family="binomial")
When we make predictions on both test and training data frames, we note that the lengths of the output vectors match the numbers of rows in the original data frames.
predicted_test <- predict(modFit,testSA)
length(predicted_test)
predicted_train <- predict(modFit,trainSA)
length(predicted_train)
...and the output:
> length(predicted_test)
[1] 185
> predicted_train <- predict(modFit,trainSA)
> length(predicted_train)
[1] 277
Finally, we demonstrate that differences in results of predict() by calculating the misclassification rate for each data frame.
missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
# Classification errors on TrainSA
missClass(trainSA$chd,predicted_train)
# Classification Errors on TestSA
missClass(testSA$chd,predicted_test)
...and the output:
> missClass(trainSA$chd,predicted_train)
[1] 0.2924188
> # Classification Errors on TestSA
> missClass(testSA$chd,predicted_test)
[1] 0.2594595
>
CONCLUSION: Somehow the code in the original post is referencing the original data frame when input into predict(), but we can't see it because it does not include a minimal reproducible example.
Loantrain_Four <- glm(status2~grade + verified + term + income + revolRatio +totalAcc + totalRevLim + accOpen24 ,data=Loan_training,family="binomial")- Jim Ryan