0
votes

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:

RF1pred <- predict(RF1, newdata=TrainS1, type = "class")

Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.

If someone could elaborate, I will be grateful.

Thank you!

EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:

RF1pred <- predict(RF1, type = "class")

If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?

EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.

# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))

# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)

# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")@y.values)
AUC  # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")@y.values)
AUC  # 0.7125
1
I think that previous question does answer yours. You get such high accuracy because you are applying the derived algorithm to the data from which it was derived. In other words, you are running an in-sample test of model fit.ulfelder
OK, I should have mentioned that I get normal results when I skip the newdata option. If not explicitly specifying newdata, isn't the algorithms again applied to the same (training) data?DGenchev
What package/function are you using to run Random Forests?ulfelder
I am using the randomForest package.DGenchev
Are you doing regression or classification?ulfelder

1 Answers

1
votes

If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).