My Training Dataset (train) is a data frame with n-features and an additional Column with the outcomes y. I built 3 individuals models, for example:
m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")
With the Test Dataset (test) I can evaluate the quality of these individuals models (naturally, it has the outcomes y):
pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)
If I apply each individual model in a data frame DATA_TO_PREDICT (the outcomes are unknown) with 5 Examples the output is naturally 5 predictions per individual model:
predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)
Now I would like to use a combined model from R-Caret-Package with Random Forest:
DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")
I can observe that the Accuracy of the combined model has increased:
predMODEL <- predict(MODEL, DF)
But if I apply the combined model in the DATA_TO_PREDICT (the outcomes are unknown) the output has not only 5 predictions, but rather a huge list with repeated results and larger than hundred. I have used:
predict(MODEL, newdata = DATA_TO_PREDICT)
EXAMPLE:
Here I show a concrete example where the output is wrong. That is, I want to predict 4 new data, but I get a result with dozens of outputs:
library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]
m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)
Then if I built the Combined Model:
pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2)
Note that DATA_TO_PREDICT has only 4 examples and the output is:
[1] Control Control Control Control Control Control Control Control
[9] Control Control Control Control Control Control Control Control
[17] Control Control Control Control Control Control Control Control
[25] Control Control Control Control Control Control Control Control
[33] Control Control Control Control Control Control Control Control
[41] Control Control Control Control Control Control Control Control
[49] Control Control Control Control Control Control Control Control
[57] Control Control Control Control Control Control Control Control
[65] Control Control Control Control Control Control Control Control
[73] Control Control Control Control Control Control
Levels: Impaired Control