4
votes

My Training Dataset (train) is a data frame with n-features and an additional Column with the outcomes y. I built 3 individuals models, for example:

m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")

With the Test Dataset (test) I can evaluate the quality of these individuals models (naturally, it has the outcomes y):

pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)

If I apply each individual model in a data frame DATA_TO_PREDICT (the outcomes are unknown) with 5 Examples the output is naturally 5 predictions per individual model:

predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)

Now I would like to use a combined model from R-Caret-Package with Random Forest:

DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")

I can observe that the Accuracy of the combined model has increased:

predMODEL <- predict(MODEL, DF)

But if I apply the combined model in the DATA_TO_PREDICT (the outcomes are unknown) the output has not only 5 predictions, but rather a huge list with repeated results and larger than hundred. I have used:

predict(MODEL, newdata = DATA_TO_PREDICT)

EXAMPLE:

Here I show a concrete example where the output is wrong. That is, I want to predict 4 new data, but I get a result with dozens of outputs:

library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]

m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)

DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)

Then if I built the Combined Model:

pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2) 

Note that DATA_TO_PREDICT has only 4 examples and the output is:

  [1] Control Control Control Control Control Control Control Control
  [9] Control Control Control Control Control Control Control Control
 [17] Control Control Control Control Control Control Control Control
 [25] Control Control Control Control Control Control Control Control
 [33] Control Control Control Control Control Control Control Control
 [41] Control Control Control Control Control Control Control Control
 [49] Control Control Control Control Control Control Control Control
 [57] Control Control Control Control Control Control Control Control
 [65] Control Control Control Control Control Control Control Control
 [73] Control Control Control Control Control Control
 Levels: Impaired Control
1

1 Answers

2
votes

This is because MODEL was trained on the predictions of the three individual models (pred1, pred2 and pred3 for the test data) and in the last step DATA_TO_PREDICT is supplied to MODEL which instead consists of observations. First, the predicted values of the individual models for DATA_TO_PREDICT have to be stored and then used as newdata for MODEL.

# (Beginning of the example omitted)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
# This trains a model with predictions as inputs:
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")

# This is missing ----------------------
# To get the inputs for the ensemble model
# the predictions for DATA_TO_PREDICT are needed
p1b <- predict(m1, newdata = DATA_TO_PREDICT)
p2b <- predict(m2, newdata = DATA_TO_PREDICT)
p3b <- predict(m3, newdata = DATA_TO_PREDICT)
DFb <- data.frame(p1b, p2b, p3b)
colnames(DFb) <- c("p1", "p2", "p3")
#----------------------------------------

predMODEL <- predict(MODEL, DFb)
# [1] Control Control Control Control