Error in the confusion matrix for Stacking using Caret

Question

I am new to Caret and R and have a question regarding Stacking. I am getting the following error message for the confusionMatrix:

"Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length"

Can you please provide your input on how to resolve this, thanks. Providing the details below.

Step1. I divided the data into training, validation and test sets. Using the train function, I created the objects from the training set using 3 methods (rf, gbm, lda).

trCtrl <- trainControl(method = 'cv', number = 3); 
rfObj <- train(diagnosis~., data = training, method = "rf", trControl = trCtrl); gbmObj <- train(diagnosis~., data = training, method = "gbm", trControl = trCtrl, verbose = F); ldaObj <- train(diagnosis~., data = training, method = "lda")

Step2. Using the above objects in the validation set, I made predictions on the validation set and collected this information in a data frame. So this data frame has 4 columns, 3 from the predictions and one from validation$y.

rfPred <- predict(rfObj, newdata = validation); gbmPred <- predict(gbmObj, newdata = validation); ldaPred <- predict(ldaObj, newdata = validation);

metaTrain <- data.frame(rfPred, gbmPred, ldaPred, diagnosis = validation$diagnosis)

Using the train function on this data frame using Random Forests, I created the final object.

metaObj <- train(diagnosis~., data = metaTrain, method = "rf", trControl = trCtrl)

Step3. Finally, using the testing set, I got predictions using the objects created from Step1. I combined these predictions into a data frame (has 3 columns). Next, using the object created from Step2, I predict on this data frame just created.

rfPredTest <- predict(rfObj, newdata = testing); gbmPredTest <- predict(gbmObj, newdata = testing); ldaPredTest <- predict(ldaObj, newdata = testing); finalDF <- data.frame(rfPredTest, gbmPredTest, ldaPredTest); finalPred <- predict(metaObj, newdata = finalDF)

I am getting the following message here: "'newdata' had 82 rows but variables found have 44 rows."

And then -

confusionMatrix(finalPred, testing$diagnosis)

I am getting the following error:

Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length

Can you please let me know what I am doing wrong? Thank you for your help with this.

Elia Elia · Accepted Answer · 2021-04-16T09:44:56

Please make a reproducible example to help the community to help you. I assume that you are tuning a classification model (due to the use of method="lda"). If you are new to caret read the package vignette which is very well-written (package vignette)

library(caret)

df <- iris
#split data into train, validation and test
ind <- createFolds(df$Species,k=3)
train <- df[ind$Fold1,]
val <- df[ind$Fold2,]
test <- df[ind$Fold3,]

Due to the procedure used in the train algorithm, the validation set is not really necessary but training and test sets are enough. However, here I continue on the line of your code

#set the resampling method (3-fold cross-validation)
trCtrl <- trainControl(method = 'cv', number = 3)

rfObj <-
train(Species ~ .,
    data = train,
    method = "rf",
    trControl = trCtrl)

bgbmObj <-
train(
Species ~ .,
data = train,
method = "gbm",
trControl = trCtrl,
verbose = F
)
ldaObj <- train(Species ~ ., data = train, method = "lda", trControl = trCtrl)

by default train test 3 different values for each tuning parameters (i.e. 3 values of mtry for randomforests; 3 values of interaction.depth and n.trees for a total of 3^2=9 combinations for Stochastic Gradient Boosting ecc...). Now that we have fitted models we can compare them:

resamps <- resamples(list(RF = rfObj,
                      GBM = gbmObj,
                      LDA = ldaObj))
summary(resamps)
bwplot(resamps, layout = c(2, 1))

Note that for every model object the best model has already been built with all the training data, regardless of the chosen resampling method, based on the best tuning parameters set obtained from the resampling technique (you can access the final model with ldaObj$finalModel. By default when you use predict with a caret model the $finalModel is used).

the following steps of your analysis are very unusual to me. However, As above, I continue on the line of your code. The mistake is in finalPred <- predict(metaObj, newdata = finalDF) because you built metaObj with the variable in metaTrain (i.e.rfPred, gbmPred, ldaPred, so the metaObj model expects the same variable name. You can set them with colnames

rfPred <-
predict(rfObj, newdata = val)
gbmPred <-
predict(gbmObj, newdata = val)
ldaPred <- predict(ldaObj, newdata = val)

metaTrain <-
data.frame(rfPred, gbmPred, ldaPred, Species = val$Species)

metaObj <- train(Species~., data = metaTrain, method = "rf", trControl = trCtrl)

rfPredTest <-
predict(rfObj, newdata = test)
gbmPredTest <-
predict(gbmObj, newdata = test)
ldaPredTest <-
predict(ldaObj, newdata = test)
finalDF <-
data.frame(rfPredTest, gbmPredTest, ldaPredTest)
#here we set the same colnames of the metaTrain object
colnames(finalDF) <- colnames(metaTrain[,1:3])

finalPred <- predict(metaObj, newdata = finalDF)

confusionMatrix(finalPred, test$Species)

Now all it works like a charm.

Error in the confusion matrix for Stacking using Caret

1 Answers