0
votes

I have a dataset with 4669 observations and 15 variables.

I am using Random forest to predict if a particular product will be accepted or not.

With my latest data , I have my output variable with "Yes", "NO" and "".

I wanted to predict if this "" will have Yes or No.

I am using the following code.

library(randomForest)

outputvar <- c("Yes", "NO", "Yes", "NO", "" , "" )
inputvar1 <- c("M", "M", "F", "F", "M", "F")
inputvar2 <- c("34", "35", "45", "60", "34", "23")
data <- data.frame(cbind(outputvar, inputvar1, inputvar2))
data$outputvar <- factor(data$outputvar, exclude = "")
ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
train0 <- data[ind0==1, ]
test0 <-  data[ind0==2, ]

fit1 <- randomForest(outputvar~., data=train0, na.action = na.exclude)
print(fit1)
plot(fit1)
p1 <- predict(fit1, train0)
fit1$confusion

p2 <- predict(fit1, test0)

t <- table(prediction = p2, actual = test0$outputvar)
t

The above code runs perfectly. the data frame I have mentioned is only a sample data frame. Since, I am not supposed to produce the original data.

AS you could notice I have divided my training data and test data into 70 and 30%. from my observation I could find test data with 1377 observation and training with 3293 observations.

When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.

Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.

My doubt is, Why does my confusion matrix have difference in observation ?

Are those NA replaced in my prediction matrix as Yes and No are real predictions ??

I am new to R, and any information would be helpful

1

1 Answers

1
votes

You seem a little confused regarding several elementary issues here...

To start with, training data with the dependent variable missing (here outputvar) make no sense; if we don't have the actual outcome for a sample, we cannot use it for training, and we should simply remove it from the training set (save for some rather extreme approaches, where one tries to impute such samples before feeding them to the classifier).

Second, although you seem to imply (kind of...) that your 2 samples with missing outputvar here are the unknown samples you are trying to predict, in practice (i.e in your code) you are not using them as such: since the sample function you use to split your data into training & test subsets is random, it can easily be the case that at least one (or even both) of these 2 samples ends up in your training set, where of course it will be of no use.

Third, even if in some runs you end up indeed with these 2 samples in your test set, you cannot of course calculate any confusion matrix, since you do need the ground truth (real labels) for doing so.

All in all, data samples without the true label, like your 2 last ones here, are useful neither for training nor for evaluation of any kind, such as the confusion matrix. They cannot be used either in the training set or in the test set.

The above code runs perfectly

Not always; due to the random nature of the sample function, you may easily end up with train/test splits that make the classifier impossible to run:

> source('~/.active-rstudio-document')  # your code verbatim
Error in randomForest.default(m, y, ...) : 
  Need at least two classes to do classification.
> train0
  outputvar inputvar1 inputvar2
1       Yes         M        34
5      <NA>         M        34

Try to re-run the code yourself several times to see (since no random seed is set, each run will in principle be different - even the length of your training & test sets will not be the same between runs!).

When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.

Given what you have shown as a sample, a good guess here is that you do not have the true labels for these 14 observations. And since the confusion matrix comes from a comparison of the predictions versus the actual labels, when the latter are missing the comparison is impossible, and these samples are naturally omitted from the confusion matrix.

Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.

It is not quite clear what exactly you mean here; but if you mean that you run predict on your test set and you did not get any NAs in the predictions, this is exactly as expected. As I explained above, the "missing entries" from your confusion matrix are not due to missing predictions, but due to missing true labels.