I have a dataset with 4669 observations and 15 variables.
I am using Random forest to predict if a particular product will be accepted or not.
With my latest data , I have my output variable with "Yes", "NO" and "".
I wanted to predict if this "" will have Yes or No.
I am using the following code.
library(randomForest)
outputvar <- c("Yes", "NO", "Yes", "NO", "" , "" )
inputvar1 <- c("M", "M", "F", "F", "M", "F")
inputvar2 <- c("34", "35", "45", "60", "34", "23")
data <- data.frame(cbind(outputvar, inputvar1, inputvar2))
data$outputvar <- factor(data$outputvar, exclude = "")
ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
train0 <- data[ind0==1, ]
test0 <- data[ind0==2, ]
fit1 <- randomForest(outputvar~., data=train0, na.action = na.exclude)
print(fit1)
plot(fit1)
p1 <- predict(fit1, train0)
fit1$confusion
p2 <- predict(fit1, test0)
t <- table(prediction = p2, actual = test0$outputvar)
t
The above code runs perfectly. the data frame I have mentioned is only a sample data frame. Since, I am not supposed to produce the original data.
AS you could notice I have divided my training data and test data into 70 and 30%. from my observation I could find test data with 1377 observation and training with 3293 observations.
When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.
Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.
My doubt is, Why does my confusion matrix have difference in observation ?
Are those NA replaced in my prediction matrix as Yes and No are real predictions ??
I am new to R, and any information would be helpful