queries in spliting train and test data in Random Forest

Question

I am having a data frame with 15 variables and 4669 observations.

I am using random forest for modelling. My target from my data set is to predict is a particular product will be accepted by the customer or not.

so, my output variable has factors of "Yes", "No" and "".

My question is, Is it possible for me to predict this "" , as Yes or No in random Forest ?

Sample data looks like below

Outputvar <- c("Yes", "Yes", "No", "NO", "", "")
Inputvar1 <- c("M", "F", "F", "M", "F", "M")
Inputvar2 <- c("34","25","40","50","60","34")
data <- data.frame(cbind(Outputvar,Inputvar2,Inputvar1))

I am new to R, and if my understanding is wrong, then could any one explain me what could be done ?

EDIT: this is the code I have tried till now

library(RandomForest)
data$outvar <- factor(data$outputvar, exclude = NULL)
ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
train0 <- data[ind0==1, ]
test0 <-  data[ind0==2, ]
fit1 <- randomForest(outputvar1~., data=train0)
print(fit1)
plot(fit1)

EDIT2: NO : 3536 Yes: 1061 "" : 72

Try to add: data$Outputvar <- factor(data$Outputvar, exclude=NULL) — Marco Sandri
@MrSmithGoesToWashington actually my question was , is It possible for me to predict those Null as Yes or No with random forest. ?? — Mikz
Be careful, you have "No" and "NO" categories in your data$Outputvar. You should correct this issue. — Marco Sandri

desertnaut desertnaut · Accepted Answer · 2018-02-22T00:20:50

My target from my data set is to predict is a particular product will be accepted by the customer or not.

so, my output variable has factors of "Yes", "No" and "".

Well, no. The actual context here is:

Your output variable has only two factors, "Yes" & "No"; and there is a part of your available dataset where you don't have the value of the outcome ("") and you want to predict it.

My question is, Is it possible for me to predict this "" , as Yes or No in random Forest ?

In principle, yes - this is exactly what classifiers, such as Random Forest, are made for. Very generally speaking, you need to train your model using only the samples for which the outcome (Yes/No) is indeed available (training set, a subset of which you may use as a test set, in order to evaluate your model performance); after that, you can use predict in the rest of your dataset so as to predict the outcomes.

Of course, this is just a 4-line summarization of a composite process, which involves many steps and sub-steps that cannot be analyzed in detail here, but hopefully gives you a (very) high level view of the issue (which, arguably, is what you are asking). My answer to your other relevant question should also be useful.

queries in spliting train and test data in Random Forest

1 Answers