0
votes

Overview

I am classifying documents using random forest implementation in ranger R.

Now I am facing an issue, System expecting all the feature that are in Train set to be present in real time data set which is not possible to achieve, hence I am not able to predict for real time data text.

Procedure following

Aim : To predict description belongs to which type of class (i.e, OutputClass)

Each of the information like description, features are converted into Document term matrix

Document term matrix of Train Set

                                    rpm      Velocity     Speed           OutputClass 

      doc1                          1             0             1            fan
      doc2                          1             1             1            fan
      doc3                          1             0             1            referigirator
      doc4                          1             1             1            washing machine
      doc5                          1             1             1            washing machine

Now train the model using the above matrix

fit <- ranger(trainingColumnNames,data=trainset)
save(fit,file="C:/TrainedObject.rda”)

Now I am using the above stored object to predict the real time description for their class type.

Load("C:/TrainedObject.rda”)

Again construct the Document matrix for the RealTimeData.

                                            Velocity           Speed     OutputClass 

      doc5                                      0               1              fan
      doc6                                      1               1              fan
      doc7                                      0               1            referigirator
      doc8                                      1               1            washing machine
      doc9                                      1               1            washing machine

In real time data there is no term or feature by name “RPM”. So moment I call predict function

Predict(fit, RealTimeData)

it is showing an error saying RPM is missing,

which practically impossible to get all the term or feature of the train set in the real time data every time.

I tried in both the implementation of random forest in R (Ranger, RandomForest) with parameter in predict function like newdata Predict.all treetype.

None of the parameter helped to predict for the missing features in real time data.

someone please help me out how to solve the above issue

Thanks in advance

1
I don't think this is a problem with Random Forest, I bet it's a problem for all (or nearly all) algorithms. One option would be to train a model each time with the features you have in your real-time data. Another would be to add in the missing features and either populate it with missing values or the mean of the column from the training data. There are more complicated solutions, but that's a good place to start.Tchotchke

1 Answers

0
votes

predict expects all the features you provided to Ranger. Hence if you have missing data on the test set you either remove the problematic feature from the train set and run ranger again or fill the missing values. For the latter solution you may want to have a look at the mice package.