2
votes

Let's say I have a classification problem, and want to use the randomForest package in R to solve this.

In my training set I want to add a third variable, var3, which is the product of var1 and var2. Thus, the training set would consist of var1, var2, var3 and the class variable.

However, this var3 is not present in the test set, and consequently, randomForest will give an error when I want to test my model:

random forest Error in eval(expr, envir, enclos) : object var3 not found

Is there any way to get around this?

1

1 Answers

1
votes

Use the formula interface to build the product in the model specification:

    class_var ~ var1 + var2 + I(var1 * var2)

The I function means that the value of the product will be calculated rather than producing the result of interaction which is not the numerical product if either of the variables is character or factor (and is basically a no-op when they are both numeric). There can be problems with this: when the values of var1 and var2 are large, the errors around the estimates may dominate the calculation. I suppose it might be safer to run this:

  class_var ~ var1 + var2 + I( scale(var1 * var2) )