Using the answer to Spark 1.5.1, MLLib Random Forest Probability, I was able train a random forest using ml.classification.RandomForestClassifier
, and process a holdout dataframe with the trained random forest.
The problem I have is that I would like to save this trained random forest to process any dataframe (with the same features as the training set) in the future.
The classification example on this page uses mllib.tree.model.RandomForestModel
, it shows how to save the trained forest, but to the best of my understanding can only be trained on (and processed on in the future) a LabeledPoint
RDD. The issue I have with the LabeledPoint
RDD is that this can only contain the label double and features vector, so I would lose all the non-label/non-feature columns that I would need for other purposes.
So I guess I need a way to either save the result of ml.classification.RandomForestClassifie
r, or construct a LabeledPoint
RDD that that can retain columns other than the label and features required by the forest trained through mllib.tree.model.RandomForestModel
.
Anyone know why both and not only one of the ML and MLlib libraries exist?
Many thanks for reading my question, and thanks in advance for any solutions/suggestions.