0
votes

Using the answer to Spark 1.5.1, MLLib Random Forest Probability, I was able train a random forest using ml.classification.RandomForestClassifier, and process a holdout dataframe with the trained random forest.

The problem I have is that I would like to save this trained random forest to process any dataframe (with the same features as the training set) in the future.

The classification example on this page uses mllib.tree.model.RandomForestModel, it shows how to save the trained forest, but to the best of my understanding can only be trained on (and processed on in the future) a LabeledPoint RDD. The issue I have with the LabeledPoint RDD is that this can only contain the label double and features vector, so I would lose all the non-label/non-feature columns that I would need for other purposes.

So I guess I need a way to either save the result of ml.classification.RandomForestClassifier, or construct a LabeledPoint RDD that that can retain columns other than the label and features required by the forest trained through mllib.tree.model.RandomForestModel.

Anyone know why both and not only one of the ML and MLlib libraries exist?

Many thanks for reading my question, and thanks in advance for any solutions/suggestions.

1

1 Answers

0
votes

I'll just re-use what's been said in the spark programming guide :

The spark.ml package aims to provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

In Spark, the core feature is it's RDDs. There is an excellent paper on that topic if you are interested, I can add the link to it later.

The comes MLLib, which was an independent library at first and got soaked into the Spark project. Nevertheless, all the machine learning algorithms in Spark are written on RDDs.

Then the DataFrame abstraction were added to the project and thus a more practical ways of building machine learning applications were needed to include transformers and evaluator and most importantly pipeline.

Data Engineer or Scientist for that matter didn't need to study the underlying tech. Thus the abstraction.

You can use both, but you need to remember that all the algorithm that you use from ML are made in MLlib and then abstracted for a easier usage.