Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

Question

I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm.

For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification.

The only difference I can find is that the one in org.apache.spark.ml is inherited from Estimator and was able to be used in cross validation. I was quite confused that they are placed in different packages. Is there anyone know the reason for it? Thanks!

yjshen yjshen · Accepted Answer · 2015-05-14T07:44:35

It's JIRA ticket

And From Design Doc:

MLlib now covers a basic selection of machine learning algorithms, e.g., logistic regression, decision trees, alternating least squares, and k-means. The current set of APIs contains several design flaws that prevent us moving forward to address practical machine learning pipelines, make MLlib itself a scalable project.

The new set of APIs will live under org.apache.spark.ml, and o.a.s.mllib will be deprecated once we migrate all features to o.a.s.ml.

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

2 Answers