Mahout recommender, Flink, Spark MLLib, 'gray box'

Question

I'm new to Mahout-Samsara and I'm trying to understand the "domain" of the different projects and how they relate to each other. I understand that Apache Mahout-Samsara deprecates many MapReduce algorithms, and that things will be based on Apache Flink or Spark or other engines like h2o ( based on the introduction of the "Apache Mahout: Beyond MapReduce" book).

I want to try some recommender algorithms but I'm not so sure about what's new and what's 'deprecated'. I see the following links,

referring to spark-rowsimilarity and spark-itemsimilarity. (I don't understand if these links are talking about an off-the-self algorithm or a design... it's probably a design because they are not listed at mahout dot apachedot org/users/basics/algorithms.html ... anyways...).

And at the same time, Apache Flink (or is it Spark MLLib?) implements the ALS algorithm for recommendation (Machine Learning for Flink and Spark MLlib).

General questions:

Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?
Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?

Other links to help the conversation:

Specific question:

I want to try a recommender algorithm as a 'gray box' (part 'black box' because I don't want to get too deep into the math, part 'white box' because I want to tweak the model and the math to the extent that I need to improve results).
I'm not interested in other ML algorithms yet. I thought about starting with what's off-the-shelf and then changing the ALS implementation of MLLib. Would that be a good approach? Any other suggestions?

Chobeat Chobeat · Accepted Answer · 2016-05-17T07:02:06

I've been working on ML on Flink for a while now and I'm doing my fair load of scouting and I'm monitoring what is going in this ecosystem. What you're asking implies a rational coordination between projects that simply doesn't exist. Algorithms get reimplemented over and over and for what I see, it's easier to do so than integrate with different frameworks. Samsara it's actually one of the most portable solutions but it's good just for a few applications.

Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?

This as I said, would require a coordination between projects that it's not a thing.

Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?

They should be the first thing in a ideal ecosystem, but they will keep building their own ML libraries for commercial purposes: computing engines with ML libraries out of the box sell really well. Actually I'm working full time on Flink ML not because I believe it's necessarily the best way to do ML on Flink, but because, right now, it's something Flink requires to be sold in many environments.

@pferrel suggested PredictionIO that is an excellent software but there are many alternatives under development: for example Beam is designing a Machine Learning API to generalize over different runners' implementations (Flink, Spark, H2O, and so on). Another alternative are data analysis platforms like Knime, RapidMiner and others, that can build pipelines over Spark or other Big Data tools.

Mahout recommender, Flink, Spark MLLib, 'gray box'

2 Answers