0
votes

I'm new to Mahout-Samsara and I'm trying to understand the "domain" of the different projects and how they relate to each other. I understand that Apache Mahout-Samsara deprecates many MapReduce algorithms, and that things will be based on Apache Flink or Spark or other engines like h2o ( based on the introduction of the "Apache Mahout: Beyond MapReduce" book).

I want to try some recommender algorithms but I'm not so sure about what's new and what's 'deprecated'. I see the following links,

referring to spark-rowsimilarity and spark-itemsimilarity. (I don't understand if these links are talking about an off-the-self algorithm or a design... it's probably a design because they are not listed at mahout dot apachedot org/users/basics/algorithms.html ... anyways...).

And at the same time, Apache Flink (or is it Spark MLLib?) implements the ALS algorithm for recommendation (Machine Learning for Flink and Spark MLlib).

General questions:

  • Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?

  • Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?

Other links to help the conversation:

Specific question:

  • I want to try a recommender algorithm as a 'gray box' (part 'black box' because I don't want to get too deep into the math, part 'white box' because I want to tweak the model and the math to the extent that I need to improve results).

  • I'm not interested in other ML algorithms yet. I thought about starting with what's off-the-shelf and then changing the ALS implementation of MLLib. Would that be a good approach? Any other suggestions?

2

2 Answers

1
votes

I've been working on ML on Flink for a while now and I'm doing my fair load of scouting and I'm monitoring what is going in this ecosystem. What you're asking implies a rational coordination between projects that simply doesn't exist. Algorithms get reimplemented over and over and for what I see, it's easier to do so than integrate with different frameworks. Samsara it's actually one of the most portable solutions but it's good just for a few applications.

Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?

This as I said, would require a coordination between projects that it's not a thing.

Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?

They should be the first thing in a ideal ecosystem, but they will keep building their own ML libraries for commercial purposes: computing engines with ML libraries out of the box sell really well. Actually I'm working full time on Flink ML not because I believe it's necessarily the best way to do ML on Flink, but because, right now, it's something Flink requires to be sold in many environments.

@pferrel suggested PredictionIO that is an excellent software but there are many alternatives under development: for example Beam is designing a Machine Learning API to generalize over different runners' implementations (Flink, Spark, H2O, and so on). Another alternative are data analysis platforms like Knime, RapidMiner and others, that can build pipelines over Spark or other Big Data tools.

1
votes

Spark-itemsimilarity and spark-rowsimialrity are command line accessible drivers. They are based on classes in Mahout-Samsara. The description of these is for running code supported since v0.10.0.

The link https://mahout.apache.org/users/basics/algorithms.html shows which algos are supported on which "compute-engine". Anything in the "Mapreduce" column is in line for deprecation.

That said, Mahout-Samsara is less a collection of algorithms than pre-0.10.0 Mahout was. It now has a R-like DSL, which includes generalized tensor math, from which most of the Mahout-Samsara algos have been built. So think of Mahout as a "roll-you-own math and algorithm" tool. But every product is scalable on your choice of compute engine. The engine's themselves are also available natively so you don't have to use only the abstracted DSL.

Regarding how Mahout-Samsara relates to MLlib or any algo lib, there will be overlap and either can be used in your code interchangeably.

Regarding recommenders, the new SimilarityAnalysis.cooccurrence implements a major innovation, called cross-occurrence that allows a recommender to ingest almost anything known about a user or user's context and even accounts for item-content similarity. The Mahout-Samsara part is the engine for Correlated Cross-Occurrence. See some slides here describing the algorithm: http://www.slideshare.net/pferrel/unified-recommender-39986309

There is a full, end-to-end implementation of this using the PredictionIO framework (PIO itself is now a proposed Apache incubator project) that is mature and can be installed using these instructions: https://github.com/actionml/cluster-setup/blob/master/install.md