Spark/Mllib Train many GaussianMixture models in a distributed way

Question

I've been playing around with the Gaussian Mixture Models provided for spark/mllib.

I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).

At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.

Example:

Let featuresByUser = RDD[user, List[Vectors]], 

the natural way to train a GMM for each user might be something like

featuresByUser.mapValues(
    feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)

However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.

So the question are,

should the Mllib methods accept Seq[Vector] as input apart from RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.

Is there any other workaround that I'm missing to deal with this case (using mllib)?

sgvd sgvd · Accepted Answer · 2015-10-23T09:37:12

Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.

What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.

Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:

val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
  val userData = getDataForUser(user)
  buildGM(userData)
})

Spark/Mllib Train many GaussianMixture models in a distributed way

1 Answers