Scala Spark - using RDD with mllib

Question

I have data in the form of RDD[List[Double], List[Double]], for example:

sampleData =
    (
        ((1.1, 1.2, 1.3), (1.1, 1.5, 1.2)),
        ((3.0, 3.3, 3.3), (3.1, 3.2, 3.6))
    )

I would like to call Statistics.corr(a, b) where a is from the first List[Double] and b is from the second List[Double]

The result I would like is 2 correlation values from the corr() function for (1.1, 1.2, 1.3), (1.1, 1.5, 1.2) and (3.0, 3.3, 3.3), (3.1, 3.2, 3.6)

My attempted solution is:

Statistics.corr(sampleData.flatMap(_._1), sampleData.flatMap(_._2))

This is giving me a single correlation for (1.1, 1.2, 1.3, 3.0, 3.3, 3.3), (1.1, 1.5, 1.2, 3.1, 3.2, 3.6), which is not what I want

Matthew Graves Matthew Graves · Accepted Answer · 2015-09-21T20:05:08

This calls for map, not flatmap, since you want to keep the rows of the RDD separate.

Unfortunately, I'm not yet aware of a serializable correlation function that will operate on two List[Double]s. The first place I checked was Pearson correlation from Apache Commons, but it's not serializable. You may have to write your own function (but I'd spend some more effort looking first). Once you have a correlation function, you'll use it like follows:

sampleData.map(x => correlation(x._1,x._2))

This will still be an RDD, and it will have no reference to the original row it came from besides the order, so you may want to pass the original data along (or, at least, whatever id it used to have).

Scala Spark - using RDD with mllib

1 Answers