6
votes

I dispose of a list of Double stored like this :

JavaRDD<Double> myDoubles

I would like to compute the mean of this list. According to the documentation, :

All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.

On the same page, I see the following code :

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

From my understanding, this is equivalent (in term of types) to

Double MSE = RDD<Double>.mean()

As a consequence, I tried to compute the mean of my JavaRDD like this :

myDoubles.rdd().mean()

However, it doesn't work and gives me the following eror : The method mean() is undefined for the type RDD<Double>. I also didn't find mention of this function in the RDD scala documentation. . Is this because of a bad understanding of my side, or is this something else ?

2
What do you mean "it doesn't work"? Is that the specific error message you see? - Daniel Darabos
Thanks! Scala is crazy like that. The mean method is on DoubleRDDFunctions, but can be used on RDD[Double]. It is also on JavaDoubleRDD, so that's what you need to get. - Daniel Darabos
(I don't know the Java API, so I cannot be more specific, sorry.) - Daniel Darabos
Excellent! I'd rather leave posting the answer to you. I don't even know how to test that line. - Daniel Darabos

2 Answers

10
votes

It's actually quite simple: mean() is defined for the JavaDoubleRDD class. I didn't find how to cast from JavaRDD<Double> to JavaDoubleRDD, but in my case, it was not necessary.

Indeed, this line in scala

val mean = valuesAndPreds.map{case(v, p) => (v - p)}.mean()

can be expressed in Java as

double mean = valuesAndPreds.mapToDouble(tuple -> tuple._1 - tuple._2).mean();
0
votes

Don't forget to add import org.apache.spark.SparkContext._ at the top of your scala file. Also make sure you are calling mean() on RDD[Double]