2
votes

I am facing a disappointing issue while trying to use groupByKey or any function of a PairRDD or MappedRDD. What I get is that I have always just a RDD and I don't know how to convert it (really I am quite sure that the conversion should be automatically detected by Scala). My code is the following:

val broadcastedDistanceMeasure = sc.broadcast(dbScanSettings.distanceMeasure)
val distances = input.cartesian(input)
  .filter(t => t._1!=t._2)
  .map( { 
    case(p1, p2) => (p1) -> broadcastedDistanceMeasure.value.distance(p1,p2)
  })

where input is a RDD. And the resulting type according to Eclise and sbt run is actually a RDD. So I cannot perform a groupByKey operation. If I try almost the same code on the spark shell, instead, I get a MappedRDD.

This is my build.sbt file:

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"

Can anybody help me?

Thanks.

Greetings.

Marco

1
Perhaps you don't see the groupByKey operation just because you are NOT importing org.apache.spark.SparkContext._. - ale64bit

1 Answers

3
votes

I think that inside the IDE, you will never see the MappedRDD type for any RDD, since this is provided as an implicit conversion in the Spark Scala API. If you look for example at the source of SparkContext you will see the implicit conversions from the common RDD to the specialized RDDs richer interfaces as PairRDDFunctions, and from inside this specialized interfaces then you can use functions as groupByKey which are made available thanks to the implicit conversions. So, in short, I think you only need to import org.apache.spark.SparkContext._ in order to achieve what you want.

In this particular case, I think the specific conversion is

implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
    new PairRDDFunctions(rdd)

which wraps the RDD into a PairRDDFunctions, which in turns contains the groupByKey operation.

Hope it helped.