1
votes

I'm new to spark and scala and I've come up with a compile error with scala: Let's say we have a rdd, which is a map like this:

val rawData = someRDD.map{
    //some ops
    Map(
    "A" -> someInt_var1  //Int
    "B" -> someInt_var2  //Int
    "C" -> somelong_var  //Long
    )
}

Then, I want to get histogram info of these vars. So, here is my code:

rawData.map{row => row.get("A")}.histogram(10)

And the compile error says:

value histogram is not a member of org.apache.spark.rdd.RDD[Option[Any]]

I'm wondering why rawData.map{row => row.get("A")} is org.apache.spark.rdd.RDD[Option[Any]] and how to transform it to rdd[Int]? I have tried like this:

rawData.map{row => row.get("A")}.map{_.toInt}.histogram(10)

But it compiles fail:

value toInt is not a member of Option[Any]

I'm totally confused and seeking for help here.

1

1 Answers

2
votes

You get Option because Map.get returns an option; Map.get returns None if the key doesn't exist in the Map; And Option[Any] is also related to the miscellaneous data types of the Map's Value, you have both Int and Long, in my case it returns AnyVal instead of Any;

A possible solution is use getOrElse to get rid of Option by providing a default value when the key doesn't exist, and if you are sure A's value is always a int, you can convert it from AnyVal to Int using asInstanceOf[Int];

A simplified example as follows:

val rawData = sc.parallelize(Seq(Map("A" -> 1, "B" -> 2, "C" -> 4L)))

rawData.map(_.get("A"))
// res6: org.apache.spark.rdd.RDD[Option[AnyVal]] = MapPartitionsRDD[9] at map at <console>:27

rawData.map(_.getOrElse("A", 0).asInstanceOf[Int]).histogram(10)
// res7: (Array[Double], Array[Long]) = (Array(1.0, 1.0),Array(1))