1
votes

for unit testing purpose I am building my own HBase Result object as follows

val row = Bytes.toBytes( "row01" )
val cf = Bytes.toBytes( "cf" )
val cell1 = new KeyValue( row, cf, "v1".getBytes(), Bytes.toBytes( "file1" ) )
val cell2 = new KeyValue( row2, cf, "v2".getBytes(), Bytes.toBytes( "file2" ) )

val cells = List( cell1, cell2 )

val result = Result.create( cells )

Now I want to add this to a sparkContext Object , like

val sparkContext = new org.apache.spark.SparkContext( conf )
val rdd = sparkContext.parallelize( List( result ) )

However, once I try to access the rdd via foreach , like

rdd.foreach{x=>x}

I get the famous Spark Task Not serializable.

Does anyone know of a better way to crete RDD[Result]?

1

1 Answers

1
votes

Result is not serializable, so if you want an RDD[Result] you have to produce the Results on the node itself from some other input (and of course, then actions like collect, first which would send Results between nodes etc. won't work). So e.g.

val rdd0 = sparkContext.parallelize( List( ("row", "cf") ) )

val rdd = rdd.map { case (str1, str2) =>
  val row = Bytes.toBytes( str1 )
  val cf = Bytes.toBytes( str2 )
  val cell1 = new KeyValue( row, cf, "v1".getBytes(), Bytes.toBytes( "file1" ) )
  val cell2 = new KeyValue( row2, cf, "v2".getBytes(), Bytes.toBytes( "file2" ) )

  val cells = List( cell1, cell2 )

  Result.create( cells )
}