I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover:
16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645) at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:627) at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457) at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I checked the code of HBaseContext, It uses a broadcast to store the HBase configuration.
class HBaseContext(@transient sc: SparkContext,
@transient config: Configuration,
val tmpHdfsConfgFile: String = null) extends Serializable with Logging {
@transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
@transient var tmpHdfsConfiguration: Configuration = config
@transient var appliedCredentials = false
@transient val job = Job.getInstance(config)
TableMapReduceUtil.initCredentials(job)
// <-- broadcast for HBaseConfiguration here !!!
var broadcastedConf = sc.broadcast(new SerializableWritable(config))
var credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
...
When the checkpoint-recover, it tried to access this broadcast value in its getConf func:
if (tmpHdfsConfiguration == null) {
try {
tmpHdfsConfiguration = configBroadcast.value.value
} catch {
case ex: Exception => logError("Unable to getConfig from broadcast", ex)
}
}
Then the exception raised. My question is: is it possible to recover the broadcasted value from checkpoint in a spark application? All we have some other solution to re-broadcast the value after recovering?
Thanks for any feedback!