2
votes

I'm trying to query data loaded into an HBase table using SparkSQL/DataFrames. My cluster is based on Cloudera CDH 6.2.0 (Spark version 2.4.0 and HBase version 2.1.0).

Following this guide I selected my HBase service in HBase Service property of my Spark Service. This operation added the following jars to my Spark classpath:

/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.1.0-cdh6.2.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/hbase-spark.jar

I started then the spark-shell. Following this example, which uses this Spark-HBase Connector, I managed to load and retrieve data from HBase and put them into a DataFrame. When I try to query this DataFrame, using SparkSQL or DataFrame API, I get the following exception:

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;
  at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:256)
  at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
  at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
  at scala.Option.map(Option.scala:146)
  at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:267)
  at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:88)
  at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:80)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  ...

I tried to start the spark-shell 'as is' without passing the above connector and the result is the same. I read that this issues can be cause by different version of protocol buffer but I don't know how to resolve it.

1
If that's the issue indeed then the solution is as simple as using consistent protocol buffers. However, I do not know much about HBase, so I might be wrong. I would look into protocol buffer versions if I were you and search for differences.Lajos Arpad

1 Answers

0
votes

We had the same issue with CDH 6.3.3 and ended up compiling Hortonworks shc-core from source and so far it seems to work with CDH 6.3.3 without any issues.