2
votes

I've set up access to an external relational store (PostgreSQL) via my Spark/Hive deployment. I can read this table via Hive/Beeline, but it fails when I try to read via SparkSQL/pyspark3 jupyter notebook, because it's unable to find JdbcStorageHandler. I've tried to add the appropriate jars in a couple of ways but am hitting the same stack trace across the board - any advice on what jar and version I need, and where exactly I should put it, for this to work? Stack trace:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hive.storage.jdbc.JdbcStorageHandler
..
..
java.lang.ClassNotFoundException: org.apache.hive.storage.jdbc.JdbcStorageHandler

In terms of getting Hive/Beeline to work: I did as described in this JDBC Storage Handler document. I hit a few jar dependency problems while doing this, but resolved it by adding the hive-jdbc-2.0.0.jar, postgresql-42.2.12.jar jars after launching Beeline, and can now successfully read data directly from the relational store from Beeline.

Some things I've tried:

  1. Add the jars listed above with spark.jars.packages in the notebook sparkmagic conf. hive-jdbc 2.0.0 installs cleanly but yields aforementioned error. I tried hive-jdbc 3.1.0 also, but it errors out and does not install. I was a little confused as to how to assess compatibility here, might be a distraction.
  2. Launch spark-sql on the cluster directly, add hive-jdbc-2.0.0.jar jar (successfully). Same stack trace.
  3. Add Apache Hive libraries across the cluster during cluster creation (the hive-jdbc, and postgres driver)
  4. Look around the rest of /usr/hdp for hive-jdbc, of which there are a variety of versions (beneath zeppelin, spark2, oozie, hive-hcatalog, hive, ranger-admin).

Environment details:

  • running on Azure HDInsight
  • Spark 2.4 (HDI 4.0)
1

1 Answers

0
votes

Please copy the hive-jdbc-handler.jar to $SPARK_HOME/standalone-metastore directory in all nodes.

cp /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar /usr/hdp/current/spark2-client/standalone-metastore/

After that launch the spark-shell and test the example

sudo -u spark spark-shell --master yarn --jars /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar
scala> spark.sql("select * from table_name").show()

If you get below error, then there is an issue created for this one, you need to back port that code Spark 3.0.0 to your cluster spark version for example Spark 2.3.2.

Caused by: java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:226)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

https://issues.apache.org/jira/browse/SPARK-27259