0
votes

I'm looking into establishing a JDBC Spark connection to use from R/python. I know that pyspark and SparkR are both available - but these seem more appropriate for interactive analysis, particularly since they reserve cluster resources for the user. I'm thinking of something more analogous to the Tableau ODBC Spark connection - something more light-weight (as I understand it) for supporting simple random access. While this seems possible and there is some documentation it isn't clear (to me) what the JDBC driver requirements are.

Should I use the org.apache.hive.jdbc.HiveDriver like I do to establish a Hive connection since Hive and Spark SQL via thrift seem closely linked? Should I swap out the hadoop-common dependency needed for my Hive connection (using HiveServer2 Port) for some spark-specific dependency (when using hive.server2.thrift.http.port)?

Also, since most of the connection functionality seems to leverage Hive, what is the key thing that causes Spark SQL to be used as the query engine instead of Hive?

1

1 Answers

0
votes

As it turned out the URL that I needed to use did not match the Hive database host URL listed in the ambari. I came across the correct URL in an example for how to connect (to my cluster specifically). Given the proper URL I was able to establish a connection using the HiveDriver without issue.