I'm looking into establishing a JDBC Spark connection to use from R/python. I know that pyspark
and SparkR
are both available - but these seem more appropriate for interactive analysis, particularly since they reserve cluster resources for the user. I'm thinking of something more analogous to the Tableau ODBC Spark connection - something more light-weight (as I understand it) for supporting simple random access. While this seems possible and there is some documentation it isn't clear (to me) what the JDBC driver requirements are.
Should I use the org.apache.hive.jdbc.HiveDriver like I do to establish a Hive connection since Hive and Spark SQL via thrift seem closely linked? Should I swap out the hadoop-common dependency needed for my Hive connection (using HiveServer2 Port) for some spark-specific dependency (when using hive.server2.thrift.http.port)?
Also, since most of the connection functionality seems to leverage Hive, what is the key thing that causes Spark SQL to be used as the query engine instead of Hive?