2
votes

I have a Databricks cluster running on some VMs. My organisation has a Hadoop cluster with a bunch a data in it that I want. I have no access to the Hadoop cluster, just a JDBC URL (all my permissions have been sorted out, they gave me just a URL).

I can open up a database management tool on my local machine (Dbeaver), and query Hive tables successfully.

However I am struggling to query Hive tables using Databricks and PySpark. It seems that to set the connection string for the HiveContext, I would normally write it in the hive-site.xml file. However Databricks doesn't give me this option.

I am on Hive 2.1.1 & Databricks 6.4 (includes Apache Spark 2.4.5, Scala 2.11)

Now I am at a loss on how to simply connect to my Hive database.

# Spark context sc is implicit in Databricks
hive_context = spark.HiveContext(sc)

# I want to be able to do something like
df = hive_context.sql("SELECT...")
2
@thebluephantom I followed this tutorial already and A) it doesn't give any information about the pyspark syntax involved and B) I came across errors because of driver compatibility; this tutorial assumes the hive database is stored in Azure and specifies drivers for that use case. I don't know how to determine suitable drivers. Is that something you know how to fix? - Alan
sys admin guys do that. i am actually installing a new system but with no hive. in spark 2 it is just a session parameter and it should be all set up... - thebluephantom
Cool... hmmm I don't fully understand what you mean there about sys admins. But maybe I'll ask... after playing around I found that the Databricks UI let's me specify session parameters for HiveContext. The parameter in question is spark.hadoop.javax.jdo.option.ConnectionDriverName... this is where the above tutorial specifies SQl drivers. Given I know my ConnectionURL is it possible to figure out what the ConnectionDriverName should be? - Alan
I may have misinterpreted. - thebluephantom

2 Answers

0
votes

This is the most anti-climactic answer, but all along it was a firewall issue inside my organisation, not the jar files. My organisation permits and blocks access to and from different sources as needed. As embarrassing as this is, please believe me that it was non-obvious from the error!

Databricks has many of these jar files preloaded. So this code would have been sufficient to establish a connection all along.

import java.sql.DriverManager
val connection = DriverManager.getConnection("jdbc:hive2://hostname:10000/dbname", "usrname", "psswrd")
connection.isClosed()

This is an example of me Googling errors and following clues around various forums, including SO, to no avail. I spoke to a Databricks SME to conclude this network issue.

0
votes

By specifing the configation for hive.metastore.uris during sparksession builder call, will solve your problem incase you are running from any other distribution, but for databricks you have to edit the spark config for your cluster.

You can refer below answers-

How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

How to add Spark configuration in Databricks cluster