0
votes

I could not find anything on this after hours of Google search so I hope I can get some ideas to my problem here.

I am trying to get data from a remote hive cluster using spark2. I have followed:

  1. How to connect to a Hive metastore programmatically in SparkSQL?
  2. How to connect to remote hive server from spark

And I was able to connect to the remote hive metastore successfully.

However, my problem starts when I execute a query in the remote hive. e.g spark.sql("select count(*) from table"). I will get an "unknown host: ns-bigdata" error. Where ns-bigdata is the cluster name of the remote cluster.

What other things am I missing here? Need I specify where the hive.metastore.warehouse.dir should be as well? e.g. hdfs://local-cluster:8020/user/hive/warehouse

Thanks in advance.

2
Sounds like your DNS server is not working. Try using IP addressesOneCricketeer
Don't think it's the DNS as my spark session is able to connect to the remote hive metastore with the hostname i.e spark.config("spark.hadoop.hive.metastore.uri", "thrift://remote.hive.domain:9083").Kok-Lim Wong
That's just a string. The connection is not attempted until you actually run a queryOneCricketeer
Try running simpler query spark.sql("show databases").show() to make sure the connection is fine. If this works fine, include database name also in the query. spark.sql("select count(*) from database.table") Also, to be clear the machine you are running spark2-submit or spark2-shell is not present in the cluster "ns-bigdata".yammanuruarun
After some thinking I think @cricket_007 may be right. Think when I try to run a query, hive is trying to access the warehouse directory in hdfs to check the schema but could not find where it is because my spark cluster doesn't know where ns-bigdata is. I'll try to see if I can get the IP of ns-bigdata and try to put in in my host file of my cluster.Kok-Lim Wong

2 Answers

0
votes

The hive server URL is in the hive site. Can you try and use that?? Also do check if hive-site.xml is present in the conf/ directory of spark

0
votes

The real reason was the customer did not set their kerberos cert in the hive thrift server for cross realm authentication. We ended up using jdbc impala.