1
votes

Using Spark 2.4 and Hive 3.1.0 in HDP 3.1 , I am trying to read managed table from hive using spark sql, but it returns an empty dataframe, while it could read an external table easily. How can i read the managed table from hive by spark sql?

Note: The hive maanged table is not empty when reading it usig the hive client.

1- I tried to format the table by ORC an parquet and it failed in both.

2- I failed to read it using HWC.

3- I failed to read it when using JDBC.

os.environ["HADOOP_USER_NAME"] = 'hdfs'

spark = SparkSession\
    .builder\
    .appName('NHIC')\
    .config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')\
    .config("hive.metastore.uris", "thrift://192.168.1.66:9083")\
    .enableHiveSupport()\
    .getOrCreate()

HiveTableName ='nhic_poc.nhic_data_sample_formatted'
data = spark.sql('select * from '+HiveTableName)

The expected is to return the dataframe with Data but Actually the dataframe is empty.

1
Can you first try simple spark.sql("describe formatted nhic_poc.nhic_data_sample_formatted").show(1000,False) and see if it works?mazaneicha
It returns the described formatted table.MMamdouh
Can you try changing the NameNode port in spark.sql.warehouse.dir from 50070 (http) to 8020 (rpc), or remove this config completely (hive.metastore.uris should be sufficient to use external metastore). Next, I'd verify the spark.sql.hive.metastore.version against your actual Hive metastore.mazaneicha

1 Answers

0
votes

Could you check if your spark environment is over-configured?

Try to run the code with environment's default configurations, by removing these lines from your code:

os.environ["HADOOP_USER_NAME"] = 'hdfs'

.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')

.config("hive.metastore.uris", "thrift://192.168.1.66:9083")