1
votes

I have tried to access the Hive ORC Transactional table (which has underlying delta files on HDFS) using PySpark but I'm not able to read the transactional table through sparkContext/hiveContext.

/mydim/delta_0117202_0117202

/mydim/delta_0117203_0117203

1

1 Answers

1
votes

Officially Spark not yet supported for Hive-ACID table, get a full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.

There is a Open Jira saprk-15348 to add support for reading Hive ACID table.

  • If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.

  • There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.

  • I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.