PySpark is not able to read Hive ORC transaction table through sparkContext/hiveContext ? Can we update/delete hive table data using Pyspark?

Question

I have tried to access the Hive ORC Transactional table (which has underlying delta files on HDFS) using PySpark but I'm not able to read the transactional table through sparkContext/hiveContext.

/mydim/delta_0117202_0117202

/mydim/delta_0117203_0117203

notNull notNull · Accepted Answer · 2019-08-01T17:43:39

Officially Spark not yet supported for Hive-ACID table, get a full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.

There is a Open Jira saprk-15348 to add support for reading Hive ACID table.

If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.
There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.
I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

PySpark is not able to read Hive ORC transaction table through sparkContext/hiveContext ? Can we update/delete hive table data using Pyspark?

1 Answers