Spark Hive Insert In Relation to Table Locking

Question

just a quick question. I'm trying to execute a Spark program with a version of 1.6.0 that utilizes concurrent loading on a Hive Table. Is using an insert statement in the hiveContext.sql("insert . . .") a way to go since I want to ensure table locking during the writing process because from what I've seen in the Spark documentation table locking and atomicity are not ensured when using Saving operations with a DataFrame.

"Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing a Overwrite, the data will be deleted before writing out the new data."

How can I ensure the atomicity or locking of a hive table in spark whenever accessing/inserting a data in the specified hive table?

Any suggestions are plenty helpful. Thank you very much.

Mariusz Mariusz · Accepted Answer · 2017-09-20T18:14:36

Solution depends on what do you need atomic writing for.

One of the simplest possibilities is to use partitioned external table:

In spark job you write dataframe not to table, but to HDFS dir.
Once write is complete, you add a new partition to table, pointing to the new dir.

Spark Hive Insert In Relation to Table Locking

1 Answers