We are trying to write into a HIVE table from SPARK and we are using saveAsTable function. I want to know whether saveAsTable every time drop and recreate the hive table or not? If it does so, then is there any other possible spark function which will actually just truncate and load a table, instead drop and recreate.
2 Answers
It depends on which .mode value you are specifying
overwrite --> then spark drops the table first then recreates the table
append --> insert new data to the table
1.Drop if exists/create if not exists default.spark1 table in parquet format
>>> df.write.mode("overwrite").saveAsTable("default.spark1")
2.Drop if exists/create if not exists default.spark1 table in orc format
>>> df.write.format("orc").mode("overwrite").saveAsTable("default.spark1")
3.Append the new data to the existing data in the table(doesn't drop/recreate table)
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
Achieve Truncate and Load using Spark:
Method1:-
You can register your dataframe as temp table then execute insert overwrite statement to overwrite target table
>>> df.registerTempTable("temp") --registering df as temptable
>>> spark.sql("insert overwrite table default.spark1 select * from temp") --overwriting the target table.
This method will work for Internal/External tables also.
Method2:-
In case of internal tables as we can truncate the tables first then append the data to the table, by using this way we are not recreating the table but we are just appending the data to the table.
>>> spark.sql("truncate table default.spark1")
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
This method will only work for Internal tables.
Even in case of external tables we can do some workaround to truncate the table by changing table properties.
Let's assume default.spark1 table is external table and
--change external table to internal table
>>> saprk.sql("alter table default.spark1 set tblproperties('EXTERNAL'='FALSE')")
--once the table is internal then we can run truncate table statement
>>> spark.sql("truncate table default.spark1")
--change back the table as External table again
>>> spark.sql("alter table default.spark1 set tblproperties('EXTERNAL'='TRUE')")
--then append data to the table
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
You can also use insertInto("table") which doesn't recreate the table
The main difference between saveAsTable is that insertInto expects that the table already exists and is based on the order of columns instead of names.