I'm trying to write a DataFrame into Hive table (on S3) in Overwrite mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable differs from df.write.insertInto in the following respects:
saveAsTableuses column-name based resolution whileinsertIntouses position-based resolution- In Append mode,
saveAsTablepays more attention to underlying schema of the existing table to make certain resolutions
Overall, it gives me the impression that saveAsTable is just a smarter version of insertInto. Alternatively, depending on use-case, one might prefer insertInto
But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable (since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?
EDIT-1
Documentation says this regarding insertInto
Inserts the content of the DataFrame to the specified table
and this for saveAsTable
In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function
Now I can list my doubts
- Does
insertIntoalways expect the table to exist? - Do
SaveModes have any impact oninsertInto? - If above answer is yes, then
- what's the differences between
saveAsTablewithSaveMode.AppendandinsertIntogiven that table already exists? - does
insertIntowithSaveMode.Overwritemake any sense?
- what's the differences between