What are the differences between saveAsTable and insertInto in different SaveMode(s)?

Question

I'm trying to write a DataFrame into Hive table (on S3) in Overwrite mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable differs from df.write.insertInto in the following respects:

saveAsTable uses column-name based resolution while insertInto uses position-based resolution
In Append mode, saveAsTable pays more attention to underlying schema of the existing table to make certain resolutions

Overall, it gives me the impression that saveAsTable is just a smarter version of insertInto. Alternatively, depending on use-case, one might prefer insertInto

But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable (since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?

EDIT-1

Documentation says this regarding insertInto

Inserts the content of the DataFrame to the specified table

and this for saveAsTable

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function

Now I can list my doubts

Does insertInto always expect the table to exist?
Do SaveModes have any impact on insertInto?
If above answer is yes, then
- what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?
- does insertInto with SaveMode.Overwrite make any sense?

All the QAs / links I've come across complain that Spark intrinsically overwrites all partitions (in Overwrite mode) and that some tricks are needed to bypass this rather inhibiting shortcoming — y2k-shubham

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2018-01-14T15:17:31

DISCLAIMER I've been exploring insertInto for some time and although I'm far from an expert in this area I'm sharing the findings for greater good.

Does insertInto always expect the table to exist?

Yes (per the table name and the database).

Moreover not all tables can be inserted into, i.e. a (permanent) table, a temporary view or a temporary global view are fine, but not:

a bucketed table
an RDD-based table

Do SaveModes have any impact on insertInto?

(That's recently been my question, too!)

Yes, but only SaveMode.Overwrite. After you think about insertInto the other 3 save modes don't make much sense (as it simply inserts a dataset).

what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?

That's a very good question! I'd say none, but let's see by just one example (hoping that proves something).

scala> spark.version
res13: String = 2.4.0-SNAPSHOT

sql("create table my_table (id long)")
scala> spark.range(3).write.mode("append").saveAsTable("my_table")
org.apache.spark.sql.AnalysisException: The format of the existing table default.my_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;
  at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:117)
  at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:76)
...
scala> spark.range(3).write.insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|  2|
|  0|
|  1|
+---+

does insertInto with SaveMode.Overwrite make any sense?

I think so given it pays so much attention to SaveMode.Overwrite. It simply re-creates the target table.

spark.range(3).write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|  1|
|  0|
|  2|
+---+

Seq(100, 200, 300).toDF.write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|200|
|100|
|300|
+---+

What are the differences between saveAsTable and insertInto in different SaveMode(s)?

4 Answers