Spark Overwrite to particular partition of parquet files

Question

I'm having a huge table consisting of billions(20) of records and my source file as an input is the Target parquet file.

Everyday I get a delta incoming file to update existing records in Target folder and append new data.

Using spark SQL dataframe, is there a way to read and update particular partitions of the parquet file?

thebluephantom thebluephantom · Accepted Answer · 2018-09-07T14:09:10

I find the question a little unclear in terms of Overwrite in the title, but append in the body of the text. I guess it depends on interpretation, anyway.

Also, not sure if it was a table or just file, but this works fine, as an example:

df.write.format("parquet").mode("append").save("/user/mapr/123/SO.parquet")

You can append any data a number of times to a directory in this case, not a Hive registered table. and the DF Writer does it all.

If ovewrite, then this suffices as well, but you need to supply the original data as well if you do not want to lose it:

df.write.format("parquet").mode("overwrite").save("/user/mapr/123/SO.parquet")

It may well be that what you want is not possible, i.e. append and new. In that case you would need your own difference analyzer and a few lines of code.

Spark Overwrite to particular partition of parquet files

1 Answers