Appending to parquet files, partitioned by data that have overlapping timestamps

Question

I have data frames that have timestamp columns. I convert that to date, are partitioned by date and append itto a growing parquet file every day.

If I append a dataset that has timestamps from say 2021-04-19 01:00:01 to 2021-04-19 13:00:00, it writes it to the parquet in the partition DATE=2021-04-19.

If later in the day I append another dataset with timestamps from 2021-04-19 15:00:00 to 2021-04-19 20:00:00, will it overwrite the previous partition which had data from 1am to 1pm? or will it actually append to it?

I use the syntax:

df.write.mode('append').partitionBy("DATE").parquet("s3://path")

Emer Emer · Accepted Answer · 2021-04-19T18:13:29

From Save Modes in Spark Documentation:

Append: When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

Thus, it does what you were expecting. Here is a toy example to check the behaviour:

data_batch_1 = [("2021-04-19", "2021-04-19 01:00:01", 1.1), 
                ("2021-04-19", "2021-04-19 13:00:00", 1.2)]

data_batch_2 = [("2021-04-19", "2021-04-19 15:00:00", 2.1), 
                ("2021-04-19", "2021-04-19 20:00:00", 2.2)]

col_names = ["DATE", "ts", "sensor1"]

df_batch_1 = spark.createDataFrame(data_batch_1, col_names)
df_batch_2 = spark.createDataFrame(data_batch_2, col_names)

s3_path = "/tmp/67163237/"

Save Batch 1

df_batch_1.write.mode("append").partitionBy("DATE").parquet(s3_path)
spark.read.parquet(s3_path).show()

+-------------------+-------+----------+
|                 ts|sensor1|      DATE|
+-------------------+-------+----------+
|2021-04-19 01:00:01|    1.1|2021-04-19|
|2021-04-19 13:00:00|    1.2|2021-04-19|
+-------------------+-------+----------+

Save Batch 2

df_batch_2.write.mode("append").partitionBy("DATE").parquet(s3_path)
spark.read.parquet(s3_path).show()

+-------------------+-------+----------+
|                 ts|sensor1|      DATE|
+-------------------+-------+----------+
|2021-04-19 15:00:00|    2.1|2021-04-19|
|2021-04-19 01:00:01|    1.1|2021-04-19|
|2021-04-19 20:00:00|    2.2|2021-04-19|
|2021-04-19 13:00:00|    1.2|2021-04-19|
+-------------------+-------+----------+

Appending to parquet files, partitioned by data that have overlapping timestamps

2 Answers