0
votes

I have data frames that have timestamp columns. I convert that to date, are partitioned by date and append itto a growing parquet file every day.

If I append a dataset that has timestamps from say 2021-04-19 01:00:01 to 2021-04-19 13:00:00, it writes it to the parquet in the partition DATE=2021-04-19.

If later in the day I append another dataset with timestamps from 2021-04-19 15:00:00 to 2021-04-19 20:00:00, will it overwrite the previous partition which had data from 1am to 1pm? or will it actually append to it?

I use the syntax:

df.write.mode('append').partitionBy("DATE").parquet("s3://path")
2
why don't you try it and tell us what happens? - mck

2 Answers

1
votes

From Save Modes in Spark Documentation:

Append: When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

Thus, it does what you were expecting. Here is a toy example to check the behaviour:

data_batch_1 = [("2021-04-19", "2021-04-19 01:00:01", 1.1), 
                ("2021-04-19", "2021-04-19 13:00:00", 1.2)]

data_batch_2 = [("2021-04-19", "2021-04-19 15:00:00", 2.1), 
                ("2021-04-19", "2021-04-19 20:00:00", 2.2)]

col_names = ["DATE", "ts", "sensor1"]

df_batch_1 = spark.createDataFrame(data_batch_1, col_names)
df_batch_2 = spark.createDataFrame(data_batch_2, col_names)

s3_path = "/tmp/67163237/"

Save Batch 1

df_batch_1.write.mode("append").partitionBy("DATE").parquet(s3_path)
spark.read.parquet(s3_path).show()
+-------------------+-------+----------+
|                 ts|sensor1|      DATE|
+-------------------+-------+----------+
|2021-04-19 01:00:01|    1.1|2021-04-19|
|2021-04-19 13:00:00|    1.2|2021-04-19|
+-------------------+-------+----------+

Save Batch 2

df_batch_2.write.mode("append").partitionBy("DATE").parquet(s3_path)
spark.read.parquet(s3_path).show()
+-------------------+-------+----------+
|                 ts|sensor1|      DATE|
+-------------------+-------+----------+
|2021-04-19 15:00:00|    2.1|2021-04-19|
|2021-04-19 01:00:01|    1.1|2021-04-19|
|2021-04-19 20:00:00|    2.2|2021-04-19|
|2021-04-19 13:00:00|    1.2|2021-04-19|
+-------------------+-------+----------+
0
votes

Following mck's excellent suggestion of (you wont know until you try it), I did, and as feared, it basically overwrote the entire partition with the new data.

I thought about it and I have decided to always re-stream the previous day's day and overwrite in the partition. It works for me in my case because I do have access to a 5 days of buffer data that I can re-pull. But this solution wont work for those where they have transient data that stays for only a few hours or so.