How to do proper housekeeping of partitioned parquet files generated from Spark Streaming

Question

My Spark structured streaming job continuously generates parquet files which I want to delete after expiration (Let's say after 30 days).

I store my parquet data partitioned with the partition key being the event date in RFC3339/ISO8601 so that housekeeping could be done fairly easy on HDFS level based on a cron job (Delete all parquet-folders with partitionkey < oldestAllowedAge in terms of string comparison).

However, since I introduced Spark Streaming, Spark writes metadata to a folder named _spark_metadata next to the to be written data itself. If I now just delete the expired HDFS files and run a spark batch-job on the entire dataset, the job will fail due to files not found. The batchjob will read the metadata and expect already deleted files to exist.

The easy solution to this is to just disable the creation of _spark_metadata directory, as described here: disabling _spark_metadata in Structured streaming in spark 2.3.0 . But as I don't want to lose performance in reading the data for my regular batch analysis, I wonder if there isn't a better solution.

I thought, I could then just use spark for deletion so that it deletes the parquet hdfs files AND updates metadata. However, simply performing a

session.sql(String.format("DELETE FROM parquet.`%s` WHERE partitionKey < " + oldestAllowedPartitionAge, path.toString()));

doesn't work. DELETE sadly is an unsupported operation in Spark...

Is there any solution so that I can delete the old data but still have the _spark_metadata folder working?

Jungtaek Lim Jungtaek Lim · Accepted Answer · 2019-04-13T23:56:34

This is actually one of known issues in Structured Streaming (SPARK-24295) though it only occurs with massive input files, and end users are taking their own workarounds. For example, stop the query -> remove old input files -> manipulate metadata manually to purge them -> restart the query.

Given manually manipulating metadata is not trivial and not ideal (given it should stop the streaming query, and force end users understand the format of metadata), SPARK-27188 is proposed as an alternative - it applies retention and purges outdated input files from metadata.

How to do proper housekeeping of partitioned parquet files generated from Spark Streaming

3 Answers