My Spark structured streaming job continuously generates parquet files which I want to delete after expiration (Let's say after 30 days).
I store my parquet data partitioned with the partition key being the event date in RFC3339/ISO8601 so that housekeeping could be done fairly easy on HDFS level based on a cron job (Delete all parquet-folders with partitionkey < oldestAllowedAge in terms of string comparison).
However, since I introduced Spark Streaming, Spark writes metadata to a folder named _spark_metadata next to the to be written data itself. If I now just delete the expired HDFS files and run a spark batch-job on the entire dataset, the job will fail due to files not found. The batchjob will read the metadata and expect already deleted files to exist.
The easy solution to this is to just disable the creation of _spark_metadata directory, as described here: disabling _spark_metadata in Structured streaming in spark 2.3.0 . But as I don't want to lose performance in reading the data for my regular batch analysis, I wonder if there isn't a better solution.
I thought, I could then just use spark for deletion so that it deletes the parquet hdfs files AND updates metadata. However, simply performing a
session.sql(String.format("DELETE FROM parquet.`%s` WHERE partitionKey < " + oldestAllowedPartitionAge, path.toString()));
doesn't work. DELETE sadly is an unsupported operation in Spark...
Is there any solution so that I can delete the old data but still have the _spark_metadata folder working?