Parquet File Output Sink - Spark Structured Streaming

Question

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input. Once I have given it a few files, it tends to write a Parquet file just fine.

I am wondering how to control this. I would like to be able force a new write to Parquet file for every new file provided as input. Any tips appreciated!

Note: I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file

mekmek mekmek · Accepted Answer · 2019-03-29T20:21:24

After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.

After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past. Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...

I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected. Essentially for each record in, at least one record out, with no delay (as was observed before).

Hopefully this is helpful to others.

Parquet File Output Sink - Spark Structured Streaming

1 Answers