Spark micro-batches from delta lake are very small

Question

I'm reading appends to a delta table in Azure storage, and something strange is happening. The cluster is not under any real load, but the offset checkpoint advances very slowly. Looking at the individual offsets being written, the offset progress per batch is miniscule. For example, a streaming checkpoint that is 200 versions behind the end of the commit log writes an offset that catches it up only 1-3 versions, not all 200 (or 200 minus however many were written during the interval) (yes, the job is running very far behind - that's why I noticed).

For reference, the job that appends to the delta table produces a new version about every three minutes. The job that reads from the table runs on a one hour interval. Yet, the offsets produced by that job are between 1 and 3 versions, not the ~20 versions that should be consumed by a one-hour interval.

What's going on here? Is there a way I can see what the decision-making is behind the micro-batch size?

Peter Dowdy Peter Dowdy · Accepted Answer · 2021-06-09T22:38:57

After digging around for a while, I came across this documentation: https://docs.databricks.com/delta/delta-streaming.html

Of note:

You can ... [c]ontrol the maximum size of any micro-batch that Delta Lake gives to streaming by setting the maxFilesPerTrigger option. This specifies the maximum number of new files to be considered in every trigger. The default is 1000. (emphasis mine)

This can mean very few versions are consumed in an operation when each version contains many file updates.

Spark micro-batches from delta lake are very small

1 Answers