I'm implementing a Dataflow pipeline that reads messages from Pubsub and writes TableRows into BigQuery (BQ) using Apache Beam SDK 2.0.0 for Java.
This is the related portion of code:
tableRowPCollection
.apply(BigQueryIO.writeTableRows().to(this.tableId)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
This code generates a group of tasks under the hood in the Dataflow pipeline. One of these tasks is the GroupByKey. This task is accumulating elements in the pipeline as can be seen in this print screen: GBK elements accumulation image. After reading the docs I suspect that this issue relates to the Window config. But I could not find a way of modifying the Window configuration since it is implicitly created by Window.Assign transform inside the Reshuffle task.
Is there a way of setting the window parameters and/or attaching triggers to this implicit Window or should I create my own DoFn that inserts a TableRow in BQ?
Thanks in advance!
[Update]
I left the pipeline running for a day approximately and after that the GroupByKey
subtask became faster and the number of elements coming in and coming out approximated to each other (sometimes were the same). Furthermore, I also noticed that the Watermark
got closer to the current date and was increasing faster. So the "issue" was solved.