I am using Apache Beam 2.13.0 with GCP Dataflow runner.
I have a problem with streaming ingest to BigQuery from a batch pipeline:
PCollection<BigQueryInsertError> stageOneErrors =
destinationTableSelected
.apply("Write BQ Attempt 1",
BigQueryIO.<KV<TableDestination, TableRow>>write()
.withMethod(STREAMING_INSERTS)
.to(new KVTableDestination())
.withFormatFunction(new KVTableRow())
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND))
.getFailedInsertsWithErr();
The error:
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 15914/18766/18766 MB,
GC last/max = 99.17/99.17 %, #pushbacks=0, gc thrashing=true.
Heap dump not written.
Same code working in the streaming mode correctly (if the with explicit method setting omitted).
The code works on reasonably small datasets (less than 2 million records). Fails on 2,5 million plus.
On the surface it appears to be a similar problem to the one described here: Shutting down JVM after 8 consecutive periods of measured GC thrashing
Creating a separate question to add additional details.
Is there anything I could do to fix this? Looks like the issue is within the BigQueryIO component itself - GroupBy key fails.