I am trying to write a 79G file into a single table in Google spanner. The table has 1 string key column, and 13 string columns and 4 string array columns. I am using this example from Google cloud dataflow spanner examples modified for the table I created. That threw an io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations.
on 79G file (it worked fine with a small example file). Nothing had been written to the database from the dataflow job.
To fix that, I followed the advice suggested here to set the batch size of the write to 1KB with: SpannerIO.write().withBatchSizeBytes(1024)
.
That caused dataflow job to autoscale to >300 workers, and the graph seemed to suggest it would take >18 hours. Is this expected behavior for a dataset this size? Is the issue the batch size? If so what is the recommended size to avoid the exception about too many mutations while keeping load times low?