google dataflow write to spanner creates hundreds of workers to write a 79G file into a table

Question

I am trying to write a 79G file into a single table in Google spanner. The table has 1 string key column, and 13 string columns and 4 string array columns. I am using this example from Google cloud dataflow spanner examples modified for the table I created. That threw an io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations. on 79G file (it worked fine with a small example file). Nothing had been written to the database from the dataflow job.

To fix that, I followed the advice suggested here to set the batch size of the write to 1KB with: SpannerIO.write().withBatchSizeBytes(1024).
That caused dataflow job to autoscale to >300 workers, and the graph seemed to suggest it would take >18 hours. Is this expected behavior for a dataset this size? Is the issue the batch size? If so what is the recommended size to avoid the exception about too many mutations while keeping load times low?

This will be fixed with github.com/apache/beam/pull/4860. The current limit is 20k mutated column values (including indexes), if you know the size of the record you could estimate the approximate batch size. I'd recommend you to also cap maximum number of workers using the --maxNumWorkers flag. See cloud.google.com/dataflow/pipelines/specifying-exec-params — Mairbek Khadikov
Unfortunately the record size varies a lot because of array lengths and sparse data. But I will try to fix the maxNumWorkers flag. Thanks! — Kavitha Srinivas
One thing you can do if you have a lot of null values, is to skip null fields in the mutation builder, so they are not counted as mutations. I think you can safely try batch sizes of 10-100K in this case. — Mairbek Khadikov
Skipping null fields in the mutation helped a lot, this time the load went through without any exceptions (and with no batch size limits). Thanks for your help. — Kavitha Srinivas
@MairbekKhadikov post your comments as answer, so post owner can accept it, and delete the community wiki answer. — VictorGGl

Unknown Unknown · Accepted Answer · 2018-04-23T07:54:26

This will be fixed with the following pull . The current limit is 20k mutated column values (including indexes), if you know the size of the record you could estimate the approximate batch size. I'd recommend you to also cap maximum number of workers using the --maxNumWorkers flag.

One thing you can do if you have a lot of null values, is to skip null fields in the mutation builder, so they are not counted as mutations. I think you can safely try batch sizes of 10-100K in this case.

google dataflow write to spanner creates hundreds of workers to write a 79G file into a table

1 Answers