I'm having huge performance issues with Datastore write speed. Most of the time it stays under 100 elements/s.
I was able to achieve the speeds of around 2600 elements/s when bench marking the write speed on my local machine using the datastore client (com.google.cloud:google-cloud-datastore) and running the batched writes in parallel.
I've set up a simple Apache Beam pipeline using the Java API. Here's it's graph:
Here's speed when running without the Datastore node:
It is much faster this way. It all points to DatastoreV1.Write to be a bottleneck in this pipeline - judging by both the speed of pipeline without the write node and the wall time of the DatastoreV1.Write compared to the wall time of the other nodes.
Approaches I've tried to solve this:
• Increasing the number of initial workers (tried 1 and 10, with no noticeable difference). Datastore decreases number of writes to 1 after some time (probably after the first 2 nodes finish processing). Based on that I suspect that DatastoreIO.v1().write() does not run its workers in parallel. Why though?
• Making sure everything is being run in the same location: the GCP project, dataflow pipeline workers & metadata, storage - all are set to us-central. This is suggested here
• Trying to use different entity key generation strategies (per this post ). Currently using this approach: Key.Builder keyBuilder = DatastoreHelper.makeKey("someKind", UUID.randomUUID().toString());
. I'm not perfectly certain this generates keys distributed evenly enough but I guess even if it doesn't the performance should not be so low?
Please note, I was unable to use the provided Apache Beam & Google libraries without workarounds: I've had to force google-api-client version to be 1.22.0 & Guava to be 23.0 due to their dependency issues (see for example https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607 ).
Looking at the DatastoreV1.Write node log:
It pushes batches of 500 entities every circa 5s which is not very fast.
Overall it looks like DatastoreIO.v1().write() speed is slow and its workers are not being run in parallel. Any idea how to fix this or what could be the cause?