Read records in batches from Google Cloud Datastore using Apache Beam

Question

I am using Apache Beam to read data from Google Cloud Datastore with the help of Beam's own io.gcp.datastore.v1.datastoreio Python APIs.

I run my pipeline on Google Cloud Dataflow.

I want to ensure that my workers are not overloaded with data.

How can I read data in batches or ensure using some other mechanism that my workers are not pulling a huge amount of data in one go?

Andrew Pilloud Andrew Pilloud · Accepted Answer · 2019-04-16T22:11:10

Dataflow automatically does this for you. By default, datastoreio breaks your files into 64MB chucks. If you want to break them up into smaller pieces, use the num_splits parameter on the initializer to specify how many pieces to break each file into.

Read records in batches from Google Cloud Datastore using Apache Beam

1 Answers