0
votes

I am using Apache Beam to read data from Google Cloud Datastore with the help of Beam's own io.gcp.datastore.v1.datastoreio Python APIs.

I run my pipeline on Google Cloud Dataflow.

I want to ensure that my workers are not overloaded with data.

How can I read data in batches or ensure using some other mechanism that my workers are not pulling a huge amount of data in one go?

1

1 Answers

0
votes

Dataflow automatically does this for you. By default, datastoreio breaks your files into 64MB chucks. If you want to break them up into smaller pieces, use the num_splits parameter on the initializer to specify how many pieces to break each file into.