Can Google Cloud DataFlow be used as a Task Queue to process multiple data in parallel?

Question

We are currently evaluating our options on google cloud platform for a solution that works this way. We are expecting a lot of messages from our application and we intend to queue these transactions using google cloud pub/ sub. Now a typical message can have multiple JSON objects in it like this :

{
 groupId: "3003030330",
 groupTitle: "Multiple Payments Processing",
 transactions: [
   {id: "3030303" , amount: "2000" , to: "XXXX-XXX"},
   {id: "3030304" , amount: "5000" , to: "XXXX-XXX"},
   {id: "3030304" , amount: "5000" , to: "XXXX-XXX"},
 ]
}

Now we need to pass each of these transactions to our payment gateway synchronously and in parallel using google cloud dataflow then collate the responses into a different PCollection and write it to another pub / sub topic . My confusion is if Google Cloud Dataflow is the most efficient and scalable solution to this problem or using the Kubernetes HorizontalPodAutoScaler to scale based on the messages in the pub/sub queue. Any ideas and thoughts would be appreciated.

It looks that you are looking for a point of view related to the best product that could fit your scenario, since Stack Overflow is focused in questions related to errors or coding advise, I would recommend you to post your question in Google Groups where other users and Googlers can help you to have a better reference. — rsantiago

Héctor Neri Héctor Neri · Accepted Answer · 2018-09-19T00:31:20

By default, Cloud Dataflow can auto-scale from 1 to 1000 instances, each one of them having 4 vCPU, 15GB memory & 420GB Persistent Disk, so if you have enough quota, you can scale up to 4,000 cores, 15,000 GB of memory and 420 TB of storage usage.

But, there is currently a beta release of the Streaming Engine, which provides a more responsive autoscaling according to variations in incoming data volume, by moving pipeline execution out of the worker VMs and into the Cloud Dataflow service backend. This way, it works best with smaller worker machine types and uses less storage space.

Can Google Cloud DataFlow be used as a Task Queue to process multiple data in parallel?

2 Answers