0
votes

I recently used Dataflow for batch processing of data and encountered a pipeline stoppage due to an IO error ("IOError: No space left on device").

Disk expansion on the worker node solved the problem, but the amount of data to be processed is not very large and it is unlikely that the disk will be exhausted.

Therefore, I would like to know how Dataflow works so that I can better understand the incident.

My questions are as follows.

  • What is the architecture of Cloud Dataflow? I would like to know the architecture and the documentation to know about it.
  • What is the flow of a Dataflow job before it is launched?

My guess is that the pipelines and jobs are managed on the Managed Kubernetes cluster, and the jobs are executed on the user's VM Instance, since the dataflow logs include kubelet and docker logs.

Any information would be appreciated.

1

1 Answers

2
votes
  1. What is the architecture of Cloud Dataflow?

Google Cloud Dataflow is one of Apache Beam runners, and it’s built on top of Google Compute Engine(GCE), i.e. when you run Dataflow job, it’s executed on GCE instance(s). During launching of job, Apache Beam SDK is installed on each worker plus other libraries which you specify, and then it’s executed. For Dataflow job, you can specify type of GCE virtual machine as well as size of hard disk. Of course, depending on the data process, number of workers VM can change during time.

There is also a feature called Dataflow Shuffle which can be used for shuffle phase, in transforms like GroupByKey, combine which is executed on managed service (of course there is always some VM at the bottom, but that’s hidden from you) and not on Dataflow worker VM. This was shuffle can be significantly faster.

  1. What is the flow of a Dataflow job before it is launched?

If you want to know about the flow of Dataflow job, I would recommend you to go through this link.

Additional Information

If you want to know about the Programming model for Apache Beam, just click here and go through it.

Then, Google Cloud has added a new multi-language Dataflow (Runner v2) pipelines enabled by new, faster architecture. If you want to explore about Runner v2, just click here and go through it.

Please find all the links below:

  1. https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline
  2. https://cloud.google.com/dataflow/docs/concepts/beam-programming-model
  3. https://cloud.google.com/blog/products/data-analytics/multi-language-sdks-for-building-cloud-pipelines