0
votes

I have read that Google Cloud Dataflow pipelines, which are based on Apache Beam SDK, can be run with Spark or Flink.

I have some dataflow pipelines currently running on GCP using default Cloud Dataflow runner and I want to run it using Spark runner but I don't know how to.

Is there any documentation or guide about how to do this? Any pointers will help.

Thanks.

2

2 Answers

1
votes

I'll assume you're using Java but the equivalent process applies with Python.

You need to migrate your pipeline to use the Apache Beam SDK, replacing your Google Dataflow SDK dependency with:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.4.0</version>
</dependency>

Then add the dependency for the runner you wish to use:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-spark</artifactId>
    <version>2.4.0</version>
</dependency>

And add the --runner=spark to specify that this runner should be used when submitting the pipeline.

See https://beam.apache.org/documentation/runners/capability-matrix/ for the full list of runners and comparison of their capabilities.

0
votes

Thanks to multiple tutorials and documentation scattered all over the web, I was finally able to have a coherent idea about how to use spark runner with any Beam SDK based pipeline.

I have documented entire process here for future reference: http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html.