How to run Cloud Dataflow pipelines using Spark runner?

Question

I have read that Google Cloud Dataflow pipelines, which are based on Apache Beam SDK, can be run with Spark or Flink.

I have some dataflow pipelines currently running on GCP using default Cloud Dataflow runner and I want to run it using Spark runner but I don't know how to.

Is there any documentation or guide about how to do this? Any pointers will help.

Thanks.

ACGray ACGray · Accepted Answer · 2018-06-20T10:26:35

I'll assume you're using Java but the equivalent process applies with Python.

You need to migrate your pipeline to use the Apache Beam SDK, replacing your Google Dataflow SDK dependency with:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.4.0</version>
</dependency>

Then add the dependency for the runner you wish to use:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-spark</artifactId>
    <version>2.4.0</version>
</dependency>

And add the --runner=spark to specify that this runner should be used when submitting the pipeline.

See https://beam.apache.org/documentation/runners/capability-matrix/ for the full list of runners and comparison of their capabilities.

How to run Cloud Dataflow pipelines using Spark runner?

2 Answers