We are currently using Google's Cloud Dataflow SDK (1.6.0) to run dataflow jobs in GCP, however, we are considering moving to the Apache Beam SDK (0.1.0). We will still be running our jobs in GCP using the dataflow service. Has anyone gone through this transition and have advice? Are there any compatibility issues here and is this move encouraged by GCP?
2 Answers
Formally Beam is not yet supported on Dataflow (although that is certainly what we are working towards). We recommend staying with the Dataflow SDK, especially if SLA or support are important to you. that said, our tests show that Beam runs on Dataflow, and although that may break at any time, you are certainly welcome to attempt at your own risk.
Update: The Dataflow SDKs are now based on Beam as of the release of Dataflow SDK 2.0 (https://cloud.google.com/dataflow/release-notes/release-notes-java-2). Both Beam and the Dataflow SDKs are currently supported on Cloud Dataflow.
You can run Beam SDK pipelines on Dataflow now. See:
https://beam.apache.org/documentation/runners/dataflow/
You'll need to add a dependency to pom.xml, and probably a few command-line options as explained on that page.