2
votes

It's really confusing that every Google document for dataflow is saying that it's based on Apache Beam now and directs me to Beam website. Also, if I looked for github project, I would see the google dataflow project is empty and just all goes to apache beam repo. Say now I need to create a pipeline, from what I read from Apache Beam, I would do : from apache_beam.options.pipeline_options However, if I go with google-cloud-dataflow, I'll have error: no module named 'options' , turns out I should use from apache_beam.utils.pipeline_options. So, looks like google-cloud-dataflow is with an older beam version and is going to be deprecated?

Which one should I pick do develop my dataflow pipeline?

2

2 Answers

5
votes

Ended up finding answer in Google Dataflow Release Notes

The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:

  • The core SDK
  • DirectRunner and DataflowRunner
  • I/O components for other Google Cloud Platform services

The Cloud Dataflow SDK distribution does not include other Beam components, such as:

  • Runners for other distributed processing engines

  • I/O components for non-Cloud Platform services

Version 2.0.0 is based on a subset of Apache Beam 2.0.0

0
votes

Yes, I've had this issue recently when testing outside of GCP. This link help to determine what you need when it comes to apache-beam. If you run the below you will have no GCP components.

$ pip install apache-beam

If you run this however you will have all the cloud components.

$ pip install apache-beam[gcp]

As an aside, I use the Anaconda distribution for almost all of my python coding and packages management. As of 7/20/17 you cannot use the anaconda repos to install the necessary GCP components. Hoping to work with the Continuum folks to have this resolved not just for Apache Beam but also for Tensorflow.