1
votes

Are there plans to enable Cloud Dataflow to write data to Cloud Bigtable? Is it even possible?

Adding a custom Sink to handle the IO would probably be the clean choice.

As a workaround, I tried connecting to a Bigtable (same project) in a simple DoFn. Opening the connection and table in the startBundle step and closing them in finishBundle.

Moreover, I added the bigtable-hbase jar (0.1.5) to the classpath and a modified version of hbase-site.xml to the resource folder which gets picked up.

When running in the cloud, I get a NPN/ALPN extensions not installed exception.

When running locally, I get an exception stating that ComputeEngineCredentials cannot find the metadata server. despite having set the GOOGLE_APPLICATION_CREDENTIALS to the generated json key file.

Any help would be greatly appreciated.

2
I do get that NPN/ALPN extension not installed error. Lets see what've to be done to correct it..The Coder
We are currently working on providing support for Cloud Bigtable both as a source and a sink in Cloud Dataflow, but I do not yet have a concrete timeline to share with you.jkff
We'll open source a ParDo() example next week.Les Vogel - Google DevRel
@jkff That's great! Can you give a ballpark estimate? Are we talking days, weeks, months?codemoped
I have to be very vague because a production-ready bigtable connector depends on resolving several issues at the intersection of different teams, which is difficult to predict (you hit one of those issues), and on the prioritization of other tasks. It's definitely not days, but hopefully not months either. Sorry I couldn't be more helpful about this.jkff

2 Answers

4
votes

We now have a Cloud Bigtable / Dataflow connector. You can see more at: https://cloud.google.com/bigtable/docs/dataflow-hbase

0
votes

Cloud BigTable requires the NPN/ALPN networking jar. This is currently not installed on the Dataflow workers. So accessing Cloud BigTable directly from a ParDo won't work.

One possible work around is to use the HBase REST API to setup a REST server to access Cloud Bigtable on a VM outside of Dataflow. These instructions might help.

You could then issue REST requests to this REST server. This could be somewhat complicated if your sending a lot of requests (i.e. processing large amounts of data and need to set up multiple instances of your REST server and load balance across them).