0
votes

I've got a bunch of data being generated in AWS S3, with PUT notifications being sent to SQS whenever a new file arrives in S3. I'd like to load the contents of these files into BigQuery, so I'm working on setting up a simple ETL in Google Dataflow. However, I can't figure out how to integrate Dataflow with any service that it doesn't already support out of the box (Pubsub, Google Cloud Storage, etc.).

The GDF docs say:

In the initial release of Cloud Dataflow, extensibility for Read and Write transforms has not been implemented.

I think I can confirm this, as I tried to write a Read transform and wasn't able to figure out how to make it work (I tried to base an SqsIO class on the provided PubsubIO class).

So I've been looking at writing a custom source for Dataflow, but can't wrap my head around how to adapt a Source to polling SQS for changes. It doesn't really seem like the right abstraction anyway, but I wouldn't really care if I could get it working.

Additionally, it looks like I'd have to do some work to download the S3 files (I tried creating a Reader for that as well with no luck b/c of the above mentioned reason).

Basically, I'm stuck. Any suggestions for integrating SQS and S3 with Dataflow would be very appreciated.

1
Right now there is no way to use SQS as an input source, however we're about to publish an API that will allow you to do just that, similar to the custom source API for bounded sources which you have already looked at. Stay tuned!jkff
Dataflow Java SDK now includes an API for defining custom unbounded sources: github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/…jkff
can you point me to the new location in apache/beam?skboro

1 Answers

1
votes

The Dataflow Java SDK now includes an API for defining custom unbounded sources:

https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/UnboundedSource.java

This can be used to implement a custom SQS Source.