I've got a bunch of data being generated in AWS S3, with PUT notifications being sent to SQS whenever a new file arrives in S3. I'd like to load the contents of these files into BigQuery, so I'm working on setting up a simple ETL in Google Dataflow. However, I can't figure out how to integrate Dataflow with any service that it doesn't already support out of the box (Pubsub, Google Cloud Storage, etc.).
In the initial release of Cloud Dataflow, extensibility for Read and Write transforms has not been implemented.
I think I can confirm this, as I tried to write a Read transform and wasn't able to figure out how to make it work (I tried to base an SqsIO class on the provided PubsubIO class).
So I've been looking at writing a custom source for Dataflow, but can't wrap my head around how to adapt a Source to polling SQS for changes. It doesn't really seem like the right abstraction anyway, but I wouldn't really care if I could get it working.
Additionally, it looks like I'd have to do some work to download the S3 files (I tried creating a Reader for that as well with no luck b/c of the above mentioned reason).
Basically, I'm stuck. Any suggestions for integrating SQS and S3 with Dataflow would be very appreciated.