I am trying to employ something like the apache dataflow pipeline detailed at the end of this article: https://cloud.google.com/blog/products/gcp/how-to-process-weather-satellite-data-in-real-time-in-bigquery. The goes-16 dataset I am trying to download from is: https://console.cloud.google.com/storage/browser/gcp-public-data-goes-16?authuser=3&project=fire-neural-network-285603&prefix=. So I could create a pub/sub topic and stream text data to my pub/sub topic and then use apache beam to download from the pub/sub topic but this seems kind of backwards to me. Is there a way I can use apache-beam to download directly from the cloud bucket whenever it updates without having to deal with pubsub? This seems backwards because to create the pub/sub I have to make a new dataflow job which will be run pretty much forever since I always want new data (so end up costing a lot). Something like this:
p = beam.Pipeline(runner, options=opts)
(p
| 'events' >> beam.io.ReadStringsFromGoogleCloud(bucketname) <---- looking for this
| 'filter' >> beam.FlatMap(lambda message: g2j.only_infrared(message))
| 'to_jpg' >> beam.Map(lambda objectid:
g2j.goes_to_jpeg(
objectid, lat, lon, bucket,
'goes/{}_{}/{}'.format( lat, lon, os.path.basename(objectid).replace('.nc','.jpg') )
))
)
Any help appreciated, if I'm going about this completely wrong let me know!