Better way to download data from Google Cloud Storage?

Question

I am trying to employ something like the apache dataflow pipeline detailed at the end of this article: https://cloud.google.com/blog/products/gcp/how-to-process-weather-satellite-data-in-real-time-in-bigquery. The goes-16 dataset I am trying to download from is: https://console.cloud.google.com/storage/browser/gcp-public-data-goes-16?authuser=3&project=fire-neural-network-285603&prefix=. So I could create a pub/sub topic and stream text data to my pub/sub topic and then use apache beam to download from the pub/sub topic but this seems kind of backwards to me. Is there a way I can use apache-beam to download directly from the cloud bucket whenever it updates without having to deal with pubsub? This seems backwards because to create the pub/sub I have to make a new dataflow job which will be run pretty much forever since I always want new data (so end up costing a lot). Something like this:

p = beam.Pipeline(runner, options=opts)
   (p
        | 'events' >> beam.io.ReadStringsFromGoogleCloud(bucketname) <---- looking for this
        | 'filter' >> beam.FlatMap(lambda message: g2j.only_infrared(message))
        | 'to_jpg' >> beam.Map(lambda objectid: 
            g2j.goes_to_jpeg(
                objectid, lat, lon, bucket,
                'goes/{}_{}/{}'.format( lat, lon, os.path.basename(objectid).replace('.nc','.jpg') ) 
                ))
   )

Any help appreciated, if I'm going about this completely wrong let me know!

Vikram Shinde Vikram Shinde · Accepted Answer · 2020-08-06T09:10:46

For streaming data, Cloud Storage --> Data Flow --> Pub/Sub is better option. Since it is a stream, the job will run forever.

If it is batch, then you can trigger Cloud Function from Cloud Storage and then push that message to Pub/Sub.

Better way to download data from Google Cloud Storage?

1 Answers