I am given the URL of a Google cloud bucket. I have to:
Use the URL to acquire a list of blobs in that bucket
For each blob I make some GCS API calls to get information about the blob (blob.size, blob.name, etc.)
For each blob I have to also read it, find something inside it and add it to the values obtained from the GCS API calls
For each blob I have to write the values found in step 2 and 3 about the blob to BigQuery
I have thousands of blobs so this needs to be done with Apache beam (I've been recommended)
My idea of the pipeline is something like this:
GetUrlOfBucket and make PCollection
Using that PCollection obtain a list of blobs as a new PCollection
Create a PCollection with the metadata of those blobs
Perform a Transform that will take in the PCollection that is a dictionary of metadata values, goes into the blob, scans for a value and returns a new PCollection that is a dictionary of the metadata values and this new value
Write this to BigQuery.
It's particularly hard for me to think about how to return a dictionary as a PCollection
[+] What I've read:
https://beam.apache.org/documentation/programming-guide/#composite-transforms
https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366
Any suggestions, specifically about how to take in that bucket name and return a PCollection of blobs, is greatly appreciated.