I have a background service which produces files in Google Cloud Storage. Once it is done it generates a file in the output folder.
In my flow I need to get the list of these files and start DataProc Spark job with the list of files. The processing is not real-time and takes tens of minutes.
GCS has a notifications system. It can stream the notification to Pub/Sub service.
In GCS there will be a file .../feature/***/***.done
created to identify the service job completion.
- Can I subscribe to new files in GCS by wildcard?
Once the file is created the notification gets to Pub/Sub service.
I believe I can write Cloud Function that would read this notification, by some magic will get the location of the modified file and will be able to list all files from that folder. Then publish another message to Pub/Sub with all the required information
- Is that possible to start DataProc job by Pub/Sub notification?
Ideally, it would be great to use Jobs instead of Streaming to reduce costs. This may mean that PubSub initiates Job instead of streaming Job pulls the new message from PubSub