I am looking for some guidance on building an architecture for a simple ETL job. I have already built a solution but I am looking for ways to improve or try an alternate architecture to present.
Here is my use case:
- Source data is uploaded in csv format to Cloud Storage from mobile device
- Process the data & convert it to json format
- Use Big Data Storage solution for analytics
- Use Visualization solution to display the data
For this, I built a solution where user will upload source data in csv format to Cloud Storage. I use Cloud Functions to monitor changes in my Cloud Storage bucket and trigger a Dataflow pipeline to batch process it and and store the data (json format) in bigquery for analysis. Lastly, I use Data Studio to visualize the information in my bigquery tables.
Here's my workflow:
Cloud Storage -> Cloud Functions (trigger) -> Cloud Dataflow -> Big Query -> Data Studio
What other alternative architectures can I use to achieve this? Is Cloud Pub/Sub an option for Batch processing? How about using Apache Kafka for Pipeline processing?