0
votes

I am looking for some guidance on building an architecture for a simple ETL job. I have already built a solution but I am looking for ways to improve or try an alternate architecture to present.

Here is my use case:

  1. Source data is uploaded in csv format to Cloud Storage from mobile device
  2. Process the data & convert it to json format
  3. Use Big Data Storage solution for analytics
  4. Use Visualization solution to display the data

For this, I built a solution where user will upload source data in csv format to Cloud Storage. I use Cloud Functions to monitor changes in my Cloud Storage bucket and trigger a Dataflow pipeline to batch process it and and store the data (json format) in bigquery for analysis. Lastly, I use Data Studio to visualize the information in my bigquery tables.

Here's my workflow:

Cloud Storage -> Cloud Functions (trigger) -> Cloud Dataflow -> Big Query -> Data Studio

What other alternative architectures can I use to achieve this? Is Cloud Pub/Sub an option for Batch processing? How about using Apache Kafka for Pipeline processing?

1
There's nothing wrong with your solution. It's actually very good - scalable, event driven, NoOps'y etc. Why are you asking for alternate architectures exactly?Graham Polley
@MikhailBerlyant: there are some written guidelines about conflict of interest. I think they are pretty good - as long as you declare your interest as a developer evangelist or whatever, you should be OK. Answers may give you more space to do that than comments, of course. (My advice on your giving feedback about these rules is: remain calm. Too much fury will encourage your interlocutors to switch off, no matter how reasonable your core case. That's the human condition, which is best accepted).halfer
The architecture that I am using fits the bill and it works as expected. I am looking to improve always and see if I have missed any corner case. Also for this case here - one of my requirements is to also build another alternative architecture. That is why I am asking for some guidance hereAndy Cooper
The architecture is good. We use the exact same one. Been running it in prod for years. No problems. Don't fix what isn't broken!Graham Polley

1 Answers

-2
votes

seems pretty OK. I have built numerous data lake solutions on AWS where architecture is more or less similar. I do occasionally use DynamoDB to store information that I later use in Lambda function (which creates the pipeline dynamically) before creating Pipelines like AMI Id, Instance Types etc.

You can use Cloud Datastore in place of DynamoDB.