0
votes

I need to implement microservice which is fairly simple in terms of logic and architecture, but needs to handle around 305k requests per second.

All it's going to do is to ingest JSON data, validate it according to simple rules and record to Google Cloud Storage as JSON files. There are lots of Google Cloud services and APIs available, but it's hard for me to pick proper stack and pipeline because I have not had much experience with them as well as with highload.

There is an example I'm looking at https://cloud.google.com/pubsub/docs/pubsub-dataflow

The flow is the following:

PubSub > Dataflow > Cloud Storage

It does exactly what I need (except date validation) but looks like Dataflow is limited to Java and Python, and I'd rather use PHP.

Another relevant example is https://medium.com/google-cloud/cloud-run-using-pubsub-triggers-2db74fc4ac6d

It uses Cloud Run, with supports PHP, and PubSub to trigger Cloud Run workload. So it goes like:

PubSub > Cloud Run 

and working with Cloud Storage in Run looks pretty simple.

Am I on a right way? Can something like mentioned above work for me or do I need something different?

1
Do you want to create 1 file per request or to group the request is messages (for example 1 file per minute)? What the purpose of your files? What will you do with them after?guillaume blaquiere
Best option would be grouping messages into fixed-size intervals (as it happens in the second example). Files serve as raw data storage for later use with BigQuery. But its not essential for now. Now its impotent to pick proper services. Should we listen to requests using App Engine or Cloud Run - or do we better publish directly to PubSub (and what goes next, GAE, GCR)..Vlad

1 Answers

1
votes

My first intuition when I saw 350k request per seconds and PubSub, is this pattern:

Pubsub > Dataflow > BigTable

My question validate the choice of BigTable because you can query BigTable table from BigQuery for later analysis.

Of course, it's expensive but you have here a very scalable system.

An alternative, if your process fits the BigQuery streaming quotas, is to stream directly into BigQuery instead of BigTable.

Pubsub > Dataflow > BigQuery

The problem with a solution of Cloud Run or App Engine, is that you will need to run a process externally (for example with Cloud Scheduler), and in this process, you will perform a loop to pull message from PubSub subscription. You will cope with several difficulties

  • PubSub perform at least 1 delivery and double messages can be a concern. Dataflow manage this automatically
  • The memory limitation of App Engine and Cloud Run can be an issue, especially if your language is not memory efficient.
  • Pulling velocity can be a concern, and parallelism can be a challenge.
  • Pulling duration is limited to some minutes (because of max request duration on Cloud Run and App Engine) and you have to exit gracefully and to wait the next Cloud Scheduler trigger to start again the PubSub pulling.

EDIT

I forgot that you didn't want to code in Java or Python. I can propose you 2 alternative if your process is really simple:

Personal opinion: coding language doesn't matter, use the right tool for the right job. Using Cloud Run or App Engine for this will create a much more unstable and hard to maintain system than learning how to write 10 lines of Java code