Process 350k requests per seconds and save data to Google Cloud Storage

Question

I need to implement microservice which is fairly simple in terms of logic and architecture, but needs to handle around 305k requests per second.

All it's going to do is to ingest JSON data, validate it according to simple rules and record to Google Cloud Storage as JSON files. There are lots of Google Cloud services and APIs available, but it's hard for me to pick proper stack and pipeline because I have not had much experience with them as well as with highload.

There is an example I'm looking at https://cloud.google.com/pubsub/docs/pubsub-dataflow

The flow is the following:

PubSub > Dataflow > Cloud Storage

It does exactly what I need (except date validation) but looks like Dataflow is limited to Java and Python, and I'd rather use PHP.

Another relevant example is https://medium.com/google-cloud/cloud-run-using-pubsub-triggers-2db74fc4ac6d

It uses Cloud Run, with supports PHP, and PubSub to trigger Cloud Run workload. So it goes like:

PubSub > Cloud Run

and working with Cloud Storage in Run looks pretty simple.

Am I on a right way? Can something like mentioned above work for me or do I need something different?

Do you want to create 1 file per request or to group the request is messages (for example 1 file per minute)? What the purpose of your files? What will you do with them after? — guillaume blaquiere
Best option would be grouping messages into fixed-size intervals (as it happens in the second example). Files serve as raw data storage for later use with BigQuery. But its not essential for now. Now its impotent to pick proper services. Should we listen to requests using App Engine or Cloud Run - or do we better publish directly to PubSub (and what goes next, GAE, GCR).. — Vlad

guillaume blaquiere guillaume blaquiere · Accepted Answer · 2020-06-02T09:15:04

My first intuition when I saw 350k request per seconds and PubSub, is this pattern:

Pubsub > Dataflow > BigTable

My question validate the choice of BigTable because you can query BigTable table from BigQuery for later analysis.

Of course, it's expensive but you have here a very scalable system.

An alternative, if your process fits the BigQuery streaming quotas, is to stream directly into BigQuery instead of BigTable.

Pubsub > Dataflow > BigQuery

The problem with a solution of Cloud Run or App Engine, is that you will need to run a process externally (for example with Cloud Scheduler), and in this process, you will perform a loop to pull message from PubSub subscription. You will cope with several difficulties

PubSub perform at least 1 delivery and double messages can be a concern. Dataflow manage this automatically
The memory limitation of App Engine and Cloud Run can be an issue, especially if your language is not memory efficient.
Pulling velocity can be a concern, and parallelism can be a challenge.
Pulling duration is limited to some minutes (because of max request duration on Cloud Run and App Engine) and you have to exit gracefully and to wait the next Cloud Scheduler trigger to start again the PubSub pulling.

EDIT

I forgot that you didn't want to code in Java or Python. I can propose you 2 alternative if your process is really simple:

Use Google provided Dataflow template, especially in streaming where you can stream directly into BigQuery, without transformation. And if you want to perform transform, you can use the source code as base and just add your transform step in it.
You can process your PubSub messages as a simple SQL query. Quite boring to set up, but you simply have to define your transform in SQL language and the Dataflow is built for you.

Personal opinion: coding language doesn't matter, use the right tool for the right job. Using Cloud Run or App Engine for this will create a much more unstable and hard to maintain system than learning how to write 10 lines of Java code

Process 350k requests per seconds and save data to Google Cloud Storage

1 Answers