We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery.
I can see 3 different ways of doing that :
- Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described here under "Streaming Insert Examples" : https://cloud.google.com/bigquery/streaming-data-into-bigquery or here : https://github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )
- Publish each event to a Cloud Pub/Sub topic, then writing a Cloud Dataflow pipeline to stream that to BigQuery (in Java or Python only it seems) , like here https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery
- Publish each event to a Pub/Sub topic from the API, but instead of Dataflow use a custom worker process that subscribes to the Pub/Sub topic on one side and streams into BQ on the other. Like here : https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python/blob/master/pubsub/pubsub-pipe-image/pubsub-to-bigquery.py or here : https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py
For this particular use case, we don't need to do any transforms and will just send events straight into the right format. But we may later have other use cases where we'll need to sync tables from our main datastore (MySQL) into BQ for analytics, so maybe starting with Dataflow straight away is worth it ?
A few questions :
- Option 1 (sending single event straight to BQ) seems simplest if you don't have any transforms to do. Is it just as fast and reliable as publishing to a Pub/Sub topic ? I'm mainly concerned about latency and error/duplication handling (https://cloud.google.com/bigquery/troubleshooting-errors#streaming). Maybe this is better done in a separate process ?
- For Option 2, are there any Dataflow "presets" that don't require you to write custom code when all you need is to read from Pub/Sub + send reliably into BQ with no transforms (maybe just deduplication / error handling)
- Are there any drawbacks to having a simple custom worker (option 3) that reads from Pub/Sub then streams into BQ and does all error handling / retrying etc ?
App -> PubSub -> Dataflow (streaming) -> BigQuery
. That's the recommended pattern from Google, and the most fault-tolerant and scalable. You'll also get Stackdriver logging hooked up to the pipelines etc. – Graham Polley