20
votes

We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery.

I can see 3 different ways of doing that :

  1. Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described here under "Streaming Insert Examples" : https://cloud.google.com/bigquery/streaming-data-into-bigquery or here : https://github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )
  2. Publish each event to a Cloud Pub/Sub topic, then writing a Cloud Dataflow pipeline to stream that to BigQuery (in Java or Python only it seems) , like here https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery
  3. Publish each event to a Pub/Sub topic from the API, but instead of Dataflow use a custom worker process that subscribes to the Pub/Sub topic on one side and streams into BQ on the other. Like here : https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python/blob/master/pubsub/pubsub-pipe-image/pubsub-to-bigquery.py or here : https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py

For this particular use case, we don't need to do any transforms and will just send events straight into the right format. But we may later have other use cases where we'll need to sync tables from our main datastore (MySQL) into BQ for analytics, so maybe starting with Dataflow straight away is worth it ?

A few questions :

  • Option 1 (sending single event straight to BQ) seems simplest if you don't have any transforms to do. Is it just as fast and reliable as publishing to a Pub/Sub topic ? I'm mainly concerned about latency and error/duplication handling (https://cloud.google.com/bigquery/troubleshooting-errors#streaming). Maybe this is better done in a separate process ?
  • For Option 2, are there any Dataflow "presets" that don't require you to write custom code when all you need is to read from Pub/Sub + send reliably into BQ with no transforms (maybe just deduplication / error handling)
  • Are there any drawbacks to having a simple custom worker (option 3) that reads from Pub/Sub then streams into BQ and does all error handling / retrying etc ?
3
Option 1 won't scale without some sort of producer/consumer pattern i.e. using a queue to process events async. You also won't be able to handle errors properly i.e. back-off-and-retry. Use: App -> PubSub -> Dataflow (streaming) -> BigQuery. That's the recommended pattern from Google, and the most fault-tolerant and scalable. You'll also get Stackdriver logging hooked up to the pipelines etc.Graham Polley

3 Answers

4
votes

For Option 2, Yes there is a preset called a Google-provided Template that facilitates movement of data from PubSub to BigQuery without having to write any code.

You can learn more about how to use this Google-provided Template, and others, in the Cloud Dataflow documentation.

3
votes

Another option is to export the logs using a log sink. Right from the Stackdriver Logging UI, you can specify BigQuery (or other destinations) for your logs. Since your Node API is running in Kubernetes, you just need to log messages to stdout and they'll automatically get written to Stackdriver.

Reference: https://cloud.google.com/logging/docs/export/configure_export_v2

2
votes

I took a look at this, and it feels like the answers are somewhat lacking. Here is what I can tell you about the Pros and Cons of each approach:

  1. Writing a custom program (via the Node BQ API or a custom worker process) has a few pitfalls when it comes to exactly-once guarantees. Specifically, if you write your own worker, you will need to perform extra work to checkpoint progress, and make sure that no elements are dropped or duplicated in case of runtime errors, or your worker process dying.

  2. If your requirements change (e.g. performing BQ streaming inserts becomes too expensive), Dataflow's Java SDK seamlessly supports either option: Streaming inserts, or the cheaper doing multiple load jobs into BQ instead of Streaming inserts; and it also handles multiple data sources well as well.

  3. Dataflow provides automatic autoscaling in case your data volume increases.

With that in mind, I would say:

  • If your use case is relatively simple, and you are okay with very rare data points being dropped on worker restarts, then a custom-written Node / Python application should do the trick for you.

  • If your use case involves only streaming PubSub to BQ, but you must make sure no data is dropped, check out the Template provided by Andrew which does exactly this.

  • If your use case is likely to be more complex than that, you may look into writing your own pipeline (and use the Template code as inspiration!).