0
votes

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.

Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.

1
How many file per table do you generate per day with Dataflow? Is your MySQL database is on Cloud SQL? What delay can you accept from the MySQL update (and the PubSub message publication) and the data available into BigQuery?guillaume blaquiere
1. I'm a little confused, what file are you talking about? Each table will have a corresponding topic in PubSub and a corresponding job in Dataflow. 2. The database is hosted on Amazon RDS. 3. Since this is a batch process, I'll be running it say once a week. So this delay can be large since I don't want to stream these changes in real time. @guillaumeblaquiereEkansh Bansal
Is it possible to not involve GCS at all and directly send data from PubSub to BQ but as a batch job and not a streaming job?Ekansh Bansal

1 Answers

6
votes

Thank you for the precision.

First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode

Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.

If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).

In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.

Solution 1

  • Connect your process to the pull subscription
  • Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
  • Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue

Solution 2

  • Connect your process to the pull subscription
  • Read a chunk of messages (for example 1000) and write them in memory (into an array).
  • Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
  • Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.

However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.

Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.

Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!