1
votes

Looking for some advice on how best to architect/design and build our pipeline.

After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.

Our data/workflow:

  • Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
  • A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
  • Perform transformation on 2 of the fields, and write the row to BigQuery.
  • The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.

What we've got running so far:

  • Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
  • DoFn input is a String, and its output is a BigQuery TableRow.
  • The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.

We've run the job using a few different worker configurations to see how it performs:

  1. 5 workers (5 vCPUs) took ~17 mins
  2. 5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
  3. 50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
  4. 100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins

Would those times be in line with what you would expect for our use case and pipeline?

2

2 Answers

2
votes

You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?

1
votes

BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.

We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.

Net-net increasing instances is not going to get you much more throughput right now.

Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)

Make sense?