Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
- Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
- A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
- Perform transformation on 2 of the fields, and write the row to BigQuery.
- The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
- Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
- DoFn input is a String, and its output is a BigQuery TableRow.
- The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
- 5 workers (5 vCPUs) took ~17 mins
- 5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
- 50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
- 100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?