Google Cloud DataFlow Randomize WritetoBigQuery

Question

I have succesfully implemented a dataflow pipeline that writes to BigQuery. This pipeline is transforming data for a Cloud ML Engine job. However, I noticed that the rows that have been written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually appear to be organized in some way (that is not completely random). Then when I export the table to sharded .csv's in GCS, each sharded .csv is essentially ordered. This means that the data cannot be entered into TensorFlow randomly since TF grabs one .csv at a time and the .csv's themselves are not random bags or rows.

Can anybody explain why the BigQuery table written by the apache beam pipeline would appear to be non-random if the original input data was randomized? Is there any way to force a shuffle/randomization of rows before writing to BigQuery? I need to ensure that the training data is completely random before being loaded into the ML model.

jkff jkff · Accepted Answer · 2017-10-16T21:12:05

BigQuery tables don't have the concept of order or grouping, they are just a bag of rows; if one needs ordering or grouping, one writes a query with an ORDER BY or GROUP BY clause. If you have code that reads rows from BigQuery and requires these rows to be read in random order, you can do something like https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

Google Cloud DataFlow Randomize WritetoBigQuery

1 Answers