I have a large data set of millions of records across 5 tables. I am flattening the tables and trying to upload them in bigquery as one batch job. I will be using a ruby script to connect to mysql make the query and batch upload them into bigquery.
I will use this wrapper to connect to BigQuery https://github.com/abronte/BigQuery
This wrapper to connect to MySQL https://rubygems.org/gems/mysql
The idea is that I will query 100k records from mysql at a time and upload these, but I don't want to hit these limits... The following limits apply for streaming data into BigQuery.
Maximum row size: 1 MB
HTTP request size limit: 10 MB
Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors. Maximum rows per request: 500
Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors.
Source: https://cloud.google.com/bigquery/streaming-data-into-bigquery
Questions:
(1) Am I re-inventing the wheel and there is something out-there that will do this already?
(2) Is there an easy way to mark what was uploaded in bigquery to prevent duplicates?
(3) Any way to avoid hitting these limits?