I'm trying to figure out the most efficient way to load data from Cloud storage to BigQuery. By efficient I mean - Preferably I don't want to split data into smaller sets to upload each subset separately - Uncompressing my data so I can leverage the 5TB limit of bq tool is also not practical for my purpose, which is to upload more than 30 (>4GB) datasets into separate BigQuery tables.
I've looked into the following alternatives: 1. bq load - has 4GB limit on compressed data 2. split data, convert to pandas dataframe, convert to json, load into BQ - this requires splitting, which I would like to avoid given the number of datasets I have to upload and that they all have different row sizes, so it will not be trivial to automate the task in hand. 3. Cloud Dataflow?
I haven't looked into the details of Dataflow, but my overall question is this - for large datasets, which is a better option - to split data and run batch uploads or to stream data i.e. insert one row at a time from Cloud Storage to BQ?