0
votes

I'm trying to figure out the most efficient way to load data from Cloud storage to BigQuery. By efficient I mean - Preferably I don't want to split data into smaller sets to upload each subset separately - Uncompressing my data so I can leverage the 5TB limit of bq tool is also not practical for my purpose, which is to upload more than 30 (>4GB) datasets into separate BigQuery tables.

I've looked into the following alternatives: 1. bq load - has 4GB limit on compressed data 2. split data, convert to pandas dataframe, convert to json, load into BQ - this requires splitting, which I would like to avoid given the number of datasets I have to upload and that they all have different row sizes, so it will not be trivial to automate the task in hand. 3. Cloud Dataflow?

I haven't looked into the details of Dataflow, but my overall question is this - for large datasets, which is a better option - to split data and run batch uploads or to stream data i.e. insert one row at a time from Cloud Storage to BQ?

1
In your case, I would 100% voted against streamingMikhail Berlyant
Note that .gz files are not a shardable compression format, so using very large .gz files for data storage is going to present problems in any large scale data analysis framework. Even if 100GB .gz files are supported, it's going to take a long time for a single worker to process that file.Michael Sheldon

1 Answers

1
votes

Almost certainly, the simplest option will be to decompress the data, copy it into GCS, and load it into BigQuery from there. All of your other suggestions are either going to be more expensive, more work, or less likely to scale.

The only downsides of using the decompressed input data is that the network upload time is longer and you'll pay more for storing it in GCS. That said, it only needs to be in GCS for a few minutes while you load it into bigquery.