How to load huge, many, and nested data from GCS to BigQuery

Question

I'm in trouble in loading Huge data to Bigquery.

In GCS, I have huge & many files like this:

gs://bucket/many_folders/yyyy/mm/dd/many_files.gz

I want to load it to BigQuery, so first, I tried:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  project:dataset.table \
  gs://bucket/* \
  schema.txt

which failed because of it exceeded "max_bad_records" limit(the file is an aggregation of many types of log so it causes many errors).

Then I calculated to found that I need to use "*" like:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  gs://bucket/many_folders/yyyy/mm/dd/*\
  schema.txt

because of the max_bad_records limitation.

But I found it is very slow(because of pararell-run limitation in BigQuery). And it exceedes daily loading job limitation also. I prefer not doing this option.

Any idea for solving this situation? I want to load this data as fast as I can.

Thank you for reading.

You may use Dataflow to read your data from Cloud Storage and load them to BigQuery. You have several examples in how to read from multiple paths here (1), (2). — Yurci

Keisuke Nagakawa 永川圭介 Keisuke Nagakawa 永川圭介 · Accepted Answer · 2019-05-04T23:10:41

0

votes

I solved it by loading GCS data as one column. Then as a next step I parsed the data.

How to load huge, many, and nested data from GCS to BigQuery

1 Answers