0
votes

I'm in trouble in loading Huge data to Bigquery.

In GCS, I have huge & many files like this:

gs://bucket/many_folders/yyyy/mm/dd/many_files.gz

I want to load it to BigQuery, so first, I tried:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  project:dataset.table \
  gs://bucket/* \
  schema.txt

which failed because of it exceeded "max_bad_records" limit(the file is an aggregation of many types of log so it causes many errors).

Then I calculated to found that I need to use "*" like:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  gs://bucket/many_folders/yyyy/mm/dd/*\
  schema.txt

because of the max_bad_records limitation.

But I found it is very slow(because of pararell-run limitation in BigQuery). And it exceedes daily loading job limitation also. I prefer not doing this option.

Any idea for solving this situation? I want to load this data as fast as I can.

Thank you for reading.

1
You may use Dataflow to read your data from Cloud Storage and load them to BigQuery. You have several examples in how to read from multiple paths here (1), (2).Yurci

1 Answers

0
votes

I solved it by loading GCS data as one column. Then as a next step I parsed the data.