1
votes

I'm trying to import more than 200M records on different computers (n=20) to my BigQuery table via Python client. Each computer runs every 10. second a job (with multiple rows)

from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + '/api.json'

print('Import job started, total rows:' + str(len(queries)))
client = bigquery.Client()
for q in queries:
    results = client.query(q)
    for err in results:
        print(err)

But I'm getting the following error:

google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many table dml insert operations for this table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

The data are being generated on run-time. So I have to import the data on run-time. I'm also not sure if BigQuery is good for that. Spanner seems to be better but it costs me too much.

How can I avoid this error? Thank you very much.

1
Don't use DML for this job, instead use streaming insert that supports 1M rows per second.Pentium10

1 Answers

1
votes

There are 4 major ways to insert data into BigQuery tables.

  1. Batch load a set of data records.
  2. Stream individual records or batches of records.
  3. Use queries to generate new data and append or overwrite the results to a table.
  4. Use a third-party application or service.

I think you are using the 3rd option, which is DML INSERT. It's not designed for large-scale high-frequency data loading use case.

In your use case, it seems the 2nd option, streaming data, could be a good fit.

Example

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of table to append to.
# table_id = "your-project.your_dataset.your_table"

rows_to_insert = [
    {u"full_name": u"Phred Phlyntstone", u"age": 32},
    {u"full_name": u"Wylma Phlyntstone", u"age": 29},
]

errors = client.insert_rows_json(table_id, rows_to_insert)  # Make an API request.
if errors == []:
    print("New rows have been added.")
else:
    print("Encountered errors while inserting rows: {}".format(errors))

You could see more details here. https://cloud.google.com/bigquery/streaming-data-into-bigquery