0
votes

So I've been trying to stream data from Google Search Console API to BigQuery in real time.

The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table. The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).

Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.

What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).

Here's the code responsible for that part:

for row in response['rows']:
     keys = ','.join(row['keys'])

     # Data Manipulation Languate (DML) Insert one row each time to BigQuery
     row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}                    
     insert_all_data = {
         "kind": "bigquery#tableDataInsertAllRequest",
         "skipInvaliedRows": True,
         "ignoreUnknownValues": True,
         'rows':[{
                        'insertId': str(uuid.uuid4()),
                        'json': row_to_stream,
                    }]
     }

     build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
         projectId=projectid,
         datasetId=dataset_id,
         tableId=tableid,
         body=insert_all_data).execute(num_retries=5)

I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.

1. What could cause this issue?

Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).

2. How do you pick the best option for a particular task?

1

1 Answers

2
votes
  • By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.

  • I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.

  • Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.