0
votes

I'm trying to set up a Dataflow job to write data from a PubSub Topic to a BigQuery table. I've clicked the "Export To BigQuery" from the PubSub Topic console, and taken the steps detailed below. Once the job is created, the flowchart I see has a box "WriteSuccessfulRecords" where the time info ramps up and up, and the Log Viewer reports endless messages like this:

Operation ongoing in step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 55m00s without outputting or completing in state finish
  at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  at [email protected]/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
  at [email protected]/java.util.concurrent.FutureTask.get(FutureTask.java:190)
  at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:817)
  at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:882)
  at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:143)
  at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:115)
  at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)

When I click through the WriteSuccessfulRecords I end up at the "StreamingWrite" box with the same time showing (what does this time mean?). The "Running" time in the WriteSuccessfulRecords (and StreamingWrite etc) box is currently at over 2 days, I created the job about an hour ago. It's previously reached close to 100 hours with no output.

My BigQuery table exists as an empty table, with the schema of the data expected from the PubSub. I've copied the table id from the BigQuery details tab and copied it into the appropriate box in the Dataflow setup (format is project-id:dataset.table-name). The BQ dataset is in the same region as the Dataflow job, although I'm not sure how relevant this is. Also my Cloud Storage temp storage location is valid, again I've copied the storage location into the Dataflow setup.

Other Dataflow setup info:

  1. I'm using the template "Pub/Sub Topic to BigQuery".
  2. Input Pub/Sub topic is projects//topics/
  3. We use a Shared VPC so I've specified the full path which looks like https://www.googleapis.com/compute/v1/projects/[pubsub project id]/regions/europe-west2/subnetworks/[subnet name]
  4. Also specified is the Service account email address
  5. My Worker Region is also set to the same as the BigQuery and Pub/Sub region, in case that's relevant.

Is there anything obvious I've missed with this setup? What next steps should I take to make progress with my Dataflow setup?

thanks in advance,

Tony

1
Does your service account have permissions to read from Pub/Sub and write to BigQuery? Are your Dataflow workers in the same region as your subnetwork (europe-west2)? When you click on the "ReadPubSubTopic" step, do you see positive throughput from the graph on the right side?Peter Kim
Hi Peter, thanks for the reply. Happy to confirm in the affirmative to all your queries - with the obne exception being I hadn't been granted BigQuery permissions to my service account (Pub/Sub and Dataflow perms were already there). I've had added the following: BigQuery Admin, BigQuery Connection Service Agent, BigQuery Data Transfer Service Agent. Between them they've enabled the write from Dataflow to BigQuery but I can't help feeling that's too much permission - do you know of a minimum role for least possible privilege for thisuse case? Thanks!Tony B

1 Answers

1
votes

It turned out that the service account required some additional permissions to write to BigQuery.

I'm not sure what the minimum roles for streaming inserts are, but my guess would be roles/bigquery.dataEditor and roles/bigquery.user. The /iam-admin/iam page will analyze the minimum set of permissions for the service account, so you could start from BigQuery Admin and narrow down. Another strategy is to look at the Dataflow logs [1] to see what kind of permissions the service account needs.

[1] "BigQuery insertAll error, retrying: Access Denied: Table project:dataset.table User does not have bigquery.tables.get permission for table project:dataset.table."