I'm trying to set up a Dataflow job to write data from a PubSub Topic to a BigQuery table. I've clicked the "Export To BigQuery" from the PubSub Topic console, and taken the steps detailed below. Once the job is created, the flowchart I see has a box "WriteSuccessfulRecords" where the time info ramps up and up, and the Log Viewer reports endless messages like this:
Operation ongoing in step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 55m00s without outputting or completing in state finish
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at [email protected]/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
at [email protected]/java.util.concurrent.FutureTask.get(FutureTask.java:190)
at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:817)
at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:882)
at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:143)
at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:115)
at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
When I click through the WriteSuccessfulRecords I end up at the "StreamingWrite" box with the same time showing (what does this time mean?). The "Running" time in the WriteSuccessfulRecords (and StreamingWrite etc) box is currently at over 2 days, I created the job about an hour ago. It's previously reached close to 100 hours with no output.
My BigQuery table exists as an empty table, with the schema of the data expected from the PubSub. I've copied the table id from the BigQuery details tab and copied it into the appropriate box in the Dataflow setup (format is project-id:dataset.table-name). The BQ dataset is in the same region as the Dataflow job, although I'm not sure how relevant this is. Also my Cloud Storage temp storage location is valid, again I've copied the storage location into the Dataflow setup.
Other Dataflow setup info:
- I'm using the template "Pub/Sub Topic to BigQuery".
- Input Pub/Sub topic is projects//topics/
- We use a Shared VPC so I've specified the full path which looks like https://www.googleapis.com/compute/v1/projects/[pubsub project id]/regions/europe-west2/subnetworks/[subnet name]
- Also specified is the Service account email address
- My Worker Region is also set to the same as the BigQuery and Pub/Sub region, in case that's relevant.
Is there anything obvious I've missed with this setup? What next steps should I take to make progress with my Dataflow setup?
thanks in advance,
Tony