Apache Beam with Dataflow - Nullpointer when reading from BigQuery

Question

I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer exception when reading from big query table and my job fails:

(288abb7678892196): java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:98)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:261)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:209)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:184)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:161)
at com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:47)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:341)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.doWork(DataflowWorker.java:297)
at com.google.cloud.dataflow.worker.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:125)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:105)
at com.google.cloud.dataflow.worker.runners.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I cannot figure out what is this connected to. When I clear the temp directory and reupload my template the job passes again.

The way I read from BQ is simply with:

BigQueryIO.read().fromQuery()

I would greatly appreciate any help.

Anyone?

Are you adding an actual query to your source? Or are you just calling fromQuery() without any parameters? Also, Read is not a function, but an internal class.. — Pablo
Are you running a pipeline directly or using the runner to create a template and then running that? — Ben Chambers
Are you doing any pardo transform on the data read from biquery? Incase yes then please provide code snippet where exactly you are getting nullpointer exception? — Manoj Kumar

Kamil Dziublinski Kamil Dziublinski · Accepted Answer · 2017-07-13T14:23:11

I ended up adding bug in google issuetracker. After longer conversation with google employee and their investigation it turned out that it doesn't make sense to use templates with dataflow batch jobs that read from BigQuery, because you can only execute them once.

To quote: "for BigQuery batch pipelines, templates can only be executed once, as the BigQuery job ID is set at template creation time. This restriction will be removed in a future release for the SDK 2, but when I cannot say. Creating Templates: https://cloud.google.com/dataflow/docs/templates/creating-templates#pipeline-io-and-runtime-parameters"

It still would be good if the error would be more clear than NullpointerException.

Anyway I hope that helps someone in the future.

Here is the issue if someone is interested in whole conversation: https://issuetracker.google.com/issues/63124894

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

3 Answers