2
votes

I have a Cloud Dataflow job that's stuck in the initiation phase, before running any application logic. I tested this by adding a log output statement inside inside the processElement step, but it's not appearing in the logs so it seems it's not being reached.

All I can see in the logs are the following messages, this which appears every minute:

logger: Starting supervisor: /etc/supervisor/supervisord_watcher.sh: line 36: /proc//oom_score_adj: Permission denied

And these which loop every few seconds:

VM is healthy? true.

http: TLS handshake error from 172.17.0.1:38335: EOF

Job is in state JOB_STATE_RUNNING, will check again in 30 seconds.

The job ID is 2015-09-14_06_30_22-15275884222662398973, though I have an additional two jobs (2015-09-14_05_59_30-11021392791304643671, 2015-09-14_06_08_41-3621035073455045662) that I started the morning and which have the same problem.

Any ideas on what might be causing this?

1
All the worker log messages are expected and consistent with normal operation. So they don't explain why your job is stuck.Jeremy Lewi
Thanks Jeremy. I suspect the problem is with the construction of the job itself, which loops through a bunch of data and calls ProcessContext.output() a lot. Probably not the ideal way to have written it.Darren Olivier
Can you elaborate on what you mean by "loops through a bunch of data and calls output()? If the data is coming in from the input to the DoFn this shouldn't be a problem (since it happens on the worker, after construction of the job). Or is the data coming from a field in a DoFn or somehow being serialized to the worker some other way?Ben Chambers
The job itself runs through approximately 50 million rows from a BigQuery table, then for each row it runs output() about 300 times.Darren Olivier
To clarify: 50 million rows is the input to the DoFn, but within the DoFn there's a for-each loop that runs through a simple array data structure of +- 300 elements and outputs a result for most of them. So the DoFn itself outputs +-300 TableRow instances.Darren Olivier

1 Answers

2
votes

It sounds like your pipeline has a BigQuery source followed by a DoFn. Before running your DoFn (and therefore reaching your print statement) the pipeline runs a BigQuery export job to create a snapshot of the data in GCS. This ensures that the pipeline gets a consistent view of the data contained in the BigQuery tables.

It seems like this BigQuery export job for your table took a long time. There unfortunately isn't a progress indicator for the export process. If you run the pipeline again and let it run longer, the export process should complete and then your DoFn will start running.

We are looking into improving the user experience of the export job as well as figuring out why it took longer than we expected.