2
votes

we are trying to debug a seemingly partially-stalled Apache Beam job running on Google Cloud. Our job reads messages from PubSub, transforms them in various ways, and streams the results to several BigQuery tables. Parts of the job are still active - a couple of our tables are being updated. But other parts seem stalled, with the last database table update many hours ago (2:35am). Unfortunately we see no useful exceptions in the logs. We have only a small number of user-generated log messages, emitted once a minute, and these have stopped, with the last one at 2:35am. An hour or so later, Beam increased the number of workers per the autoscaling pipeline, presumably keying off a backlog in parts of our pipeline.

Without useful logs, the only leads I have are that

  • some of the workers seem to have a java process stuck at 100% CPU
  • looking at /var/log/dataflow/windmill/ on these workers shows a WARNING and ERROR log updated at 2:36am with messages like

    W0811 02:35:43.005868    19 work_service_client.cc:958] flowingestion-gcla-081020-08101355-256d-harness-jmb
    5 Unable to update setup work item 5076700766800503996 error: DEADLINE_EXCEEDED: Http(504) Gateway Timeout
    E0811 02:36:12.814573   208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb
    5 Lost lease for work with id 1911643509568450683
    

and

    E0811 02:36:12.821274   208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb
    5 Lost lease for work with id 8994368075509118540
    E0811 02:36:12.821322   208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb
    5 Lost lease for work with id 8994368075509118575

Does anyone have any advice on where to go from here?

If anyone from the Google Cloud team is able to take a look, our job ID is 2017-08-10_13_55_26-6781083469468673800.

1

1 Answers

3
votes

We figured out that the problem stemmed from our own code...

One of the stages in our pipeline attempted to decompress its input from PubSub. Something went wrong, and the decompression became stuck in a CPU-bound loop.

To determine this, we did the following:

  • using the Google Compute Engine web interface, we looked at the CPU history of each worker over the last several hours. Of those that had been running since the start of the Apache Beam pipeline, a handful showed a sharp increase in CPU usage around 2:35am(!)
  • we connected to one of these instances using ssh, ran top, and found a Java process at 100% CPU.
  • we couldn't get a stack trace with jstack - it reported that the process didn't "seem to be a Hotspot VM". We could issue SIGQUIT to the pid for a thread dump, but descriptor 1 was connected to a pipe. So we attached to the pid with strace -f -s 256 -o strace.out, issued SIGQUIT, then reconstructed the thread dumps from strace.out.

The output showed one interesting thread (out of more than 300) running in our own code. That revealed the problem.

I would love to hear if anyone has a more polished way of doing this :-)