we are trying to debug a seemingly partially-stalled Apache Beam job running on Google Cloud. Our job reads messages from PubSub, transforms them in various ways, and streams the results to several BigQuery tables. Parts of the job are still active - a couple of our tables are being updated. But other parts seem stalled, with the last database table update many hours ago (2:35am). Unfortunately we see no useful exceptions in the logs. We have only a small number of user-generated log messages, emitted once a minute, and these have stopped, with the last one at 2:35am. An hour or so later, Beam increased the number of workers per the autoscaling pipeline, presumably keying off a backlog in parts of our pipeline.
Without useful logs, the only leads I have are that
- some of the workers seem to have a java process stuck at 100% CPU
looking at /var/log/dataflow/windmill/ on these workers shows a WARNING and ERROR log updated at 2:36am with messages like
W0811 02:35:43.005868 19 work_service_client.cc:958] flowingestion-gcla-081020-08101355-256d-harness-jmb 5 Unable to update setup work item 5076700766800503996 error: DEADLINE_EXCEEDED: Http(504) Gateway Timeout E0811 02:36:12.814573 208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb 5 Lost lease for work with id 1911643509568450683
and
E0811 02:36:12.821274 208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb
5 Lost lease for work with id 8994368075509118540
E0811 02:36:12.821322 208 work_service_client.cc:689] flowingestion-gcla-081020-08101355-256d-harness-jmb
5 Lost lease for work with id 8994368075509118575
Does anyone have any advice on where to go from here?
If anyone from the Google Cloud team is able to take a look, our job ID is 2017-08-10_13_55_26-6781083469468673800.