0
votes

I am using apache beam 2.22.0 (java sdk) and want to log metrics and write them to a GCS bucket after a batch pipeline finishes execution. I have tried using result.waitUntilFinish() followed by the intended code:

  1. DirectRunner- GCS object is created as expected and the logs appear on the console
  2. DataflowRunner- GCS object is created but logs (post pipeline exec) don't appear on stackdriver

Problem: When a GCS template is created for the same, Neither the GCS object is created nor logs appear using the template.

2

2 Answers

1
votes

what you are doing is the correct way of getting a signal for when the pipeline is done. There is no direct API in Apache Beam that allows for getting that signal within the running pipeline aside from wait_until_finish().

For your logging problem, you need to use the Cloud Logging API in your code. This is because the pipeline is submitted to the Dataflow service and runs in GCE VMs which logs to Cloud Logging. However, the code outside of your pipeline runs locally.

See Perform action after Dataflow pipeline has processed all data for a little more information.

0
votes

It is possible to export the logs from your Dataflow job to Google Cloud Storage, Big Query or PubSub. In order to do that, you can use Cloud Logging Console, Cloud Logging API or gcloud logging to export the desired metrics to a specific sink.

In summary, to use the log export:

  1. Create a sink, selecting Google Cloud Storage as the Sink Service( or one of the desired other options).
  2. Within the sink, create a query to filter your logs (Optional)
  3. Export destination

Afterwards, every time Cloud Logging receives new entries it will add them to the sink, only the new entries.

While you did not mention if you are using custom metrics, I should point that you should follow the Metrics naming rules, here. Otherwise, it won't show up in StackDriver.