0
votes

I'm using Google Dataflow with templates: a template is deployed to GCS by the CI server (Continuous Integration), and later a gcloud dataflow jobs run command is used to start a batch job from this template. Now, within the pipeline itself I would like to know the start time of this exact pipeline (to use in the names of the output files).

Is this kind of introspection possible in Beam/Dataflow? Is it possible to get the job name and start time of the job from within the job itself? (That is, in the code which executes on the VMs by the Dataflow)?

Thank you!

1

1 Answers

3
votes

It can be done, just a bit tricky in the current implementation of the template feature.

For job id, you can follow this code snippet: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/ExportTransform.java#L178

In this piece of code, job id got propagated as a side input, but I think it should also be fine if you don't use side input.

For job start time, there are two ways: 1. Parse the job id with pacific standard time. But I agree it's a bit fragile. 2. Get the current time and pass down as a side input. You can follow the link above.

Thanks.