6
votes

I have a Beam application that runs successfully locally with directrunner and gives me all the log information i have in my code on my local console. But when I tried running it in the google cloud dataflow environment, i only see those log information on my local console but they don't show up on the Google Cloud Console for dataflow job, neither in their StackDriver logging page.

Here is what i did to run my code for dataflow runner from my local console:

mvn compile exec:java -Dexec.mainClass= ... \
                      -Dexec.args=... "   \ 
                      -Pdataflow-runner

and all logs come back on this local console. But when I went to the Google Cloud Console on my browser and search for the logs for my dataflow job, I don't see those logs in my code LOGGER.info(msg)anywhere. I only see logs related to dataflow pipeline.

So I wonder whether my Beam application run separately in such a way that the part of the main class not inside the pipeline is run locally and only the part of pipeline code will be sent to the google cloud for execution there. And hence those logs info not inside the pipeline code will not be available on the Google Cloud Dataflow logs.

2

2 Answers

3
votes

You are correct, the main program does not run on Google Cloud - it only constructs the pipeline and submits it to the Dataflow service.

You can easily confirm this by stepping through your main program in a debugger: it is a regular Java program, just one of the things that happens as part of the execution is the pipeline.run() call in your program, which under the hood packages the steps of the pipeline so far into an HTTP request to the Dataflow service saying "here's a specification of a pipeline, please run this". If this call didn't happen, or if the network was down, Dataflow would never even learn that your program exists.

Dataflow is just that - a service that responds to HTTP requests - it is not a different way to run Java programs, so it has no way of knowing about anything in your program that your program isn't explicitly sending to it; for example, it has no way of knowing about your log statements.

Moreover, if you use templates, then the execution of your main program is completely decoupled from execution of the pipeline: the main program submits the pipeline template and finishes, and you can request to run the template with different parameters later, possibly multiple times or not at all.

1
votes

For anyone looking for this answer in 2020: DataFlow worker logs (things inside DoFn's and such) have moved to the "Logs Explorer" and StackDriver appears to have disappeared or been renamed.

Go to Google Cloud Console.

Logging > Logs Explorer

Log Field: Dataflow Steps

Log Name: dataflow.googleapis.com/worker