1
votes

My dataflow job has to download some file from remote server. I want to save the file on worker machine so job doesn't have to keep downloading the same file.

I tried to do this with setup method, however it seems setup will be called for each thread, and multiple threads can call setup in parallel (I cannot find documentation around this, but based on my experience my job tries to write file data in parallel and hence causing malformed data).

Is there a way to perform one-time setup whenever worker machine is launched?

I also checked Apache Beam: DoFn.Setup equivalent in Python SDK but I believe it focuses around per-thread setup.

1
I have one question, do you want to download this files before running the pipeline or after job start? - aga
Before running pipeline. - Kazuki
Shared indeed seems useful! - Kazuki

1 Answers

1
votes

The Beam model doesn't include a specific callback for when a VM is created because the model doesn't guarantee the runtime environment. However, because you are using Dataflow that uses containers you have two options:

The first will give you direct control over the container image, and it works for all languages. The second only works for Python.