Is it OK to use for loop for step order in Apache beam

Question

I have several functions (or transforms): func1, func2, func3, ...

and a dict that holds the functions

FUNCS = {
    '1': func1,
    '2': func2,
    ...
}

What I'm thinking about is to pass a parameter funcs, that accepts a string of integers, and use for to loop through funcs, and execute the functions.

For example:

say I pass funcs="1321", then the functions are executed as:

with beam.Pipeline as p:
    lines = (
        p
        | "read file" >> beam.io.ReadFromText('gs://some/inputData.txt')
    )
    for f in funcs: # 1321
        lines = lines | FUNCS[f](#some other params)

and the functions are executed in order: func1, func3, func2, func1.

Is there any difference compared with:

with ...
    lines = ...
    lines = lines | func1 | func3 | func2 | func1

It's possible, I think; but is this even a good idea? Will there be any disadvantage about the parallel things of beam?

The true question is:

Is the pipeline get built first, THEN get executed?

Will the for loop and hard coded steps above end up with same pipeline? What effects does for loop have on efficiency and final result?

I'm using flex template of Google Dataflow btw.

Pablo Pablo · Accepted Answer · 2020-08-12T00:12:28

That's a smart question.

The short answer is: Yes, the pipeline is built first, and then executed.

The pipeline is only executed after you exit the with block.

Your for loop is completely fine.

Is it OK to use for loop for step order in Apache beam

1 Answers