1
votes

I'm really new to Kettle. And I read this when I use the "set variables" step in my transformation. "all steps in a Kettle transformation run in parallel". I'm wondering how this can be possible.

For example, I have a transformation which only have two steps, A reads data from an csv file, and B writes these data to an xml file. If these two steps are run in parallel, how can B write the data to xml before A read data?

Any answers would be appreciated.

1

1 Answers

5
votes

It is exactly what it says. When a transformation starts all steps start at the same time. They then have an input "buffer" or a rowset which is generally 50k rows.

So; When the first step has read it's first 50k rows, they will then fill the buffer and the next step will then start processing those rows whilst the first step is still reading.

and so on and so on down the line..

In your example when the first 50k rows are read from the CSV, it will start writing the XML with those rows whilst it is still reading the next 50k.

Thats why set variables must be used in a previous transformation and tied together with a job.

One of the key things when performance tuning a pdi job is to identify which step in the chain is the slowest. Thankfully the performance metrics stuff makes this pretty easy!

Additionally you can run multiple copies of steps too if you want to, e.g. for steps writing to a database etc.