0
votes

I have an ETL requirement like:

I need to fetch around 20000 records from a table and process each record separately.(The processing of each record involves a couple of steps like creating a table for each record and inserting some data into it). For prototype I implemented it with two Jobs(with corresponding transformations). Rather than table I created a simple empty file. But this simple case also doesn't seem to work smoothly. (When I do create a table for each record the Kettle exits after 5000 reocrds)

Flow

When I run this the Kettle goes slow and then hangs after 2000-3000 files though processing is complete after a long time though Kettle seems to stop at some time. Is my design approach right?. When I replace the write to file with actual requirement like creating a new table(through sql script step) for each id and inserting data into it, the kettle exits after 5000 records. What do I need to do so that the flow works. increasing the Java memory(Xmx is already at 2gb)?. Is there any other configuration I can change? Or is there any other way? Extra Time shouldn't be a constraint but the flow should work.

My initial guess was since we are not storing any data the prototype atleast should work smoothly. I am using Kettle 3.2.

1
It is hard to guess how yours flow implemented. Just a chips instead of complete picture. Which kind of error do u have. OutOfMemory exception?simar
If u have some complex data processing including file system calls for each row, u have at least to limit number of rows in one set send to transformation. Configure transformation in transformation properties -> "Miscellaneous" tab, number of rows in row-set.simar
When you use 'copy rows to result' that will keep all records in memory - so if data is big- most probably will be OutOfMemory. Redesign and get rid of copying rows in jobs - copy only if amount of rows is small - don't escape transformations otherwise.Dzmitry Prakapenka
Thanks guys!There is not much in the flow. Simply fetching id(around 10000 that are being copied to rows) from DB and looping through using for each input row. In processing I am just creating a file. Seems Kettle doesn't release the memory. When I see it through a memory tool like visualVM/Jconsole there is a linear rise in memory that gets released only when the job with loop ends.user6786116
This link helps forums.pentaho.com/… a bit but adding a dummy transformation with clear list of result rows/files seems to slow things.user6786116

1 Answers

0
votes

I seem to remember this is a known issue/restriction, hence why job looping is deprecated these days.

Are you able to re-build the job using the transformation and/or job executor steps? You can execute any number of rows via those stops.

These steps have their own issues - namely you have to explicitly handle errors, but it's worth a try just to see if you can achieve what you want. It's a slightly different mindset, but a nicer way to build loops than the job approach.