0
votes

Good day,

I have a kettle pentaho file that run as a batch job.

Basically, this files contain of 2 main steps, First step, read from a input file (txt file) and store inside table1. Second step, same as first step, read from same input file and store inside table2.

This batch is working fine until I put in a 20MB input file.It require more than 7hours to finish the job.

Below is some test case I have done:

15360 records, 1.4MB, 2 minutes and 20 seconds (140 seconds total).
30720 records, 2.8MB , 7 minutes and 30 seconds (450 seconds total)
61440 records, 5.5MB, 26 minutes and 55 seconds (1615 seconds total).
250000 records, 20MB, 7 hours and 30 minutes

In the log, I found there is some steps that occupied most of the time consuming. Which are as follow: 1. Text file input. 2. Select values. 3. Modified Java Script Value. enter image description here

Both main steps also contain this 3 kettle pentaho function. For 20MB input file, first step only take around 7 minutes, but second step take more than 7 hours.

Try to look at it in quite long time, still cant find out what is the problem.

Kindly advise.

1
Try disabling the Table output step and check the speed. Something as simple as that should be able to process more than 100k rows per second, your numbers are 1000x slower than they should. If that solves the speed issue (250k records shouldn't take more than a couple seconds), the problem is most likely the table output. Either db connection is slow or the commit size is too small or something like that.nsousa

1 Answers

1
votes

There might be multiple reasons (i assume). First of all, try to optimize steps like "Select Values" and "Modified JavaScript". Some of the performance tuning tips are given in here.

Also you may try to increase the Java Memory in the pan.sh. check the image below:

enter image description here

Change the JAVAMAXMEM to somevalue higher like 1024.

Hope these changes might help :)