Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

Question

I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).

Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.

Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.

My transformation

From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.

simar simar · Accepted Answer · 2017-01-26T08:57:00

I did made some experiments.

Environment:

DB: Postgresv9.5x64
PDI KETTLE v5.2.0
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)

Results are underneath to help you make decision

Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.

Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.

Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.

Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.

Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.

Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.

Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

4 Answers