0
votes

I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).

Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.

Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.

My transformation

From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.

4

4 Answers

0
votes

I did made some experiments.

Environment:

  • DB: Postgresv9.5x64
  • PDI KETTLE v5.2.0
  • PDI KETTLE defautl jvm settings 512mb
  • Data source: DBF FILE over 2_215_000 rows
  • Both PDI and Kettle on same localhost
  • Table truncated on each run
  • PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)

Results are underneath to help you make decision

  1. Bulk loader: average over 150_000 rows per second around 13-15s

  2. Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s

  3. Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s

  4. Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.

Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.

Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.

Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.

  1. Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.

Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.

enter image description here

1
votes

The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.

Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.

0
votes

To insert just 7 rows you don't need bulk loader. Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.

Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.

0
votes

Slow insert/update step Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?

Let' look on documentation

The Insert/Update step first looks up a row in a table using one or more lookup keys. If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.

Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.

So if this step is bottleneck then, try to figure out what exactly slow.

  • Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
  • Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)

Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.

The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.