Let's say I have a .csv-File with 100 million rows. I import that csv-file into pentaho Kettle and want to write all rows into a PostgreSQL database. What is the fastest insert-transformation? I have tried the normal table output transformation and the PostgreSQL Bulk Loader (which is way faster than the table output). But still, it is too slow. Is there a faster way than using the PostgreSQL Bulk Loader?
1 Answers
Considering the fact that PostgreSQL Bulk Loader runs COPY table_name FROM STDIN
- there's nothing faster from data load in postgres. Multi-value insert will be slower, just multiple insert will be slowest. So you can't make it faster.
To speed up COPY
you can:
set commit_delay to 100000;
set synchronous_commit to off;
and other server side tricks (like dropping indexes before loading).
NB:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
update
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction.
(emphasis mine)
Also notice I recommend using SET
for the session level, so if the GeoKettle does not allow to set config before running commands on postgres, you can use pgbouncer connect_query
for the specific user/database pair, or think some other trick. And if you can't do anything to set synchronous_commit
per session and you decide to change it per database or user (so it would be applied to GeoKettle connection, don't forget to set it back to on
after load is over.