0
votes

I need to insert large amount of data into an sqlite database. The data is received continuously.

What is better: to prepare statements, begin a transaction, do inserts, after certain amount of inserts end the transaction, repeat the cycle or cache the data, once in a while open the db, do bulk insert, close db, continue caching?

Are there any other tips how to do it in the most efficient way?

1
If by "bulk insert" you mean construct one humongous INSERT statement with a zillion VALUES entries, then that's almost certainly not the way to go. Neither would be lots of implicit-transaction INSERT statements. That leaves prepared-statement-inserts in a batched transaction, in which case whether the source of the data comes from "the incoming stream" or a cache probably doesn't make much difference. - TripeHound
If your table has indices, and other factors allow it, dropping them and (re)creating them at the end has been known to be faster than maintaining them throughout the bulk update. But as noted above, there are a lot of factors that can affect the speed, so benchmarking different approaches is always recommended. - TripeHound
Accumulating data in RAM will be much more efficient than doing so in HDD. So you have to consider your memory requirements and possibility to lose data during accumulation in RAM. - Megamozg

1 Answers

1
votes

The only difference between your two cases is that you're closing the database connection. This is usually not a good idea because it drops the page cache. Closing the connection might be useful only if you have an extremely small amount of memory and really need it for something else.

The important thing is to put many inserts into a single transaction, but you're already doing that.

The question is how many inserts belong into a transaction. If you can do the receiving and the database operations in parallel, then you should simply insert all data that you've received since the last batch. (This will result in a continous stream of database operations.) Caching data for a longer timer does not make sense unless you want to avoid doing too much disk I/O.

If you have certain performance requirements, then you must measure yourself.