Original question
Background
It is well-known that SQLite needs to be fine tuned to achieve insert speeds on the order of 50k inserts/s. There are many questions here regarding slow insert speeds and a wealth of advice and benchmarks.
There are also claims that SQLite can handle large amounts of data, with reports of 50+ GB not causing any problems with the right settings.
I have followed the advice here and elsewhere to achieve these speeds and I'm happy with 35k-45k inserts/s. The problem I have is that all of the benchmarks only demonstrate fast insert speeds with < 1m records. What I am seeing is that insert speed seems to be inversely proportional to table size.
Issue
My use case requires storing 500m to 1b tuples ([x_id, y_id, z_id]
) over a few years (1m rows / day) in a link table. The values are all integer IDs between 1 and 2,000,000. There is a single index on z_id
.
Performance is great for the first 10m rows, ~35k inserts/s, but by the time the table has ~20m rows, performance starts to suffer. I'm now seeing about 100 inserts/s.
The size of the table is not particularly large. With 20m rows, the size on disk is around 500MB.
The project is written in Perl.
Question
Is this the reality of large tables in SQLite or are there any secrets to maintaining high insert rates for tables with > 10m rows?
Known workarounds which I'd like to avoid if possible
- Drop the index, add the records, and re-index: This is fine as a workaround, but doesn't work when the DB still needs to be usable during updates. It won't work to make the database completely inaccessible for x minutes / day
- Break the table into smaller subtables / files: This will work in the short term and I have already experimented with it. The problem is that I need to be able to retrieve data from the entire history when querying which means that eventually I'll hit the 62 table attachment limit. Attaching, collecting results in a temp table, and detaching hundreds of times per request seems to be a lot of work and overhead, but I'll try it if there are no other alternatives.
- Set
SQLITE_FCNTL_CHUNK_SIZE
: I don't know C (?!), so I'd prefer to not learn it just to get this done. I can't see any way to set this parameter using Perl though.
UPDATE
Following Tim's suggestion that an index was causing increasingly slow insert times despite SQLite's claims that it is capable of handling large data sets, I performed a benchmark comparison with the following settings:
- inserted rows: 14 million
- commit batch size: 50,000 records
cache_size
pragma: 10,000page_size
pragma: 4,096temp_store
pragma: memoryjournal_mode
pragma: deletesynchronous
pragma: off
In my project, as in the benchmark results below, a file-based temporary table is created and SQLite's built-in support
for importing CSV data is used. The temporary table is then attached
to the receiving database and sets of 50,000 rows are inserted with an
insert-select
statement. Therefore, the insert times do not reflect
file to database insert times, but rather table to table insert
speed. Taking the CSV import time into account would reduce the speeds
by 25-50% (a very rough estimate, it doesn't take long to import the
CSV data).
Clearly having an index causes the slowdown in insert speed as table size increases.
It's quite clear from the data above that the correct answer can be assigned to Tim's answer rather than the assertions that SQLite just can't handle it. Clearly it can handle large datasets if indexing that dataset is not part of your use case. I have been using SQLite for just that, as a backend for a logging system, for a while now which does not need to be indexed, so I was quite surprised at the slowdown I experienced.
Conclusion
If anyone finds themselves wanting to store a large amount of data using SQLite and have it indexed, using shards may be the answer. I eventually settled on using the first three characters of an MD5 hash a unique column in z
to determine assignment to one of 4,096 databases. Since my use case is primarily archival in nature, the schema will not change and queries will never require shard walking. There is a limit to database size since extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma settings, and even some denormalisation gives me a nice balance that will, based on the benchmarking above, maintain an insert speed of at least 10k inserts / second.
cache_size
is set to10000
. All inserts are in one transaction with commits every 50k inserts. These values were chosen after extensive testing to find the fastest insert speed (35k-45k inserts/s), but that speed drops as the number of records increases past about 10m. – user918938table_z
then linked totable_y
andtable_x
viaz_id
through this table. So, to answer your question, it helps find individual records. – user918938