15
votes

Background

I would like to insert 1-million records to SQLite using Python. I tried a number of ways to improve it but it is still not so satisfied. The database load file to memory using 0.23 second (search pass below) but SQLite 1.77 second to load and insert to file.

Environment

Intel Core i7-7700 @ 3.6GHz
16GB RAM
Micron 1100 256GB SSD, Windows 10 x64
Python 3.6.5 Minconda
sqlite3.version 2.6.0

GenerateData.py

I generate the 1 million test input data with the same format as my real data.

import time
start_time = time.time()
with open('input.ssv', 'w') as out:
    symbols = ['AUDUSD','EURUSD','GBPUSD','NZDUSD','USDCAD','USDCHF','USDJPY','USDCNY','USDHKD']
    lines = []
    for i in range(0,1*1000*1000):
        q1, r1, q2, r2 = i//100000, i%100000, (i+1)//100000, (i+1)%100000
        line = '{} {}.{:05d} {}.{:05d}'.format(symbols[i%len(symbols)], q1, r1, q2, r2)
        lines.append(line)
    out.write('\n'.join(lines))
print(time.time()-start_time, i)

input.ssv

The test data looks like this.

AUDUSD 0.00000 0.00001
EURUSD 0.00001 0.00002
GBPUSD 0.00002 0.00003
NZDUSD 0.00003 0.00004
USDCAD 0.00004 0.00005
...
USDCHF 9.99995 9.99996
USDJPY 9.99996 9.99997
USDCNY 9.99997 9.99998
USDHKD 9.99998 9.99999
AUDUSD 9.99999 10.00000
// total 1 million of lines, taken 1.38 second for Python code to generate to disk

Windows correctly shows 23,999,999 bytes file size.

Baseline Code InsertData.py

import time
class Timer:
    def __enter__(self):
        self.start = time.time()
        return self
    def __exit__(self, *args):
        elapsed = time.time()-self.start
        print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed)) 

with Timer() as t:
    with open('input.ssv', 'r') as infile:
        infile.read()

Basic I/O

with open('input.ssv', 'r') as infile:
    infile.read()

Imported in 0.13 seconds or 7.6 M per second

It tests the read speed.

with open('input.ssv', 'r') as infile:
    with open('output.ssv', 'w') as outfile:
        outfile.write(infile.read()) // insert here

Imported in 0.26 seconds or 3.84 M per second

It tests the read and write speed without parsing anything

with open('input.ssv', 'r') as infile:
    lines = infile.read().splitlines()
    for line in lines:
        pass # do insert here

Imported in 0.23 seconds or 4.32 M per second

When I parse the data line by line, it achieves a very high output.

This gives us a sense about how fast the IO and string processing operations on my testing machine.

1. Write File

outfile.write(line)

Imported in 0.52 seconds or 1.93 M per second

2. Split to floats to string

tokens = line.split()
sym, bid, ask = tokens[0], float(tokens[1]), float(tokens[2])
outfile.write('{} {:.5f} {%.5f}\n'.format(sym, bid, ask)) // real insert here

Imported in 2.25 seconds or 445 K per second

3. Insert Statement with autocommit

conn = sqlite3.connect('example.db', isolation_level=None)
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))

When isolation_level = None (autocommit), program takes many hours to complete (I could not wait for such a long hours)

Note the output database file size is 32,325,632 bytes, which is 32MB. It is bigger than the input file ssv file size of 23MB by 10MB.

4. Insert Statement with BEGIN (DEFERRED)

conn = sqlite3.connect('example.db', isolation_level=’DEFERRED’) # default
c.execute("INSERT INTO stocks VALUES ('{}',{:.5f},{:.5f})".format(sym,bid,ask))

Imported in 7.50 seconds or 133,296 per second

This is the same as writing BEGIN, BEGIN TRANSACTION or BEGIN DEFERRED TRANSACTION, not BEGIN IMMEDIATE nor BEGIN EXCLUSIVE.

5. Insert by Prepared Statement

Using the transaction above gives a satisfactory results but it should be noted that using Python’s string operations is undesired because it is subjected to SQL injection. Moreover using string is slow compared to parameter substitution.

c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(sym,bid,ask)])

Imported in 2.31 seconds or 432,124 per second

6. Turn off Synchronous

Power failure corrupts the database file when synchronous is not set to EXTRA nor FULL before data reaches the physical disk surface. When we can ensure the power and OS is healthy, we can turn synchronous to OFF so that it doe not synchronized after data handed to OS layer.

conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')

Imported in 2.25 seconds or 444,247 per second

7. Turn off journal and so no rollback nor atomic commit

In some applications the rollback function of a database is not required, for example a time series data insertion. When we can ensure the power and OS is healthy, we can turn journal_mode to off so that rollback journal is disabled completely and it disables the atomic commit and rollback capabilities.

conn = sqlite3.connect('example.db', isolation_level='DEFERRED')
c = conn.cursor()
c.execute('''PRAGMA synchronous = OFF''')
c.execute('''PRAGMA journal_mode = OFF''')

Imported in 2.22 seconds or 450,653 per second

8. Using in-memory database

In some applications writing data back to disks is not required, such as applications providing queried data to web applications.

conn = sqlite3.connect(":memory:")

Imported in 2.17 seconds or 460,405 per second

9. Faster Python code in the loop

We should consider to save every bit of computation inside an intensive loop, such as avoiding assignment to variable and string operations.

9a. Avoid assignment to variable

tokens = line.split()
c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(tokens[0], float(tokens[1]), float(tokens[2]))])

Imported in 2.10 seconds or 475,964 per second

9b. Avoid string.split()

When we can treat the space separated data as fixed width format, we can directly indicate the distance between each data to the head of data. It means line.split()[1] becomes line[7:14]

c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], float(line[7:14]), float(line[15:]))])

Imported in 1.94 seconds or 514,661 per second

9c. Avoid float() to ?

When we are using executemany() with ? placeholder, we don’t need to turn the string into float beforehand.

executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])

Imported in 1.59 seconds or 630,520 per second

10. The fastest full functioned and robust code so far

import time
class Timer:    
    def __enter__(self):
        self.start = time.time()
        return self
    def __exit__(self, *args):
        elapsed = time.time()-self.start
        print('Imported in {:.2f} seconds or {:.0f} per second'.format(elapsed, 1*1000*1000/elapsed))
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS stocks''')
c.execute('''CREATE TABLE IF NOT EXISTS stocks
             (sym text, bid real, ask real)''')
c.execute('''PRAGMA synchronous = EXTRA''')
c.execute('''PRAGMA journal_mode = WAL''')
with Timer() as t:
    with open('input.ssv', 'r') as infile:
        lines = infile.read().splitlines()
        for line in lines:
            c.executemany("INSERT INTO stocks VALUES (?,?,?)", [(line[0:6], line[7:14], line[15:])])
        conn.commit()
        conn.close()

Imported in 1.77 seconds or 564,611 per second

Possible to get faster?

I have a 23MB file with 1 million records composing of a piece of text as symbol name and 2 floating point number as bid and ask. When you search pass above, the test result shows a 4.32 M inserts per second to plain file. When I insert to a robust SQLite database, it drops to 0.564 M inserts per second. What else you may think of to make it even faster in SQLite? What if not SQLite but other database system?

1
Did I get the question right: Half a million inserts per second into SQLite is too slow for you?Klaus D.
@KlausD. The database load file to memory using 0.23 second (search pass above) but SQLite 1.77 second to load and insert to file. Not too slow but I would like to make it faster. See if you may tell if it may be quite closed to the software bottleneck or any method to optimize it.etoricky
Unluckily performance optimization is not a topic for SO. You might find help at Code Review or the database related sibling sites.Klaus D.
Great research! 6 & 7 did the trick for me. I was using in-memory database before, but disabling the safeguards got me to similar speeds on an SSD with REPLACE INTO.siikamiika

1 Answers

1
votes

If python's interpreter is actually a significant factor in timing (section 9) vs SQLite performance, you may find PyPy to improve performance significantly (Python's sqlite3 interface is implemented in pure python.) Here not much is done in pure python, but if you were doing more string operations or had for loops then it is worth it to switch from CPython.

Obviously if performance outside SQLite really matters you can try writing in a faster language like C/C++. Multi-threading may or may not help depending on how the database locks are implemented.