Trouble with PostgreSQL loading a large csv file into a table

Question

On my setup, PostgreSQL 9.2.2 seems to error out when trying to load a large csv file into a table.

The size of the csv file is ~9GB

Here's the SQL statement I'm using to do the bulk load:

copy chunksBase (chunkId, Id, chunk, chunkType) from path-to-csv.csv' delimiters ',' csv

Here's the error I get after a few minutes:

pg.ProgrammingError: ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073723635 bytes by 65536 more bytes.
CONTEXT:  COPY chunksbase, line 47680536

I think that the buffer can't allocate more than exactly 1GB, which makes me think that this could be a postgresql.conf issue.

Here's the uncommented lines in postgresql.conf:

bash-3.2# cat postgresql.conf | perl -pe 's/^[ \t]*//' | grep -v '^#' | sed '/^$/d'
log_timezone = 'US/Central'
datestyle = 'iso, mdy'
timezone = 'US/Central'
lc_messages = 'en_US.UTF-8'         # locale for system error message
lc_monetary = 'en_US.UTF-8'         # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'          # locale for number formatting
lc_time = 'en_US.UTF-8'             # locale for time formatting
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 50 # pgtune wizard 2012-12-02
maintenance_work_mem = 768MB # pgtune wizard 2012-12-02
constraint_exclusion = on # pgtune wizard 2012-12-02
checkpoint_completion_target = 0.9 # pgtune wizard 2012-12-02
effective_cache_size = 9GB # pgtune wizard 2012-12-02
work_mem = 72MB # pgtune wizard 2012-12-02
wal_buffers = 8MB # pgtune wizard 2012-12-02
checkpoint_segments = 16 # pgtune wizard 2012-12-02
shared_buffers = 3GB # pgtune wizard 2012-12-02
max_connections = 80 # pgtune wizard 2012-12-02
bash-3.2#

Nothing that explicitly sets a buffer to 1GB.

What's going on here? Even if the solution is to increase a buffer in postgresql.conf, why is postgres seeming to try and bulk load an entire csv file into ram on the single copy call? One would think that loading large csv files is a common task; I can't be the first person to come across this problem; so I would figure that postgres would have handled chunking the bulk load so that the buffer limit was never reached in the first place.

As a workaround, I'm splitting the csv into smaller files, and then calling copy for each file. This seems to be working fine. But it's not a particularly satisfying solution, because now I have to maintain split versions of each large csv that I want to load into postgres. There has to be a more proper way to bulk load a large csv file into postgres.

EDIT1: I am in the process of making sure that the csv file is not malformed in any way. I'm doing this by trying to load all split csv files into postgres. If all can be loaded, then this indicates that the issue here is not likely due to the csv file being malformed. I've already found a few issues. Not sure yet if these issues are causing the string buffer error when trying to load the large csv.

I suspect that your CSV is malformed -- or more specifically, does not correspond to the format specified in your COPY command. See the documentation on CSV handling for details. Does your CSV have an unmatched " character? — willglynn
I think its an issue with single or double quotes and strings. Some string values aren't properly terminated or there're single characters inside text values (e.g. ... aren't ...), I would bet the second one. Anyway this is why Postgres is trying to buffer bigger string then it is originally saved in the csv file. — Wojtas
I'd first check the (maximum) line length in the CSV file. BTW: is this genetic/DNA data? — wildplasser
Any thing different the value of chunksbase on line 47680536? — Kuberchaun
Nothing different that I can tell. Isn't that exactly when 1GB of file size happens? — Clayton Stanley

Clayton Stanley Clayton Stanley · Accepted Answer · 2012-12-17T04:46:38

It turned out to be a malformed csv file.

I split the large csv into smaller chunks (each with 1 million rows) and started loading each one into postgres.

I started getting more informative errors:

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY chunksbase, line 15320779

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xe9 0xae 0x22
CONTEXT:  COPY chunksbase, line 369513

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0x80
CONTEXT:  COPY chunksbase, line 16602

There were a total of 5 rows with invalid utf8 byte sequences, out of a few hundred million. After removing those rows, the large 9GB csv loaded just fine.

It would have been nice to get the invalid byte sequence errors when loading the large file initially. But at least they appeared once I started isolating the problem.

Note that the line number mentioned in the error when loading the large file initially, had no relation with the encoding errors that were found when loading the smaller csv subset files. The initial line number was the point in the file where exactly 1GB of data took place, so it was related to the 1GB buffer allocation error. But, that error had nothing to do with the real problem...

Trouble with PostgreSQL loading a large csv file into a table

1 Answers