How to parse large CSV files in a way that is robust to unclosed double quote character?

Question

I'm trying to parse large CSV files (large here means the CSV files are frequently larger than main memory). I process the CSV row-by-row as a stream, which allows me to deal with those large files.

The RFC on CSV files defines the double quote character to regard all that comes after as a single fiels (thus escaping delimiters):

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

Every now and then, my application needs to deal with improper CSV files, which contain a double quote character that is not closed. This results in the CSV parser trying to read the whole part of the file starting from this double quote charater into one file, which, as my files can be large, might cause memory issues.

What I want to do is to make my parsing solution robust to such cases by somehow detect such problems and abort parsing in these cases. One thing that might help is that I know the typical length of my fields, so I might be able to do something with an upper bound on the field length.

Does anyone know a way to parse CSV files in a way that is robust to large files that can contain unclosed occurrences of double quote characters, such that it parses the file when possible and aborts withouth consuming all available memory first when an unclosed double quote is present? My current parsing solution makes use of OpenCSV, but I would have no problem with switching if this would help solving it.

Why not reading and parsing row by row? (or does CRLF mean there is a linebreak in the field) Then you would not need to read the remaining file if in one row the closing double quote is missed. Could a field also contain the field delimiter character? Is "ccc" CLRF zzz one field or two fields? Maybe you should provide some valid examples and some exceptional examples and how the should be interpreted. — SubOptimal
You should implement some sort of validation for each field if possbile,othewise a correct quote closure can be detected only at the end of line. — Francesco Pirrone
Double-qoutes used for encapsulation delimiters in some value fields in csv. What do you want to do with open single qoutes and do you want to throw out it or use it as part of you next value? — SanyaLuc
@SubOptimal I parse row by row, the problem is that the double quote mean everything after is escaped until the next double quote, so this includes line breaks — Niek Tax
@NiekTax In this case you need to implement some validation on the field level. E.g. If you know the field is defined to be at max. 50 characters long and you read the 51st character without finding a closing double quote then you can reject it, as it's invalid. At the end you stay at the point which EJP already mention: You can never know where exactly the missing quote was supposed to be. And if the field delimiter itself is a valid character in an escaped field it get even harder to guess. Take pen, paper and few broken lines and try to write down the rules to reliable detect the fields. — SubOptimal

user207421 user207421 · Accepted Answer · 2015-05-19T10:20:41

Reject them.

The problem is insoluble, except by heuristics such as maximum field lengths, but then what? You can never know where exactly the missing quote was supposed to be.

Reject them.

How to parse large CSV files in a way that is robust to unclosed double quote character?

2 Answers