I'm trying to parse large CSV files (large here means the CSV files are frequently larger than main memory). I process the CSV row-by-row as a stream, which allows me to deal with those large files.
The RFC on CSV files defines the double quote character to regard all that comes after as a single fiels (thus escaping delimiters):
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Every now and then, my application needs to deal with improper CSV files, which contain a double quote character that is not closed. This results in the CSV parser trying to read the whole part of the file starting from this double quote charater into one file, which, as my files can be large, might cause memory issues.
What I want to do is to make my parsing solution robust to such cases by somehow detect such problems and abort parsing in these cases. One thing that might help is that I know the typical length of my fields, so I might be able to do something with an upper bound on the field length.
Does anyone know a way to parse CSV files in a way that is robust to large files that can contain unclosed occurrences of double quote characters, such that it parses the file when possible and aborts withouth consuming all available memory first when an unclosed double quote is present? My current parsing solution makes use of OpenCSV, but I would have no problem with switching if this would help solving it.
"ccc" CLRF zzz
one field or two fields? Maybe you should provide some valid examples and some exceptional examples and how the should be interpreted. – SubOptimalYou can never know where exactly the missing quote was supposed to be.
And if the field delimiter itself is a valid character in an escaped field it get even harder to guess. Take pen, paper and few broken lines and try to write down the rules to reliable detect the fields. – SubOptimal