4
votes

I'm trying to parse large CSV files (large here means the CSV files are frequently larger than main memory). I process the CSV row-by-row as a stream, which allows me to deal with those large files.

The RFC on CSV files defines the double quote character to regard all that comes after as a single fiels (thus escaping delimiters):

  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

Every now and then, my application needs to deal with improper CSV files, which contain a double quote character that is not closed. This results in the CSV parser trying to read the whole part of the file starting from this double quote charater into one file, which, as my files can be large, might cause memory issues.

What I want to do is to make my parsing solution robust to such cases by somehow detect such problems and abort parsing in these cases. One thing that might help is that I know the typical length of my fields, so I might be able to do something with an upper bound on the field length.

Does anyone know a way to parse CSV files in a way that is robust to large files that can contain unclosed occurrences of double quote characters, such that it parses the file when possible and aborts withouth consuming all available memory first when an unclosed double quote is present? My current parsing solution makes use of OpenCSV, but I would have no problem with switching if this would help solving it.

2
Why not reading and parsing row by row? (or does CRLF mean there is a linebreak in the field) Then you would not need to read the remaining file if in one row the closing double quote is missed. Could a field also contain the field delimiter character? Is "ccc" CLRF zzz one field or two fields? Maybe you should provide some valid examples and some exceptional examples and how the should be interpreted.SubOptimal
You should implement some sort of validation for each field if possbile,othewise a correct quote closure can be detected only at the end of line.Francesco Pirrone
Double-qoutes used for encapsulation delimiters in some value fields in csv. What do you want to do with open single qoutes and do you want to throw out it or use it as part of you next value?SanyaLuc
@SubOptimal I parse row by row, the problem is that the double quote mean everything after is escaped until the next double quote, so this includes line breaksNiek Tax
@NiekTax In this case you need to implement some validation on the field level. E.g. If you know the field is defined to be at max. 50 characters long and you read the 51st character without finding a closing double quote then you can reject it, as it's invalid. At the end you stay at the point which EJP already mention: You can never know where exactly the missing quote was supposed to be. And if the field delimiter itself is a valid character in an escaped field it get even harder to guess. Take pen, paper and few broken lines and try to write down the rules to reliable detect the fields.SubOptimal

2 Answers

1
votes

Reject them.

The problem is insoluble, except by heuristics such as maximum field lengths, but then what? You can never know where exactly the missing quote was supposed to be.

Reject them.

1
votes

Use the CSV parser in uniVocity-parsers. It can even parse broken quote escapes. Try this example:

import java.io.*;
import java.util.*;

import com.univocity.parsers.csv.*;

public class Test {

    public static void main(String ... args){
        CsvParserSettings settings = new CsvParserSettings();
        settings.getFormat().setLineSeparator("\r\n");
        settings.setParseUnescapedQuotes(true); // THIS IS IMPORTANT FOR YOU
        CsvParser parser = new CsvParser(settings);

        String line1 = "something,\"a quoted value \"with unescaped quotes\" can be parsed\", something\r\n";
        System.out.println("Input line: " + line1);

        String line2 = "\"after the newline \r\n you will find \" more stuff\r\n";
        System.out.println("Input line: " + line2);

        List<String[]> allLines = parser.parseAll(new StringReader(line1 + line2));


        int count = 0;
        for(String[] line : allLines){
            System.out.println("Line " + ++count);
            for(String element : line){
                System.out.println("\t" + element);

            }
            System.out.println();
        }
    }
}

This will produce:

Input line: something,"a quoted value "with unescaped quotes" can be parsed", something

Input line: "after the newline 
 you will find " more stuff

Line 1
    something
    a quoted value "with unescaped quotes" can be parsed
    something

Line 2
    after the newline 
 you will find " more stuff

Note in line 2 you got one multiline record with a quote character. The parser assumes this is part of the value in this case, as you are parsing unescaped quotes and the input is broken.

Now, if you change this line to disallow broken quotes:

settings.setParseUnescapedQuotes(false);

You will get:

Exception in thread "main" com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=0, charIndex=29, content parsed=a quoted value 

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).