0
votes

I have to read a csv file. The file can contain any delimtier and can be enclosed by ""\" or not. The file should also be parsed regarding RFC4180. (I know that in RFC4180 the delimiter is a ",", but a user should also be able to read a file delimited by "|" for example).

public List<List<String>> readFileAsListOfList(File file, String delimiter, String lineEnding, String enclosure) throws Exception {
        if (!file.exists()) {
            throw new Exception("File doesn't exist.");
        }
        if (!file.isFile()) {
            throw new Exception("File must be a file.");
        }

        List<List<String>> fileContent = new ArrayList<>();
        CSVFormat csvFormat = CSVFormat.RFC4180.withDelimiter(delimiter.charAt(0)).withEscape(lineEnding.charAt(0));
        if (StringUtils.isNotEmpty(enclosure)) {
            csvFormat.withQuote(enclosure.charAt(0));
        } else {
            csvFormat.withQuote(null);
        }
        System.out.println(csvFormat);
        List<String> lineContent = new ArrayList<>();
        for (CSVRecord rec : csvFormat.parse(new FileReader(file))) {
            for (String field : rec) {
                lineContent.add(field);
            }
            fileContent.add(lineContent);
        }
        return fileContent;
    }

If I have now the case that the file is not enclosed and I have a line like

aaa|bbb|"|ccc

I get following error:

Exception in thread "main" java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (startline 120707) EOF reached before encapsulated token finished at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:530) at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:540) at com.ids.dam.pim.validation.CSVFileReaderApache.readFileAsListOfList(CSVFileReaderApache.java:61) at com.ids.dam.pim.validation.CSVFileReaderApache.main(CSVFileReaderApache.java:78) Caused by: java.io.IOException: (startline 120707) EOF reached before encapsulated token finished at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288) at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158) at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:586) at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:527) ... 3 more

I think this is because my CSVFormat still contains a double quote as enclosure, because this is default in RFC4180.

Printing out the format gives following:

Delimiter=<|> Escape=<L> QuoteChar=<"> RecordSeparator=<
> SkipHeaderRecord:false

For me, this means I can overwrite the default delimiter with CSVFormat.RFC4180.withDelimiter(delimiter.charAt(0)... but I cannot set the enclosure to null

Is there a way to set the enclosure to null while still using RFC4180?

1

1 Answers

1
votes

Quoting is always optional in CSV, and the quoting character can be choosen as is the delimiter one. If you know that your file uses a | delimiter and no quotes, you should build you CSVFormat that way. And beware, withOption(...) does not apply the option to the current csv format but returns a now one that is the same as the original but has the option set. From Apache CSVFormat javadoc

public CSVFormat withQuoteMode(QuoteMode quoteModePolicy)

Returns a new CSVFormat with the output quote policy of the format set to the specified value.
...

Returns: A new CSVFormat that is equal to this but with the specified quote policy

You should use:

    CSVFormat csvFormat = CSVFormat.RFC4180.withDelimiter(delimiter.charAt(0))
            .withEscape(lineEnding.charAt(0));
    if (StringUtils.isNotEmpty(enclosure)) {
        csvFormat = csvFormat.withQuote(enclosure.charAt(0));
    } else {
        csvFormat = csvFormat.withQuoteMode(NONE);
    }