3
votes

I am trying to parse a CSV file using OpenCSV. One of the columns stores the data in YAML serialized format and is quoted because it can have comma inside it. It also has quotes inside it, so it is escaped by putting two quotes. I am able to parse this file easily in Ruby, but with OpenCSV I am not able to parse it fully. It is a UTF-8 encoded file.

Here is my Java snippet which is trying to read the file

CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream(csvFilePath), "UTF-8"), ',', '\"', '\\');

Here are 2 lines from this file. First line is not being parsed properly and is getting split at ""[Fair Trade Certified]"" because of escaped double quotes I guess.

1061658767,update,1196916,Product,28613099,Product::Source,"---
product_attributes:
-
- :name: Ornaments
  :brand_id: 49120
  :size: each
  :alcoholic: false
  :details: ""[Fair Trade Certified]""
  :gluten_free: false
  :kosher: false
  :low_fat: false
  :organic: false
  :sugar_free: false
  :fat_free: false
  :vegan: false
  :vegetarian: false
",,2015-11-01 00:06:19.796944,,,,,,
1061658768,create,,,28613100,Product::Source,"---
product_id:
retailer_id:
store_id:
source_id: 333790
locale: en_us
source_type: Product::PrehistoricProductDatum
priority: 1
is_definition:
product_attributes:
",,2015-11-01 00:06:19.927948,,,,,,
2
The "standard" for CSV files is RFC4180, but is not always followed. It includes quoting fields with commas, and transforming inner quotes to two quotes. Googling "RFC4180 java parser" turns up some possibilities.Paul
With OpenCSV you are unable to parse it. Credit where credit is due.user207421
@EJP don't know what you imply by that :) But anyways, using a parser compatible with RFC4180 fixed it.invinc4u
Thanks @Paul. Using a compatible RFC4180 parser worked for me !!invinc4u

2 Answers

2
votes

First off I am glad the FastCSV worked for you but I ran the suspected substring and ran it through the 3.9 openCSV and it worked with both the CsvParser and the RFC4180Parser. Could you please give a little detail on how it did not parse and/or try it with 3.9 openCSV to see if you get the same issue and then try with the configuration below.

Here are the tests that I used:

CSVParser:

@Test
public void parseBigStringFromStackOverflowWithMultipleQuotesInLine() throws IOException {

    String bigline = "28613099,Product::Source,\"---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"\"[Fair Trade Certified]\"\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" +
            "\",,2015-11-01 00:06:19.796944";

    String suspectString = "---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"[Fair Trade Certified]\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" ;

    StringReader stringReader = new StringReader(bigline);

    CSVReaderBuilder builder = new CSVReaderBuilder(stringReader);
    CSVReader csvReader = builder.withFieldAsNull(CSVReaderNullFieldIndicator.BOTH).build();

    String item[] = csvReader.readNext();

    assertEquals(5, item.length);
    assertEquals("28613099", item[0]);
    assertEquals("Product::Source", item[1]);
    assertEquals(suspectString, item[2]);
}

RFC4180Parser

def 'parse big line from stackoverflow with complex string'() {
    given:
    RFC4180ParserBuilder builder = new RFC4180ParserBuilder()
    RFC4180Parser parser = builder.build()
    String bigline = "28613099,Product::Source,\"---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"\"[Fair Trade Certified]\"\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" +
            "\",,2015-11-01 00:06:19.796944"

    String suspectString = "---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"[Fair Trade Certified]\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n"

    when:
    String[] values = parser.parseLine(bigline)

    then:
    values.length == 5
    values[0] == "28613099"
    values[1] == "Product::Source"
    values[2] == suspectString
}
2
votes

The solution was to use a RFC4180 compatible CSV parser, as suggested by Paul. I had used CSVReader from OpenCSV which didn't work or maybe I couldn't get it to work properly.

I used FastCSV, a RFC4180 CSV parser, and it worked seamlessly.

File file = new File(csvFilePath);
CsvReader csvReader = new CsvReader();
CsvContainer csv = csvReader.read(file, StandardCharsets.UTF_8);
for (CsvRow row : csv.getRows()) {
    System.out.println(row.getFieldCount());  
}