1
votes

I'm trying to use the Univocity format auto-detection for parsing this CSV table:

HEADER1, HEADER2, HEADER3
11, 12, 13
21, 22, 23
31, 32, 33

As you can see, there're same number of commas ',' and spaces ' '. Problem is that the heuristic for finding the delimiter gives preference to the ' ' instead of the ',' character.

So in this case the detected separator is the space ' '. And then, the values of the cells are wrong since the comma is taken as part of the value:

I saw there's a functionality setDelimiterDetectionEnabled for defining the delimiters in order of priority, but I couldn't make it work.

I use it like this: setDelimiterDetectionEnabled(true, ',', ' '), but still chooses the space as delimiter.

If I remove 1 space in the CSV table (so there would be more commas than spaces) the comma is chosen as delimiter.

This the code, is scala but I think this is not relevant because the library is written in java:

val settings = new CsvParserSettings
settings.setDelimiterDetectionEnabled(true, ',', ' ')
val parser = new CsvParser(settings)
val spaceAndCommaTable = new File("/home/pr/SPACE_AND_COMMA.csv")
val parsed = parser.parseAll(spaceAndCommaTable, "UTF-8")
val format = parser.getDetectedFormat

I expected to have format.getDelimiter the comma ',', but the actual delimiter is the space ' '

1

1 Answers

1
votes

Author of the library here. I've just fixed this and will release the final version 2.8.3 tomorrow to include the adjustment needed for this to work. For testing, you can already use the latest 2.8.3-SNAPSHOT.

Thank you