2
votes

I am trying to parse a comma separated string using:

val array = input.split(",")

Then I notice that some input lines have "," inside a quotation mark:

data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5

*Note that the data is not very clean, so some fields are inside quotation marks while some don't


How do I split such line into:

array(0) = data0
array(1) = data1
array(2) = data2
array(3) = data3
array(4) = data4-1, data4-2, data4-3
array(5) = data5
4
Parsing CSV files can be notoriously tricky due to its behaviour around quotes, and commas and quotes included in quoted values. I recommend pulling in a library which is well regarded for dealing robustly with all the edge cases. Options you could consider include scala-csv, and traversable-csv. Or use a Java library like opencsv.Shadowlands
Otherwise, if you don't want to or can't use a library, you could look at this SO answer or this SO answer to see how others have tackled roll-your-own CSV parsers.Shadowlands
@Shadowlands Could you please summarize your comments in an answer ( as I think you have shown many valuable approaches, others can benefit from.) Thx.Martin Senne
@MartinSenne Sure, happy to make it an answer (although I don't have anything much further to add).Shadowlands

4 Answers

6
votes

As per my comments:

Parsing CSV files can be notoriously tricky due to its behaviour around quotes, and commas and quotes included in quoted values. I recommend pulling in a library which is well regarded for dealing robustly with all the edge cases.

Options you could consider include scala-csv, and traversable-csv. Or use a Java library like opencsv.

Otherwise, if you don't want to or can't use a library, you could look at this SO answer or this SO answer to see how others have tackled roll-your-own CSV parsers.

0
votes

I would recommend using a CSV library to parse CSV data - the format is a mess and painful to get right.

I would suggest kantan.csv, mainly because I'm the author but also because it lets you got a bit further than turning a CSV stream into a list of arrays of strings. Take, for example, the following input:

1,Foo,2.0
2,Bar,false

Using kantan.csv, you can write:

import kantan.csv.ops._

new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)

Calling toList on the result will yield:

List((1,Foo,Left(2.0)), (2,Bar,Right(false)))

Note how the last column is either a float or a boolean, but this is captured in the type of each element of the iterator.

0
votes

Below is my solution to parse CSV row:

String[] res = row.split(";");
for (int i = 0; i < res.length; i++) {
    res[i] = deQuotes(res[i]);
}
return res;

remove quotes with REGEXP:

static final Pattern PATTERN_DE_QUOTES = Pattern.compile("(?i)^\\\"(.*)\\\"$");

static String deQuotes(String s) {
    Matcher matcher;
    if ((matcher = PATTERN_DE_QUOTES.matcher(s)).find()) {
        return matcher.group(1).replaceAll("\"\"", "\"");
    }
    return s;
}

I hope it will help you.

-1
votes

You can actually split that line with a regex expression.

val s = """data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5"""

"""((".*?")|('.*?')|[^"',]+)+""".r.findAllIn(s).foreach(println)

btw. any library that can parse csv files can also parse a single csv line. Just wrap the string into a StringReader.