0
votes

I have inherited some code that uses regular expressions to parse CSV formatted data. It didn't need to cope with empty string fields before now, however the requirements have changed so that empty string fields are a possibility.

I have changed the regular expression from this:

new Regex("((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

to this

new Regex("((?<field>[^\",\\r\\n]*)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

(i.e. I have changed the + to *)

The problem is that I am now getting an extra empty field at the end, e.g. "ID,Name,Description" returns me four fields: "ID", "Name", "Description" and ""

Can anyone spot why?

3
I'm not familiar with C# but isn't there a package/module/class which is able to parse CSV?Felix Kling
Not out the box @Felix, but there's approximately 1.72billion implementations out there.Jamiec
What did you want to handle: Id,,Name?xanatos
Id,,Name is one example, yes. I would also want to handle ,Name and Id,Simon Williams

3 Answers

2
votes

This one:

var rx = new Regex("((?<=^|,)(?<field>)(?=,|$)|(?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

I move the handling of "blank" fields to a third "or". Now, the handling of "" already worked (and you didn't need to modify it, it was the second (?<field>) block of your code), so what you need to handle are four cases:

,
,Id
Id,
Id,,Name

And this one should do it:

(?<=^|,)(?<field>)(?=,|$)

An empty field must be preceeded by the beginning of the row ^ or by a ,, must be of length zero (there isn't anything in the (?<field>) capture) and must be followed by a , or by the end of the line $.

1
votes

I would suggest you to use the FileHelpers library. It is easy to use, does its job and maintaining your code will be much easier.

1
votes

The problem with your regex is that it matches the empty string. Now $ works a little like lookahead - it guarantees that the match is at the end of the string, but is not part of the match.

So when you have "ID,Name,Description", your first match is

ID,, and the rest is "Name,Description"

Then the next match is

Name, and the rest is "Description"

The next match:

Description and the rest is ""

So the final match is matching the empty string.