2
votes

I am using regex to parse data from an OCR'd document and I am struggling to match the scenarios where a 1000s comma separator has been misread as a dot, and also where the dot has been misread as a comma!

So if the true value is 1234567.89 printed as 1,234,567.89 but being misread as:

1.234,567.89

1,234.567.89

1,234,567,89

etc

I could probably sort this in C# but I'm sure that a regex could do it. Any regex-wizards out there that can help?

UPDATE:

I realise this is a pretty dumb question as the regex is pretty straight forward to catch all of these, it is then how I choose to interpret the match. Which will be in C#. Thanks - sorry to waste your time on this!

I will mark the answer to Dmitry as it is close to what I was looking for. Thank you.

3
What do you want to capture, the wrong ones or right ones?Alfie Goodacre
Sorry of course. I actually want to capture all of them including ones that having missing 1000s comma separator. So I've probably answered this myself and it's not really a regex issue at all. Doh.Mark Chidlow
I think the problem is infeasable if you don't have an idea on what the number should be. How can you decide if it is correct to interpret "," or "." ?Felice Pollano
if you know a number always has 2 decimal places, why not just remove all commas and dots, and then divide the number by 100?Jonesopolis
@Jonesopolis - I have a mixture of decimal places. So it's more that the last point or comma should be a point and I ignore the previous commas/points!Mark Chidlow

3 Answers

3
votes

Please notice, that there's ambiguity since:

  123,456 // thousand separator 
  123.456 // decimal separator

are both possible (123456 and 123.456). However, we can detect some cases:

  1. Too many decimal separators 123.456.789
  2. Wrong order 123.456,789
  3. Wrong digits count 123,45

So we can set up a rule: the separator can be decimal one if it's the last one and not followed by exactly three digits (see ambiguity above), all the other separators should be treated as thousand ones:

  1?234?567?89
   ^   ^   ^
   |   |   the last one, followed by two digits (not three), thus decimal 
   |   not the last one, thus thousand  
   not the last one, thus thousand

Now let's implement a routine

  private static String ClearUp(String value) {
    String[] chunks = value.Split(',', '.');

    // No separators
    if (chunks.Length <= 1)    
      return value; 

    // Let's look at the last chunk
    // definitely decimal separator (e.g. "123,45")
    if (chunks[chunks.Length - 1].Length != 3) 
      return String.Concat(chunks.Take(chunks.Length - 1)) + 
             "." + 
             chunks[chunks.Length - 1]; 

    // may be decimal or thousand
    if (value[value.Length - 4] == ',')    
      return String.Concat(chunks);
    else 
      return String.Concat(chunks.Take(chunks.Length - 1)) + 
             "." + 
             chunks[chunks.Length - 1]; 
  }

Now let's try some tests:

   String[] data = new String[] {
     // you tests
     "1.234,567.89",
     "1,234.567.89",
     "1,234,567,89",

     // my tests
     "123,456", // "," should be left intact, i.e. thousand separator 
     "123.456", // "." should be left intact, i.e. decimal separator 
   };

   String report = String.Join(Environment.NewLine, data
    .Select(item => String.Format("{0} -> {1}", item, ClearUp(item))));

   Console.Write(report);

the outcome is

   1.234,567.89 -> 1234567.89
   1,234.567.89 -> 1234567.89
   1,234,567,89 -> 1234567.89
   123,456 -> 123456
   123.456 -> 123.456
1
votes

Try this Regex:

\b[\.,\d][^\s]*\b

\b = Word boundaries containing: . or comma or digits Not containing spaces

1
votes

Responding to update/comments: you do not need regex to do this. Instead, if you can isolate the number string from the surrounding spaces, you can pull it into a string-array using Split(',','.'). Based on the logic you outlined above, you could then use the last element of the array as the fractional part, and concatenate the first elements together for the whole part. (Actual code left as an exercise...) This will even work if the ambiguous-dot-or-comma is the last character in the string: the last element in the split-array will be empty.

Caveat: This will only work if there is always a decimal point--otherwise, you would not be able to differentiate logically between a thousands-place comma and a decimal with thousandths.