How can I perform a culture-sensitive "starts-with" operation from the middle of a string?

Question

I have a requirement which is relatively obscure, but it feels like it should be possible using the BCL.

For context, I'm parsing a date/time string in Noda Time. I maintain a logical cursor for my position within the input string. So while the complete string may be "3 January 2013" the logical cursor may be at the 'J'.

Now, I need to parse the month name, comparing it against all the known month names for the culture:

Culture-sensitively
Case-insensitively
Just from the point of the cursor (not later; I want to see if the cursor is "looking at" the candidate month name)
Quickly
... and I need to know afterwards how many characters were used

The current code to do this generally works, using CompareInfo.Compare. It's effectively like this (just for the matching part - there's more code in the real thing, but it's not relevant to the match):

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length, 
                               CompareOptions.IgnoreCase) == 0;
}

However, that relies on the candidate and the region we compare being the same length. Fine most of the time, but not fine in some special cases. Suppose we have something like:

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

Now my comparison will fail. I could use IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

but:

That requires me to create a substring, which I'd really rather avoid. (I'm viewing Noda Time as effectively a system library; parsing performance may well be important to some clients.)
It doesn't tell me how far to advance the cursor afterwards

In reality, I strongly suspect this won't come up very often... but I'd really like to do the right thing here. I'd also really like to be able to do it without becoming a Unicode expert or implementing it myself :)

(Raised as bug 210 in Noda Time, in case anyone wants to follow any eventual conclusion.)

I like the idea of normalization. I need to check that in detail for a) correctness and b) performance. Assuming I can make it work correctly, I'm still not sure how whether it would be worth changing over all - it's the sort of thing which will probably never actually come up in real life, but could hurt the performance of all my users :(

I've also checked the BCL - which doesn't appear to handle this properly either. Sample code:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

Changing the custom month name to just "bed" with a text value of "bEd" parses fine.

Okay, a few more data points:

The cost of using Substring and IsPrefix is significant but not horrible. On a sample of "Friday April 12 2013 20:28:42" on my development laptop, it changes the number of parse operations I can execute in a second from about 460K to about 400K. I'd rather avoid that slowdown if possible, but it's not too bad.
Normalization is less feasible than I thought - because it's not available in Portable Class Libraries. I could potentially use it just for non-PCL builds, allowing the PCL builds to be a little less correct. The performance hit of testing for normalization (string.IsNormalized) takes performance down to about 445K calls per second, which I can live with. I'm still not sure it does everything I need it to - for example, a month name containing "ß" should match "ss" in many cultures, I believe... and normalizing doesn't do that.

While I understand your desire to avoid the performance hit of creating a substring, it might be best to do so, but earlier in the game by shifting everything to a chosen unicode normalization form FIRST and then knowing you can walk "point-by-point". Probably D-form. — IDisposable
@IDisposable: Yes, I did wonder about that. Obviously I can normalize the month names themselves beforehand. At least I can do the normalization just once. I wonder if the normalization procedure checks whether anything needs to be done first. I don't have much experience in normalization - definitely one avenue to look into. — Jon Skeet
If your text isn't too long, you could do if (compareInfo.IndexOf(text, candidate, position, options) == position). msdn.microsoft.com/en-us/library/ms143031.aspx But if text is very long that's going to waste a lot of time searching beyond where it needs to. — Jim Mischel
Just bypass using the String class at all in this instance and use a Char[] directly. You'll end up writing more code, but that's what happens when you want high performance... or maybe you should be programming in C++/CLI ;-) — intrepidis
Will CompareOptions.IgnoreNonSpace not take care of this automagically for you? It looks to me (from the docco, not in a position to test from this iPad sorry!) as though this might be a (the?) use-case for that option. "Indicates that the string comparison must ignore nonspacing combining characters, such as diacritics." — Sepster

Esailija Esailija · Accepted Answer · 2013-04-14T16:22:54

I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

For example:

x heiße y
  ^--- cursor

Matches heisse but then moves cursor 1 too much. And:

x heisse y
  ^--- cursor

Matches heiße but then moves cursor 1 too less.

This will apply to any character that doesn't have a simple one-to-one mapping.

You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

For example this works in Java (and even in Javascript), given string that is in Normal Form C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

So it's likely that you need a 3rd party library to case fold your strings before using them.

Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.

How can I perform a culture-sensitive "starts-with" operation from the middle of a string?

3 Answers