Iterating through Unicode codepoints character by character

Question

I've got a series of Unicode codepoints. What I really need to do is iterate through these codepoints as a series of characters, not a series of codepoints, and determine properties of each individual character, e.g. is a letter, whatever.

For example, imagine that I was writing a Unicode-aware textbox, and the user entered a Unicode character that was more than one codepoint- for example, "e with diacritic". I know that this specific character can be represented as one codepoint as well, and can be normalized to that form, but I don't think that's possible in the general case. How could I implement backspace? It obviously can't just erase the last codepoint, because they might have just entered more than one codepoint.

How can I iterate over a bunch of Unicode codepoints as characters?

Edit: The Break Iterators offered by ICU appear to be pretty much what I need. However, I'm not using ICU, so any references on how to implement my own equivalent functionality would be an accepted answer.

Another edit: It turns out that the Windows API does indeed offer this functionality. MSDN just isn't very good about putting all the string functions in one place. CharNext is the function I'm looking for.

How do you define "character" in this context? Something that translates to a single visual grapheme? — Nicol Bolas
Unless and until you define character in terms of code points, no answer is possible. Unicode defines only two things: code points and extended grapheme clusters. It does not define character. Please rephrase your question in terms of code points and/or extended grapheme clusters, or else define your terms with sufficient precision as to make possible a programmic solution, which you have not yet bothered to do. — tchrist
@tchrist: Did you really have to go and post the same comment on every answer? I got it by reading it once. — Puppy
"However, I'm not using ICU" Really, you should. This is after all what it is for. In order to do what BreakIterator does, you will need to be able to query the properties of unicode points to know if one can break between them or not. And that requires basically downloading the Unicode specification and building a table of codepoint ranges for different properties. Or you can just use ICU, which does it for you. — Nicol Bolas
@tchrist: This is a bit late, but according to the Unicode Standard version 6.2, page 11: "Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation." The document then goes on to provide a table illustrating the difference between a glyph and a character. If this is not a definition, then I don't know what is. — Talia

bmargulies bmargulies · Accepted Answer · 2011-11-26T22:07:21

Use the ICU library.

http://site.icu-project.org/

for example:

http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90

is the function that returns the character at a particular character offset in a string.

Iterating through Unicode codepoints character by character

2 Answers