0
votes

How do I remove accents from a UTF-8 encoded string? There are hundreds of answers that either use some library function or use conversion tables.

I am looking for the actual algorithm (the idea behind it and why it works), not a ready to use implementation.

My goal is to count individual characters in a UTF-8-encoded string (so that, for example, utf8_strlen("Vypočítávání") = 12. I would like to count length of any string, including Chinese or Klingon.

I already know how to count multibyte characters: if the current byte's MSB is 1, then I know that some more bytes will be present. Looking at the next few bytes, I can tell that:

  • 110xxxxx means one more byte will follow,
  • 1110xxxx two more,
  • 11110xxx three.

(We can assume that the string is encoded correctly, ie. the sequence is a valid UTF-8 stream. That means that those bytes will actually follow.)

I read one byte and I know how many follow that designate a single Unicode codepoint, so I can skip those (again, the stream is valid) and increment the intermediate sum accordingly.

How would I do the same for combining characters? That is, is there a straightforward way to tell whether a codepoint is an accent for example (such as háček in č or cedilla in ç or any strange curve in Chinese)? If there is then I am looking forward to skipping them, too.

Thanks a lot!

2
Yes, there is a straightforward way to do that. Unfortunately, that straightforward way isn't an algorithm, but a table lookup in the huge data tables available as part of the Unicode standard. They specify all sorts of properties for every code point, including the one you're after. - jalf
It’s unclear what you are asking. The title does not match the content of the question. The question seems to imply that accents are presented using combining characters, byt in most cases, they are not. And processing bytes is really irrelevant here; it is at a completely different level, conceptually and in programming. And you have not defined what you mean by the length of a string. - Jukka K. Korpela

2 Answers

2
votes

You have to actually decode the UTF-8 sequences into Unicode codepoints (ie, convert UTF-8 to UTF-32), then you can manipulate the codepoints as needed, then re-encode the remaining codepoints back to UTF-8 if needed.

Since you already know how to parse each UTF-8 octet to detect each sequence's byte count, simply take each complete 1-4 byte sequence, parse the remaining bits into a 32bit value, lookup that value in the Unicode charts to know whether it is an accent, diacritical, or other combining character, and then act accordingly. You should also normalize the decoded codepoint values to make the combining characters easier to detect or skip.

2
votes

To do this right, you'll have to read TR29 (UNICODE TEXT SEGMENTATION), segment into "grapheme clusters", then count the number of clusters.