How do I remove accents from a UTF-8 encoded string? There are hundreds of answers that either use some library function or use conversion tables.
I am looking for the actual algorithm (the idea behind it and why it works), not a ready to use implementation.
My goal is to count individual characters in a UTF-8-encoded string (so that, for example, utf8_strlen("Vypočítávání") = 12. I would like to count length of any string, including Chinese or Klingon.
I already know how to count multibyte characters: if the current byte's MSB is 1, then I know that some more bytes will be present. Looking at the next few bytes, I can tell that:
110xxxxxmeans one more byte will follow,1110xxxxtwo more,11110xxxthree.
(We can assume that the string is encoded correctly, ie. the sequence is a valid UTF-8 stream. That means that those bytes will actually follow.)
I read one byte and I know how many follow that designate a single Unicode codepoint, so I can skip those (again, the stream is valid) and increment the intermediate sum accordingly.
How would I do the same for combining characters? That is, is there a straightforward way to tell whether a codepoint is an accent for example (such as háček in č or cedilla in ç or any strange curve in Chinese)?
If there is then I am looking forward to skipping them, too.
Thanks a lot!