I want to index the characters in a utf8 string which does not necessarily contain only ascii characters. I want the same kind of behavior I get in javascript:
> str = "lλך" // i.e. Latin ell, Greek lambda, Hebrew lamedh
'lλך'
> str[0]
'l'
> str[1]
'λ'
> str[2]
'ך'
Following the advice of UTF-8 Everywhere, I am representing my mixed character-length string just as any other sting in c - and not using wchars.
The problem is that, in C, one cannot access the 16th character of a string: only the 16th byte. Because λ
is encoded with two bytes in utf-8, I have to access the 16th and 17th bytes of the string in order to print out one λ
.
For reference, the output of:
#include <stdio.h>
int main () {
char word_with_greek[] = "this is lambda:_λ";
printf("%s\n",word_with_greek);
printf("The 0th character is: %c\n", word_with_greek[0]);
printf("The 15th character is: %c\n",word_with_greek[15]);
printf("The 16th character is: %c%c\n",word_with_greek[16],word_with_greek[17]);
return 0;
}
is:
this is lambda:_λ
The 0th character is: t
The 15th character is: _
The 16th character is: λ
Is there an easy way to break up the string into characters? It does not seem too difficult to write a function which breaks a string into wchars- but I imagine that someone has already written this yet I cannot find it.
which breaks a string into wchars
Don't. Just don't. wchars are not decoded UTF-8 characters. wchars are another encoding. If you want to/need to decode, then decode to a utf32 string. – tkausl