How do I index a (not all ascii) utf8 string in C?

Question

I want to index the characters in a utf8 string which does not necessarily contain only ascii characters. I want the same kind of behavior I get in javascript:

> str = "lλך" // i.e. Latin ell, Greek lambda, Hebrew lamedh
'lλך'
> str[0]
'l'
> str[1]
'λ'
> str[2]
'ך'

Following the advice of UTF-8 Everywhere, I am representing my mixed character-length string just as any other sting in c - and not using wchars.

The problem is that, in C, one cannot access the 16th character of a string: only the 16th byte. Because λ is encoded with two bytes in utf-8, I have to access the 16th and 17th bytes of the string in order to print out one λ.

For reference, the output of:

#include <stdio.h>                                                                                                    

int main () {                                                                                                         
  char word_with_greek[] = "this is lambda:_λ";                                                                       
  printf("%s\n",word_with_greek);                                                                                     
  printf("The 0th character is: %c\n", word_with_greek[0]);                                                           
  printf("The 15th character is: %c\n",word_with_greek[15]);                                                          
  printf("The 16th character is: %c%c\n",word_with_greek[16],word_with_greek[17]);                                    
  return 0;                                                                                                           
}

is:

this is lambda:_λ
The 0th character is: t
The 15th character is: _
The 16th character is: λ

Is there an easy way to break up the string into characters? It does not seem too difficult to write a function which breaks a string into wchars- but I imagine that someone has already written this yet I cannot find it.

which breaks a string into wchars Don't. Just don't. wchars are not decoded UTF-8 characters. wchars are another encoding. If you want to/need to decode, then decode to a utf32 string. — tkausl
Just about everything about Unicode is non-trivial. So I suggest you try to find a library (there are a few) to help you. Asking for libraries if off-topic here though, but you could try on the software recommendation stack exchange site. — Some programmer dude
Your question is slightly contradicting. You state that you want to iterate through a string. That's a very natural concept for UTF-8. You have a pointer into the string and a function that tells you how many bytes to skip to get to the next character. But then you want to access the 16th character. That's indexed access - not iterating. Check your requirements again. Indexed access is usually not needed. Most likely it's just an old habit of how you have implemented string processing in the past. — Codo
Iteration is for string processing. But iteration doesn't need an index. The index of the character to process is usually irrelevant. You just want to access the character at the iterator's current position. Indexed access is inefficient with Unicode (see Schlemiel the painter's Algorithm), iteration with the concept of a current position is fast. — Codo
There is no fast and easy definition of character. Is oᷔ a single character? Why or why not? How about o͡e, how many are there? — n. 1.8e9-where's-my-share m.

Serge Ballesta Serge Ballesta · Accepted Answer · 2019-01-25T07:27:39

It depends on what your unicode characters can be. Most strings are restricted to the Basic Multilanguage Plane. If yours are (not by accident by because of their very nature: at least no risk for emoji...) you can use the char16_t to represent any character. BTW wchar_t is at least as large as char16_t so in that case it is safe to use it.

If your script can contain emoji character, or other characters not in the BMP or simply if you are unsure, the only foolproof way is to convert everything to char32_t because any unicode character (at least in 2019...) as a code using less than 32 bits.

Converting for UTF8 to 32 (or 16) bits unicode is not that hard, and can be coded by hand, Wikipedia contains enough information for it. But you will find tons of library where this is already coded and tested, mainly the excellent libiconv, but the C11 version of the C standard library contains functions for UTF8 conversions. Not as nice but useable.

How do I index a (not all ascii) utf8 string in C?

2 Answers