4
votes

I know how to convert a character string into a byte array using a particular encoding, but how do I convert the character indexes to byte indexes (in Java)?

For instance, in UTF-32, character index i is byte index 4 * i because every UTF-32 character is 4 bytes wide. But in UTF-8, most English characters are 1 byte wide, characters in most other scripts are 2 or 3 bytes wide, and a few are 4 bytes wide. For a given string and encoding, how would I get an array of starting byte indexes for each character?

Here's an example of what I mean. The string "Hello مرحبا こんにちは" in UTF-8 has the following indexes: [0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29] because the Latin characters are 1 byte each, the Arabic characters are 2 bytes each, and the Japanese characters are 3 bytes each. (Before the cumulative sum, the array is [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3].)

Is there a library function in Java that computes these index positions? It needs to be efficient, so I shouldn't convert each character to a separate byte array just to query its length. Is there an easy way to compute it myself, from some knowledge of Unicode? It should be possible to do in one pass, by recognizing special bytes that indicate the width of the next character.

1
You're asking two separate questions here - one about Python and one about Java. It would be better to ask them separately, so you can "accept" the best answer to each one.Dawood ibn Kareem
Okay, in this question, I'll just ask about and accept answers about Java. Python information in a comment would just be a welcome bonus.Jim Pivarski
OK, are you aware that in Java, all strings are stored internally in UTF-16?Dawood ibn Kareem
ps: if you must, then the Description section of en.wikipedia.org/wiki/UTF-8 will give you the info needed to parse utf-8 into Unicode characters...thebjorn
I'm using a library that does extended POSIX regular expressions on byte arrays--- it has no concept of a Java String--- and it returns matches as indexes of the byte array. One option is to str.getBytes("utf-32be") and divide all indexes it gives me by 4. I'd like to use Java's internal encoding (utf-16be?) for efficiency, in which the factor is usually--- but not always--- 2. I want it to always work.Jim Pivarski

1 Answers

8
votes

I think this can do what you want:

static int[] utf8ByteIndexes(String s) {
    int[] byteIndexes = new int[s.length()];
    int sum = 0;
    for (int i = 0; i < s.length(); i++) {
        byteIndexes[i] = sum;
        int c = s.codePointAt(i);
        if (Character.charCount(c) == 2) {
            i++;
            byteIndexes[i] = sum;
        }
        if (c <=     0x7F) sum += 1; else
        if (c <=    0x7FF) sum += 2; else
        if (c <=   0xFFFF) sum += 3; else
        if (c <= 0x1FFFFF) sum += 4; else
        throw new Error();
    }
    return byteIndexes;
}

Given a Java string, it returns an array of the UTF-8 byte indexes corresponding to each char in the String.

System.out.println(Arrays.toString(utf8ByteIndexes("Hello مرحبا こんにちは")));

Output:

[0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29]

Exotic Unicode characters above U+FFFF, those that don't fit in Java's 16-bit char type, are a bit of a nuisance. For example, Christmas tree emoji U+1F384 (🎄) is encoded using two Java "chars". For those, the function above returns the same byte index for both chars:

System.out.println(Arrays.toString(utf8ByteIndexes("x🎄y")));

Output:

[0, 1, 1, 5]

The overall cumulative byte count is correct though (the emoji takes 4 bytes if encoded in UTF-8).