I know how to convert a character string into a byte array using a particular encoding, but how do I convert the character indexes to byte indexes (in Java)?
For instance, in UTF-32, character index i
is byte index 4 * i
because every UTF-32 character is 4 bytes wide. But in UTF-8, most English characters are 1 byte wide, characters in most other scripts are 2 or 3 bytes wide, and a few are 4 bytes wide. For a given string and encoding, how would I get an array of starting byte indexes for each character?
Here's an example of what I mean. The string "Hello مرحبا こんにちは"
in UTF-8 has the following indexes: [0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29]
because the Latin characters are 1 byte each, the Arabic characters are 2 bytes each, and the Japanese characters are 3 bytes each. (Before the cumulative sum, the array is [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3]
.)
Is there a library function in Java that computes these index positions? It needs to be efficient, so I shouldn't convert each character to a separate byte array just to query its length. Is there an easy way to compute it myself, from some knowledge of Unicode? It should be possible to do in one pass, by recognizing special bytes that indicate the width of the next character.
String
--- and it returns matches as indexes of the byte array. One option is tostr.getBytes("utf-32be")
and divide all indexes it gives me by 4. I'd like to use Java's internal encoding (utf-16be?) for efficiency, in which the factor is usually--- but not always--- 2. I want it to always work. – Jim Pivarski