Convert char pos of UnicodeString to byte pos in a utf8 string

Question

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.

The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.

PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

Are you sure ByteToCharIndex doesn't work? I wouldn't be surprised if the text of the documentation predates Delphi 2009, when AnsiString changed to carry its own code page. Now that AnsiString includes a code page, the function should be able to tell whether the string is encoded as MBCS, SBCS, or UTF-8 instead of relying on the system setting. — Rob Kennedy
@RobKennedy - It does not work; more than that, the Windows function CharNextExA does not work with UTF8 too. — kludg

kludg kludg · Accepted Answer · 2012-04-30T17:46:37

You should parse UTF8 strings yourself using UTF8 description. I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:

function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
  I: Integer;
  P: PAnsiChar;

begin
  Result:= 0;
  if (Index <= 0) or (Index > Length(S)) then Exit;
  I:= 1;
  P:= PAnsiChar(S);
  while I <= Index do begin
    if Ord(P^) and $C0 <> $80 then Inc(Result);
    Inc(I);
    Inc(P);
  end;
end;

const TestStr: UTF8String = 'abФЫВА';

procedure TForm1.Button2Click(Sender: TObject);
begin
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;

The reverse function is no problem too:

function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
  P: PAnsiChar;

begin
  Result:= 0;
  P:= PAnsiChar(S);
  while (Result < Length(S)) and (Index > 0) do begin
    Inc(Result);
    if Ord(P^) and $C0 <> $80 then Dec(Index);
    Inc(P);
  end;
  if Index <> 0 then Result:= 0;  // char index not found
end;

Convert char pos of UnicodeString to byte pos in a utf8 string

3 Answers