1
votes

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.

The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.

PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

3
Are you sure ByteToCharIndex doesn't work? I wouldn't be surprised if the text of the documentation predates Delphi 2009, when AnsiString changed to carry its own code page. Now that AnsiString includes a code page, the function should be able to tell whether the string is encoded as MBCS, SBCS, or UTF-8 instead of relying on the system setting.Rob Kennedy
@RobKennedy - It does not work; more than that, the Windows function CharNextExA does not work with UTF8 too.kludg
Yes, it doesn't work as Serg confirmed, I tried it too.Edwin Yip

3 Answers

3
votes

You should parse UTF8 strings yourself using UTF8 description. I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:

function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
  I: Integer;
  P: PAnsiChar;

begin
  Result:= 0;
  if (Index <= 0) or (Index > Length(S)) then Exit;
  I:= 1;
  P:= PAnsiChar(S);
  while I <= Index do begin
    if Ord(P^) and $C0 <> $80 then Inc(Result);
    Inc(I);
    Inc(P);
  end;
end;

const TestStr: UTF8String = 'abФЫВА';

procedure TForm1.Button2Click(Sender: TObject);
begin
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;

The reverse function is no problem too:

function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
  P: PAnsiChar;

begin
  Result:= 0;
  P:= PAnsiChar(S);
  while (Result < Length(S)) and (Index > 0) do begin
    Inc(Result);
    if Ord(P^) and $C0 <> $80 then Dec(Index);
    Inc(P);
  end;
  if Index <> 0 then Result:= 0;  // char index not found
end;
1
votes

I wrote a function based on Serg's code with great respect, I posted it here as a separate answer with the hope that it's helpful to others too. Serg's answer is accepted instead.

{Return the index (1-based) of the first byte of the character (unicode point) specified by aCharIdx (1-based) in aUtf8Str.

Code is amended by Edwin Yip based on code written by SO member Serg (https://stackoverflow.com/users/246408/serg)

ref 1: https://stackoverflow.com/a/10388131/133516

ref 2: http://sergworks.wordpress.com/2012/05/01/parsing-utf8-strings/ }

function CharPosToUTF8BytePos(const aUtf8Str: UTF8String; const aCharIdx:
    Integer): Integer;
var
  p: PAnsiChar;
  charCount: Integer;
begin
  p:= PAnsiChar(aUtf8Str);
  Result:= 0;
  charCount:= 0;
  while (Result < Length(aUtf8Str)) do
  begin
    if IsUTF8LeadChar(p^) then
      Inc(charCount);

    if charCount = aCharIdx then
      Exit(Result + 1);

    Inc(p);
    Inc(Result);
  end;
end;
0
votes

Both UTF-8 and UTF-16 (what UnicodeString uses) are variable-length encodings. A given Unicode codepoint can be encoded in UTF-8 using between 1-4 single-byte codeunits, and in UTF-16 using either 1 or 2 2-byte codeunits, depending on the codepoint's numeric value. The only way to translate a position in a UTF-16 string into a position in an equivilent UTF-8 string is to decode the UTF-16 codeunits preceeding the position back to their original Unicode codepoint values and then re-encode them to UTF-8 codeunits.

It sounds like you are better off re-writting the code that interacts with Scintilla to use UTF8String instead of UnicodeString, then you won't have to translate between UTF-8 and UTF-16 at that layer anymore. When interacting with the rest of your code, you can convert between UTF8String and UnicodeString as needed.