2
votes

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.

Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().

The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):

테스트
ed 85 8c ec 8a a4 ed 8a b8
  1. Write a Delphi XE 2 Windows form application with TMemo control:

    procedure TForm1.ReadFile(aFilename:string);
    var
      gFile     : TextFile;
      gLine     : RawByteString;
      gWideLine : string;
    begin
      AssignFile(gFile, aFilename);
      try
        Reset(gFile);
        Memo1.Clear;
        while not EOF(gFile) do
        begin
          ReadLn(gFile, gLine);
          gWideLine := UTF8ToWideString(gLine);
          Memo1.Lines.Add(gWideLine);
        end;
      finally
        CloseFile(gFile);
      end;
    end;
    
  2. I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is:

    $ED $85 $8C $EC $8A $A4 $ED $8A $B8

As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!

  1. Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.

  2. Read same file (UTF8 w/o BOM) with above application and gLine now has hex value:

    $3F $8C $EC $8A $A4 $3F $3F

    Output in TMemo: ?�스??

  3. Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).

  4. Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:

    $ED $85 $8C $EC $8A $A4 $ED $8A $B8

    Output in TMemo: 테스트 (perfect!)

Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.

Question(s):

  1. Why is Read() not returning me the exact byte string as found in the file? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage?

  2. Is there a built in way to read a UTF8 encoded file line by line?

Update:

Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.

1
Readln, and indeed Writeln don't properly support Unicode encodings. This question is relatedDavid Heffernan

1 Answers

8
votes

Is there a built in way to read a UTF8 encoded file line by line?

Use TStreamReader. It has a ReadLine() method.

    procedure TForm1.ReadFile(aFilename:string);
    var
      gFile     : TStreamReader;
      gLine     : string;
    begin
      Memo1.Clear;
      gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
      try
        while not gFile.EndOfStream do
        begin
          gLine := gFile.ReadLine;
          Memo1.Lines.Add(gLine);
        end;
      finally
        gFile.Free;
      end;
    end;

With that said, this particular example can be greatly simplified:

    procedure TForm1.ReadFile(aFilename:string);
    begin
      Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
    end;