Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

Question

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.

Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().

The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):

테스트
ed 85 8c ec 8a a4 ed 8a b8

Write a Delphi XE 2 Windows form application with TMemo control:

procedure TForm1.ReadFile(aFilename:string);
var
  gFile     : TextFile;
  gLine     : RawByteString;
  gWideLine : string;
begin
  AssignFile(gFile, aFilename);
  try
    Reset(gFile);
    Memo1.Clear;
    while not EOF(gFile) do
    begin
      ReadLn(gFile, gLine);
      gWideLine := UTF8ToWideString(gLine);
      Memo1.Lines.Add(gWideLine);
    end;
  finally
    CloseFile(gFile);
  end;
end;

I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is:

$ED $85 $8C $EC $8A $A4 $ED $8A $B8

As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!

Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and gLine now has hex value:

$3F $8C $EC $8A $A4 $3F $3F

Output in TMemo: ?�스??
Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).
Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:

$ED $85 $8C $EC $8A $A4 $ED $8A $B8

Output in TMemo: 테스트 (perfect!)

Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.

Question(s):

Why is Read() not returning me the exact byte string as found in the file? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage?
Is there a built in way to read a UTF8 encoded file line by line?

Update:

Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.

Readln, and indeed Writeln don't properly support Unicode encodings. This question is related — David Heffernan

Remy Lebeau Remy Lebeau · Accepted Answer · 2015-03-21T23:45:45

Is there a built in way to read a UTF8 encoded file line by line?

Use TStreamReader. It has a ReadLine() method.

    procedure TForm1.ReadFile(aFilename:string);
    var
      gFile     : TStreamReader;
      gLine     : string;
    begin
      Memo1.Clear;
      gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
      try
        while not gFile.EndOfStream do
        begin
          gLine := gFile.ReadLine;
          Memo1.Lines.Add(gLine);
        end;
      finally
        gFile.Free;
      end;
    end;

With that said, this particular example can be greatly simplified:

    procedure TForm1.ReadFile(aFilename:string);
    begin
      Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
    end;

Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

1 Answers