In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile
and ReadLn()
routines.
Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F
. This only applies to using ReadLn
and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8)
or TFileStream.Read()
.
The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):
테스트
ed 85 8c ec 8a a4 ed 8a b8
Write a Delphi XE 2 Windows form application with TMemo control:
procedure TForm1.ReadFile(aFilename:string); var gFile : TextFile; gLine : RawByteString; gWideLine : string; begin AssignFile(gFile, aFilename); try Reset(gFile); Memo1.Clear; while not EOF(gFile) do begin ReadLn(gFile, gLine); gWideLine := UTF8ToWideString(gLine); Memo1.Lines.Add(gWideLine); end; finally CloseFile(gFile); end; end;
I inspect the contents of
gLine
before performing aUTF8ToWideString
conversation and under English / US locale Windows it is:$ED $85 $8C $EC $8A $A4 $ED $8A $B8
As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!
Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and
gLine
now has hex value:$3F $8C $EC $8A $A4 $3F $3F
Output in TMemo: ?�스??
Hypothesis that
ReadLn()
(andRead()
for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).Use
TFileStream
to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:$ED $85 $8C $EC $8A $A4 $ED $8A $B8
Output in TMemo: 테스트 (perfect!)
Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.
Question(s):
Why is
Read()
not returning me the exact byte string as found in the file? Is it because I'm using aTextFile
type and so Delphi is doing a degree of interpretation using the non-unicode codepage?Is there a built in way to read a UTF8 encoded file line by line?
Update:
Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.
Readln
, and indeedWriteln
don't properly support Unicode encodings. This question is related – David Heffernan