2
votes

I am parsing very large files (Unicode - Delphi 2009), and I have a very efficient routine for doing so using PChar variables as outlined in the Stackoverflow question: What is the fastest way to Parse a line in Delphi?

Everything was working great until I ran into a file that had some embedded hex:00 characters in it. This character signals the end of a PChar string and my parsing stops at that point.

However, when you load the file, as in:

FileStream := TFileStream.Create(Filename, fmOpenRead or fmShareDenyWrite);
Size := FileStream.Size;

then you find that the size of the file is much larger. If you open the file with Notepad, it loads to the end of the file, not stopping at the first hex:00 as the PChar does.

How can I read to the end of the file while still using PChar parsing without slowing down my reading/parsing too much?

3
It's difficult to answer whithout seeing the actual code that uses the PChars. It depends on how you handle them - string functions will always stop at the first zero byte, as that is the end of the string by definition. On the other hand, they are only typed pointers to memory. You could handle them as normal pointers, and just don't stop at the first zero byte but store the length elsewhere.Chris
@Chris - the code is very similar to the accepted answer in the other Stackoverflow question I refer to above. Specifically, I've got lines like: while (cp^ > #0) and (cp^ <= #32) dolkessler

3 Answers

5
votes

The accepted code in your other question is breaking out when it reaches a #0 character. To fix that you just need to save the length of the input and check that instead. The updated code would look something like this:

type
  TLexer = class
  private
    FData: string;
    FTokenStart: PChar;
    FCurrPos: PChar;
    FEndPos: PChar;                                         // << New
    function GetCurrentToken: string;
  public
    constructor Create(const AData: string);
    function GetNextToken: Boolean;
    property CurrentToken: string read GetCurrentToken;
  end;

{ TLexer }

constructor TLexer.Create(const AData: string);
begin
  FData := AData;
  FCurrPos := PChar(FData);
  FEndPos := FCurrPos + Length(AData);                      // << New
end;

function TLexer.GetCurrentToken: string;
begin
  SetString(Result, FTokenStart, FCurrPos - FTokenStart);
end;

function TLexer.GetNextToken: Boolean;
var
  cp: PChar;
begin
  cp := FCurrPos; // copy to local to permit register allocation

  // skip whitespace
  while (cp <> FEndPos) and (cp^ <= #32) do                 // << Changed
    Inc(cp);

  // terminate at end of input
  Result := cp <> FEndPos;                                  // << Changed

  if Result then
  begin
    FTokenStart := cp;
    Inc(cp);
    while (cp <> FEndPos) and (cp^ > #32) do                // << Changed
      Inc(cp);
  end;

  FCurrPos := cp;
end;
2
votes

If you reach a #0 character, but you haven't consumed all the characters in the file, then keep going. How you keep going depends on how you were deciding to stop in the first place.

The question you referenced has this code:

while (cp^ > #0) and (cp^ <= #32) do
  Inc(cp);

// using null terminator for end of file
Result := cp^ <> #0;

That will obviously stop at a null character. If you don't want a null character to denote the end of the file, then don't stop at null characters. Stop after consuming all the characters instead. You'll have to know how many characters to expect, and keep track of how many characters you've seen.

nChars := Length(FData);
nCharsSeen := 0;
while (nCharsSeen < nChars) and (cp^ <= #32) do begin
  Inc(cp);
  Inc(nCharsSeen);
end;

// using character count for end of file
Result := nCharsSeen < nChars;

The referenced answer was parsing a string, so I've used Length to learn the number of characters. If you're parsing a file, then use something like TFileStream.Size instead.

1
votes

I took the code from your previously accepted answer and modified it slightly by adding two additional variables:

FPosInt: NativeUInt;
FSize: NativeUInt;

FSize is initialized with the string length in the constructor (a string variable has it's length stored while a PChar has not). FPosInt is the number of the current character in your file. The additional code in the constructor:

FSize := Length(FData);
FPosInt := 0;

The relevant part in the GetNextToken function then does not stop at the first zero byte anymore but continues until the last character of the string is reached:

// skip whitespace; this test could be converted to an unsigned int
// subtraction and compare for only a single branch
while (cp^ <= #32) and (FPosInt < FSize) do
  begin
  Inc(cp);
  Inc(FPosInt);
  end;

// end of file is reached if the position counter has reached the filesize
Result := FPosInt < FSize;

I switched the two statements in the while condition, as they are evaluated from left to right and the first one is the one that will evaluate to false more often.


An alternative approach does not count the number of characters but saves the start position of the pointer. In the constructor:

FSize := Length(FData);
FStartPos := NativeUInt(FCurrPos);

And in GetNextToken:

// skip whitespace; this test could be converted to an unsigned int
// subtraction and compare for only a single branch
while (cp^ <= #32) and ((NativeUInt(cp) - FStartPos) < FSize) do
  Inc(cp);

// end of file is reached if the position counter has reached the filesize
Result := (NativeUInt(cp) - FStartPos) < FSize;