4
votes

I am using Embarcadero's Rad Studio Delphi (10.2.3) and have encountered a memory issue while reading in very large text files (7 million lines+, every line is different, lines can be 1 to ~200 characters in length, etc.). I am fairly new at Delphi programming, so I have scoured SO and Google looking for help before posting.

I originally implemented a TStringList and read the file using LoadFromFile method, but this failed spectacularly when the processed text files became large enough. I then implemented a TStreamReader and used ReadLn to populate the TStringList using the basic code found here:

TStringList.LoadFromFile - Exceptions with Large Text Files

Code Example:

//MyStringList.LoadFromFile(filename);
Reader := TStreamReader.Create(filename, true);
try
  MyStringList.BeginUpdate;
  try
    MyStringList.Clear;
    while not Reader.EndOfStream do
      MyStringList.Add(Reader.ReadLine);
  finally
    MyStringList.EndUpdate;
  end;
finally
  Reader.Free;
end;

This worked great until the files I needed to process became huge (~7 million lines +). It appears that the TStringList is getting so large as to run out of memory. I say "appears" as I don't actually have access to the file that is being run, and all error information is provided by my customer through email, making this problem even more difficult as I can't simply debug it in the IDE.

The code is compiled 32-bit and I am unable to use the 64-bit compiler. I can't include a database system or the like, either. Unfortunately, I have some tight restrictions. I need to load in every line to look for patterns and compare those lines to other lines to look for "patterns within patterns." I apologize for being very vague here.

The bottom line is this--is there a way to have access to every line in the text file without using a TStringList, or perhaps a better way to handle the TStringList memory?

Maybe there is a way to load a specific block of lines from the StreamReader into the TStringList (e.g., read in the first 100,000 lines and process, the next 100,000 lines, etc.) instead of everything at once? I think I could then write something to handle the possible "inter-block" patterns.

Thanks in advance for any and all help and suggestions!

***** EDITED WITH UPDATE *****

Ok, here is the basic solution that I need to implement:

var
  filename: string;
  sr: TStreamReader;
  sl: TStringList;
  total, blocksize: integer;
begin
  filename := 'thefilenamegoeshere';
  sl := TStringList.Create;
  sr := TStreamReader.Create(filename, true);
  sl.Capacity := sr.BaseStream.Size div 100;
  total := 0; // Total number of lines in the file (after it is read in)
  blocksize := 10000; // The number of lines per "block"
  try
    sl.BeginUpdate;
    try
      while not sr.EndOfStream do
        begin
          sl.Clear;
          while not (sl.Count >= blocksize) do
            begin
              sl.Add(sr.ReadLine);
              total := total + 1;
              if (sr.EndOfStream = true) then break;
            end;
          // Handle the current block of lines here
        end;
    finally
      sl.EndUpdate;
    end;
  finally
    sr.Free;
    sl.Free;
  end;
end;

I have some test code that I will use to refine my routines, but this seems to be relatively fast, efficient, and sufficient. I want to thank everyone for their responses that got my gray matter firing!

1
Perhaps try IMAGE_FILE_LARGE_ADDRESS_AWARE? docwiki.embarcadero.com/RADStudio/Tokyo/en/… - Ville Krumlinde
If you don't want to make use of 64GB address space then you'll need to redesign your code to avoid having to load the entire file into memory. It's that simple. Exactly how you do that will depend on the details that you have but we do not. But it's hard to see past the fact that if the data can't fit in memory at once, then you need to avoid trying to fit it into memory at once. - David Heffernan
@VilleKrumlinde - Thanks for the suggestion. I have set the LARGE_ADDRESS flag in my software, though I don't know if the customer has done everything on their end. They are running Windows 10 Pro 64-bit, though even that is through an emulator on a Linux box sometimes (depending on the particular user). And the users don't have Admin privileges, thus making all this even more complicated. - Ric Crooks
You just read the first N lines, deal with them. Then read the next N lines, deal with them and so on. You can use a variable to count how many lines you have read. - David Heffernan
@RicCrooks. As a small point, your solution screams for an anonymous method to be inserted where you wrote: > // Handle the current block of lines here. keep your methods cohesive. One method should deal with breaking a large file into blocks according to some scheme, the other method should deal with processing the block. It will result in more maintanable code, and also easier to swap in a new file processing methodology in the future. - Dave Novo

1 Answers

0
votes

As a (very) quick fix, you can try to use TALStringlist (just replace in your code TStringList by TalStringList) from https://github.com/Zeus64/alcinoe. It's not a very clean way to go, but TALStringlist will stay in unicode UTF-8, reducing by 2 the memory used by default UTF 16 String. As you have 7 000 000 lines of around 100 chars it's mean around 700 Mb, this can work on 32 bits