10
votes

I'm reading in a large text file with 1.4 million lines that is 24 MB in size (average 17 characters a line).

I'm using Delphi 2009 and the file is ANSI but gets converted to Unicode upon reading, so fairly you can say the text once converted is 48 MB in size.

( Edit: I found a much simpler example ... )

I'm loading this text into a simple StringList:

  AllLines := TStringList.Create;
  AllLines.LoadFromFile(Filename);

I found that the lines of data seem to take much more memory than their 48 MB.

In fact, they use 155 MB of memory.

I don't mind Delphi using 48 MB or even as much as 60 MB allowing for some memory management overhead. But 155 MB seems excessive.

This is not a fault of StringList. I previously tried loading the lines into a record structure, and I got the same result (160 MB).

I don't see or understand what could be causing Delphi or the FastMM memory manager to use 3 times the amount of memory necessary to store the strings. Heap allocation can't be that inefficient, can it?

I've debugged this and researched it as far as I can. Any ideas as to why this might be happening, or ideas that might help me reduce the excess usage would be much appreciated.

Note: I am using this "smaller" file as an example. I am really trying to load a 320 MB file, but Delphi is asking for over 2 GB of RAM and running out of memory because of this excess string requirement.

Addenum: Marco Cantu just came out with a White Paper on Delphi and Unicode. Delphi 2009 has increased the overhead per string from 8 bytes to 12 bytes (plus maybe 4 more for the actual pointer to the string). An extra 16 bytes per 17x2 = 34 byte line adds almost 50%. But I'm seeing over 200% overhead. What could the extra 150% be?


Success!! Thanks to all of you for your suggestions. You all got me thinking. But I'll have to give Jan Goyvaerts credit for the answer, since he asked:

...why are you using TStringList? Must the file really be stored in memory as separate lines?

That led me to the solution that instead of loading the 24 MB file as a 1.4 million line StringList, I can group my lines into natural groups my program knows about. So this resulted in 127,000 lines loaded into the string list.

Now each line averages 190 characters instead of 17. The overhead per StringList line is the same but now there are many fewer lines.

When I apply this to 320 MB file, it no longer runs out of memory and now loads in less than 1 GB of RAM. (And it only takes about 10 seconds to load, which is pretty good!)

There will be a little bit extra processing to parse the grouped lines, but it shouldn't be noticeable in real time processing of each group.

(In case you were wondering, this is a genealogy program, and this may be the last step I needed to allow it to load all the data about one million people in a 32-bit address space in less than 30 seconds. So I've still got a 20 second buffer to play with to add the indexes into the data the will be required to allow display and editing of the data.)

8
How do you measure the memory it takes? I hope not with the Mem Usage column from Task Manager. That is not measuring what you might think it is.Lars Truijens
For memory measurement, I use GlobalMemoryStatusEx. See: msdn.microsoft.com/en-us/library/aa366589(VS.85).aspxlkessler
You should check how much memory is actually used in Delphi. The Delphi MM will suballocate the larger blocks it obtains from the OS, and release them to the OS only when possible (fragmentation and the like can deny it), so what Windows sees and what Delphi does can be different. If you use the full FastMM library available from Sourceforge it has facilities to query the MM allocation giving you a deeper look at what's going on. Otherwise you could use a memory profiler (i.e. AQTime) to check it and see what allocated memory, when and why.Mad Hatter

8 Answers

10
votes

You asked me personally to answer your question here. I don't know the precise reason why you're seeing such high memory usage, but you need to remember that TStringList does a lot more than just loading your file. Each of these steps requires memory that may result in memory fragmentation. TStringList needs to load your file into memory, convert it from Ansi to Unicode, split it into one string for each line, and stuff those lines into an array that will be reallocated many times.

My question to you is why are you using TStringList? Must the file really be stored in memory as separate lines? Are you going to modify the file in-memory, or just display parts of it? Keeping the file in memory as one big chunk and scanning the whole thing with regular expressions that match the parts you want will be more memory efficient than storing separate lines.

Also, must the whole file be converted to Unicode? While your application is Unicode, your file is Ansi. My general recommendation is to convert Ansi input to Unicode as soon as possible, because doing so saves CPU cycles. But when you have 320 MB of Ansi data that will stay as Ansi data, memory consumption will be the bottleneck. Try keeping the file as Ansi in memory, and only convert the parts you'll be displaying to the user as Ansi.

If the 320 MB file isn't a data file you're extracting certain information from, but a data set you want to modify, consider converting it into a relational database, and let the database engine worry how to manage the huge set of data with limited RAM.

8
votes

What if you made your original record use AnsiString? That chops it in half immediately? Just because Delphi defaults to UnicodeString doesn't mean you have to use it.

Additionally, if you know exactly the length of each string (within a character or two) then it might be better to use short strings even and shave off a few more bytes.

I am curious if there might be a better way to accomplish what you are trying to do. Loading 320 MB of text into memory might not be the best solution, even if you can get it down to only require 320 MB

6
votes

I using Delphi 2009 and the file is ANSI but gets converted to Unicode upon reading, so fairly you can say the text once converted is 48 MB in size.

Sorry, but I don't understand this at all. If you have a need for your program to be Unicode, surely the file being "ANSI" (it must have some character set, like WIN1252 or ISO8859_1) isn't the right thing. I'd first convert it to be UTF8. If the file does not contain any chars >= 128 it won't change a thing (it will even be the same size), but you are prepared for the future.

Now you can load it into UTF8 strings, which will not double your memory consumption. On-the-fly-conversion of the few strings that can be visible on the screen at the same time to the Delphi Unicode string will be slower, but given the smaller memory footprint your program will perform much better on systems with little (free) memory.

Now if your program still consumes too much memory with TStringList you can always use TStrings or even IStrings in you program, and write a class that implements IStrings or inherits TStrings and does not keep all the lines in memory. Some ideas that come to mind:

  1. Read the file into a TMemoryStream, and maintain an array of pointers to the first characters of the lines. Returning a string is easy then, you only need to return a proper string between the start of the line and the start of the next one, with the CR and NL stripped.

  2. If this still consumes too much memory, replace the TMemoryStream with a TFileStream, and do not maintain an array of char pointers, but an array of file offsets for the line starts.

  3. You could also use the Windows API functions for memory mapped files. That allows you to work with memory addresses instead of file offsets, but does not consume that much memory as the first idea.

4
votes

By default, Delphi 2009's TStringList reads a file as ANSI, unless there is a Byte Order Mark to identify the file as something else, or if you provide an encoding as the optional second parameter of LoadFromFile.

So if you are seeing that the TStringList is taking up more memory than you think, then something else is going on.

3
votes

Are you by any chance compiling the program with FastMM sources from sourceforge and with FullDebugMode defined? In that case, FastMM is not really releasing unused memory blocks, which would explain the problem.

1
votes

Are you relying on Windows to tell you how much memory the program is using? It's notorious for overstating the memory used by a Delphi app.

I do see plenty of extra memory use in your code, though.

Your record structure is 20 bytes--if there is one such record per line you're looking at more data for the records than for the text.

Furthermore, a string has an inherent 4 byte overhead--another 25%.

I believe there is a certain amount of allocation granularity in Delphi's heap handling but I don't recall what it is at present. Even at 8 bytes (two pointers for a linked list of free blocks) you're looking at another 25%.

Note that we are already up to over a 150% increase.

1
votes

Part of it could be the block allocation algorithm. As your list grows, it starts increasing the amount of memory allocated at each chunk. I haven't looked at it in a long time, but I believe it goes something like doubling the amount of last allocated each time it runs out of memory. When you start to deal with lists that large, your allocations are also much larger than you ultimately need.

EDIT- As lkessler pointed out, this increase is actually only 25%, but it still should be considered as a part of the problem. if your just beyond the tipping point, there could be an enormous block of memory allocated to the list that isn't being used.

0
votes

Why are you loading that amount of data into a TStringList? The list itself will have some overhead. Maybe TTextReader could help you.