Building and exporting a very large tree structure with limited memory

Question

To summarize, I have an application that takes a set of input files, produces a tree from the data, and then writes it out to a text file as XML.

Currently, the entire tree is stored in memory before it is written out because during parsing we need to reference arbitrary nodes on the tree to either fetch or update its values.

The problem we are facing occurs when the tree becomes too big to store all of it in memory. The tree itself is very flat with only 4-6 levels of depth. It looks something like this

Root
   Group
      Record
         Data
         Data
      Record
         Data
         Data
         Data
         ...
      ...
   Group
      Record
      ...

There will always be one Root node, and each node only has one type of child. However, there is no order to the way nodes are added to other nodes either: depending on how the data is formatted, you might add records to different groups, and you might add data to different records (as opposed to building one record for one group, and then moving on to another)

My first suggestion was to just throw more memory at our machine. We're running the tool on a 64-bit windows machine, so if we're running out of memory, then we just need to get more memory. But that suggestion wasn't taken.

The next idea I had was to write out nodes whenever the tree was taking up too much space in memory, but because data can be added to a particular record at any time, it becomes difficult to determine when we are actually done with a record or not. Especially if we need to refer back to a record, and it's already been written out.

There are several others options such as optimizing the way the tree is designed (since each node takes up a fairly large amount of memory), but for this question I would like to know techniques for building and exporting large trees.

If it's too big to fit in memory, why are you outputting it to one single XML file? It's not inherently a bad approach, but it does seem like the wrong tool for the job. — Robin Green

ATrubka ATrubka · Accepted Answer · 2013-11-20T20:21:37

There are 2 ways to approach this problem, in my opinion.

Learn more about the usage of data to determine specific patterns and come up with the most efficient way for the use case that you have.
Treat the data as a black box assuming that any scenario of accessing or changing the data may be possible with the same frequency and probability.

You're not giving us any food for the first approach, so we have to assume the latter. Concept of cache comes to mind as a solution. There are different types of cache, but the basic concept is that you keep as much as possible in memory and once you exceed certain limit you persist and delete from memory the part that was used the least amount of times or the longest time ago.

When you do that you may choose to keep the actual tree structure in memory purging only node contents or purge both node contents and the tree structure itself. If you have a large but limited number of nodes it's probably better if you keep tree structure making "purged" nodes as lightweight as possible. However, if the number of nodes in the tree is virtually unlimited, then you might consider purging entire sub-trees.

The last approach can be very efficient for use cases when tree access is typically done by accessing subtrees as opposed to being completely random.

If you provide more information about the data and patterns of use, we might be able to come up with a more detailed suggestion.

Building and exporting a very large tree structure with limited memory

4 Answers