4
votes

To summarize, I have an application that takes a set of input files, produces a tree from the data, and then writes it out to a text file as XML.

Currently, the entire tree is stored in memory before it is written out because during parsing we need to reference arbitrary nodes on the tree to either fetch or update its values.

The problem we are facing occurs when the tree becomes too big to store all of it in memory. The tree itself is very flat with only 4-6 levels of depth. It looks something like this

Root
   Group
      Record
         Data
         Data
      Record
         Data
         Data
         Data
         ...
      ...
   Group
      Record
      ...

There will always be one Root node, and each node only has one type of child. However, there is no order to the way nodes are added to other nodes either: depending on how the data is formatted, you might add records to different groups, and you might add data to different records (as opposed to building one record for one group, and then moving on to another)

My first suggestion was to just throw more memory at our machine. We're running the tool on a 64-bit windows machine, so if we're running out of memory, then we just need to get more memory. But that suggestion wasn't taken.

The next idea I had was to write out nodes whenever the tree was taking up too much space in memory, but because data can be added to a particular record at any time, it becomes difficult to determine when we are actually done with a record or not. Especially if we need to refer back to a record, and it's already been written out.

There are several others options such as optimizing the way the tree is designed (since each node takes up a fairly large amount of memory), but for this question I would like to know techniques for building and exporting large trees.

4
If it's too big to fit in memory, why are you outputting it to one single XML file? It's not inherently a bad approach, but it does seem like the wrong tool for the job.Robin Green

4 Answers

3
votes

There are 2 ways to approach this problem, in my opinion.

  • Learn more about the usage of data to determine specific patterns and come up with the most efficient way for the use case that you have.

  • Treat the data as a black box assuming that any scenario of accessing or changing the data may be possible with the same frequency and probability.

You're not giving us any food for the first approach, so we have to assume the latter. Concept of cache comes to mind as a solution. There are different types of cache, but the basic concept is that you keep as much as possible in memory and once you exceed certain limit you persist and delete from memory the part that was used the least amount of times or the longest time ago.

When you do that you may choose to keep the actual tree structure in memory purging only node contents or purge both node contents and the tree structure itself. If you have a large but limited number of nodes it's probably better if you keep tree structure making "purged" nodes as lightweight as possible. However, if the number of nodes in the tree is virtually unlimited, then you might consider purging entire sub-trees.

The last approach can be very efficient for use cases when tree access is typically done by accessing subtrees as opposed to being completely random.

If you provide more information about the data and patterns of use, we might be able to come up with a more detailed suggestion.

3
votes

1) The first option that comes to my mind is storing all (parent-child) pairs in a database, and then explore it recursively to build the XML from it.

2) Another option is doing it bottom up, by scanning the complete input three times (once for each layer, starting with records as parents, and ending with root). Each layer is stored in disk as a set of XML files, one for each node. Then, when building the higher level in the tree, children files can be simply appended to their corresponding parent files (because they are guaranteed to be fully populated). This requires maintaining 2 in-memory indexes; one for the current level, and one for the one beneath it. These indexes point to the files.

3
votes
  1. Create a working directory on disk.

  2. Read through the data. When you encounter a group you haven't seen before, create a subdirectory for it in the working directory. When you see a record you haven't seen before, create a file for it in the appropriate group directory. When you see some data, append it to the appropriate file. Carry on until you've finished reading through the data.

  3. Walk the file tree, and concatenate the contents of the record files to the output file, with the appropriate tags for the root and the groups.

  4. Delete the working directory, and everything in it

If you hang on to the filehandles in memory, and reuse them between the writing and reading phases (this means using either a RandomAccessFile or a MappedByteBuffer, both of which you can both write and read, and rewind), and don't flush them at any point, then you will leave the problem of disk IO and caching entirely in the hands of the operating system (well, and runtime libraries and so on). If the OS decides that the best thing for this particular execution of the program is to write some of the data to disk, it will do so. If it can all fit in memory, it will keep it all in memory. It will be able to batch the writes so they are nice and big, and thus efficient. It will be able to pre-fetch the reads, which are sequential traversals of a set of files, and thus predictable. If your operating system is good, this will be about as efficient a solution as this problem admits. If it's Windows, it might not be.

An extra trick would be to write the data to the files not in XML, but in some more compact intermediate format, only converting it to XML for the final output. That would make more efficient use of cache and bandwidth.

2
votes

If I were you, I would persist the tree while handling it, instead of storing it in memory. By persistence, I mean anything from a flat text file to a relational database to a graph-oriented NoSQL solution.

Then you can create a basic access layer in form of a Java bean for the CRUD operations you need to perform, and here you go.

Another option would be to build the XML file dynamically, thus skipping the building of the tree in memory. For that purpose, you could leverage SAX, which is an XML handling solution based on events, not on loading the entire document in memory.