Removing elements from lazy sequences in a large Clojure tree structure, avoiding head retention

Question

Problem Description

For processing large data structures in Clojure, lazy sequences offer a nice, idiomatic approach. One needs to be cautious to avoid head retention, though.

I struggle with processing a large tree structure like this:

                 R                                         Root
       __________|____________________
       A                   B         C, D, E, ...          1st Level Children
_______|_______     _______|_______
X Y Y ... Y X Y     X Y Y ... Y X Y                        2nd Level Children

All nodes are maps with a key :content. The value of any :content is a lazy seq with all the children of that node.
The whole tree does not fit into memory. There are too many Y items on the 2nd level.
The whole tree excluding the Y items fits into memory.

After processing the tree, I would like to end up with a new tree, where all Y nodes were removed:

           R
     ______|__________________
     A             B         C, D, E, ...
_____|___     _____|___
X X ... X     X X ... X

Example code and further explanation

;; Generating example data
;;;;;;;;;;;;;;;;;;;;;;;;;;

(defn root [content]
  {:tag :root :content content})

(defn lazy-elements [n tag content]
  (lazy-seq (repeat n {:tag tag :content content})))

(defn level-1 [content]
  (lazy-elements 3 :A content))

(defn level-2 [n]
  (concat (lazy-elements 10 :X '(:leaf))
          (lazy-elements n :Y '(:leaf))))

(defn remove-nodes [node]
  (remove #(= (:tag %) :Y) node))


;; Illustrating usage
;;;;;;;;;;;;;;;;;;;;;

;; runs and runs and runs... and eventually returns correctly
(defn valid-run []
  (->> (root (level-1 (level-2 1e8)))
       :content
       first
       :content
       remove-nodes))

;; Does not terminate properly, runs out of memory
(defn invalid-run []
  (->> (root (level-1 (level-2 1e8)))
       :content
       (map :content)       ; source of head retention
       (map remove-nodes)))

(Gist available on GitHub)

The second example will crash (depending on available memory, one might need to adjust the number of level 2 items). Mapping over the :content of all level 1 items introduces a reference which causes head retention issues when cycling through all the content items to remove the unwanted :Y items.

I was able to use data from something like valid-run, put that into a var holding mutable state, doing that for all the relevant nodes and after that stitching all the data together again. But I am very unhappy with that approach because of having to depend on mutability and having to use some quite imperative code to merge the data in the end (running through indices of lists for example).

Question

How can this be achieved in a functional, declarative style? Ideally I would want to avoid having to use mutable state as well as being too imperative (e.g. stitching together collections using indices and such).

Resources

The following articles and snippets are interesting reads about aspects of that problem:

Some more background

Eventually I need this to process large XML files. Large means >1GB and parsing that into a tree will not work on the available memory. From that XML I want to put some elements into bucket A (let's say a database table) and all the rest of the XML tree into bucket B. The XML structure of course should be retained for the extracted subtrees.

Instead of parsing the XML into a tree, I could also process the XML as an event stream, for example via data.xml/source-seq. However, this would mean loosing the XML tree semantics. Would work, but not be beautiful. But maybe there are other approaches to handling that XML in the first place.

You're constructing a new tree. Can you write it out as you create it, and then free what you've already written? At level i, you have a lazy sequence of nodes. Choose the next node, and descend into it. Once you've finished with that node, write out its replacement tree and free it and all of the tree below it. Descending means iterating that process, except for the writing part. The idea is to hold in memory only the subtree that you're examining, and to write out and forget what you've already processed, lazily bringing into memory new subtrees. Not sure if that all works. — Mars
Another: You're cleaning Y's from the leaf nodes only, right? At the same level? If so, construct a sequence of sequences of leaf nodes, representing those that are under the same node. These should have identifiers, so that you can attach them back to their parent nodes after you've stripped out the Y's. i.e.: You walk the tree first only to construct the sequence of sequences of leaf nodes. Then you remove Y's. Then you walk the tree again to create a new tree with the new sets of leaf nodes. Write out subtrees to keep from having the new tree all in memory. Again, not sure ... — Mars
My suggestions might be horrible ideas. Just off the top of my head. — Mars
I have to say I have not completely grokked the question, but I have had success with trees in clojure using a lazy zipper approach, refs here: josf.info/blog/2014/04/14/seqs-of-clojure-zippers and josf.info/blog/2014/04/14/seqs-of-clojure-zippers . These make handling trees very "functional" and immutable. — TWJW
I'd just like to note that this is an excellent question. An interesting technical problem, explained well with background, examples, and even diagrams; and runnable code snippets to reproduce the exact problem. The only thing it's missing would be a sample stacktrace, so that nobody has to run it locally to see the error that happens. — amalloy

amalloy amalloy · Accepted Answer · 2015-09-21T04:08:44

The problem is that your level-2 nodes all have pointers to the same in-memory lazy sequence, and then you map over that sequence multiple times. You would have the same problem if you just made valid-run process both the first and second node - the number of nodes doesn't much matter, because you blow the heap with any two nodes. In a real application, where you've read these nodes from a database or a file or whatever, they will point to different objects, which you can handle lazily, each in turn.

If you generate more representative example data (ie, the same data but without the structural sharing), you can GC each node as you process it:

(defn root' [content]
  (fn []
    {:tag :root :content (content)}))

(defn lazy-elements' [n tag content]
  (repeatedly n (fn [] {:tag tag :content (content)})))

(defn level-1' [content]
  (fn []
    (lazy-elements' 3 :A content)))

(defn level-2' [n]
  (fn []
    (concat (lazy-elements' 10 :X (fn [] '(:leaf)))
            (lazy-elements' n :Y (fn [] '(:leaf))))))

(defn remove-nodes [node]
  (remove #(= (:tag %) :Y) node))

(defn run []
  (let [root-builder (root' (level-1' (level-2' 1e8)))]
    (->> (root-builder)
         :content
         (map :content)       
         (map remove-nodes))))

user> (pprint (run))
(({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)})
 ({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)})
 ({:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}
  {:tag :X, :content (:leaf)}))

Since we're just generating example content, I've adjusted all your node builders to take, rather than an object they should store N copies of, a function they should call N times to get N distinct objects. And rather than returning a node, they return a function that, when called, produces a copy of that node; this allows them to compose just as nicely as your original versions, just needing one extra function call at the outer level. If you actually have distinct objects already, as I suspect you would in a real application, you can just use your original functions as written.