Problem Description
For processing large data structures in Clojure, lazy sequences offer a nice, idiomatic approach. One needs to be cautious to avoid head retention, though.
I struggle with processing a large tree structure like this:
R Root
__________|____________________
A B C, D, E, ... 1st Level Children
_______|_______ _______|_______
X Y Y ... Y X Y X Y Y ... Y X Y 2nd Level Children
- All nodes are maps with a key
:content
. The value of any:content
is a lazy seq with all the children of that node. - The whole tree does not fit into memory. There are too many
Y
items on the 2nd level. - The whole tree excluding the
Y
items fits into memory.
After processing the tree, I would like to end up with a new tree, where all Y
nodes were removed:
R
______|__________________
A B C, D, E, ...
_____|___ _____|___
X X ... X X X ... X
Example code and further explanation
;; Generating example data
;;;;;;;;;;;;;;;;;;;;;;;;;;
(defn root [content]
{:tag :root :content content})
(defn lazy-elements [n tag content]
(lazy-seq (repeat n {:tag tag :content content})))
(defn level-1 [content]
(lazy-elements 3 :A content))
(defn level-2 [n]
(concat (lazy-elements 10 :X '(:leaf))
(lazy-elements n :Y '(:leaf))))
(defn remove-nodes [node]
(remove #(= (:tag %) :Y) node))
;; Illustrating usage
;;;;;;;;;;;;;;;;;;;;;
;; runs and runs and runs... and eventually returns correctly
(defn valid-run []
(->> (root (level-1 (level-2 1e8)))
:content
first
:content
remove-nodes))
;; Does not terminate properly, runs out of memory
(defn invalid-run []
(->> (root (level-1 (level-2 1e8)))
:content
(map :content) ; source of head retention
(map remove-nodes)))
The second example will crash (depending on available memory, one might need to adjust the number of level 2 items). Mapping over the
:content
of all level 1 items introduces a reference which causes head retention issues when cycling through all the content items to
remove the unwanted :Y
items.
I was able to use data from something like valid-run
, put that into a var holding mutable state, doing that for all the relevant nodes
and after that stitching all the data together again. But I am very unhappy with that approach because of having to depend on mutability
and having to use some quite imperative code to merge the data in the end (running through indices of lists for example).
Question
How can this be achieved in a functional, declarative style? Ideally I would want to avoid having to use mutable state as well as being too imperative (e.g. stitching together collections using indices and such).
Resources
The following articles and snippets are interesting reads about aspects of that problem:
- XML manipulation in Clojure
- Rich Hickey on incidental complexity of head retention
- Related Clojure documentation
- Head retention funkiness in a nutshell
Some more background
Eventually I need this to process large XML files. Large means >1GB and parsing that into a tree will not work on the available memory. From that XML I want to put some elements into bucket A (let's say a database table) and all the rest of the XML tree into bucket B. The XML structure of course should be retained for the extracted subtrees.
Instead of parsing the XML into a tree, I could also process the XML as an event stream, for example via data.xml/source-seq. However, this would mean loosing the XML tree semantics. Would work, but not be beautiful. But maybe there are other approaches to handling that XML in the first place.