6
votes

This is perhaps related to functional data structures, but I found no tags about this topic.

Say I have a syntax tree type Tree, which is organised as a DAG by simply sharing common sub expressions. For example,

data Tree = Val Int | Plus Tree Tree

example :: Tree
example = let x = Val 42 in Plus x x

Then, on this syntax tree type, I have a pure function simplify :: Tree -> Tree, which, when given the root node of a Tree, simplifies the whole tree by first simplifying the children of that root node, and then handle the operation of the root node itself.

Since simplify is a pure function, and some nodes are shared, we expect not to call simplify multiple times on those shared nodes.

Here comes the problem. The whole data structure is invariant, and the sharing is transparent to the programmer, so it seems impossible to determine whether or not two nodes are in fact the same nodes.

The same problem happens when handling the so-called “tying-the-knot” structures. By tying the knot, we produce a finite data representation for an otherwise infinite data structure, e.g. let xs = 1 : xs in xs. Here xs itself is finite, but calling map succ on it does not necessarily produce a finite representation.

These problems can be concluded as such: when the data is organised in an invariant directed graph, how do we avoid revisiting the same node, doing duplicated work, or even resulting in non-termination when the graph happened to be cyclic?

Some ideas that I have thought of:

  1. Extend the Tree type to Tree a, making every nodes hold an extra a. When generating the graph, associate each node with a unique a value. The memory address should have worked here, despite that the garbage collector may move any heap object at any time.
  2. For the syntax tree example, we may store a STRef (Maybe Tree) in every node for the simplified version, but this might not be extensible, and injects some implementation detail of a specific operation to the whole data structure itself.
1
We can't do this in pure code: referential transparency mandates that there is no safe way to detect "shared subtrees" from "identical subtrees". If you are OK with using unsafe stuff, and promise to be extremely careful so that you do not break referential transparency, you can unsafely compare the underlying pointers. Not recommended unless you really like to live dangerously :-). Perhaps you could instead decorate your tree nodes with unique IDs and use those to detect sharing, or use STRefs.chi
@chi Even if I cannot tell whether two nodes were the same one, is there any way I can do some transformation on the tree (e.g. fmap f), while (in general) preserving its shape? Or I would not be able to take a Tree () with shared nodes in it, and replace the ()s with STRefs without changing its shape. Then I would have to carry all those STRefs ever since the tree is created, which means I will need to wrap the whole stuff in the ST monad.Ruifeng Xie
In principle, there could exist a sharing-aware fmap in the libraries that, relying on unsafe pointer equality, preserves the sharing (or at leats some of it) in the original tree (and calls f fewer times). I don't think there is such a thing at the moment. Ditto for other traversals.chi

1 Answers

5
votes

This is a problem with a lot of research behind it. In general, you cannot observe sharing in a pure language like Haskell, due to referential transparency. But in practice, you can safely observe sharing so long as you restrict yourself to doing the observing in the IO monad. Andy Gill (one of the legends from the old Glasgow school of FP!) has written a wonderful paper about this about 10 years ago:

http://ku-fpg.github.io/files/Gill-09-TypeSafeReification.pdf

Very well worth reading, and the bibliography will give you pointers to prior art in this area and many suggested solutions, from "poor-man's morally-safe" approaches to fully monadic knot-tying techniques. In my mind, Andy's solution and the corresponding reify package in Hackage:

https://hackage.haskell.org/package/data-reify

are the most practical solutions to this problem. And I can tell from experience that they work really well in practice.