How would an LRU cache work for a trie data structure?

Question

Let’s say I have a trie/prefix trie with a total limit of 10 nodes. I’m limiting to 10 nodes to simulate memory being exceeded. (If I cannot load the entire tree into memory, I have total - 10 nodes stored on disk.

I now insert a new string into the trie that will cause the tree to exceed the 10 node limit, so now it’s time for the LRU cache to evict the least recently accessed node from the trie.

Let’s say the tree contains the words hello, help, hi and the LRU node is “h”. This would mean I need to delete “h” from the trie, which will delete the entire tree in this case. My confusion lies in also updating the cache itself to delete all the children. How does this work in this case?

I assume the cache has nodes like “h”, “he”, “hel”, “help”, etc. If I delete the “h” node, I assume I need to delete everything in the cache prefixed with “h”? My entire assumption seems really inefficient.

Why would you have a trie with a limit on the number of nodes that deletes "older" nodes automatically? What does it mean? Clearly deleting "h" doesn't make sense - how can "h" even by the LRU node when it's the root of your tree? How can it be less recently used then a child of it? — Erwin Bolwidt
The 10 node limit would be hypothetical simulation to mimic limiting number of nodes I can load into memory at once. If I insert “hello” into a prefix trie, I insert “h”, then “he” then “hel”, “hell”, “hello”. In this case, the newest insert node is “o” from the “hello” prefix and the oldest would be “h”, no? — John Lippson
LRU is Least Recently Used; since you haven't retrieved anything from the trie, I'd say they all have the same age. It doesn't seem very sensible to remove anything but a leaf node as removing any non-leaf node would automatically remove all their children too. But it doesn't seem very sensible to remove individual nodes - only the string that you inserted into the trie; but a trie doesn't normally preserve the original strings that were inserted so it'd be hard to remove them unless you modify the data structure (maybe counting by how many strings each node was used) — Erwin Bolwidt
Is there a more sensible way to read/write pieces of a prefix trie to disk and back into memory as memory limit is reached? I assumed some LRU policy based on memory usage would be how it works, but the actual eviction of nodes is really confusing me. I didn’t think they all had the same age because I thought LRU cache was typically represented as a doubly linked list. — John Lippson

m.raynal m.raynal · Accepted Answer · 2019-07-05T12:43:25

One thing to keep in mind when talking about cache is that it is a redundant data structure, whose only goal is to speed-up data fetches.
So, when a piece of data is evicted from the cache, it has no consequence (other than the execution speed) on the program which uses this data, because it will then be fetched from the main memory. So, in any case, your trie will have the exact same behavior, regardless of which piece of it is located in the cache or not.

This is very important, because it allows us to code in high level languages, such as java, without caring about the replacement policy of the cache implemented by the processor. If it was not the case, it would be a nightmare, because we would have to take into account all the existing (and future?) replacement policy implemented in processors. Not even mentioning that these policies are not as simple as LRU (there are cache sets, which divide cache into 'lines', and their behavior is pretty much linked to their physical structure as well), and that the place a piece of data will be located in the cache depends on its address in the main memory, which will not necessarily be the same for each code execution.

In short, the two things you mention (trie nodes in java, and LRU cache policies) are too far apart (one is very, very low level programming, the other high level). That is why we rarely, if ever, consider their interactions.
If you implement a Trie in java, your job is to be sure that it works well in all situations, that it is well designed so maintenance will be easier (possible), that it is readable so other programmers can work on it some day. Eventually, if it still runs too slow, you can try and optimize it (after determining where the bottlenecks are, never before).
But if you want to link your trie to the cache hit/miss, and replacement policies, you will have to read the translation of your implementation in bytecode (done by the JVM).

PS: in your post, you talk of simulating memory being execeeded. There is no such thing for a program. When the cache is full, we fill up the main memory. When the main memory is full, operating systems usually reserve a part of the hard drive to play the role of the main memory (we call it swapping, and when it happens, the computer is as good as frozen). When the swap is full, programs crash. All of them.
In the 'mind' of a program, the operating system gives it absolutely gigantic amounts of memory (which is virtual, but for the program it's as good as real one), that will never be filled up. The program itself is not 'conscious' of the way memory is managed, and of the amount of memory left, for a lot of good reasons (security, guarantee that all programs will have a fair share of the ressources ...)

How would an LRU cache work for a trie data structure?

1 Answers