15
votes

Hash-table-based dict/map structures provide O(1) lookup time. Yet I keep seeing implications that in Elixir, finding a matching function head is faster than looking up something in a map.

For instance, Elixir's String.Unicode compiles a list of unicode characters into many function heads, so that asking "what's the upcase version of é" is answered by finding the function head for upcase that expects "é".

I don't know why this would be faster or more memory efficient than having a single upcase function head that looks up "é" in a map.

Similarly, when showing how to build an I18n library in "Metaprogramming Elixir", Chris McCord gives each translation key its own function head, and says:

“By generating function heads for each translation mapping, we again let the Virtual Machine take over for fast lookup.”

Do maps in Elixir not provide O(1) lookup? Is finding a matching function head O(1)? Why opt for compiling a static list to many function heads instead of just storing it in a map?

1
Just remember that complexity analysis is not speed. One O(1) function may run ten times faster than another. In any case, hashing is not O(1). It may approach or amortize to O(1) under certain circumstances (probabilistically O(1)), but collisions will always push it towards O(n).paxdiablo
@paxdiablo I understand that complexity != speed, and that's a good point. As far as collisions, I don't see how they push toward O(N). A hash table will limit the number of keys per bucket - eg, maybe a max of 10. So lookup in a bucket is never more than 10 checks and hence O(1), right? With a good hashing algo the need to rehash should get less and less frequent - eg each time we rehash we have to do twice the work as last time, but it took us twice as long to have to do that, so it amortizes to O(1). If somehow new keys keep going in the same bucket, it's a bad hashing algo, right?Nathan Long
@Nathan, if you're allowed to limit the data, all algorithms become O(1). Complexity analysis is for arbitrary data. You can push the problem further into the distance by, for example, using hashes of hashes but, sans data limitations, the problem will always be there.paxdiablo
@paxdiablo :) Yes, but I'm fairly sure that limiting the number of keys per bucket is necessary for has tables to get O(1) performance. I'm not saying limit the number of items in the hash, just that we rehash everything when buckets get too full. I've implemented a hash and written about it at nathanmlong.com/2015/10/reimplementing-rubys-hash - maybe skip down to "Did We Achieve O(1)?" I think maybe we're misunderstanding each other somehow - you sound like you know what you're talking about and I'm pretty sure I do too. :)Nathan Long

1 Answers

23
votes

Elixir maps aren't flat hash-table-based datastructures. Small maps are ordered lists where the map has <= 31 entries. When a map gets 32 entries, it changes to a hash trie that has a lookup of O(log n). Immutable hash-table-based datastructures would be very expensive to update, after all, which is the primary method of "building up" one of these datastructures. Changing one value would require you to update a new map with just 1 more item. They're based on Rich Hickey's persistent hashmaps, which are a kind of hash array mapped trie.

Function head match complexity is O(n) worst-case, but can be optimized by the compiler in ways I do not fully understand. In some cases, it can turn some function head patterns into a tree. But because function head matches do not have a total order, and must match in the order in which they are defined, the amount of optimization is pretty limited. You might just be getting lucky in using parts of the function head match that are very low and also in a tree of the function head matching order.

Each step of a head match is also quite lightweight and highly optimized, where maps are still quite new and there are some performance optimizations to be had. The complexity of the hash function, for instance, is not simple if the key is complex/nested (simple integers, however, have very fast hash functions). But in your unicode example, I bet the unicode standard, by nature of how it orders the id's, puts as many of their commonly-used characters in the front. This probably makes it very easy for the VM to optimize and for you to get very good lookup times. I'm sure if you looked up an obscure alphabet, your complexity would change.

But the reason one wouldn't just dynamically generate a module with function matching as the way to look up data is that you would thrash global state, in particular the code_server module. Some of the lookups may go faster, but the speedups would probably slow way down as the datastructure got bigger.