8
votes

In OCaml, polymorphic comparison is implemented by walking the runtime representation of values which consists of immediates and pointers to blocks.

According to Real World Ocaml, a polymorphic variant with no parameters is just stored as an unboxed integer. Excerpt reproduced here for convenience.

A polymorphic variant without any parameters is stored as an unboxed integer and so only takes up one word of memory, just like a normal variant. This integer value is determined by applying a hash function to the name of the variant. The hash function isn't exposed directly by the compiler, but the type_conv library from Core provides an alternative implementation: ...

However, polymorphic comparison does not appear to operate on the value of the integer and appears to respect the lexicographic ordering of the name of the polymorphic variant (in the top-level at least).

# List.sort Pervasives.compare
     [ `L ; `K ; `J ; `I ; `H ; `G ; `F ; `E ; `D; `C ; `B; `A ];; 
[`A; `B; `C; `D; `E; `F; `G; `H; `I; `J; `K; `L]

There is one small wrinkle: the length of the representation seems to be weighted most in the ordering.

# List.sort compare  [ `BBBB; `AAAA; `AAA; `ABA; `BB; `ZZ; `AA ];; 
[`AA; `BB; `ZZ; `AAA; `ABA; `AAAA; `BBBB]

How is OCaml pulling this off? How is the information OCaml needs to sort the variants lexicographically still present at runtime? Shouldn't a polymorphic variant without any arguments be indistinguishable from an ordinary integer?

Did the OCaml implementers pick a hash function that, by coincidence/design, has this behavior for short variant names?

1
Just for the sake of stating the obvious, I strongly suggest you don't rely on such property in actual code.Richard-Degenne
@RichouHunter I was just surprised that the comparison order isn’t “totally unpredictable”.Gregory Nisbet
Sure, I could tell. I just don't want a future reader thinking about using this.Richard-Degenne

1 Answers

8
votes

Due to its construction the hash function preserves the order for short strings. But this is not a general property.

# List.sort compare [`AAAAAAA; `BAAAAAA; `CAAAAAA];;
- : [> `AAAAAAA | `BAAAAAA | `CAAAAAA ] list =
       [`BAAAAAA; `CAAAAAA; `AAAAAAA]
#

The hashing code looks like this for OCaml 4.06.0:

CAMLexport value caml_hash_variant(char const * tag)
{
  value accu;
  for (accu = Val_int(0); *tag != 0; tag++)
    accu = Val_int(223 * Int_val(accu) + *((unsigned char *) tag));
#ifdef ARCH_SIXTYFOUR
  accu = accu & Val_long(0x7FFFFFFFL);
#endif
  /* Force sign extension of bit 31 for compatibility between 32 and 64-bit
     platforms */
  return (int32_t) accu;
}

It seems to me that for short strings of characters with codes less than 223 this will tend to preserve lexical order.