Up until recently my answer would have been very close to Jon Skeet's here. However, I recently started a project which used power-of-two hash tables, that is hash tables where the size of the internal table is 8, 16, 32, etc. There's a good reason for favouring prime-number sizes, but there are some advantages to power-of-two sizes too.
And it pretty much sucked. So after a bit of experimentation and research I started re-hashing my hashes with the following:
public static int ReHash(int source)
{
unchecked
{
ulong c = 0xDEADBEEFDEADBEEF + (ulong)source;
ulong d = 0xE2ADBEEFDEADBEEF ^ c;
ulong a = d += c = c << 15 | c >> -15;
ulong b = a += d = d << 52 | d >> -52;
c ^= b += a = a << 26 | a >> -26;
d ^= c += b = b << 51 | b >> -51;
a ^= d += c = c << 28 | c >> -28;
b ^= a += d = d << 9 | d >> -9;
c ^= b += a = a << 47 | a >> -47;
d ^= c += b << 54 | b >> -54;
a ^= d += c << 32 | c >> 32;
a += d << 25 | d >> -25;
return (int)(a >> 1);
}
}
And then my power-of-two hash table didn't suck any more.
This disturbed me though, because the above shouldn't work. Or more precisely, it shouldn't work unless the original GetHashCode()
was poor in a very particular way.
Re-mixing a hashcode can't improve a great hashcode, because the only possible effect is that we introduce a few more collisions.
Re-mixing a hash code can't improve a terrible hash code, because the only possible effect is we change e.g. a large number of collisions on value 53 to a large number of value 18,3487,291.
Re-mixing a hash code can only improve a hash code that did at least fairly well in avoiding absolute collisions throughout its range (232 possible values) but badly at avoiding collisions when modulo'd down for actual use in a hash table. While the simpler modulo of a power-of-two table made this more apparent, it was also having a negative effect with the more common prime-number tables, that just wasn't as obvious (the extra work in rehashing would outweigh the benefit, but the benefit would still be there).
Edit: I was also using open-addressing, which would also have increased the sensitivity to collision, perhaps more so than the fact it was power-of-two.
And well, it was disturbing how much the string.GetHashCode()
implementations in .NET (or study here) could be improved this way (on the order of tests running about 20-30 times faster due to fewer collisions) and more disturbing how much my own hash codes could be improved (much more than that).
All the GetHashCode() implementations I'd coded in the past, and indeed used as the basis of answers on this site, were much worse than I'd throught. Much of the time it was "good enough" for much of the uses, but I wanted something better.
So I put that project to one side (it was a pet project anyway) and started looking at how to produce a good, well-distributed hash code in .NET quickly.
In the end I settled on porting SpookyHash to .NET. Indeed the code above is a fast-path version of using SpookyHash to produce a 32-bit output from a 32-bit input.
Now, SpookyHash is not a nice quick to remember piece of code. My port of it is even less so because I hand-inlined a lot of it for better speed*. But that's what code reuse is for.
Then I put that project to one side, because just as the original project had produced the question of how to produce a better hash code, so that project produced the question of how to produce a better .NET memcpy.
Then I came back, and produced a lot of overloads to easily feed just about all of the native types (except decimal
†) into a hash code.
It's fast, for which Bob Jenkins deserves most of the credit because his original code I ported from is faster still, especially on 64-bit machines which the algorithm is optimised for‡.
The full code can be seen at https://bitbucket.org/JonHanna/spookilysharp/src but consider that the code above is a simplified version of it.
However, since it's now already written, one can make use of it more easily:
public override int GetHashCode()
{
var hash = new SpookyHash();
hash.Update(field1);
hash.Update(field2);
hash.Update(field3);
return hash.Final().GetHashCode();
}
It also takes seed values, so if you need to deal with untrusted input and want to protect against Hash DoS attacks you can set a seed based on uptime or similar, and make the results unpredictable by attackers:
private static long hashSeed0 = Environment.TickCount;
private static long hashSeed1 = DateTime.Now.Ticks;
public override int GetHashCode()
{
//produce different hashes ever time this application is restarted
//but remain consistent in each run, so attackers have a harder time
//DoSing the hash tables.
var hash = new SpookyHash(hashSeed0, hashSeed1);
hash.Update(field1);
hash.Update(field2);
hash.Update(field3);
return hash.Final().GetHashCode();
}
*A big surprise in this is that hand-inlining a rotation method that returned (x << n) | (x >> -n)
improved things. I would have been sure that the jitter would have inlined that for me, but profiling showed otherwise.
†decimal
isn't native from the .NET perspective though it is from the C#. The problem with it is that its own GetHashCode()
treats precision as significant while its own Equals()
does not. Both are valid choices, but not mixed like that. In implementing your own version, you need to choose to do one, or the other, but I can't know which you'd want.
‡By way of comparison. If used on a string, the SpookyHash on 64 bits is considerably faster than string.GetHashCode()
on 32 bits which is slightly faster than string.GetHashCode()
on 64 bits, which is considerably faster than SpookyHash on 32 bits, though still fast enough to be a reasonable choice.
GetHashCode
. I hope it would be helpful for others. Guidelines and rules for GetHashCode written by Eric Lippert – reneGetHashCode()
is used in very many implementations ofEquals()
. That's what I meant with that statement.GetHashCode()
insideEquals()
is often used as a shortcut to determine inequality, because if two objects have a different hash code they have to be objects that are not equal and the rest of the equality-check doesn't have to executed. – bitbonkGetHashCode()
andEquals()
need to look at all fields of both objects (Equals has to do this if it the hashcodes are equal or not-checked). Because of this, a call toGetHashCode()
insideEquals()
is often redundant and could reduce performance.Equals()
may also be able to short circuit, making it much faster - however in some cases the hashcodes may be cached, making theGetHashCode()
check faster and so worthwhile. See this question for more. – NotEnoughData