Friday, December 07, 2007

Hashtables and Wedding Days

While using a commercial STL hash implementation a few years ago, my program appeared to hang (the old halting problem dilemma; is it in an infinite loop or just slow?) on a hash table insert. I got a trace of the key/value pairs inserted and wrote a minimal program only doing the inserts. That hung too. Sent it off to the vendor who found a bug in his bucket rebalancing code.

He sent back a patch for the implementation with the advice to get a better hash function, since a lot of the inserts were hashing to the same location. Using the key object's address as the hash value was a poor choice: it was word aligned, and the lower 2 bits were always zero. So we changed the hash function to right-shift the address, and with the patched code, we were off and running again.

Jeff Atwood draws an analogy of the birthday paradox and the probability of two objects hashing to the same location. The problem with my poor hash function resembles the Wedding Day Paradox, as I'm calling it. I recently had a conversation with a coworker:

Me: My anniversary is coming up.
Coworker: Mine too! How many years?
Me: Four.
Coworker: Me too. When were you married?
Me: October, how about you?

We quickly discovered we both had our weddings fall on the same day of the same year. We realized this should not surprise us, since most weddings occur on a Saturday. Knowing that we were both married in the same month of the same year gives only 4 or 5 choices. This effect is even more significant than the seasonal variations of birthdays with more concentration in some ranges than in others.

I've filed this under bugs, so let this serve as a cautionary tale if you undertake to do your own hash functions; consider well the distribution of your keys. Also, if you write a hash table implementation, include in your test suite a scenario where the hash function is really dumb.

No comments: