April 26, 2010

Hashing is one of my favorite areas of computer science, and one of the most misunderstood ones.  The academic approach to hashing for use in hash tables versus the practical approaches tend to create a lot of confusion.  It is also confusing that cryptographic hashing and the hashing for hash tables are both referred to as "hashing".  I'll do my best to explain what I feel is important about hashing.

What is a Hashing Function 

A hashing function takes data of a larger size (fixed or variable) and return a number in a fixed range.  For example, a function the returns the first character in a string would be a simple hashing function.  The important part is that the returned value lands in a fixed range; in our example we know we will always get back a single character and that allows up to do various things.  Like collate the strings by their first character.

What Makes a Good Hashing Function

It really depends on what you're planning to do with it.  There are three main branches of use: cryptographic, integrity checking, and hash tables.

For cryptographic hashes I'd recommend using well known and well studied hashing algorithms and following procedures outlined by experts in their use.  Attempting to build your own cryptographic hashing algorithm usually end badly unless you've got a PhD in mathematics.  Even then several cryptographic hashing algorithms for the past have since been deemed to weak to continue using them, most predominantly MD5.  Cryptographic hashes tend to take a long time to compute so be sure you really need one before using one.  The most common usage is ensuring that data hasn't been tampered with.  A good cryptographic hashing function is one where it is difficult to create two different inputs that result in the same output.

Integrity checking hashes, like http://en.wikipedia.org/wiki/Cyclic_redundancy_check (CRC), are typically used for checking that data hasn't been corrupted by accident (as opposed to being corrupted or changed by an evildoer). They are typically fast to compute since they don't have to protect from intentional attacks like cryptographic hashes do.  Depending on the typical type of corruption for the media being used you'd select a different hashing function.

The final usage of hash functions is for hash tables.  We can use a hash function to map an identifier (key) into a fixed range.  If we have a set of data where each piece of data is associated with a key we can use the hash function to subset our data into many smaller groups.  Now if we want to test if a certain key exists in our data we can first apply the hash function to the key to locate a subset to search.  Since searching the smaller subset is faster than searching the whole set we can greatly speed up searching.  Academics will tell you that a good hashing function for a hash table is one that distributes keys from the set into evenly sized subsets.  Practical usage places the speed at which hashing function runs as being more important for many cases; especially cases where the number of queries being performed vastly outnumbers the number of keys in the set.  I generally recommend using hashing functions borrowed from the integrity checking family since they tend to be fast and do a reasonable job of distributing keys evenly into subsets. Even better is to use fast hashing functions designed for hash tables, like MurMurHash.

Handling Collisions

Let's say we have a hash table with 26 subsets based on the first letter of the names of people in the set.  If we have a Carol and a Cliff in the set they'd both land in the C subset.  At this point we have a few options: we can use an indexed collection of some sort to store the subsets, or we can change to another hashing function and make a whole new hash table that doesn't contain the collision; perhaps using the first two characters of the name.  In practice both of these options are often used.  If the subsets are small enough we store the subset in an indexed collection (simply binary tree, splay tree, red-black tree, etc.).  If a subset starts to get to large we switch to a hashing function that produces are larger range of subsets and create a new hash table.

The extreme step of creating a new table can be optimized if we impose some additional notions on our hashing function.  If we compute a hash that is as large as we'll ever make the hash table (or larger) and also store that hash with each entry added to the hash table we can use a limited amount of the hash to select a subset of a smaller size.  The easiest method for doing this is to create a hash function that returns a register sized (say 32bit) value and then only use so many of the bits returned to index into a set of subsets.  Academics are probably objecting to this because using a number of subsets that isn't prime tends to lead to higher collision rates.  Unfortunately obtaining the modulus of the hash against a prime number is very slow when compared to masking off a few bits.  The other advantage is that on average adding another bit of the full hash value to the amount used to determine which sub set an entry lives in will result in the entry already being in the correct subset half of the time; since adding a zero value doesn't change the result.

Comparing Hash Tables to Other Options

I've been told by many people that they use fancy trees (splay, red-black) because they are faster than hash tables.  While that might be true when compared with some implementations of a hash table I think they could still benefit from a hash table.  The hash table implementation they are comparing to is either using a bad hashing function (i.e. slow), a bad collision handling method, or a very small data set.  I can say this because any of those fancy tree techniques can be used as the basis for the collision handing method used in a hash table.  Since computing a CRC and comparing two keys run in approximately the same time a hash table using a fancy tree for handling collisions would have the same worst-case order performance of the fancy tree with one additional comparison added but would be effectively operating on 1/k'th the number of entries where k was the number of subsets used by the hash table.

To put it simply, if I use a technique to store and search 26 million names I can use the same technique to store 1 million names 26 times and use the first letter of a name to decide which of the 26 instances I continue on to search in.  As long as the additional time taken to compute the hash value is less than the savings gained from reducing the search set to 1/26th of its original size the hash table is ahead, and taking the first letter of a name takes virtually no time at all.  So it seems logical the hash table approach will be faster.  (Yes, I know, the 1/26th original size is lie since not many names begin with the letter 'Q' as opposed to the letter 'E'.  Still, the subsets will be considerably smaller, and that's what matters.)

Hash tables do take up some extra space in memory.  For each subset you have an addition pointer stored in a array that you follow to get to the structure used to handle collisions.  You also will likely store the full hash with each entry taking up another registers worth of space for each entry. One side benefit you gain from storing the full hash with each entry is the ability to do a quick early out test when comparing keys since if the hashes don't match the keys won't match.  This can save you a significant amount of time since comparing large keys can be time consuming.  

Don't Reinvent the Wheel

If you're lucky enough to be using C# you already have access to a really decent hash table by another name: Dictionary<K,V>.  C# has hashing so ingrained into it that every object support the GetHashCode call.  (It's up to the object's author to implement it correctly, normally done by xor'ing the GetHashCode results of all of its data members.)  C# also has the advantage that strings are immutable, so in theory the underlying system can cache the result of computing the hash in the GetHashCode call rather than recomputing it each time.

C++0x is adding hash_map to the standard library and stl_port has an implementation.  If you're using another language you can find various implementations from the wikipedia article on hash tables: http://en.wikipedia.org/wiki/Hash_table.

For a really good fast hashing algorithm you can use http://sites.google.com/site/murmurhash/.

Reinvent the Wheel

Of course if you have some spare time you can learn a lot from building your own implementation.  There are also certain properties that are had to obtain off the shelf.  For instance, if you want to pre-bake you hash table into a format that can be memory mapped as a single read-only blob of memory (making it easily shared by all threads accessing it) you'll probably have to create your own hash table system.  Learning is always a good thing, so if you have the time give it a try.  Just be sure to test anything thoroughly before using it for real world applications. Read up about testing hashes here http://www.serice.net/testing_hash_functions/.