April 26, 2010
Hashing is one of my favorite areas of computer science, and one of the most misunderstood ones. The academic approach to hashing for use in hash tables versus the practical approaches tend to create a lot of confusion. It is also confusing that cryptographic hashing and the hashing for hash tables are both referred to as "hashing". I'll do my best to explain what I feel is important about hashing.
For cryptographic hashes I'd recommend using well known and well studied hashing algorithms and following procedures outlined by experts in their use. Attempting to build your own cryptographic hashing algorithm usually end badly unless you've got a PhD in mathematics. Even then several cryptographic hashing algorithms for the past have since been deemed to weak to continue using them, most predominantly MD5. Cryptographic hashes tend to take a long time to compute so be sure you really need one before using one. The most common usage is ensuring that data hasn't been tampered with. A good cryptographic hashing function is one where it is difficult to create two different inputs that result in the same output.
Integrity checking hashes, like http://en.wikipedia.org/wiki/Cyclic_redundancy_check (CRC), are typically used for checking that data hasn't been corrupted by accident (as opposed to being corrupted or changed by an evildoer). They are typically fast to compute since they don't have to protect from intentional attacks like cryptographic hashes do. Depending on the typical type of corruption for the media being used you'd select a different hashing function.
The final usage of hash functions is for hash tables. We can use a hash function to map an identifier (key) into a fixed range. If we have a set of data where each piece of data is associated with a key we can use the hash function to subset our data into many smaller groups. Now if we want to test if a certain key exists in our data we can first apply the hash function to the key to locate a subset to search. Since searching the smaller subset is faster than searching the whole set we can greatly speed up searching. Academics will tell you that a good hashing function for a hash table is one that distributes keys from the set into evenly sized subsets. Practical usage places the speed at which hashing function runs as being more important for many cases; especially cases where the number of queries being performed vastly outnumbers the number of keys in the set. I generally recommend using hashing functions borrowed from the integrity checking family since they tend to be fast and do a reasonable job of distributing keys evenly into subsets. Even better is to use fast hashing functions designed for hash tables, like MurMurHash.
The extreme step of creating a new table can be optimized if we impose some additional notions on our hashing function. If we compute a hash that is as large as we'll ever make the hash table (or larger) and also store that hash with each entry added to the hash table we can use a limited amount of the hash to select a subset of a smaller size. The easiest method for doing this is to create a hash function that returns a register sized (say 32bit) value and then only use so many of the bits returned to index into a set of subsets. Academics are probably objecting to this because using a number of subsets that isn't prime tends to lead to higher collision rates. Unfortunately obtaining the modulus of the hash against a prime number is very slow when compared to masking off a few bits. The other advantage is that on average adding another bit of the full hash value to the amount used to determine which sub set an entry lives in will result in the entry already being in the correct subset half of the time; since adding a zero value doesn't change the result.
To put it simply, if I use a technique to store and search 26 million names I can use the same technique to store 1 million names 26 times and use the first letter of a name to decide which of the 26 instances I continue on to search in. As long as the additional time taken to compute the hash value is less than the savings gained from reducing the search set to 1/26th of its original size the hash table is ahead, and taking the first letter of a name takes virtually no time at all. So it seems logical the hash table approach will be faster. (Yes, I know, the 1/26th original size is lie since not many names begin with the letter 'Q' as opposed to the letter 'E'. Still, the subsets will be considerably smaller, and that's what matters.)
Hash tables do take up some extra space in memory. For each subset you have an addition pointer stored in a array that you follow to get to the structure used to handle collisions. You also will likely store the full hash with each entry taking up another registers worth of space for each entry. One side benefit you gain from storing the full hash with each entry is the ability to do a quick early out test when comparing keys since if the hashes don't match the keys won't match. This can save you a significant amount of time since comparing large keys can be time consuming.
C++0x is adding
For a really good fast hashing algorithm you can use http://sites.google.com/site/murmurhash/.
Software Development >