Understanding perfect minimal hashing

Let $h$ be a hash function that takes a key $k$ from some universe of keys $U$ and maps it to some set of integers $\{ 0, 1,\dots,n-1\}$.  More formally, $\forall k \in U, h(k) \to \{0, 1, \dots, n-1\}$. In the simplest hash, we can have an array $V$ of $n$ elements. The value $h(k)$ is used as the index of that array such that $V[h(k)]$ is the value corresponding to that hash key.

The hash data structure is designed to give us average $O(1)$ time on inserting, deleting, and fetching – that is, given a key $k$ return the value $h(k)$. While an array also has these same $O(1)$ average time for inserting, deleting, and fetching, this only works if you can afford to allocate memory for every possible key.  Whereas an array has space proportional to the size of the universe of keys, a hash has space proportional to the size of keys used.

In general, the size of the universe of keys $U$ is much larger than the size of the set of buckets $n$.  That means there may be two keys that hash to the same value. $\exists x,y : x \neq y \land h(x) = h(y)$.  This is called a hash collision.  Common strategies for handling collisions include chaining, where the value stored in $V$ is a linked-list of all collied values, and open addressing, where multiple indices in $V$ are checked for a single key $k$ .

Load factor describes what percentage of hash buckets $n$ are occupied.  Assuming that the hash keys $h(k)$ are randomly distributed we will see collisions long before we hit a 100% load factor.  Even with the strategies mentioned above, we will eventually need to rehash to avoid worst-case performance. Rehashing means reconstructing the hash by picking a new hash function $h$, increasing the number of buckets $n$ or both.

A hash is considered perfect when there are no collisions between the keys.  A hash is considered minimal when the load factor is at 100% – that is, when the number of buckets $n$ is equal to the number of keys in the hash.  A perfect minimal hash guarantees at worst-case O(1) inserting, deleting, and fetching.  If we know exactly what keys will be in the hash, we can find a hash function $h$ such that we generate a perfect minimal hash.  Finding this hash function is called perfect minimal hash function generation. One site mentions that part of his algorithm for finding perfect minimal hash functions in NP-complete.  I don’t know if this is true generally of perfect minimal hash function generation.  I’ve found two programs that will generate a perfect minimal hash function from a static set of keywords – the GNU program gperf and cmph.

Gperf is an older GNU utility that generates C or C++ code that contains a hash function and the lookup table.  Gperf has tons of options to specify how to go about the search for a perfect hash function and how to output the code.  The manual is fairly extensive and the author’s paper describes the implementation details.

CMPH is dual-licensed LGPL and MPL program that is actively developed by four researchers in the field. The concepts page gives a brief overview of the theory while the subpages describes the different algorithms used.  The authors also compare their tool to gperf saying that gperf is used for small key sets and generates perfect hashes while CMPH is geared towards very large sets and perfect minimal hashes.  How large is “large set”?  Quoth the main page: “[CMPH] has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys”.  CMPH also gives some empirical data on average key size and load-factor for each algorithm used.

Why would you want to use a perfect hash function generator?  If you are operating in an environment with tight memory or space constraints, you may use a perfect hash function generator to squeeze some extra performance out.  If you know your keys in advance and are parsing a lot of text, it may be worthwhile to generate a perfect hash function.  And for key sets on the order of 100 million to a billion, it may be worthwhile to investigate CMPH.

Posted in Research | Comments Off on Understanding perfect minimal hashing