In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables (see precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.
See also: Hash function.
The goal of hashing is usually to map keys from some large domain (universe)
U
m
[m]=\{0,...,m-1\}
[m]
The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability for any two designated keys. The concept of
k
k
The strictest definition, introduced by Wegman and Carter under the name "strongly universal
k
H=\{h:U\to[m]\}
k
k
(x1,...,xk)\inUk
k
(y1,...,yk)\in[m]k
\Prh\left[h(x1)=y1\land … \landh(xk)=yk\right]=m-k
This definition is equivalent to the following two conditions:
x\inU
h
H
h(x)
[m]
x1,...,xk\inU
h
H
h(x1),...,h(xk)
Often it is inconvenient to achieve the perfect joint probability of
m-k
(\mu,k)
\forall
(x1,...,xk)\inUk
\forall(y1,...,yk)\in[m]k
~~\Prh\left[h(x1)=y1\land … \landh(xk)=yk\right]\le\mu/mk
Observe that, even if
\mu
h(xi)
k
The original technique for constructing -independent hash functions, given by Carter and Wegman, was to select a large prime number, choose random numbers modulo, and use these numbers as the coefficients of a polynomial of degree whose values modulo are used as the value of the hash function. All polynomials of the given degree modulo are equally likely, and any polynomial is uniquely determined by any -tuple of argument-value pairs with distinct arguments, from which it follows that any -tuple of distinct arguments is equally likely to be mapped to any -tuple of hash values.[1]
In general the polynomial can be evaluated in any finite field.Besides the fields modulo prime, a popular choice is the field of size
2n
See main article: Tabulation hashing. Tabulation hashing is a technique for mapping keys to hash values by partitioning each key into bytes, using each byte as the index into a table of random numbers (with a different table for each byte position), and combining the results of these table lookups by a bitwise exclusive or operation. Thus, it requires more randomness in its initialization than the polynomial method, but avoids possibly-slow multiplication operations. It is 3-independent but not 4-independent. Variations of tabulation hashing can achieve higher degrees of independence by performing table lookups based on overlapping combinations of bits from the input key, or by applying simple tabulation hashing iteratively.
The notion of k-independence can be used to differentiate between different collision resolution in hashtables, according to the level of independence required to guarantee constant expected time per operation.
For instance, hash chaining takes constant expected time even with a 2-independent family of hash functions, because the expected time to perform a search for a given key is bounded by the expected number of collisions that key is involved in. By linearity of expectation, this expected number equals the sum, over all other keys in the hash table, of the probability that the given key and the other key collide. Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to ensure that this sum has the same value that it would for a truly random hash function.[1]
Double hashing is another method of hashing that requires a low degree of independence. It is a form of open addressing that uses two hash functions: one to determine the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives constant expected time per operation.
On the other hand, linear probing, a simpler form of open addressing where the step size is always one can be guaranteed to work in constant expected time per operation with a 5-independent hash function, and there exist 4-independent hash functions for which it takes logarithmic time per operation.
For Cuckoo hashing the required k-independence is not known as of 2021.In 2009 it was shown[4] that
O(logn)
Kane, Nelson and David Woodruff improved the Flajolet–Martin algorithm for the Distinct Elements Problem in 2010.[7] To give an
\varepsilon
\tfrac{log1/\varepsilon}{loglog1/\varepsilon}
The Count sketch algorithm for dimensionality reduction requires two hash functions, one 2-independent and one 4-independent.
The Karloff–Zwick algorithm for the MAX-3SAT problem can be implemented with 3-independent random variables.
The MinHash algorithm can be implemented using a
log\tfrac{1}{\epsilon}