The hash join is an example of a join algorithm and is used in the implementation of a relational database management system. All variants of hash join algorithms involve building hash tables from the tuples of one or both of the joined relations, and subsequently probing those tables so that only tuples with the same hash code need to be compared for equality in equijoins.
Hash joins are typically more efficient than nested loops joins, except when the probe side of the join is very small. They require an equijoin predicate (a predicate comparing records from one table with those from the other table using a conjunction of equality operators '=' on one or more columns).
The classic hash join algorithm for an inner join of two relations proceeds as follows:
The first phase is usually called the "build" phase, while the second is called the "probe" phase. Similarly, the join relation on which the hash table is built is called the "build" input, whereas the other input is called the "probe" input.
This algorithm is simple, but it requires that the smaller join relation fits into memory, which is sometimes not the case. A simple approach to handling this situation proceeds as follows:
r
R
r
S
R
S
This is essentially the same as the block nested loop join algorithm. This algorithm may scan
S
A better approach is known as the "grace hash join", after the GRACE database machine for which it was first implemented.
This algorithm avoids rescanning the entire
S
R
S
It is possible that one or more of the partitions still does not fit into the available memory, in which case the algorithm is recursively applied: an additional orthogonal hash function is chosen to hash the large partition into sub-partitions, which are then processed as before. Since this is expensive, the algorithm tries to reduce the chance that it will occur by forming the smallest partitions possible during the initial partitioning phase.
The hybrid hash join algorithm is a combination of the classical hash join and grace hash join. It uses minimal amount of memory for partitioning like in grace hash join and uses the remaining memory to initialize a classical hash join during partitioning phase. During the partitioning phase, the hybrid hash join uses the available memory for two purposes:
R
S
R
R
Note that this algorithm is memory-sensitive, because there are two competing demands for memory (the hash table for partition 0, and the output buffers for the remaining partitions). Choosing too large a hash table for partition 0 might cause the algorithm to recurse because one of the non-zero partitions is too large to fit into memory.
Hash joins can also be evaluated for an anti-join predicate (a predicate selecting values from one table when no related values are found in the other). Depending on the sizes of the tables, different algorithms can be applied:
This is more efficient when the NOT IN table is smaller than the FROM table.
This is more efficient when the NOT IN table is larger than the FROM table.
Hash semi-join is used to return the records found in the other table. Unlike the plain join, it returns each matching record from the leading table only once, regardless of how many matches there are in the IN table.
As with the anti-join, semi-join can also be left and right:
The records are returned right after they produced a hit. The actual records from the hash table are ignored.
This is more efficient when the IN table is smaller than the FROM table.
With this algorithm, each record from the hash table (that is, FROM table) can only be returned once, since it is removed after being returned.
This is more efficient when the IN table is larger than the FROM table.