A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.
A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters; and similar to the way a Zobrist hash can be rapidly updated from the old hash value.
One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.
At best, rolling hash values are pairwise independent[1] or strongly universal. They cannot be 3-wise independent, for example.
The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:
H=c1ak-1+c2ak-2+c3ak-3+...+cka0
a
c1,...,ck
In order to avoid manipulating huge
H
n
a
n
n
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum
H
a
H
a
a
a-1
H
The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2). Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC32). The hash is the result of the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.
Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between and .[2]
Hashing by cyclic polynomial[3] - sometimes called Buzhash - is also simple and it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of tabulation hashing: it presumes that there is some hash function
h
[0,2L)
s
s(101)=011
⊕
H=sk-1(h(c1)) ⊕ sk-2(h(c2)) ⊕ \ldots ⊕ s(h(ck-1)) ⊕ h(ck),
where the multiplications by powers of two can be implemented by binary shifts. The result is a number in
[0,2L)
Computing the hash values in a rolling fashion is done as follows. Let
H
H
H\leftarrows(H)
c1
k
sk(h(c1))
H\leftarrows(H) ⊕ sk(h(c1)) ⊕ h(ck+1),
where
ck+1
Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first
L-k+1
H
k-1
H → H ÷ 2k-1
One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified.[4]
A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower N bits (with a probability of , given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary. Each chunk will thus have an average size of bytes. This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries.
Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes.[5] The backup software Borg uses the Buzhash algorithm with a customizable chunk size range for splitting file streams.[6]
Such content-defined chunking is often used for data deduplication.[4] [6]
Several programs, including gzip (with the --rsyncable
option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:[7]
S(n)=
n | |
\sum | |
i=n-8195 |
ci,
H(n)=S(n)\mod4096,
where
S(n)
n
ci
i
H(n)
S(n)
Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.
For every
n
H(n)==0
n
n+1
Chunking is a technique to divide a data stream into a set of blocks, also called chunks. Content-defined chunking (CDC) is a chunking technique in which the division of the data stream is not based on fixed chunk size, as in fixed-size chunking, but on its content.
The Content-Defined Chunking algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC [8] proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm,[9] skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach.[10]
The basic version pseudocode is provided as follows:
algorithm FastCDC input: data buffer src, data length n, output: cut point i MinSize ← 2KB // split minimum chunk size is 2 KB MaxSize ← 64KB // split maximum chunk size is 64 KB Mask ← 0x0000d93003530000LL fp ← 0 i ← 0 // buffer size is less than minimum chunk size if n ≤ MinSize then return n if n ≥ MaxSize then n ← MaxSize // Skip the first MinSize bytes, and kickstart the hash while i < MinSize do fp ← (fp << 1) + Gear[''src''[''i'']] i ← i + 1 while i < n do fp ← (fp << 1) + Gear[''src''[''i'']] if !(fp & Mask) then return i i ← i + 1 return i
Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking [10]) when segmenting the data stream.
All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position. In particular, computing the Rabin-Karp rolling hash of a string of length
k
O(k)
O(k)