Approximate Membership Query Filters (hereafter, AMQ filters) comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of
\epsilon
Bloom filters are the most known AMQ filter, but there are other AMQ filters that support additional operations or have different space requirements.
AMQ filters have numerous applications, mainly in distributed systems and databases. There, they are often used to avoid network request or I/O operations that result from requesting elements that do not exist.
The approximate membership query problem is to store information about a set of elements in a space-efficient way. The goal is to answer queries about whether an element is in the set or not, while constraining false positives to a maximal probability
\epsilon
An AMQ filter lookup will determine whether an element is definitely not in the set or probably in the set.
In other words, if the filter represents a set and we are interested in a value, then the lookup function applied to behaves as follows:
s\inS
s\notinS
1-\epsilon
A false positive is a lookup of an element that is not part of the set, but where the lookup returns true. The probability of this happening is the false positive rate
\epsilon
After an element is inserted the lookup for this element must return true. Dynamic AMQ filters support inserting elements one at a time without rebuilding the data structure. Other AMQ filters have to be rebuilt after each insertion. Those are called static AMQ filters.
There is a tradeoff between storage size and the false positive rate
\epsilon
log2(1/\epsilon)
nlog2(1/\epsilon)(1+o(1))
n
There are different ways to solve the approximate membership query problem. The most known data structure are Bloom filters, but there are other data structures that perform better for some false positive rates and space requirements, support additional operations, or have other insertion and lookup times. Below we describe some well known AMQ filters.
A Bloom filter is a bit array of
m
k
m
k
true
is returned. To reduce the false positive rate, the number of hash functions and m
The idea of quotient filters is to hash an element and to split its fingerprint into the
r
dR
dQ
The space used by quotient filters is comparable to Bloom filters, but quotient filters can be merged without affecting their false positive rate.
Cuckoo filters are based on cuckoo hashing, but only fingerprints of the elements are stored in the hash table. Each element has two possible locations. The second location is calculated based on the first location and the fingerprint of the element. This is necessary to enable moving already inserted elements if both possible slots for an element are full.
After reaching a load threshold the insertion speed of cuckoo filter degrades. It is possible that an insertion fails, and the table must be rehashed. Whereas Bloom filters have always constant insertion time, but as the load factor increases the false positive rate increases as well.
A cuckoo filter supports deleting elements in the case where we know for certain that the element was in fact previously inserted. This is an advantage over Bloom filters and quotient filters which do not support this operation.
Xor filters[3] are static AMQ filters that are based on a Bloomier filter and use the idea of perfect hash tables. Similar to cuckoo filters, they save fingerprints of the elements in a hash table. The idea is that a query for an element
x
h0,h1,h2
x
1.23log2(1/\epsilon)
The disadvantage of this filter is that the data structure has to be rebuilt if additional elements are added. They are used in applications where no elements have to added afterwards and space is of importance.
Typical applications of AMQ filters are distributed systems and database systems. The AMQ filter functions as a proxy to the set of keys of a database or remote memory. Before a presumably slow query to the database or to remote memory is performed, the AMQ filter is used to give an approximate answer as to whether the key is in the database or in remote memory. The slow query is only performed when the AMQ filter returns true. Only in the case of a (hopefully rare) false positive is an unnecessary I/O or remote access performed. The applications are numerous and include package and resource routing, P2P networks, and distributed caching.[4]
AMQ filters are often used as an in-memory data structure to avoid expensive disk accesses. One application is Log-structured merge-trees or LSM trees. They have a fast in-memory component and one or multiple components on a disk which are trees themselves. Elements are inserted into the in-memory component until it reaches its maximal size, than the in-memory component is merged with the disk components. To speed up the lookup many LSM trees implement AMQ filters like Bloom filters or quotient filters. Those filters approximate for each component which elements are stored in it. LSM trees are used in databases like Apache AsterixDB, Bigtable, HBase, LevelDB, SQLite4.
Networks offer a lot of applications for AMQ filters. They are used to approximate a set of data that is located on a different servers. In many cases those AMQ filters can be seen as immutable. Even if the set on the remote server changes the AMQ filter is often not updated right away, but some false positives are tolerated. One example of this application is web cache sharing. If a proxy has a cache miss it wants to determine if another proxy has the requested data. Therefore, the proxy must know or at least approximate if another proxy holds the requested web page. This can be archived by periodically broadcasting a static AMQ filter of the URLs of the web pages a proxy has cached instead of broadcasting URL lists. In this setting, false negatives can occur if the cache changed in between periodic updates.
The same concept can be applied to P2P networks. AMQ filters can be used to approximate what is stored at each node of the network. The filter can be filled with ids or keywords of the actual documents of the nodes. False positives only lead to some unnecessary requests. AMQ filters have further applications in P2P networks for example finding the difference or intersection between sets stored on different nodes.