BK-tree explained

A BK-tree is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller specifically adapted to discrete metric spaces.For simplicity, consider integer discrete metric

d(x,y)

. Then, BK-tree is defined in the following way. An arbitrary element a is selected as root node. The root node may have zero or more subtrees. The k-th subtree is recursively built of all elements b such that

d(a,b)=k

. BK-trees can be used for approximate string matching in a dictionary.

Example

This picture depicts the BK-tree for the set

of words obtained by using the Levenshtein distance

each node

is labeled by a string of

w_u\inW

;

(u,v)

is labeled by

d_uv=d(w_u,w_v)

where

w_u

denotes the word assigned to

The BK-tree is built so that:

for all node

of the BK-tree, the weight assigned to its egress arcs are distinct;

for all arc

e=(u,v)

labeled by

, each descendant

satisfies the following equation:

d(w_u,w_v')=k

- Example 1: Consider the arc from "book" to "books". The distance between "book" and any word in is equal to 1;
- Example 2: Consider the arc from "books" to "boo". The distance between "books" and any word in is equal to 2.

Insertion

The insertion primitive is used to populate a BK-tree

according to a discrete metric

Input:

: the BK-tree;

d_uv

denotes the weight assigned to an arc

(u,v)

;

w_u

denotes word assigned to a node

);

: the discrete metric used by

(e.g. the Levenshtein distance);

: the element to be inserted into

;

Output:

The node of

corresponding to

Algorithm:

If the

is empty:

- Create a root node

w_r\leftarroww

- Return

to the root of

While

exists:

k\leftarrowd(w_u,w)

- If

k=0

- - Return

- Find

the child of

such that

d_uv=k

- If

is not found:

- - Create the node

w_v\leftarroww

- - Create the arc

(u,v)

d_uv\leftarrowk

- - Return

u\leftarrowv

Lookup

Given a searched element

, the lookup primitive traverses the BK-tree to find the closest element of

. The key idea is to restrict the exploration of

to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).

Input:

: the BK-tree;

: the corresponding discrete metric (e.g. the Levenshtein distance);

: the searched element;

d_max

: the maximum distance allowed between the best match and

, defaults to

+infty

;

Output:

w_best

: the closest element to

stored in

and according to

\perp

if not found;

Algorithm:

is empty:

- Return

\perp

Create

a set of nodes to process, and insert the root of

into

(w_best,d_best)\leftarrow(\perp,d_max)

While

S\ne\emptyset

- Pop an arbitrary node

from

d_u\leftarrowd(w,w_u)

- If

d_u<d_best

(w_best,d_best)\leftarrow(w_u,d_u)

- For each egress-arc

(u,v)

- - If

|d_uv-d_u|<d_best

: (cut-off criterion)

- - - Insert

into

Return

w_best

Example of the lookup algorithm

Consider the example 8-node B-K Tree shown above and set

"cool".

is initialized to contain the root of the tree, which is subsequently popped as the first value of

with

w_u

="book". Further

d_u=2

since the distance from "book" to "cool" is 2, and

d_best=2

as this is the best (i.e. smallest) distance found thus far. Next each outgoing arc from the root is considered in turn: the arc from "book" to "books" has weight 1, and since

|1-2|=1

is less than

d_best=2

, the node containing "books" is inserted into

for further processing. The next arc, from "book" to "cake," has weight 4, and since

|4-2|=2

is not less than

d_best=2

, the node containing "cake" is not inserted into

. Therefore, the subtree rooted at "cake" will be pruned from the search, as the word closest to "cool" cannot appear in that subtree. To see why this pruning is correct, notice that a candidate word

appearing in "cake"s subtree having distance less than 2 to "cool" would violate the triangle inequality: the triangle inequality requires that for this set of three numbers (as sides of a triangle), no two can sum to less than the third, but here the distance from "cool" to "book" (which is 2) plus the distance from "cool" to

(which is less than 2) cannot reach or exceed the distance from "book" to "cake" (which is 4). Therefore, it is safe to disregard the entire subtree rooted at "cake".

Next the node containing "books" is popped from

and now

d_u=3

, the distance from "cool" to "books." As

d_u>d_best

d_best

remains set at 2 and the single outgoing arc from the node containing "books" is considered. Next, the node containing "boo" is popped from

and

d_u=2

, the distance from "cool" to "boo." This again does not improve upon

d_best=2

. Each outgoing arc from "boo" is now considered; the arc from "boo" to "boon" has weight 1, and since

|2-1|=1<d_best=2

, "boon" is added to

. Similarly, since

|2-2|=0<d_best

, "cook" is also added to

Finally each of the two last elements in

are considered in arbitrary order: suppose the node containing "cook" is popped first, improving

d_best

to distance 1, then the node containing "boon" is popped last, which has distance 2 from "cool" and therefore does not improve the best result. Finally, "cook" is returned as the answer

w_best

with

d_best=1

References

W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed queries trees. In M. Crochemore and D. Gusfield, editors, 5th Combinatorial Pattern Matching, LNCS 807, pages 198–212, Asilomar, CA, June 1994.
Ricardo Baeza-Yates and Gonzalo Navarro. Fast Approximate String Matching in a Dictionary. Proc. SPIRE'98

External links

A BK-tree implementation in Common Lisp with test results and performance graphs.
An explanation of BK-Trees and their relationship to metric spaces http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
An explanation of BK-Trees with an implementation in C# http://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/
A BK-tree implementation in Lua https://profan.github.io/lua-bk-tree/
A BK-tree implementation in Python https://github.com/benhoyt/pybktree

BK-tree explained

Example

Insertion

Lookup

Example of the lookup algorithm

See also

References

External links