In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree,[1] is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.
In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements.
In the dynamic optimality problem, the tree can be modified at any time, typically by permitting tree rotations. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant competitive ratio compared to the dynamically optimal tree in all cases, though this has not yet been proven.
In the static optimality problem as defined by Knuth, we are given a set of ordered elements and a set of
2n+1
a1
an
A1
An
B0
Bn
Ai
ai
1\lei<n
Bi
ai
ai+1
B0
a1
Bn
an
2n+1
The static optimality problem is the optimization problem of finding the binary search tree that minimizes the expected search time, given the
2n+1
{2n\choosen}
1 | |
n+1 |
In 1971, Knuth published a relatively straightforward dynamic programming algorithm capable of constructing the statically optimal tree in only O(n2) time. In this work, Knuth extended and improved the dynamic programming algorithm by Edgar Gilbert and Edward F. Moore introduced in 1958.[3] Gilbert's and Moore's algorithm required
O(n3)
O(n2)
To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree of n elements is the sum of the lengths of all
2n+1
But weighted path lengths have an interesting property. Let E be the weighted path length of a binary tree, be the weighted path length of its left subtree, and be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements (and thus each of its search paths) is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence:
E=EL+ER+W
This recurrence leads to a natural dynamic programming solution. Let
Eij
Wij
Rij
\begin{align} Ei,i-1=Wi,i-1&=Bi-1\operatorname{for}1\leqi\leqn+1\\ Wi,j&=Wi,j-1+Aj+Bj\\ Ei,&=mini\leq(Ei,r-1+Er+1,j+Wi,j)\operatorname{for}1\leqi\leqj\leqn \end{align}
In addition to its dynamic programming algorithm, Knuth proposed two heuristics (or rules) to produce nearly (approximation of) optimal binary search trees. Studying nearly optimal binary search trees was necessary since Knuth's algorithm time and space complexity can be prohibitive when
n
Knuth's rules can be seen as the following:
Knuth's heuristics implements nearly optimal binary search trees in
O(nlogn)
O(n)
While the O(n2) time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large.
In 1975, Kurt Mehlhorn published a paper proving important properties regarding Knuth's rules. Mehlhorn's major results state that only one of Knuth's heuristics (Rule II) always produces nearly optimal binary search trees. On the other hand, the root-max rule could often lead to very "bad" search trees based on the following simple argument.
Let
and
Considering the weighted path length
P
Thus, the resulting tree by the root-max rule will be a tree that grows only on the right side (except for the deepest level of the tree), and the left side will always have terminal nodes. This tree has a path length bounded by and, when compared with a balanced search tree (with path bounded by ), will perform substantially worse for the same frequency distribution.
In addition, Mehlhorn improved Knuth's work and introduced a much simpler algorithm that uses Rule II and closely approximates the performance of the statically optimal tree in only time. The algorithm follows the same idea of the bisection rule by choosing the tree's root to balance the total weight (by probability) of the left and right subtrees most closely. And the strategy is then applied recursively on each subtree.
That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most
2+(1-log(\sqrt{5}-1))-1H=2+
H | |
1-log(\sqrt{5 |
-1)}
where H is the entropy of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of
(1/log3)H=
H | |
log3 |
this approximation is very close.
See main article: Garsia–Wachs algorithm. In the special case that all of the
Ai
O(nlogn)
The following code snippet determines an optimal binary search tree when given a set of keys and probability values that the key is the search key: public static float calculateOptimalSearchTree(int numNodes, float[] probabilities, int[][] roots)
There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time. The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees, but Demaine et al. give a very good formal statement of it.
In the dynamic optimality problem, we are given a sequence of accesses x1, ..., xm on the keys 1, ..., n. For each access, we are given a pointer to the root of our BST and may use the pointer to perform any of the following operations:
(It is the presence of the fourth operation, which rearranges the tree during the accesses, which makes this the dynamic optimality problem.)
For each access, our BST algorithm may perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value xi. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some minimum total number of operations required to perform those accesses. We would like to come close to this minimum.
While it is impossible to implement this "God's algorithm" without foreknowledge of exactly what the access sequence will be, we can define OPT(X) as the number of operations it would perform for an access sequence X, and we can say that an algorithm is dynamically optimal if, for any X, it performs X in time O(OPT(X)) (that is, it has a constant competitive ratio).
There are several data structures conjectured to have this property, but none proven. It is an open problem whether there exists a dynamically optimal data structure in this model.
See main article: Splay tree.
The splay tree is a form of binary search tree invented in 1985 by Daniel Sleator and Robert Tarjan on which the standard search tree operations run in
O(log(n))
See main article: Tango tree.
The tango tree is a data structure proposed in 2004 by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Pătrașcu which has been proven to perform any sufficiently-long access sequence X in time
O(loglogn\operatorname{OPT}(X))
loglogn
In 2013, John Iacono published a paper which uses the geometry of binary search trees to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal.[7] Nodes are interpreted as points in two dimensions, and the optimal access sequence is the smallest arborally satisfied superset of those points. Unlike splay trees and tango trees, Iacono's data structure is not known to be implementable in constant time per access sequence step, so even if it is dynamically optimal, it could still be slower than other search tree data structures by a non-constant factor.
The interleave lower bound is an asymptotic lower bound on dynamic optimality.