In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.
Random binary trees have been used for analyzing the average-case complexity of data structures based on binary search trees. For this application it is common to use random trees formed by inserting nodes one at a time according to a random permutation. The resulting trees are very likely to have logarithmic depth and logarithmic Strahler number. The treap and related balanced binary search trees use update operations that maintain this random structure even when the update sequence is non-random.
Other distributions on random binary trees include the uniform discrete distribution in which all distinct trees are equally likely, distributions on a given number of nodes obtained by repeated splitting, binary tries and radix trees for random data, and trees of variable size generated by branching processes.
For random trees that are not necessarily binary, see random tree.
A binary tree is a rooted tree in which each node may have up to two children (the nodes directly below it in the tree), and those children are designated as being either left or right. It is sometimes convenient instead to consider extended binary trees in which each node is either an external node with zero children, or an internal node with exactly two children. A binary tree that is not in extended form may be converted into an extended binary tree by treating all its nodes as internal, and adding an external node for each missing child of an internal node. In the other direction, an extended binary tree with at least one internal node may be converted back into a non-extended binary tree by removing all its external nodes. In this way, these two forms are almost entirely equivalent for the purposes of mathematical analysis, except that the extended form allows a tree consisting of a single external node, which does not correspond to anything in the non-extended form. For the purposes of computer data structures, the two forms differ, as the external nodes of the first form may be represented explicitly as objects in a data structure.
In a binary search tree the internal nodes are labeled by numbers or other ordered values, called keys, arranged so that an inorder traversal of the tree lists the keys in sorted order. The external nodes remain unlabeled. Binary trees may also be studied with all nodes unlabeled, or with labels that are not given in sorted order. For instance, the Cartesian tree data structure uses labeled binary trees that are not necessarily binary search trees.
A random binary tree is a random tree drawn from a certain probability distribution on binary trees. In many cases, these probability distributions are defined using a given set of keys, and describe the probabilities of binary search trees having those keys. However, other distributions are possible, not necessarily generating binary search trees, and not necessarily giving a fixed number of nodes.
For any sequence of distinct ordered keys, one may form a binary search tree in which each key is inserted in sequence as a leaf of the tree, without changing the structure of the previously inserted keys. The position for each insertion can be found by a binary search in the previous tree. The random permutation model, for a given set of keys, is defined by choosing the sequence randomly from the permutations of the set, with each permutation having equal probability.
For instance, if the three keys 1,3,2 are inserted into a binary search tree in that sequence, the number 1 will sit at the root of the tree, the number 3 will be placed as its right child, and the number 2 as the left child of the There are six different permutations of the keys 1,2, and 3, but only five trees may be constructed from them. That is because the permutations 2,1,3 and 2,3,1 form the same tree. Thus, this tree has probability
\tfrac26=\tfrac13
For any key
x
n
x
2logn+O(1)
log
O
x
y
y
x
y
x
y
[x,y]
x
\tfrac12
x
\tfrac13
x
2logn+O(1)
x
The longest root-to-leaf path, in a random binary search tree, is longer than the expected path length, but only by a constant factor. Its length, for a tree with
n
\beta
0<\beta<1
In the random permutation model, each key except the smallest and largest has probability
\tfrac13
\tfrac12
n\ge2
(n+1)/3
The Strahler number of vertices in any tree is a measure of the complexity of the subtrees under those vertices. A leaf (external node) has Strahler number one. For any other node, the Strahler number is defined recursively from the Strahler numbers of its children. In a binary tree, if two children have different Strahler numbers, the Strahler number of their parent is the larger of the two child numbers. But if two children have equal Strahler numbers, their parent has a number that is greater by one. The Strahler number of the whole tree is the number at the root node. For
n
log3n+O(1)
log3n+o(logn)
In applications of binary search tree data structures, it is rare for the keys to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow arbitrary insertions and deletions to preserve the property that the shape of the tree is random, as if the keys had been inserted randomly.
If a given set of keys is assigned numeric priorities (unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, the binary search tree that would result from inserting the keys in priority order. By choosing the priorities to be independent random real numbers in the unit interval, and by maintaining the Cartesian tree structure using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known as a treap or a randomized binary search tree.[2]
Variants of the treap including the zip tree and zip-zip tree replace the tree rotations by "zipping" operations that split and merge trees, and that limit the number of random bits that need to be generated and stored alongside the keys. The result of these optimizations is still a tree with a random structure, but one that does not exactly match the random permutation model.[3]
The number of binary trees with
n
n=1,2,3,...
n
n
n
log4n+O(1)
Due to their large heights, this model of equiprobable random trees is not generally used for binary search trees. However, it has other applications, including:
An algorithm of Jean-Luc Rémy generates a uniformly random binary tree of a specified size in time linear in the size, by the following process. Start with a tree consisting of a single external node. Then, while the current tree has not reached the target size, repeatedly choose one of its nodes (internal or external) uniformly at random. Replace the chosen node by a new internal node, having the chosen node as one of its children (equally likely left or right), and having a new external node as its other child. Stop when the target size is reached.[4]
The Galton–Watson process describes a family of distributions on trees in which the number of children at each node is chosen randomly, independently of other nodes. For binary trees, two versions of the Galton–Watson process are in use, differing only in whether an extended binary tree with only one node, an external root node, is allowed:
p
1-p
p
p=\tfrac12
The probability
p=\tfrac12
p\le\tfrac12
p>\tfrac12
Another way to generate the same trees is to make a sequence of coin flips, with probability
p
1-p
Because the number of internal nodes equals the number of heads in this coin flip sequence, all trees with a given number
n
p
p
p
p
p
p=\tfrac12
i
p<\tfrac12
p=\tfrac12
any particular tree with
n
Galton–Watson processes were originally developed to study the spread and extinction of human surnames, and have been widely applied more generally to the dynamics of human or animal populations. These processes have been generalized to models where the probability of being an internal or external node at a given level of the tree (a generation, in the population dynamics application) is not fixed, but depends on the number of nodes at the previous level. A version of this process, with the critical has been studied as a model for speciation, where it is known as the critical branching process. In this process, each species has an exponentially distributed lifetime, and over the course of its lifetime produces child species at a rate equal to the lifetime. When a child is produced, the parent continues as the left branch of the evolutionary tree, and the child becomes the right branch.
Another application of critical Galton–Watson trees (in the version where the root must be internal) arises in the Karger–Stein algorithm for finding minimum cuts in graphs, using a recursive edge contraction process. This algorithm calls itself twice recursively, with each call having probability at least
\tfrac12
n
2log2n
O(n2log3n)
Devroye and Robson consider a related continuous-time random process in which each external node is eventually replaced by an internal node with two external children, at an exponentially distributed time after its first appearance as an external node. The number of external nodes in the tree, at any time, is modeled by a simple birth process or Yule process in which the members of a population give birth at a constant rate: giving birth to one child, in the Yule process, corresponds to being replaced by two children, in Devroye and Robson's model. If this process is stopped at any fixed time, the result is a binary tree of a random size (depending on the stopping time), distributed according to the random permutation model for that size. Devroye and Robson use this model as part of an algorithm to quickly generate trees in the random permutation model, described by their numbers of nodes at each depth rather than by their exact structure. A discrete variant of this process starts with a tree consisting of a single external node, and repeatedly replaces a randomly-chosen external node by an internal node with two external children. Again, if this is stopped at a fixed time (with a fixed size), the resulting tree is distributed according to the random permutation model for that size.
Another form of binary tree, the binary trie or digital search tree, has a collection of binary numbers labeling some of its external nodes. The internal nodes of the tree represent prefixes of their binary representations that are shared by two or more of the numbers. The left and right children of an internal node are obtained by extending the corresponding prefix by one more bit, a zero or a one bit respectively. If this extension does not match any of the given numbers, or it matches only one of them, the result is an external node; otherwise it is another internal node. Random binary tries have been studied, for instance for sets of random real numbers generated independently in the unit interval. Despite the fact that these trees may have some empty external nodes, they tend to be better balanced than random binary search trees. For
n
log2n
2log2n
A variant of the trie, the radix tree or compressed trie, eliminates empty external nodes and their parent internal nodes. The remaining internal nodes correspond to prefixes for which both possible extensions, by a zero or a one bit, are used by at least one of the randomly chosen numbers. For a radix tree for
n
Luc Devroye and Paul Kruszewski describe a recursive process for constructing random binary trees with
n
x
(0,1)
xn
x
n
x
x