PH-tree | |
Type: | tree, map |
Invented Year: | 2014 |
Space Avg: | O(n) |
Space Worst: | O(n) |
Search Avg: | O(log n) |
Search Worst: | O(log n) |
Insert Avg: | O(log n) |
Insert Worst: | O(log n) |
Delete Avg: | O(log n) |
Delete Worst: | O(log n) |
The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys.The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.
The basic PH-tree is a spatial index that maps keys, which are -dimensional vectors with integers, to user defined values. The PH-tree is a multi-dimensional generalization of a Crit bit tree in the sense that a Crit bit tree is equivalent to a PH-tree with
1
A -dimensional PH-tree is a tree of nodes where each node partitions space by subdividing it into
2d
Some other structural properties of PH-trees are:
2n
d
Similar to most quadtrees, the PH-tree is a hierarchy of nodes where every node splits the space in all dimensions. Thus, a node can have up to
2d
The PH-tree uses the bits of the multi-dimensional keys to determine their position in the tree. All keys that have the same leading bits are stored in the same branch of the tree.
For example, in a node at level, to determine the quadrant where a key should be inserted (or removed or looked up), it looks at the 's bit of each dimension of the key. For a 3D node with 8 quadrants (forming a cube) the 's bit of the first dimension of the key determines whether the target quadrant is on the left or the right of the cube, the 's bit of the second dimension determines whether it is at the front or the back, and the 's bit of the third dimension determines bottom vs top, see picture.
Example with three 1D keys with 8bit values:
k0=\{1\}base 10=\{00000001\}base 2
k1=\{4\}10=\{00000100\}2
k2=\{35\}10=\{00100011\}2
k0
k1
L=5
k3
L=2
k2
With 2D keys every node has
2d=4
h
k0\rarrh=\{00\}2
k1\rarrh=\{01\}2
h
The ordering of the entries in a node always follows Z-ordering.Entries in a node can, for example, be stored in fixed size arrays of size
2d
O(1)
O(2d)
Another solution is to store entries in a sorted collection, such as dynamic arrays and/or B-trees. This slows down lookup operations to
O(log{n | |
node\entries |
O(n | |
node\entries |
)
The original implementation aimed for minimal memory consumption by switching between fixed and dynamic array representation depending on which uses less memory. Other implementations https://github.com/tzaeschke/phtreehttps://github.com/improbable-eng/phtree-cpp do not switch dynamically but use fixed arrays for
d\lesssim4
d\lesssim8
Lookup, insertion and removal operations all work very similar: find the correct node, then perform the operation on the node. Window queries and -nearest-neighbor searches are more complex.
The Lookup operation determines whether a key exists in the tree. It walks down the tree and checks every node whether it contains a candidate subnode or a user value that matches the key.
function lookup(key) is entry ← get_root_entry // if the tree is not empty the root entry contains a root node while entry != NIL && entry.is_subnode do node ← entry.get_node entry ← node.get_entry(key) repeat return entry // entry can be NIL
function get_entry(key) is node ← current node h ← extract_bits_at_depth(key, node.get_depth} entry ← node.get_entry_at(h) return entry // entry can be NIL
The Insert operation inserts a new key-value pair into the tree unless they key already exists. The operation traverses the tree like the Lookup function and then inserts the key into the node. There are several cases to consider:
function insert(node, key, value) level ← node.get_level // Level is 0 for root h ← extract_bits_at_level(key, level) entry ← node.get_entry(h) if entry
key then // Case 2. Collision, there is already an entry return ← failed_insertion else // Case 3. level_diff ← get_level_of_difference(key, entry.get_key) entry_new ← create_entry(key, value) // new subnode with existing entry and new entry subnode_new ← create_node(level_diff, entry, entry_new) node.set_entry(h, subnode_new) end if return
Removal works inversely to insertion, with the additional constraint that any subnode has to be removed if less than two entries remain. The remaining entry is moved to the parent node.
Window queries are queries that return all keys that lie inside a rectangular axis-aligned hyperbox. They can be defined to be two -dimensional points
min
max
function query(node, min, max, result_list) is foreach entry ← node.get_entries do if entry.is_subnode then if entry.get_prefix >= min and entry.get_prefix <= max then query(entry.get_subnode, min, max, result_list) end if else if entry.get_key >= min and entry.get_key <= max then result_list.add(entry) end if end if repeat return
In order to accurately estimate query time complexity the analysis needs to include the dimensionality
d
n | |
node\entries |
O(d ⋅
n | |
node\entries |
)
d
min/max
O(d)
2d
d
The idea is to find minimum and maximum values for the quadrant's addresses
h
C
hmin
hmax
d
i
0\leqi<d
i
hmin
hmax
i
min
max
C
hmin,i=(mini\leqCi)
hmax,i=(maxi\geqCi)
hmin
1
hmin
0
hmin
hmax
h
h<hmin
h>hmax
function query(node, min, max, result_list) is h_min ← calculate h_min h_max ← calculate h_max for each entry ← node.get_entries_range(h_min, h_max) do [... ] repeat return
Calculating
hmin
hmax
O(2d)=O(d)
O(d+d ⋅
n | |
node\entries |
)
Between
hmin
hmax
hmin
hmax
h
d
h
0
h
0
hmin
1
h
1
hmax
64
O(1)
function is_overlap(h, h_min, h_max) is return (h | h_min) & h_max
function query(node, min, max, result_list) is h_min ← calculate h_min h_max ← calculate h_max for each entry ← node.get_entries_range(h_min, h_max) do h ← entry.get_h; if (h | h_min) & h_max
The resulting time complexity is
O(d+
n | |
node\entries |
)
O(d ⋅
n | |
node\entries |
)
For higher dimensions with larger nodes it is also possible to avoid iterating through all
h
h
1
hinput
h
1
function increment_h(h_input, h_min, h_max) is h_out = h_input | (~ h_max) // pre - mask h_out += 1 // increment h_out = (h_out & h_max) | h_min // post - mask return h_out
Again, for
d\leq64
O(1)
O(d+
n | |
overlapping\quadrants |
)
nearest neighbor searches can be implemented using standard algorithms.
The PH-tree can only store integer values. Floating point values can trivially be stored as integers casting them as an integer. However, the authors also propose an approach without loss of precision.
Lossless converting of a floating point value into an integer value (and back) without loss if precision can be achieved by simply interpreting the 32 or 64 bits of the floating point value as an integer (with 32 or 64 bits).Due to the way that IEEE 754 encodes floating point values, the resulting integer values have the same ordering as the original floating point values, at least for positive values. Ordering for negative values can be achieved by inverting the non-sign bits.
Example implementations in Java:
Example implementations in C++:
Encoding (and the inverse decoding) is lossless for all floating point values. The ordering works well in practice, including
\pminfty
-0.0
NaN
0.0
-0.0
[0.0,10.0]
-0.0
-0.0
[-0.0,10.0]
In order to store volumes (axis-aligned hyper-boxes) as keys, implementations typically use corner representation which converts the two
d
2d
k=\{min0,max0,min1,max1,...,mind-1,maxd-1\}
This works trivially for lookup, insert and remove operations. Window queries need to be converted from
d
2d
kmin=\{min0,min0,min1,min1,...,mind-1,mind-1\}
kmax=\{max0,max0,max1,max1,...,maxd-1,maxd-1\}
For a window query operation that matches all boxes that intersect with a query box, the query keys are:
kmin=\{-infty,min0,-infty,min1,...,-infty,mind-1\}
kmax=\{max0,+infty,max1,+infty,...,maxd-1,+infty\}
In high dimensions with less than
2d
O(log{n})
d=50
d=100
Research has reported fast add/remove/exact-match operations with large and fast changing datasets. Window queries have been shown to work well especially for small windows or large dataset
The PH-tree is mainly suited for in-memory use.The size of the nodes (number of entries) is fixed while persistent storage tends to benefit from indexes with configurable node size to align node size with page size on disk. This is easier with other spatial indexes, such as R-Trees.