In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.[1] It is a more general form of the maximum agreement subtree problem.[2]
Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.[3]
The problem of frequent subtree mining has been formally defined as:
Given a threshold minfreq, a class of trees
l{C}
P\preceqT
P,T\inl{C}
l{D}\subseteql{C}
l{P}\subsetl{C}
l{P}
\forallP\inl{P}: freq(P,l{D})=\sum\nolimitsT
where is an anti-monotone function such that if
P'\preceqP
\forallT\inl{C}: d(P',T)\geqd(P,T).
In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.[4]
A sub-tree
S=(Vs,Es)
T=(V,E)
Vs\subseteqV
Es\subseteqE
A sub-tree
S=(Vs,Es)
T=(V,E)
Vs\subseteqV
The support of a sub-tree is the number of trees in a database that contains the sub-tree. A sub-tree is frequent if its support is not less than a user-specified threshold (often denoted as minsup). The goal of TreeMiner is to find all embedded sub-trees that have support at least the minimum support.
There are several different ways of encoding a tree structure. TreeMiner uses string representations of trees for efficient tree manipulation and support counting. Initially the string is set to
\varnothing
Two k-sub-trees are said to be in the same prefix equivalence class if the string representation of them are identical up to the (k-1)-th node. In other words, all elements in a prefix equivalence class only differ by the last node. For example, two trees with string representation A B -1 C -1 and A B -1 D -1 are in the prefix equivalence class A B with elements (C, 0) and (D,0). An element of a prefix class is specified by the node label paired with the 0-based depth first index of the node it is attached to. In this example, both elements of prefix class A B are attached to the root, which has an index of 0.
The scope of a node A is given by a pair of numbers
[l,r]
Frequent sub-tree patterns follow the anti-monotone property. In other words, the support of a k-sub-tree is less than or equal to the support of its (k-1)-sub-trees. Only super patterns of known frequent patterns can possibly be frequent. By utilizing this property, k-sub-trees candidates can be generated based on frequent (k-1)-sub-trees through prefix class extension. Let C be a prefix equivalence class with two elements (x,i) and (y,j). Let C' be the class representing the extension of element (x,i). The elements of C' are added by performing join operation on the two (k-1)-sub-trees in C. The join operation on (x,i) and (y,j) is defined as the following.
i>j
i=j
i<j
This operation is repeated for any two ordered, but not necessarily distinct elements in C to construct the extended prefix classes of k-sub-trees.
TreeMiner performs depth first candidate generation using scope-list representation of sub-trees to facilitate faster support counting. A k-sub-tree S can be representation by a triplet (t,m,s) where t is the tree id the sub-tree comes from, m is the prefix match label, and s the scope of the last node in S. Depending on how S occurs in different trees across the database, S can have different scope-list representation. TreeMiner defines scope-list join that performs class extension on scope-list representation of sub-trees. Two elements (x,i) and (y,j) can be joined if there exists two sub-trees
(tx,mx,sx)
(ty,my,sy)
tx=ty,mx=my,sy\subsetsx
i=j
tx=ty,mx=my,sy>sx
i>j
By keeping track of distinct tree ids used in the scope-list tests, the support of sub-trees can be calculated efficiently.
Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining. Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining. Other domains in which frequent subtree mining is useful include computational biology,[5] [6] RNA structure analysis, pattern recognition, bioinformatics,[7] and analysis of the KEGG GLYCAN database.[8]
Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem.[9] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".[10]