An anatree [1] is a data structure designed to solve anagrams. Solving an anagram is the problem of finding a word from a given list of letters. These problems are commonly encountered in word games like Scrabble or in newspaper crossword puzzles. The problem for the wordwheel also has the condition that the central letter appear in all the words framed with the given set. Some other conditions may be introduced regarding the frequency (number of appearances) of each of the letters in the given input string. These problems are classified as Constraint satisfaction problem in computer science literature.
An anatree is represented as a directed tree which contains a set of words (W) encoded as strings in some alphabet. The internal vertices are labelled with some letter in the alphabet and the leaves contain words. The edges are labelled with non-negative integers. An anatree has the property that the sum of the edge labels from the root to the leaf is the length of the word stored at the leaf. If the internal vertices are labelled as
\alpha1
\alpha2
\alphal
n1
n2
nl
n1
\alpha1
n2
\alpha2
A mixed anatree is an anatree where the internal vertices also store words. A mixed anatree can have words of varying lengths, where as in a regular anatree, all words are of the same length.
A number of data structures have been proposed to solve anagrams in constant time. Two of the most commonly used data structures are the alphabetic map and the frequency map.
The alphabetic map maintains a hash table of all the possible words that can be in the language (this is referred to as the lexicon). For a given input string, sort the letters in alphabetic order. This sorted string maps onto a word in the hash table. Hence finding the anagram requires sorting the letters and looking up the word in the hash table. The sorting can be done in linear time with counting sort and hash table look ups can be done in constant time. For example, given the word ANATREE, the alphabetic map would produce a mapping of
\{AAEENRT->\{''anatree''\}\}
A frequency map also stores the list of all possible words in the lexicon in a hash table. For a given input string, the frequency map maintains the frequencies (number of appearances) of all the letters and uses this count to perform a look up in the hash table. The worst case execution time is found to be linear in size of the lexicon. For example, given the word ANATREE, the alphabetic map would produce a mapping of
f(A)->2,f(E)->2,f(N)->1,f(R)->1,f(T)->1
The construction of an anatree begins by selecting a label for the root and partitioning words based on the label chosen for the root. This process is repeated recursively for all the labels of the tree. Anatree construction is non-canonical for a given set of words, depending on the label chosen for the root, the anatree will differ accordingly. The performance of the anatree is greatly impacted by the choice of labels.
The following are some heuristics for choosing labels:
\alpha | |
W | |
n |
n\alphas
\alpha
D\alpha=\sumnn
| |||||||
|W| |
To find a word in an anatree, start at the root, depending on the frequency of the label in the given input string, follow the edge that has that frequency till the leaf. The leaf contains the required word. For example, consider the anatree in the figure, to find the word
dog
ogd
1
1
d
A lexicon that stores
w
l
O
Data Structure | Space Requirements | |||||
---|---|---|---|---|---|---|
Alphabetic Map | O(wl(log | O | ) | |||
Frequency Map | O(w( | O | log l + l log | O | )) | |
Anatree | O(w | O | (log | O | + l)) |
The worst case execution time of an anatree is
O(|w|(l+w|O|2))