Suffix automaton explained
In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string
is the smallest
directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.
. The
state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any
deterministic acyclic finite state automaton.
Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string
having length at least two characters has at most
states and at most
transitions. Further works have shown a close connection between suffix automata and
suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc.
Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings.
History
The concept of suffix automaton was introduced in 1983 by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner, Vaughan Pratt and Anatol Slissenko. In their initial work, Blumer et al. showed a suffix automaton built for the string
of length greater than
has at most
states and at most
transitions, and suggested a linear
algorithm for automaton construction.
In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm while building a suffix tree of the string
constructs a suffix automaton of the reversed string
as an auxiliary structure. In 1987, Blumer
et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG). In 1997,
Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction. In 2001, Shunsuke Inenaga
et al. developed an algorithm for construction of CDAWG for a set of words given by a
trie.
Definitions
Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular:
that is used to construct words. Its elements are called "characters";
- "Word" is a finite sequence of characters
\omega=\omega1\omega2...\omegan
. "Length" of the word
is denoted as
;
- "Formal language" is a set of words over given alphabet;
- "Language of all words" is denoted as
(where the "*" character stands for
Kleene star), "empty word" (the word of zero length) is denoted by the character
;
\alpha=\alpha1\alpha2...\alphan
and
\beta=\beta1\beta2...\betam
is denoted as
or
and corresponds to the word obtained by writing
to the right of
, that is,
\alpha\beta=\alpha1\alpha2...\alphan\beta1\beta2...\betam
;
- "Concatenation of languages"
and
is denoted as
or
and corresponds to the set of pairwise concatenations
AB=\{\alpha\beta:\alpha\inA,\beta\inB\}
;
may be represented as
, where
\alpha,\beta,\gamma\in\Sigma*
, then words
,
and
are called "prefix", "suffix" and "
subword" (substring) of the word
correspondingly;
and
(with
) then
is said to "occur" in
as a subword. Here
and
are called left and right positions of occurrence of
in
correspondingly.
Automaton structure
lA=(\Sigma,Q,q0,F,\delta)
, where:
is an "alphabet" that is used to construct words,
is a set of automaton "
states",
is an "initial" state of automaton,
is a set of "final" states of automaton,
\delta:Q x \Sigma\mapstoQ
is a
partial "transition" function of automaton, such that
for
and
is either undefined or defines a transition from
over character
.
Most commonly, deterministic finite automaton is represented as a directed graph ("diagram") such that:
- Set of graph vertices corresponds to the state of states
,
- Graph has a specific marked vertex corresponding to initial state
,
- Graph has several marked vertices corresponding to the set of final states
,
- Set of graph arcs corresponds to the set of transitions
,
- Specifically, every transition is represented by an arc from
to
marked with the character
. This transition also may be denoted as
.
In terms of its diagram, the automaton recognizes the word
\omega=\omega1\omega2...\omegam
only if there is a path from the initial vertex
to some final vertex
such that concatenation of characters on this path forms
. The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of
is the language of its (possibly empty) suffixes.
Automaton states
See main article: DFA minimization. "Right context" of the word
with respect to language
is a set
[\omega]R=\{\alpha:\omega\alpha\inL\}
that is a set of words
such that their concatenation with
forms a word from
. Right contexts induce a natural
equivalence relation
on the set of all words. If language
is recognized by some deterministic finite automaton, there exists unique up to
isomorphism automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a minimal automaton for the given language
.
Myhill–Nerode theorem allows it to define it explicitly in terms of right contexts:
In these terms, a "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word
. The right context of the word
with respect to this language consists of words
, such that
is a suffix of
. It allows to formulate the following lemma defining a
bijection between the right context of the word and the set of right positions of its occurrences in
:
For example, for the word
and its subword
, it holds
and
. Informally,
is formed by words that follow occurrences of
to the end of
and
is formed by right positions of those occurrences. In this example, the element
corresponds with the word
while the word
corresponds with the element
.
It implies several structure properties of suffix automaton states. Let
, then:
and
have at least one common element
, then
and
have a common element as well. It implies
is a suffix of
and therefore
endpos(\beta)\subsetendpos(\alpha)
and
. In aforementioned example,
, so
is a suffix of
and thus
[cab]R=\{a\}\subset\{a,acaba\}=[ab]R
and
endpos(cab)=\{6\}\subset\{2,6\}=endpos(ab)
;
, then
endpos(\alpha)=endpos(\beta)
, thus
occurs in
only as a suffix of
. For example, for
and
it holds that
and
endpos(b)=endpos(ab)=\{2,6\}
;
and
is a suffix of
such that
|\alpha|\leq|\gamma|\leq|\beta|
, then
[\alpha]R=[\gamma]R=[\beta]R
. In the example above
and it holds for "intermediate" suffix
that
.
Any state
of the suffix automaton recognizes some continuous
chain of nested suffixes of the longest word recognized by this state.
"Left extension"
\overset{\scriptstyle{\leftarrow}}{\gamma}
of the string
is the longest string
that has the same right context as
. Length
|\overset{\scriptstyle{\leftarrow}}{\gamma}|
of the longest string recognized by
is denoted by
. It holds:
"Suffix link"
of the state
is the pointer to the state
that contains the largest suffix of
that is not recognized by
.
In this terms it can be said
recognizes exactly all suffixes of
\overset{\scriptstyle{\leftarrow}}{\alpha}
that is longer than
and not longer than
. It also holds:
Connection with suffix trees
See main article: Suffix tree. A "prefix tree" (or "trie") is a rooted directed tree in which arcs are marked by characters in such a way no vertex
of such tree has two out-going arcs marked with the same character. Some vertices in trie are marked as final. Trie is said to recognize a set of words defined by paths from its root to final vertices. In this way prefix trees are a special kind of deterministic finite automata if you perceive its root as an initial vertex. The "suffix trie" of the word
is a prefix tree recognizing a set of its suffixes. "A
suffix tree" is a tree obtained from a suffix trie via the compaction procedure, during which consequent edges are merged if the
degree of the vertex between them is equal to two.
By its definition, a suffix automaton can be obtained via minimization of the suffix trie. It may be shown that a compacted suffix automaton is obtained by both minimization of the suffix tree (if one assumes each string on the edge of the suffix tree is a solid character from the alphabet) and compaction of the suffix automaton. Besides this connection between the suffix tree and the suffix automaton of the same string there is as well a connection between the suffix automaton of the string
and the suffix tree of the reversed string
.
Similarly to right contexts one may introduce "left contexts"
[\omega]L=\{\beta\in\Sigma*:\beta\omega\inL\}
, "right extensions"
\overset{\scriptstyle{ → }}{\omega~}
corresponding to the longest string having same left context as
and the equivalence relation
. If one considers right extensions with respect to the language
of "prefixes" of the string
it may be obtained:
, which implies the suffix link tree of the string
and the suffix tree of the string
are isomorphic:
Similarly to the case of left extensions, the following lemma holds for right extensions:
Size
A suffix automaton of the string
of length
has at most
states and at most
transitions. These bounds are reached on strings
and
correspondingly. This may be formulated in a stricter way as
where
and
are the numbers of transitions and states in automaton correspondingly.
Construction
Initially the automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.
State updates
After a new character is appended to the string, some equivalence classes are altered. Let
be the right context of
with respect to the language of
suffixes. Then the transition from
to
after
is appended to
is defined by lemma:
After adding
to the current word
the right context of
may change significantly only if
is a suffix of
. It implies equivalence relation
is a refinement of
. In other words, if
, then
. After the addition of a new character at most two equivalence classes of
will be split and each of them may split in at most two new classes. First, equivalence class corresponding to
empty right context is always split into two equivalence classes, one of them corresponding to
itself and having
as a right context. This new equivalence class contains exactly
and all its suffixes that did not occur in
, as the right context of such words was empty before and contains only empty word now.
Given the correspondence between states of the suffix automaton and vertices of the suffix tree, it is possible to find out the second state that may possibly split after a new character is appended. The transition from
to
corresponds to the transition from
to
in the reversed string. In terms of suffix trees it corresponds to the insertion of the new longest suffix
into the suffix tree of
. At most two new vertices may be formed after this insertion: one of them corresponding to
, while the other one corresponds to its direct ancestor if there was a branching. Returning to suffix automata, it means the first new state recognizes
and the second one (if there is a second new state) is its suffix link. It may be stated as a lemma:
It implies that if
(for example, when
didn't occur in
at all and
), then only the equivalence class corresponding to the empty right context is split.
Besides suffix links it is also needed to define final states of the automaton. It follows from structure properties that all suffixes of a word
recognized by
are recognized by some vertex on
suffix path
of
. Namely, suffixes with length greater than
lie in
, suffixes with length greater than
but not greater than
lie in
and so on. Thus if the state recognizing
is denoted by
, then all final states (that is, recognizing suffixes of
) form up the sequence
(last,link(last),link2(last),...)
.
Transitions and suffix links updates
After the character
is appended to
possible new states of suffix automaton are
and
. Suffix link from
goes to
and from
it goes to
. Words from
occur in
only as its suffixes therefore there should be no transitions at all from
while transitions to it should go from suffixes of
having length at least
and be marked with the character
. State
is formed by subset of
, thus transitions from
should be same as from
. Meanwhile, transitions leading to
should go from suffixes of
having length less than
and at least
, as such transitions have led to
before and corresponded to seceded part of this state. States corresponding to these suffixes may be determined via traversal of suffix link path for
.
Construction of the suffix automaton for the word abbcbc |
---|
| |
| |
| | |
Construction algorithm
Theoretical results above lead to the following algorithm that takes character
and rebuilds the suffix automaton of
into the suffix automaton of
:
- The state corresponding to the word
is kept as
;
- After
is appended, previous value of
is stored in the variable
and
itself is reassigned to the new state corresponding to
;
- States corresponding to suffixes of
are updated with transitions to
. To do this one should go through
, until there is a state that already has a transition by
;
- Once the aforementioned loop is over, there are 3 cases:
- If none of states on the suffix path had a transition by
, then
never occurred in
before and the suffix link from
should lead to
;
- If the transition by
is found and leads from the state
to the state
, such that
, then
does not have to be split and it is a suffix link of
;
- If the transition is found but
, then words from
having length at most
should be segregated into new "clone" state
;
- If the previous step was concluded with the creation of
, transitions from it and its suffix link should copy those of
, at the same time
is assigned to be common suffix link of both
and
;
- Transitions that have led to
before but corresponded to words of the length at most
are redirected to
. To do this, one continues going through the suffix path of
until the state is found such that transition by
from it doesn't lead to
.
The whole procedure is described by the following pseudo-code:
function : define assign assign while is undefined: assign define if : assign else if : assign else: define assign assign assign while : assign
Here
is the initial state of the automaton and
is a function creating new state for it. It is assumed
,
,
and
are stored as global variables.
Complexity
Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton. It may be implemented in
with
memory overhead or in
with
memory overhead if one assumes that memory allocation is done in
. To obtain such complexity, one has to use the methods of
amortized analysis. The value of
strictly reduces with each iteration of the cycle while it may only increase by as much as one after the first iteration of the cycle on the next
add_letter call. Overall value of
never exceeds
and it is only increased by one between iterations of appending new letters that suggest total complexity is at most linear as well. The linearity of the second cycle is shown in a similar way.
Generalizations
The suffix automaton is closely related to other suffix structures and substring indices. Given a suffix automaton of a specific string one may construct its suffix tree via compacting and recursive traversal in linear time. Similar transforms are possible in both directions to switch between the suffix automaton of
and the suffix tree of reversed string
. Other than this several generalizations were developed to construct an automaton for the set of strings given by trie, compacted suffix automation (CDAWG), to maintain the structure of the automaton on the sliding window, and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.
Compacted suffix automaton
As was already mentioned above, a compacted suffix automaton is obtained via both compaction of a regular suffix automaton (by removing states which are non-final and have exactly one out-going arc) and the minimization of a suffix tree. Similarly to the regular suffix automaton, states of compacted suffix automaton may be defined in explicit manner. A two-way extension
\overset{\scriptstyle{\longleftrightarrow}}{\gamma}
of a word
is the longest word
, such that every occurrence of
in
is preceded by
and succeeded by
. In terms of left and right extensions it means that two-way extension is the left extension of the right extension or, which is equivalent, the right extension of the left extension, that is
. In terms of two-way extensions compacted automaton is defined as follows:
Two-way extensions induce an equivalence relation which defines the set of words recognized by the same state of compacted automaton. This equivalence relation is a transitive closure of the relation defined by , which highlights the fact that a compacted automaton may be obtained by both gluing suffix tree vertices equivalent via
\overset{\scriptstyle{\leftarrow}}{\alpha}=\overset{\scriptstyle{\leftarrow}}{\beta}
relation (minimization of the suffix tree) and gluing suffix automaton states equivalent via
\overset{\scriptstyle{ → }}{\alpha}=\overset{\scriptstyle{ → }}{\beta}
relation (compaction of suffix automaton). If words
and
have same right extensions, and words
and
have same left extensions, then cumulatively all strings
,
and
have same two-way extensions. At the same time it may happen that neither left nor right extensions of
and
coincide. As an example one may take
,
and
, for which left and right extensions are as follows:
\overset{\scriptstyle{ → }}{\alpha}=\overset{\scriptstyle{ → }}{\beta}=ab=\overset{\scriptstyle{\leftarrow}}{\beta}=\overset{\scriptstyle{\leftarrow}}{\gamma}
, but
\overset{\scriptstyle{ → }}{\gamma}=b
and
\overset{\scriptstyle{\leftarrow}}{\alpha}=a
. That being said, while equivalence relations of one-way extensions were formed by some continuous chain of nested prefixes or suffixes, bidirectional extensions equivalence relations are more complex and the only thing one may conclude for sure is that strings with the same two-way extension are
substrings of the longest string having the same two-way extension, but it may even happen that they don't have any non-empty substring in common. The total number of equivalence classes for this relation does not exceed
which implies that compacted suffix automaton of the string having length
has at most
states. The amount of transitions in such automaton is at most
.
Suffix automaton of several strings
Consider a set of words
. It is possible to construct a generalization of suffix automaton that would recognize the language formed up by suffixes of all words from the set. Constraints for the number of states and transitions in such automaton would stay the same as for a single-word automaton if you put
. The algorithm is similar to the construction of single-word automaton except instead of
state, function
add_letter would work with the state corresponding to the word
assuming the transition from the set of words
\{\omega1,...,\omegai,...,\omegak\}
to the set
\{\omega1,...,\omegaix,...,\omegak\}
.
This idea is further generalized to the case when
is not given explicitly but instead is given by a
prefix tree with
vertices. Mohri
et al. showed such an automaton would have at most
and may be constructed in linear time from its size. At the same time, the number of transitions in such automaton may reach
, for example for the set of words
T=\{\sigma1,a\sigma1,
...,an\sigma1,an\sigma2,...,an\sigmak\}
over the alphabet
\Sigma=\{a,\sigma1,...,\sigmak\}
the total length of words is equal to
, the number of vertices in corresponding suffix trie is equal to
and corresponding suffix automaton is formed of
states and
transitions. Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings but instead of growing words one by one, it traverses the trie in a
breadth-first search order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.
Sliding window
Some compression algorithms, such as LZ77 and RLE may benefit from storing suffix automaton or similar structure not for the whole string but for only last
its characters while the string is updated. This is because compressing data is usually expressively large and using
memory is undesirable. In 1985, Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size
in
worst-case and
on average, assuming characters are distributed independently and
uniformly. She also showed
complexity cannot be improved: if one considers words construed as a concatenation of several
words, where
, then the number of states for the window of size
would frequently change with jumps of order
, which renders even theoretical improvement of
for regular suffix automata impossible.
The same should be true for the suffix tree because its vertices correspond to states of the suffix automaton of the reversed string but this problem may be resolved by not explicitly storing every vertex corresponding to the suffix of the whole string, thus only storing vertices with at least two out-going edges. A variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene; several years later a similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson. The existence of such an algorithm, for compacted suffix automaton that absorbs some properties of both suffix trees and suffix automata, was an open question for a long time until it was discovered by Martin Senft and Tomasz Dvorak in 2008, that it is impossible if the alphabet's size is at least two.
One way to overcome this obstacle is to allow window width to vary a bit while staying
. It may be achieved by an approximate algorithm suggested by Inenaga et al. in 2004. The window for which suffix automaton is built in this algorithm is not guaranteed to be of length
but it is guaranteed to be at least
and at most
while providing linear overall complexity of the algorithm.
Applications
Suffix automaton of the string
may be used to solve such problems as:
- Counting the number of distinct substrings of
in
on-line,
- Finding the longest substring of
occurring at least twice in
,
and
in
,
- Counting the number of occurrences of
in
in
,
- Finding all occurrences of
in
in
, where
is the number of occurrences.
It is assumed here that
is given on the input after suffix automaton of
is constructed.
Suffix automata are also used in data compression, music retrieval and matching on genome sequences.
References
Bibliography
- .
- .
- .
- free. .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
- .
External links