Ukkonen Builder
Ukkonen’s algorithm is very similar to McCreight’s.
McCreight1976 |
Ukkonen1995 |
English |
---|---|---|
step |
phase |
round |
locus |
implicit state |
node + substring |
locus node |
explicit state |
node |
branching state |
internal node |
|
final state |
leaf |
|
arc |
transition function |
edge |
suffix link |
suffix function |
suffix pointer |
head |
active point |
|
contracted locus |
canonical representation |
parent node |
scan |
search |
|
rescan |
canonize |
fast-forward search |
The differences are: In round \(i\) Ukkonen inserts character \(t_i\) into the tree while McCreight inserts the whole suffix \(t_i\dots t_n\). Ukkonen’s algorithm does not need to know the characters following \(t_i\). It can build a tree from a Python generator and be ready in constant time after the last character was generated.
McCreight’s algorithm is easier to understand. McCreight inserts one leaf and at most one internal node per round. It is not clear beforehand how many leafs and internal nodes a round in Ukkonen’s will insert: in Fig. 2 of Ukkonen’s paper, the last round creates 2 internal nodes and 3 new leaves, while both preceding rounds created nothing.
Some annotations to Ukkonen’s paper follow:
Ukkonen identifies subsequences by two indices: \(k\) and \(p\). In a generalized suffix
tree these are not sufficient to uniquely identify a subsequence, we also need to know
to which sequence \(k\) and \(p\) are referring. So instead of using the parameter pair
\((k,p)\) we use three parameters: \((S,k,p)\). We also prefer to name them: S, start,
end
.
We also have two problems with indices: The first is that Python indices start with 0
while Ukkonen’s start with 1. The second is that Python indices point to one beyond the
end while Ukkonen’s point to the end. In conclusion: Ukkonen’s \(1..2\) becomes Python’s
[0:2]
.
Ukkonen treats the suffix trie as an automaton and thus uses transition function to describe what a mere mortal would call an edge, and explicit state \(s\) to describe what the rest of us would call a node. Edges in a trie are labeled by one character only. A branching state is an internal node, a final state is a leaf node. Every node in the trie represents a substring and every leaf node represents a suffix. The converse is true. The node representing the string \(abc\) is called \(\overline{abc}\).
A suffix tree is a compressed suffix trie without those internal nodes that have only one child. Edges in a tree are labeled by more than one character. As a consequence those states in the trie that had their nodes removed become implicit states in the tree. Those that still retain their nodes are explicit states. Explicit substrings always end on a node, implicit ones end in the middle of an edge (but can be made explict by splitting the edge).
States are represented by the triple: \((s,(k,p))\), which is a node and a substring
going out of the node. There may be more than one such representation. Remember that
we use node, S, start, end
instead.
The canonical representation of a state is: the state itself and an empty substring if explicit, the nearest node above the state and the shortest possible substring if implicit.
The transition function \(g'(s,(k,p)) = r\) starts at node \(s\) and ends at the state \(r\), which may be a node or an implicit state (in the middle of an edge).
The suffix function \(f(\bar x) = \bar y\) is a link present at each internal node, that points to another node that encodes the same suffix minus the first symbol. Following suffix links you get from: cacao -> acao -> cao -> ao -> a -> root -> aux.
An open transition is any edge going into a leaf. A leaf’s end
grows
automatically whenever a new symbol is added to the tree.
The boundary path of a trie is the chain of suffix links starting at the deepest state and ending at aux. In a tree the states on the boundary path may be implicit. To follow an implicit boundary path we use substep A and substep B from McCreight’s algorithm.
The active point and end point are states on the boundary path. Active point and end point may be the same point. The active point is the first state on the boundary path that is not a leaf, the end point ist the first state that already has the transition we trying to are add, and if we follow that transition we reach the next active point.
See also
Ukkonen’s suffix tree algorithm in plain English https://stackoverflow.com/questions/9452701/
A tree builder that uses Ukkonen’s Algorithm.
This module implements Ukkonen’s algorithm to build a suffix tree in linear time, adapted to generalized suffix trees.
- class Builder
Builds the suffix-tree using Ukkonen’s Algorithm.
- __init__()
- transition(s: Internal, k: int) Tuple[Node, Sequence[Hashable], int, int]
Follow the \(s,S,k\) transition.
Let \(g'(s,(k',p')) = s'\) be the \(t_k\)-transition from \(s\).
- Parameters:
s (node.Internal) – the state \(s\)
k (int) – \(k\)
- Return tuple:
\(s,S,k,p\)
- test_and_split(s: Internal, start: int, end: int, t: Hashable) Tuple[bool, Internal]
Test if endpoint and eventually split a transition.
Return True, if s is the end point.
“Tests whether or not a state with canonical reference pair \((s,(k,p))\) is the end point, that is, a state that in \(STrie(T^{i-1})\) would have a \(t_i\)-transition. Symbol \(t_i\) is given as input parameter \(t\). The test result is returned as the first output parameter. If \((s,(k,p))\) is not the end point, then state \((s,(k,p))\) is made explicit (if not already so) by splitting a transition. The explicit state is returned as the second output parameter.” – [Ukkonen1995]
- canonize(s: Internal, start: int, end: int) Tuple[Internal, int]
Canonize a reference pair.
This has the same function as substep b in McCreight’s algorithm.
“Given a reference pair \((s,(k,p))\) for some state \(r\), it finds and returns state \(s'\) and left link \(k'\) such that \((s',(k',p))\) is the canonical reference pair for \(r\). State \(s'\) is the closest explicit ancestor of \(r\) (or \(r\) itself if \(r\) is explicit). Therefore the string that leads from \(s'\) to \(r\) must be a suffix of the string \(t_k\dots t_p\) that leads from \(s\) to \(r\). Hence the right link \(p\) does not change but the left link \(k\) can become \(k', k' \geq k\).” – [Ukkonen1995]
- update(s: Internal, start: int, end: List[int]) Tuple[Internal, int]
Update the tree.
“[…] transforms \(STree(T^{i-1})\) into \(STree(T^i)\) by inserting the \(t_i\)-transitions in the second group. The procedure uses procedure canonize mentioned above, and procedure test-and-split that tests whether or not a given reference pair refers to the end point. If it does not then the procedure creates and returns an explicit state for the reference pair provided that the pair does not already represent an explicit state. Procedure update returns a reference pair for the end point \(s_{j'}\) (actually only the state and the left pointer of the pair, as the second pointer remains \(i-1\) for all states on the boundary path).” – [Ukkonen1995]
\(s,(k,i-1)\) is the canonical reference pair for the active point.
- Returns:
a reference pair for the endpoint \(s_{j\prime}\).
- build(root: Internal, id_: object, S: Iterable[Hashable]) None
Add a sequence to the tree.
- Parameters:
root (node.Internal) – the root node of the tree
id (Id) – the id of the sequence
S (IterSymbols) – an iterator over the symbols