Ukkonen Builder

Ukkonen’s algorithm is very similar to McCreight’s.

McCreight1976

Ukkonen1995

English

step

phase

round

locus

implicit state

node + substring

locus node

explicit state

node

branching state

internal node

final state

leaf

arc

transition function

edge

suffix link

suffix function

suffix pointer

head

active point

contracted locus

canonical representation

parent node

scan

search

rescan

canonize

fast-forward search

The differences are: In round \(i\) Ukkonen inserts character \(t_i\) into the tree while McCreight inserts the whole suffix \(t_i\dots t_n\). Ukkonen’s algorithm does not need to know the characters following \(t_i\). It can build a tree from a Python generator and be ready in constant time after the last character was generated.

McCreight’s algorithm is easier to understand. McCreight inserts one leaf and at most one internal node per round. It is not clear beforehand how many leafs and internal nodes a round in Ukkonen’s will insert: in Fig. 2 of Ukkonen’s paper, the last round creates 2 internal nodes and 3 new leaves, while both preceding rounds created nothing.

Some annotations to Ukkonen’s paper follow:

Ukkonen identifies subsequences by two indices: \(k\) and \(p\). In a generalized suffix tree these are not sufficient to uniquely identify a subsequence, we also need to know to which sequence \(k\) and \(p\) are referring. So instead of using the parameter pair \((k,p)\) we use three parameters: \((S,k,p)\). We also prefer to name them: S, start, end.

We also have two problems with indices: The first is that Python indices start with 0 while Ukkonen’s start with 1. The second is that Python indices point to one beyond the end while Ukkonen’s point to the end. In conclusion: Ukkonen’s \(1..2\) becomes Python’s [0:2].

Ukkonen treats the suffix trie as an automaton and thus uses transition function to describe what a mere mortal would call an edge, and explicit state \(s\) to describe what the rest of us would call a node. Edges in a trie are labeled by one character only. A branching state is an internal node, a final state is a leaf node. Every node in the trie represents a substring and every leaf node represents a suffix. The converse is true. The node representing the string \(abc\) is called \(\overline{abc}\).

A suffix tree is a compressed suffix trie without those internal nodes that have only one child. Edges in a tree are labeled by more than one character. As a consequence those states in the trie that had their nodes removed become implicit states in the tree. Those that still retain their nodes are explicit states. Explicit substrings always end on a node, implicit ones end in the middle of an edge (but can be made explict by splitting the edge).

States are represented by the triple: \((s,(k,p))\), which is a node and a substring going out of the node. There may be more than one such representation. Remember that we use node, S, start, end instead.

The canonical representation of a state is: the state itself and an empty substring if explicit, the nearest node above the state and the shortest possible substring if implicit.

The transition function \(g'(s,(k,p)) = r\) starts at node \(s\) and ends at the state \(r\), which may be a node or an implicit state (in the middle of an edge).

The suffix function \(f(\bar x) = \bar y\) is a link present at each internal node, that points to another node that encodes the same suffix minus the first symbol. Following suffix links you get from: cacao -> acao -> cao -> ao -> a -> root -> aux.

An open transition is any edge going into a leaf. A leaf’s end grows automatically whenever a new symbol is added to the tree.

The boundary path of a trie is the chain of suffix links starting at the deepest state and ending at aux. In a tree the states on the boundary path may be implicit. To follow an implicit boundary path we use substep A and substep B from McCreight’s algorithm.

The active point and end point are states on the boundary path. Active point and end point may be the same point. The active point is the first state on the boundary path that is not a leaf, the end point ist the first state that already has the transition we trying to are add, and if we follow that transition we reach the next active point.

See also

[Ukkonen1995]

Ukkonen’s suffix tree algorithm in plain English https://stackoverflow.com/questions/9452701/

A tree builder that uses Ukkonen’s Algorithm.

This module implements Ukkonen’s algorithm to build a suffix tree in linear time, adapted to generalized suffix trees.

class Builder

Builds the suffix-tree using Ukkonen’s Algorithm.

__init__()
transition(s: Internal, k: int) Tuple[Node, Sequence[Hashable], int, int]

Follow the \(s,S,k\) transition.

Let \(g'(s,(k',p')) = s'\) be the \(t_k\)-transition from \(s\).

Parameters:
Return tuple:

\(s,S,k,p\)

test_and_split(s: Internal, start: int, end: int, t: Hashable) Tuple[bool, Internal]

Test if endpoint and eventually split a transition.

Return True, if s is the end point.

“Tests whether or not a state with canonical reference pair \((s,(k,p))\) is the end point, that is, a state that in \(STrie(T^{i-1})\) would have a \(t_i\)-transition. Symbol \(t_i\) is given as input parameter \(t\). The test result is returned as the first output parameter. If \((s,(k,p))\) is not the end point, then state \((s,(k,p))\) is made explicit (if not already so) by splitting a transition. The explicit state is returned as the second output parameter.” – [Ukkonen1995]

canonize(s: Internal, start: int, end: int) Tuple[Internal, int]

Canonize a reference pair.

This has the same function as substep b in McCreight’s algorithm.

“Given a reference pair \((s,(k,p))\) for some state \(r\), it finds and returns state \(s'\) and left link \(k'\) such that \((s',(k',p))\) is the canonical reference pair for \(r\). State \(s'\) is the closest explicit ancestor of \(r\) (or \(r\) itself if \(r\) is explicit). Therefore the string that leads from \(s'\) to \(r\) must be a suffix of the string \(t_k\dots t_p\) that leads from \(s\) to \(r\). Hence the right link \(p\) does not change but the left link \(k\) can become \(k', k' \geq k\).” – [Ukkonen1995]

update(s: Internal, start: int, end: List[int]) Tuple[Internal, int]

Update the tree.

“[…] transforms \(STree(T^{i-1})\) into \(STree(T^i)\) by inserting the \(t_i\)-transitions in the second group. The procedure uses procedure canonize mentioned above, and procedure test-and-split that tests whether or not a given reference pair refers to the end point. If it does not then the procedure creates and returns an explicit state for the reference pair provided that the pair does not already represent an explicit state. Procedure update returns a reference pair for the end point \(s_{j'}\) (actually only the state and the left pointer of the pair, as the second pointer remains \(i-1\) for all states on the boundary path).” – [Ukkonen1995]

\(s,(k,i-1)\) is the canonical reference pair for the active point.

Returns:

a reference pair for the endpoint \(s_{j\prime}\).

build(root: Internal, id_: object, S: Iterable[Hashable]) None

Add a sequence to the tree.

Parameters:
  • root (node.Internal) – the root node of the tree

  • id (Id) – the id of the sequence

  • S (IterSymbols) – an iterator over the symbols