McCreight Builder
The original and mathematically precise exposition (in blockquotes) is taken from the excellent paper by McCreight [McCreight1976]. The comments are by me, trying to understand the paper and translate it into language suited for mere mortals.
First McCreight defines:
- extension
An extension of a string \(\bs\alpha\) is any string of which \(\bs\alpha\) is a prefix.
- locus
The locus of a string is the node at the end of the path named by the string. The locus may not exist.
- extended locus
The extended locus of a string \(\bs\alpha\) is the locus of the shortest extension of \(\bs\alpha\) whose locus is defined. This may be the locus itself.
- contracted locus
The contracted locus of a string \(\bs\alpha\) is the locus of the longest prefix of \(\bs\alpha\) whose locus is defined. This may be the locus itself.
- \(\suf{\imath}\)
is the suffix of \(\bf S\) beginning at character position \(i\) (beginning with 1).
- \(\head{\imath}\)
\(\head{\imath}\) is the longest prefix of \(\suf{\imath}\) which is also a prefix of \(\suf{\jmath}\) for some \(j < i\). Equivalently, \(\head{\imath}\) is the longest prefix of \(\suf{\imath}\) whose extended locus exists within the tree \(T_{\imath-1}\).
- \(\tail{\imath}\)
We define \(\tail{\imath}\) as \(\suf{\imath}-\head{\imath}\).
The McCreight algorithm inserts all suffixes of \(\bf S\) in order beginning with the longest.
In each step it inserts one suffix and creates one leaf node and at most one internal node. For each internal node it will also create a suffix link in the next step.
To insert one suffix \(\suf{}\), we search the tree from the root until we get a mismatch. If the mismatch happens on a node, we call that node the \(\head{}\). If the mismatch happens in the middle of an edge, we split the edge and insert a new node, and call this new node the \(\head{}\). Then we append the \(\tail{}\) to the \(\head{}\). This procedure is slow because we have to search starting from the root over and over again.
McCreight shows us how to warp from \(\head{\imath-1}\) to \(\head{\imath}\) in linear time using suffix links and “rescanning”. He demonstrates how suffix links can be set and exploited in linear time:
LEMMA 1. If \(\head{\imath-1}\) can be written as \(\bf x\bs{\delta}\) for some character \(\bf x\) and some (possibly empty) string \(\bs\delta\), then \(\bs\delta\) is a prefix of \(\head{\imath}\).
PROOF. By induction on \(i\). Suppose \(\head{\imath-1}=\bf x\bs{\delta}\). This means that there is a \(j,\, j < i\), such that \(\bf x\bs{\delta}\) is a prefix both of \(\suf{\imath-1}\) and of \(\suf{\jmath-1}\). Thus \(\bs\delta\) is a prefix both of \(\suf{\imath}\) and of \(\suf{\jmath}\). Therefore by the definition of \(\head{}\), \(\bs\delta\) is a prefix of \(\head{\imath}\). \(\quad\square\)
Rephrasing that for mere mortals:
LEMMA 1: If the locus of \(\head{\imath-1}\) was not the root node, then in step \(i\) we will find or create a node \(c\) that is on the path to \(\head{\imath}\). Therefore in step \(i\) it will be possible link to \(\head{\imath-1}\) to \(c\) with a suffix link.
PROOF: at the end of step \(i-1\) the locus of \(\head{\imath-1}\) can exists only if \(\suf{\imath-1}\) and some previous suffix \(\suf{\jmath-1}\) diverged at \(\bf x\bs{\delta}\). But then, in step \(i\), \(\suf{\imath}\) must diverge with the previously inserted \(\suf{\jmath}\) at \(\bs\delta\).
To exploit this relationship, auxiliary links are added to our tree structure. From each nonterminal node which is the locus of \(\bf x\bs{\delta}\), where \(\bf x\) is a character and \(\bs\delta\) is a string, a suffix link is introduced pointing to the locus of \(\bs\delta\). (Note that the locus of \(\bs\delta\) is never within the subtree rooted at the locus of \(\bf x\bs{\delta}\).) […] These links enable the algorithm in step \(i\) to do a short-cut search for the locus of \(\head{\imath}\) beginning at the locus of \(\head{\imath-1}\) which it has visited in the previous step. The following semiformal presentation of the algorithm will prove by induction on \(i\) that
\(P1\): in tree \(T_{\imath}\) only the locus of \(\head{\imath}\) could fail to have a valid suffix link, and that
\(P2\): in step \(i\) the algorithm visits the contracted locus of \(\head{\imath}\) in \(T_{\imath-1}\).
Properties \(P1\) and \(P2\) clearly obtain if \(i = 1\). Now suppose \(i > 1\). In step \(i\) the algorithm does the following:
It follows a description of the algorithm:
Substep A. Let us write down \(\head{\imath-1}\) as the concatenation of three strings \(\bs{\chi\alpha\beta}\). Let us also write down \(\head{\imath}\) in the same way as \(\bs{\alpha\beta\gamma}\).
\(\bs\chi\) was the first character in \(\suf{\imath-1}\) and is of no more importance when we insert \(\suf{\imath}\). \(\bs\alpha\) is that part of \(\suf{\imath}\) we can short-circuit with a suffix link. \(\bs\beta\) is the next part of \(\suf{\imath}\) which we can fast-forward by “rescanning”. \(\bs\gamma\) is the third part of \(\suf{\imath}\) for which we have to use slow “scanning”. (And \(\tail{\imath}\) is the last part of \(\suf{\imath}\).)
We go to the first ancestor-or-self of \(\head{\imath-1}\) that has a suffix link. (That would be \(\bs{\chi\alpha}\).) We then follow the suffix link to the locus of \(\bs\alpha\) and call that node \(c\). Goto substep B.
Note that if we set the suffix link of root to root itself we don’t have to special case an empty \(\bs\alpha\).
Substep A. First the algorithm identifies three strings \(\bs\chi\), \(\bs\alpha\), and \(\bs\beta\), with the following properties:
\(\head{\imath-1}\) can be represented as \(\bs{\chi\alpha\beta}\).
If the contracted locus of \(\head{\imath-1}\) in \(T_{\imath-2}\) is not the root, it is the locus of \(\bs{\chi\alpha}\). Otherwise \(\bs\alpha\) is empty.
\(\bs\chi\) is a string of at most one character which is empty only if \(\head{\imath+1}\) is empty.
Lemma 1 guarantees that \(\head{\imath}\) may be represented as \(\bs{\alpha\beta\gamma}\) for some (possibly empty) string \(\bs\gamma\). The algorithm now chooses the appropriate case of the following two:
\(\bs\alpha\) empty: The algorithm calls the root node \(c\) and goes to substep B. Note that \(c\) is defined as the locus of \(\bs\alpha\).
\(\bs\alpha\) nonempty: By definition, the locus of \(\bs{\chi\alpha}\) must have existed in the tree \(T_{\imath-2}\). By \(P1\) the suffix link of that (nonterminal) locus node must be defined in tree \(T_{\imath-1}\), since the node itself must have been constructed before step \(i-1\). By \(P2\) the algorithm visited that node in step \(i-1\). The algorithm follows its suffix link to a nonterminal node called \(c\) and goes to substep B. Note that \(c\) is defined as the locus of \(\bs\alpha\).
Substep B. This is the second time-saving trick devised by McCreight.
From the node \(c\), the locus of \(\bs\alpha\), we search the locus of \(\bs{\alpha\beta}\). We save time by what McCreight calls “rescanning”: We already know that all characters in \(\bs\beta\) must match so we need to examine only the first character of each edge (to decide which edge to follow).
If we overshoot the length of \(\bs\beta\) we have to split the edge as usual. When found or created, we call that node \(d\). Goto Substep C. Note that if \(\bs\gamma\) is not empty the edge will already be split.
Substep B. This is called “rescanning,” and is the key idea of algorithm M. Since \(\bs{\alpha\beta\gamma}\) is defined as \(\head{\imath}\), by the definition of \(\head{}\) we know that the extended locus of \(\bs{\alpha\beta}\) exists in the tree \(T_{\imath-1}\). This means that there is a sequence of arcs downward from node \(c\) (the locus of \(\bs\alpha\)) which spells out some extension of \(\bs\beta\). To rescan \(\bs\beta\) the algorithm finds the child arc \(\bs\rho\) of \(c\) which begins with the first character of \(\bs\beta\) and leads to a node which we shall call \(f\). It compares the lengths of \(\bs\beta\) and \(\bs\rho\). If \(\bs\beta\) is longer, then a recursive rescan of \(\beta - \rho\) is begun from node \(f\). If \(\bs\beta\) is the same length or shorter, then \(\bs\beta\) is a prefix of \(\bs\rho\), the algorithm has found the extended locus of \(\bs{\alpha\beta}\) and the rescan is complete. It has been accomplished in time linear in the number of nodes encountered. A new nonterminal node is constructed to be the locus of \(\bs{\alpha\beta}\) if one does not already exist. The algorithm calls \(d\) the locus of \(\bs{\alpha\beta}\) and goes to substep C. (Note that substep B constructs a new node to be the locus of \(\bs{\alpha\beta}\) only if \(\bs\gamma\) is empty.)
Substep C. Back to the naive algorithm.
First we set the suffix link on \(\bs{\chi\alpha\beta}\) to point to \(\bs{\alpha\beta}\). Then from the locus of \(\bs{\alpha\beta}\) we scan the slow way to \(\bs{\alpha\beta\gamma}\) comparing each character until we get a mismatch. If the mismatch happens on a node, we call that node \(\head{\imath}\). If the mismatch happens in the middle of an edge, we split the edge and insert a new node, and call the new node \(\head{\imath}\). Then we append the unmatched rest \(\tail{\imath}\) to \(\head{\imath}\). Increment \(i\) and goto Substep A.
Substep C. This is called “scanning.” If the suffix link of the locus of \(\bs{\chi\alpha\beta}\) is currently undefined, the algorithm first defines that link to point to node \(d\). This and inductive hypothesis establish the truth of \(P1\) in \(T_{\imath}\). Then the algorithm begins searching from node \(d\) deeper into the tree to find the extended locus of \(\bs{\alpha\beta\gamma}\). The major difference between scanning and rescanning is that in rescanning the length of \(\bs\beta\) is known beforehand (because it has already been scanned), while in scanning the length of \(\bs\gamma\) is not known beforehand. Thus the algorithm must travel downward into the tree in response to the characters of \(\tail{\imath}\) (of which \(\bs\gamma\) is a prefix) one by one from left to right. When the algorithm “falls out of the tree” (as constraint \(S1\) guarantees that it must), it has found the extended locus of \(\bs{\alpha\beta\gamma}\). The last node of \(T_{\imath-1}\) encountered in this downward trek of rescanning and scanning is the contracted locus of \(\head{\imath}\) in \(T_{\imath-1}\) ; this establishes the truth of \(P2\). A new nonterminal node is constructed to be the locus of \(\bs{\alpha\beta\gamma}\) if one does not already exist. Finally a new arc labeled \(\tail{\imath}\), is constructed from the locus of \(\bs{\alpha\beta\gamma}\) to a new terminal node. Step \(i\) is now finished.
A tree builder that uses McCreight’s Algorithm.
This module implements McCreight’s algorithm to build a suffix tree in linear time, adapted to generalized suffix trees.
- class Builder
Builds the suffix-tree using McCreight’s Algorithm.
- build(root: Internal, id_: object, S: Iterable[Hashable]) None
Add a sequence to the tree.
- Parameters:
root (node.Internal) – the root node of the tree
id (Id) – the id of the sequence
S (Symbols) – the sequence to insert