McCreight Builder
The original and mathematically precise exposition (in blockquotes) is taken from the excellent paper by McCreight [McCreight1976]. The comments are by me, trying to understand the paper and translate it into language suited for mere mortals.
First McCreight defines:
- extension
An extension of a string
is any string of which is a prefix. - locus
The locus of a string is the node at the end of the path named by the string. The locus may not exist.
- extended locus
The extended locus of a string
is the locus of the shortest extension of whose locus is defined. This may be the locus itself. - contracted locus
The contracted locus of a string
is the locus of the longest prefix of whose locus is defined. This may be the locus itself. is the suffix of
beginning at character position (beginning with 1).
is the longest prefix of which is also a prefix of for some . Equivalently, is the longest prefix of whose extended locus exists within the tree . We define
as .
The McCreight algorithm inserts all suffixes of
In each step it inserts one suffix and creates one leaf node and at most one internal node. For each internal node it will also create a suffix link in the next step.
To insert one suffix
McCreight shows us how to warp from
LEMMA 1. If
can be written as for some character and some (possibly empty) string , then is a prefix of . PROOF. By induction on
. Suppose . This means that there is a , such that is a prefix both of and of . Thus is a prefix both of and of . Therefore by the definition of , is a prefix of .
Rephrasing that for mere mortals:
LEMMA 1: If the locus of
PROOF: at the end of step
To exploit this relationship, auxiliary links are added to our tree structure. From each nonterminal node which is the locus of
, where is a character and is a string, a suffix link is introduced pointing to the locus of . (Note that the locus of is never within the subtree rooted at the locus of .) […] These links enable the algorithm in step to do a short-cut search for the locus of beginning at the locus of which it has visited in the previous step. The following semiformal presentation of the algorithm will prove by induction on that
: in tree only the locus of could fail to have a valid suffix link, and that
: in step the algorithm visits the contracted locus of in . Properties
and clearly obtain if . Now suppose . In step the algorithm does the following:
It follows a description of the algorithm:
Substep A. Let us write down
We go to the first ancestor-or-self of
Note that if we set the suffix link of root to root itself we don’t have to special case
an empty
Substep A. First the algorithm identifies three strings
, , and , with the following properties:
can be represented as . If the contracted locus of
in is not the root, it is the locus of . Otherwise is empty.
is a string of at most one character which is empty only if is empty. Lemma 1 guarantees that
may be represented as for some (possibly empty) string . The algorithm now chooses the appropriate case of the following two:
empty: The algorithm calls the root node and goes to substep B. Note that is defined as the locus of .
nonempty: By definition, the locus of must have existed in the tree . By the suffix link of that (nonterminal) locus node must be defined in tree , since the node itself must have been constructed before step . By the algorithm visited that node in step . The algorithm follows its suffix link to a nonterminal node called and goes to substep B. Note that is defined as the locus of .
Substep B. This is the second time-saving trick devised by McCreight.
From the node
If we overshoot the length of
Substep B. This is called “rescanning,” and is the key idea of algorithm M. Since
is defined as , by the definition of we know that the extended locus of exists in the tree . This means that there is a sequence of arcs downward from node (the locus of ) which spells out some extension of . To rescan the algorithm finds the child arc of which begins with the first character of and leads to a node which we shall call . It compares the lengths of and . If is longer, then a recursive rescan of is begun from node . If is the same length or shorter, then is a prefix of , the algorithm has found the extended locus of and the rescan is complete. It has been accomplished in time linear in the number of nodes encountered. A new nonterminal node is constructed to be the locus of if one does not already exist. The algorithm calls the locus of and goes to substep C. (Note that substep B constructs a new node to be the locus of only if is empty.)
Substep C. Back to the naive algorithm.
First we set the suffix link on
Substep C. This is called “scanning.” If the suffix link of the locus of
is currently undefined, the algorithm first defines that link to point to node . This and inductive hypothesis establish the truth of in . Then the algorithm begins searching from node deeper into the tree to find the extended locus of . The major difference between scanning and rescanning is that in rescanning the length of is known beforehand (because it has already been scanned), while in scanning the length of is not known beforehand. Thus the algorithm must travel downward into the tree in response to the characters of (of which is a prefix) one by one from left to right. When the algorithm “falls out of the tree” (as constraint guarantees that it must), it has found the extended locus of . The last node of encountered in this downward trek of rescanning and scanning is the contracted locus of in ; this establishes the truth of . A new nonterminal node is constructed to be the locus of if one does not already exist. Finally a new arc labeled , is constructed from the locus of to a new terminal node. Step is now finished.
A tree builder that uses McCreight’s Algorithm.
This module implements McCreight’s algorithm to build a suffix tree in linear time, adapted to generalized suffix trees.
- class Builder
Builds the suffix-tree using McCreight’s Algorithm.
- build(root: Internal, id_: object, S: Iterable[Hashable]) None
Add a sequence to the tree.
- Parameters:
root (node.Internal) – the root node of the tree
id (Id) – the id of the sequence
S (Symbols) – the sequence to insert