McCreight Builder

The original and mathematically precise exposition (in blockquotes) is taken from the excellent paper by McCreight [McCreight1976]. The comments are by me, trying to understand the paper and translate it into language suited for mere mortals.

First McCreight defines:

extension

An extension of a string α is any string of which α is a prefix.

locus

The locus of a string is the node at the end of the path named by the string. The locus may not exist.

extended locus

The extended locus of a string α is the locus of the shortest extension of α whose locus is defined. This may be the locus itself.

contracted locus

The contracted locus of a string α is the locus of the longest prefix of α whose locus is defined. This may be the locus itself.

sufı

is the suffix of S beginning at character position i (beginning with 1).

headı

headı is the longest prefix of sufı which is also a prefix of sufȷ for some j<i. Equivalently, headı is the longest prefix of sufı whose extended locus exists within the tree Tı1.

tailı

We define tailı as sufıheadı.

The McCreight algorithm inserts all suffixes of S in order beginning with the longest.

In each step it inserts one suffix and creates one leaf node and at most one internal node. For each internal node it will also create a suffix link in the next step.

To insert one suffix suf, we search the tree from the root until we get a mismatch. If the mismatch happens on a node, we call that node the head. If the mismatch happens in the middle of an edge, we split the edge and insert a new node, and call this new node the head. Then we append the tail to the head. This procedure is slow because we have to search starting from the root over and over again.

McCreight shows us how to warp from headı1 to headı in linear time using suffix links and “rescanning”. He demonstrates how suffix links can be set and exploited in linear time:

LEMMA 1. If headı1 can be written as xδ for some character x and some (possibly empty) string δ, then δ is a prefix of headı.

PROOF. By induction on i. Suppose headı1=xδ. This means that there is a j,j<i, such that xδ is a prefix both of sufı1 and of sufȷ1. Thus δ is a prefix both of sufı and of sufȷ. Therefore by the definition of head, δ is a prefix of headı.

Rephrasing that for mere mortals:

LEMMA 1: If the locus of headı1 was not the root node, then in step i we will find or create a node c that is on the path to headı. Therefore in step i it will be possible link to headı1 to c with a suffix link.

PROOF: at the end of step i1 the locus of headı1 can exists only if sufı1 and some previous suffix sufȷ1 diverged at xδ. But then, in step i, sufı must diverge with the previously inserted sufȷ at δ.

To exploit this relationship, auxiliary links are added to our tree structure. From each nonterminal node which is the locus of xδ, where x is a character and δ is a string, a suffix link is introduced pointing to the locus of δ. (Note that the locus of δ is never within the subtree rooted at the locus of xδ.) […] These links enable the algorithm in step i to do a short-cut search for the locus of headı beginning at the locus of headı1 which it has visited in the previous step. The following semiformal presentation of the algorithm will prove by induction on i that

P1: in tree Tı only the locus of headı could fail to have a valid suffix link, and that

P2: in step i the algorithm visits the contracted locus of headı in Tı1.

Properties P1 and P2 clearly obtain if i=1. Now suppose i>1. In step i the algorithm does the following:

It follows a description of the algorithm:

Substep A. Let us write down headı1 as the concatenation of three strings χαβ. Let us also write down headı in the same way as αβγ.

χ was the first character in sufı1 and is of no more importance when we insert sufı. α is that part of sufı we can short-circuit with a suffix link. β is the next part of sufı which we can fast-forward by “rescanning”. γ is the third part of sufı for which we have to use slow “scanning”. (And tailı is the last part of sufı.)

We go to the first ancestor-or-self of headı1 that has a suffix link. (That would be χα.) We then follow the suffix link to the locus of α and call that node c. Goto substep B.

Note that if we set the suffix link of root to root itself we don’t have to special case an empty α.

Substep A. First the algorithm identifies three strings χ, α, and β, with the following properties:

  1. headı1 can be represented as χαβ.

  2. If the contracted locus of headı1 in Tı2 is not the root, it is the locus of χα. Otherwise α is empty.

  3. χ is a string of at most one character which is empty only if headı+1 is empty.

Lemma 1 guarantees that headı may be represented as αβγ for some (possibly empty) string γ. The algorithm now chooses the appropriate case of the following two:

  • α empty: The algorithm calls the root node c and goes to substep B. Note that c is defined as the locus of α.

  • α nonempty: By definition, the locus of χα must have existed in the tree Tı2. By P1 the suffix link of that (nonterminal) locus node must be defined in tree Tı1, since the node itself must have been constructed before step i1. By P2 the algorithm visited that node in step i1. The algorithm follows its suffix link to a nonterminal node called c and goes to substep B. Note that c is defined as the locus of α.

Substep B. This is the second time-saving trick devised by McCreight.

From the node c, the locus of α, we search the locus of αβ. We save time by what McCreight calls “rescanning”: We already know that all characters in β must match so we need to examine only the first character of each edge (to decide which edge to follow).

If we overshoot the length of β we have to split the edge as usual. When found or created, we call that node d. Goto Substep C. Note that if γ is not empty the edge will already be split.

Substep B. This is called “rescanning,” and is the key idea of algorithm M. Since αβγ is defined as headı, by the definition of head we know that the extended locus of αβ exists in the tree Tı1. This means that there is a sequence of arcs downward from node c (the locus of α) which spells out some extension of β. To rescan β the algorithm finds the child arc ρ of c which begins with the first character of β and leads to a node which we shall call f. It compares the lengths of β and ρ. If β is longer, then a recursive rescan of βρ is begun from node f. If β is the same length or shorter, then β is a prefix of ρ, the algorithm has found the extended locus of αβ and the rescan is complete. It has been accomplished in time linear in the number of nodes encountered. A new nonterminal node is constructed to be the locus of αβ if one does not already exist. The algorithm calls d the locus of αβ and goes to substep C. (Note that substep B constructs a new node to be the locus of αβ only if γ is empty.)

Substep C. Back to the naive algorithm.

First we set the suffix link on χαβ to point to αβ. Then from the locus of αβ we scan the slow way to αβγ comparing each character until we get a mismatch. If the mismatch happens on a node, we call that node headı. If the mismatch happens in the middle of an edge, we split the edge and insert a new node, and call the new node headı. Then we append the unmatched rest tailı to headı. Increment i and goto Substep A.

Substep C. This is called “scanning.” If the suffix link of the locus of χαβ is currently undefined, the algorithm first defines that link to point to node d. This and inductive hypothesis establish the truth of P1 in Tı. Then the algorithm begins searching from node d deeper into the tree to find the extended locus of αβγ. The major difference between scanning and rescanning is that in rescanning the length of β is known beforehand (because it has already been scanned), while in scanning the length of γ is not known beforehand. Thus the algorithm must travel downward into the tree in response to the characters of tailı (of which γ is a prefix) one by one from left to right. When the algorithm “falls out of the tree” (as constraint S1 guarantees that it must), it has found the extended locus of αβγ. The last node of Tı1 encountered in this downward trek of rescanning and scanning is the contracted locus of headı in Tı1 ; this establishes the truth of P2. A new nonterminal node is constructed to be the locus of αβγ if one does not already exist. Finally a new arc labeled tailı, is constructed from the locus of αβγ to a new terminal node. Step i is now finished.

A tree builder that uses McCreight’s Algorithm.

This module implements McCreight’s algorithm to build a suffix tree in linear time, adapted to generalized suffix trees.

class Builder

Builds the suffix-tree using McCreight’s Algorithm.

build(root: Internal, id_: object, S: Iterable[Hashable]) None

Add a sequence to the tree.

Parameters:
  • root (node.Internal) – the root node of the tree

  • id (Id) – the id of the sequence

  • S (Symbols) – the sequence to insert