Tree

A Generalized Suffix Tree.

class Tree(data: Dict[object, Sequence[Hashable]] | None = None, *, builder: Builder | type | str | None = None)

A generalized suffix tree.

The key feature of the suffix tree is that for any leaf \(i\), the concatenation of the edgle-labels on the path from the root to leaf \(i\) exactly spells out the suffix of \(S\) that starts at point \(i\). That is, it spells out \(S[i..m]\).

Initialize and interrogate the tree:

>>> from suffix_tree import Tree
>>> tree = Tree({"A": "xabxac"})
>>> tree.find("abx")
True
>>> tree.find("abc")
False
>>> tree = Tree({"A": "xabxac", "B": "awyawxacxz"})
>>> for id_, path in tree.find_all("xac"):
...     print(id_, ":", str(path))
...
A : x a c $
B : x a c x z $
>>> tree.find_all("abc")
[]
>>> from suffix_tree import naive
>>> tree = Tree({"A": "xabxac"}, builder=naive.Builder)
>>> tree.find("abx")
True
>>> tree.find("abc")
False
__init__(data: Dict[object, Sequence[Hashable]] | None = None, *, builder: Builder | type | str | None = None)

Initialize and optionally build the tree.

Parameters:
  • data (Dict[Id, Symbols]) – a dictionary of ids to sequences of symbols or None

  • builder (builder.Builder) – a builder (default = suffix_tree.ukkonen.Builder)

  • progress (Callable) – a progress function (default = None). The function gets called at regular intervals during tree construction with one parameter: the current phase.

add(id_: object, S: Sequence[Hashable], *, builder: Builder | type | str | None = None)

Add a sequence of symbols to the tree.

Parameters:
  • id (object) – an object (probably a str or int) serving as an id for the sequence

  • S (Sequence) – a sequence of symbols

  • builder (builder.Builder) – a builder (default = suffix_tree.ukkonen.Builder)

  • progress (Callable) – a progress function (default = None). The function gets called at regular intervals during tree construction with one parameter: the current phase.

>>> from suffix_tree import Tree
>>> tree = Tree()
>>> tree.add("A", "xabxac")
>>> tree.add("B", "awyawxawxz")
>>> tree.find("abx")
True
>>> tree.find("awx")
True
>>> tree.find("abc")
False
find_path(path: Path) Tuple[Node, int, Node | None]

Find a path in the tree.

See: suffix_tree.node.Internal.find_path()

find(S: Sequence[Hashable]) bool

Find a sequence in the tree.

Parameters:

S (Sequence) – a sequence of symbols

Returns:

True if the sequence was found in the tree.

>>> from suffix_tree import Tree
>>> tree = Tree({"A": "xabxac"})
>>> tree.find("abx")
True
>>> tree.find("abc")
False
find_all(S: Sequence[Hashable]) List[Tuple[object, Path]]

Find all occurences of a sequence in a tree.

Parameters:

S (Sequence) – a sequence of symbols

Returns:

a list of positions

>>> from suffix_tree import Tree
>>> tree = Tree({"A": "xabxac"})
>>> for id_, path in tree.find_all("ab"):
...     print(id_, ":", str(path))
...
A : a b x a c $
>>> tree.find_all("abc")
[]
find_id(id_: object, S: Sequence[Hashable]) bool

Find a sequence in the tree.

Parameters:
  • id (object) – the id of a sequence in the tree

  • S (Sequence) – a given sequence of symbols

Returns:

True if the given sequence was found in the tree in the sequence labeled with id_.

>>> from suffix_tree import Tree
>>> tree = Tree({"A": "xabxac", "B": "awyawxawxz"})
>>> tree.find_id("A", "abx")
True
>>> tree.find_id("B", "awx")
True
>>> tree.find_id("B", "abx")
False
pre_order(f) None

Walk the tree visiting each node before its children.

Parameters:

f (Callable) – the visitor function

post_order(f) None

Walk the tree visiting each node after its children.

Parameters:

f (Callable) – the visitor function

common_substrings() List[Tuple[int, int, Path]]

Compute a common substring table.

Suppose we have \(K\) strings.

Definition For each \(k\) between 2 and \(K\), we define \(l(k)\) to be the length of the longest substring common to at least \(k\) of the strings.

We want to compute a table of \(K - 1\) entries, where entry \(k\) gives \(l(k)\) and also points to one of the common substrings of that length. — [Gusfield1997] §7.6, page 127ff

Returns:

a list of \(K - 1\) tuples containing \(k\), \(l(k)\), and one of the common substrings of that length, which may be None.

>>> from suffix_tree import Tree
>>> tree = Tree(
...     {
...         "A": "sandollar",
...         "B": "sandlot",
...         "C": "handler",
...         "D": "grand",
...         "E": "pantry",
...     }
... )
>>> for k, lk, path in tree.common_substrings():
...     print(k, lk, path)
...
2 4 s a n d
3 3 a n d
4 3 a n d
5 2 a n
maximal_repeats() List[Tuple[int, Path]]

Get a list of the maximal repeats in the tree.

Definition A maximal pair in a string \(S\) is a pair of identical substrings \(\alpha\) and \(\beta\) in \(S\) such that the character to immediate left (right) of \(\alpha\) is different from the character to the immediate left (right) of \(\beta\). That is, extending \(\alpha\) and \(\beta\) in either direction would destroy the equality of the two strings. A maximal repeat \(\alpha\) is a substring of \(S\) that occurs in a maximal pair in \(S\). — [Gusfield1997] §7.12, page 143ff.

Returns:

a list of tuples of \(C\) and path, where \(C\) is the number of distinct sequences in the tree that contain the maximal repeat.

>>> from suffix_tree import Tree
>>> tree = Tree({"A": "xabxac", "B": "awyawxawxz"})
>>> for C, path in sorted(tree.maximal_repeats()):
...     print(C, path)
...
1 a w
1 a w x
2 a
2 x
2 x a