Collation Algorithm

The library uses an enhancement of the Needleman-Wunsch algorithm by Gotoh [Gotoh1982]. This section provides a very high level overview of the algorithm.

Phase 1 - Build Table

In phase 1 the algorithm builds a table. For example this is the table built for the two strings: the quick brown fox jumps over the lazy dog and sick fox is crazy.

sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96

Every cell in the table contains three values: \(D\), \(P\), and \(Q\), and an arrow, like this:

DP
Q

We define the score \(S\) for each cell as:

\[S = \max(D, P, Q)\]

The grayed cells in the first row and first column are initialized using the gap start and gap extension penalties. The numbers for each remaining cell are calculated using only values from the three cells, to the top-left, the top, and the left, of the current cell:

\[D = S_↖ + \mbox{similarity}(word_←, word_↑)\]
\[P = \max(S_↑ + openingpenalty, P_↑ + extensionpenalty)\]
\[Q = \max(S_← + openingpenalty, Q_← + extensionpenalty)\]

Finally the arrow in the current cell is set to point to that cell which yielded the highest of the current cell’s \(D\), \(P\), and \(Q\) values.

Phase 2 - Backtrack

When the table is thus completed, two empty sequences are created. Then the algorithm starts backtracking from the last (bottom-right) cell following the arrows until it reaches the first (top-left) cell. If the arrow points:

the word in the row header is added to the first sequence, a hyphen is added to the second sequence,

the word in the row header is added to the first sequence, the word in the column header is added to the second sequence,

a hyphen is added to the first sequence, the word in the column header is added to the second sequence.

sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96

Finally the two sequences are reversed and printed.

thequickbrownfoxjumpsoverthelazydog
-sick-foxis--crazy-

Parameters

The algorithm can be customized by setting:

  • a word comparison (similarity) function,

  • the starting gap penalty,

  • the gap opening penalty,

  • and the gap extension penalty.