Collation Algorithm ~~~~~~~~~~~~~~~~~~~ The library uses an enhancement of the Needleman-Wunsch algorithm by Gotoh [Gotoh1982]_. This section provides a very high level overview of the algorithm. Phase 1 - Build Table --------------------- In phase 1 the algorithm builds a table. For example this is the table built for the two strings: *the quick brown fox jumps over the lazy dog* and *sick fox is crazy.* .. raw:: html :file: _static/super-collator-phase1.html Every cell in the table contains three values: `D`, `P`, and `Q`, and an arrow, like this: .. raw:: html :align: center
DP
Q
We define the score `S` for each cell as: .. math:: S = \max(D, P, Q) The grayed cells in the first row and first column are initialized using the *gap start* and *gap extension* penalties. The numbers for each remaining cell are calculated using only values from the three cells, to the top-left, the top, and the left, of the current cell: .. math:: D = S_↖ + \mbox{similarity}(word_←, word_↑) .. math:: P = \max(S_↑ + openingpenalty, P_↑ + extensionpenalty) .. math:: Q = \max(S_← + openingpenalty, Q_← + extensionpenalty) Finally the arrow in the current cell is set to point to that cell which yielded the highest of the current cell's `D`, `P`, and `Q` values. Phase 2 - Backtrack ------------------- When the table is thus completed, two empty sequences are created. Then the algorithm starts backtracking from the last (bottom-right) cell following the arrows until it reaches the first (top-left) cell. If the arrow points: ↑ the word in the row header is added to the first sequence, a hyphen is added to the second sequence, ↖ the word in the row header is added to the first sequence, the word in the column header is added to the second sequence, ← a hyphen is added to the first sequence, the word in the column header is added to the second sequence. .. raw:: html :file: _static/super-collator-phase2.html Finally the two sequences are reversed and printed. .. raw:: html :file: _static/super-collator-result.html Parameters ---------- The algorithm can be customized by setting: - a word comparison (similarity) function, - the starting gap penalty, - the gap opening penalty, - and the gap extension penalty.