Collation Tool Overview

This overview describes the collation tool and the pre-processing of the TEI files.

Pre-Processing of the TEI files

We extract every chapter of every capitular from all manuscripts and store them in the Postgres database. The text stored in the database is already normalized.

If a manuscript contains more than one copy of a chapter, all copies are extracted. If one or more correcting hands were active in the chapter, the original and each corrected version are extracted.

The online collation tool knows about all versions and offers them to the user.

Manuscript files(XML+TEI)publ/mss/*.xmlCronMakefilemss-extract-chapters-txt.xslPreprocessed files(XML)publ/cache/collation/*.xmlimport_data.pyDatabase(Postgres)

Data flow during pre-processing

The Makefile is run by cron at regular intervals.

The Makefile knows all the dependencies between the files and runs the appropriate tools to keep the database up-to-date with the manuscript files.

All preprocessed files can be found in the publ/cache/collation directory. The preprocessed files are normalized, eg. have the letter V replaced by U.

The import_data.py script imports the preprocessed text files into the database.

Collation Tool

The collation tool consists of two parts: one frontend written in JavaScript and using the Vue.js library, and one backend application server written in Python and using the super-collator library.

The application server retrieves the chapters from the database and collates them. The results of the collation are sent in json to the frontend that does the formatting for display.

BackendDatabase(Postgres)API Server(Python)Super-Collator(Python library)Frontend(Javascript)

Data flow during collation

The collation unit is the chapter, so that only short texts need to be collated, saving much processing time.

The Wordpress collation plugin delivers the Javascript client to the user. After that, all communication happens directly between the client and the application server.

Collation Algorithm

The application server uses an enhancement of the Needleman-Wunsch algorithm by Gotoh [Gotoh1982]. This section provides a very high level overview of the algorithm.

Phase 1 - Build Table

In phase 1 the algorithm builds a table. For example this is the table built for the two strings: the quick brown fox jumps over the lazy dog and sick fox is crazy.

sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96

Every cell in the table contains three values: \(D\), \(P\), and \(Q\), and an arrow, like this:

DP
Q

We define the score \(S\) for each cell as:

\[S = \max(D, P, Q)\]

The grayed cells in the first row and first column are initialized using the gap start and gap extension penalties. The numbers for each remaining cell are calculated using only values from the three cells, to the top-left, the top, and the left, of the current cell:

\[D = S_↖ + \mbox{similarity}(word_←, word_↑)\]
\[P = \max(S_↑ + openingpenalty, P_↑ + extensionpenalty)\]
\[Q = \max(S_← + openingpenalty, Q_← + extensionpenalty)\]

Finally the arrow in the current cell is set to point to that cell which yielded the highest of the current cell’s \(D\), \(P\), and \(Q\) values.

Phase 2 - Backtrack

When the table is thus completed, two empty sequences are created. Then the algorithm starts backtracking from the last (bottom-right) cell following the arrows until it reaches the first (top-left) cell. If the arrow points:

the word in the row header is added to the first sequence, a hyphen is added to the second sequence,

the word in the row header is added to the first sequence, the word in the column header is added to the second sequence,

a hyphen is added to the first sequence, the word in the column header is added to the second sequence.

sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96

Finally the two sequences are reversed and printed.

thequickbrownfoxjumpsoverthelazydog
-sick-foxis--crazy-

Parameters

The algorithm can be customized by setting:

  • a word comparison (similarity) function,

  • the starting gap penalty,

  • the gap opening penalty,

  • and the gap extension penalty.

Word Comparison Function

The word comparison function returns a similarity value between 0 and 1, 0 being totally different and 1 being completely equal. The chosen function is not critical to the functioning of the aligner. The similarity should increase with the desirability of the alignment, but otherwise there are no fixed rules.

In the current implementation the similarity is calculated as follows:

All words in the input texts are split into sets of trigrams (sometimes called 3-shingles). The trigrams are obtained by first prefixing and suffixing the word with two spaces respectively, then cutting the resulting string into all possible strings of length 3. This means that all trigrams partially overlap each other.

To calculate the similarity between two words three sets are built: the set of trigrams in word a, the set of trigrams in word b, and the set of trigrams common to both words. The similarity is then given by the formula:

\[\mbox{similarity}(a,b)= \frac{2|set_{a} \cap set_{b}|}{|set_a| + |set_b|}\]

The factor of 2 was added to bring the similarity of identical words to 1.

This is sometimes called the Sørensen–Dice coefficient.

An example calculation follows:

⎵⎵h ⎵hl hlo lod odo dou ouu uui uic ico co⎵ o⎵⎵ ⎵⎵l ⎵lu lud udo dou oui uic ico co⎵ o⎵⎵ hlodouuico ludouico Similarity: 2 * 5 / (12 + 10) = 0.454545

Calculating similarity using trigrams

The similarity based on trigrams was chosen because its calculation can be done in \(\mathcal{O}(n)\) time whereas a similarity based on Levenshtein distance needs \(\mathcal{O}(n^2)\) time. The sets of trigrams for each input word are calculated only once and if you presort the trigrams in these sets (to be implemented), the common set can be found in \(\mathcal{O}(n)\) time.

Optimizations yet to be implemented: in a first step gather all trigrams in all input texts, give each one an integer id, and later operate on the ids only. Maybe hash each trigram onto a value 0..63 and build a bitmask for each word, later operate on the masks only.

References

[Gotoh1982]

Gotoh, O. 1982, An Improved Algorithm for Matching Biological Sequences, J. Mol. Biol. 162, 705-708 http://jaligner.sourceforge.net/references/gotoh1982.pdf