We extract every chapter of every capitular from all manuscripts and store them in the
Postgres database. The text stored in the database is already normalized.
If a manuscript contains more than one copy of a chapter, all copies are extracted. If
one or more correcting hands were active in the chapter, the original and each corrected
version are extracted.
The online collation tool knows about all versions and offers them to the user.
The Makefile is run by cron at regular intervals.
The Makefile knows all the dependencies between the files and runs
the appropriate tools to keep the database up-to-date with the manuscript files.
All preprocessed files can be found in the publ/cache/collation directory. The
preprocessed files are normalized, eg. have the letter V replaced by U.
The import_data.py script imports the preprocessed text files into the
database.
The collation tool consists of two parts: one frontend written in JavaScript and using
the Vue.js library, and one backend application server written in Python and using the
super-collator library.
The application server retrieves the chapters from the database and collates them. The
results of the collation are sent in json to the frontend that does the formatting for
display.
The collation unit is the chapter, so that only short texts need to be collated,
saving much processing time.
The Wordpress collation plugin delivers the Javascript client to the user.
After that, all communication happens directly between the client and the
application server.
The application server uses an enhancement of the Needleman-Wunsch algorithm by Gotoh
[Gotoh1982]. This section provides a very high level overview of the algorithm.
In phase 1 the algorithm builds a table. For example this is the table built for the
two strings: the quick brown fox jumps over the lazy dog and sick fox is crazy.
sick
fox
is
crazy
0.00
0.00
0.00
⏹
-0.50
-0.50
-0.50
←
-0.70
-0.70
-0.70
←
-0.90
-0.90
-0.90
←
-1.10
-1.10
-1.10
←
the
-0.50
-0.50
-0.50
↑
0.00
-0.70
-0.70
↖
-0.50
-0.90
-0.50
←
-0.70
-1.10
-0.70
←
-0.90
-1.30
-0.90
←
quick
-0.70
-0.70
-0.70
↑
-0.04
-0.50
-0.90
↖
0.00
-1.00
-0.54
↖
-0.50
-1.20
-0.50
←
-0.70
-1.40
-0.70
←
brown
-0.90
-0.90
-0.90
↑
-0.70
-0.54
-1.10
↑
-0.04
-0.50
-1.04
↖
0.00
-1.00
-0.54
↖
-0.50
-1.20
-0.50
←
fox
-1.10
-1.10
-1.10
↑
-0.90
-0.74
-1.30
↑
0.46
-0.54
-1.24
↖
-0.04
-0.50
-0.04
←
0.00
-1.00
-0.24
↖
jumps
-1.30
-1.30
-1.30
↑
-1.10
-0.94
-1.50
↑
-0.74
-0.04
-1.44
↑
0.64
-0.54
-0.54
↖
-0.04
-0.50
0.14
←
over
-1.50
-1.50
-1.50
↑
-1.30
-1.14
-1.70
↑
-0.94
-0.24
-1.64
↑
-0.04
0.14
-0.74
↑
0.64
-0.36
-0.36
↖
the
-1.70
-1.70
-1.70
↑
-1.50
-1.34
-1.90
↑
-1.14
-0.44
-1.84
↑
-0.24
-0.06
-0.94
↑
0.14
0.14
-0.56
↑
lazy
-1.90
-1.90
-1.90
↑
-1.70
-1.54
-2.10
↑
-1.34
-0.64
-2.04
↑
-0.44
-0.26
-1.14
↑
0.40
-0.06
-0.76
↖
dog
-2.10
-2.10
-2.10
↑
-1.90
-1.74
-2.30
↑
-1.54
-0.84
-2.24
↑
-0.64
-0.46
-1.34
↑
-0.26
-0.10
-0.96
↑
Every cell in the table contains three values: \(D\), \(P\), and \(Q\), and an arrow, like this:
D
P
Q
↖
We define the score \(S\) for each cell as:
\[S = \max(D, P, Q)\]
The grayed cells in the first row and first column are initialized using the gap start
and gap extension penalties. The numbers for each remaining cell are calculated using
only values from the three cells, to the top-left, the top, and the left, of the current
cell:
When the table is thus completed, two empty sequences are created. Then the algorithm
starts backtracking from the last (bottom-right) cell following the arrows until it
reaches the first (top-left) cell. If the arrow points:
↑
the word in the row header is added to the first sequence, a hyphen is added to the
second sequence,
↖
the word in the row header is added to the first sequence, the word in the column
header is added to the second sequence,
←
a hyphen is added to the first sequence, the word in the column header is added to the
second sequence.
sick
fox
is
crazy
0.00
0.00
0.00
⏹
-0.50
-0.50
-0.50
←
-0.70
-0.70
-0.70
←
-0.90
-0.90
-0.90
←
-1.10
-1.10
-1.10
←
the
-0.50
-0.50
-0.50
↑
0.00
-0.70
-0.70
↖
-0.50
-0.90
-0.50
←
-0.70
-1.10
-0.70
←
-0.90
-1.30
-0.90
←
quick
-0.70
-0.70
-0.70
↑
-0.04
-0.50
-0.90
↖
0.00
-1.00
-0.54
↖
-0.50
-1.20
-0.50
←
-0.70
-1.40
-0.70
←
brown
-0.90
-0.90
-0.90
↑
-0.70
-0.54
-1.10
↑
-0.04
-0.50
-1.04
↖
0.00
-1.00
-0.54
↖
-0.50
-1.20
-0.50
←
fox
-1.10
-1.10
-1.10
↑
-0.90
-0.74
-1.30
↑
0.46
-0.54
-1.24
↖
-0.04
-0.50
-0.04
←
0.00
-1.00
-0.24
↖
jumps
-1.30
-1.30
-1.30
↑
-1.10
-0.94
-1.50
↑
-0.74
-0.04
-1.44
↑
0.64
-0.54
-0.54
↖
-0.04
-0.50
0.14
←
over
-1.50
-1.50
-1.50
↑
-1.30
-1.14
-1.70
↑
-0.94
-0.24
-1.64
↑
-0.04
0.14
-0.74
↑
0.64
-0.36
-0.36
↖
the
-1.70
-1.70
-1.70
↑
-1.50
-1.34
-1.90
↑
-1.14
-0.44
-1.84
↑
-0.24
-0.06
-0.94
↑
0.14
0.14
-0.56
↑
lazy
-1.90
-1.90
-1.90
↑
-1.70
-1.54
-2.10
↑
-1.34
-0.64
-2.04
↑
-0.44
-0.26
-1.14
↑
0.40
-0.06
-0.76
↖
dog
-2.10
-2.10
-2.10
↑
-1.90
-1.74
-2.30
↑
-1.54
-0.84
-2.24
↑
-0.64
-0.46
-1.34
↑
-0.26
-0.10
-0.96
↑
Finally the two sequences are reversed and printed.
The word comparison function returns a similarity value between 0 and 1, 0 being totally
different and 1 being completely equal. The chosen function is not critical to the
functioning of the aligner. The similarity should increase with the desirability of the
alignment, but otherwise there are no fixed rules.
In the current implementation the similarity is calculated as follows:
All words in the input texts are split into sets of trigrams (sometimes called
3-shingles). The trigrams are obtained by first prefixing and suffixing the word with
two spaces respectively, then cutting the resulting string into all possible strings of
length 3. This means that all trigrams partially overlap each other.
To calculate the similarity between two words three sets are built: the set of
trigrams in word a, the set of trigrams in word b, and the set of trigrams
common to both words. The similarity is then given by the formula:
The similarity based on trigrams was chosen because its calculation can be done in
\(\mathcal{O}(n)\) time whereas a similarity based on Levenshtein distance needs
\(\mathcal{O}(n^2)\) time. The sets of trigrams for each input word are calculated
only once and if you presort the trigrams in these sets (to be implemented), the common
set can be found in \(\mathcal{O}(n)\) time.
Optimizations yet to be implemented: in a first step gather all trigrams in all
input texts, give each one an integer id, and later operate on the ids only.
Maybe hash each trigram onto a value 0..63 and build a bitmask for each word,
later operate on the masks only.