Greek New Testament¶

Towards a global stemma of the Greek New Testament textual tradition

Technological issues and solutions.

CCeH Cologne — Marcello Perathoner <marcello@perathoner.de>

DiXiT/ESTS Antwerp 2016

Current Situation

A system based on mysql with scripts in perl, python and php.
Evolved from even more primitive system based on Foxpro (dBase).

Problems

It is slow.
It is hard to evolve because it is poorly specified.

Path Forward

Standardize and modernize toolchain
Create visual user-interface
Specify and document

Python

Standardize on one language: Python

Easy to learn
Many scientific libraries
- Numpy (fast numeric calculations on large datasets)
- Scikit-learn (clustering, dimensionality reduction)
- Biopython (phylogenetic trees)
Good for glueing things together

Manuscript Affinity

One problem we face is to calculate manuscript affinity (similarity).

$\text{Affinity} = {\text{No. of equal passages}\over{\text{No. of passages defined in both mss.}}}$

185 manuscripts
7450 variant passages (in ACTA)
17020 ways to pair manuscripts
126.799.000 comparisons of readings.

This problem size is not tractable in SQL.

Numpy

We want to do affinity calculation in RAM for speed.

Numpy is a NUMerical PYthon library for multi-dimensional arrays and matrices.

There are:

$185 \text{ mss.} \times 7450 \text{ variants} = 1378250 \text{ readings}$

Matrix size: 1.4 MB. Easy fit.

Readings Matrix

Input: the Readings Matrix: 185 × 7450.

$\renewcommand\arraystretch{1.3} \begin{blockarray}{c>{\enskip}*{9}{c}<{\enskip}} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & \cdots \\ \noalign{\medskip} \begin{block}{c(>{\enskip}*{9}{c}<{\enskip})} \textrm{A} & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \cdots \\ \mathfrak{P}_8 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & \cdots \\ \mathfrak{P}_{29} & 1 & 1 & 2 & 1 & 1 & 1 & 1 & 1 & \cdots \\ \mathfrak{P}_{33} & 0 & 0 & 1 & 1 & 1 & 2 & 1 & 1 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \\ \ell2010 & 1 & 1 & 0 & 1 & 1 & 1 & 3 & 1 & \cdots \\ \end{block} \end{blockarray}$

Affinity Matrix

Result: the Affinity Matrix: 185 × 185.

$\renewcommand\arraystretch{1.3} \begin{blockarray}{c>{\enskip}*{6}{c}<{\enskip}} & \textrm{A} & \mathfrak{P}_8 & \mathfrak{P}_{29} & \mathfrak{P}_{33} & \cdots & \ell2010 \\ \noalign{\medskip} \begin{block}{c(>{\enskip}*{6}{c}<{\enskip})} \textrm{A} & 1 & a_{1,2} & a_{1,3} & a_{1,4} & \cdots & a_{1,185} \\ \mathfrak{P}_8 & a_{2,1} & 1 & a_{2,3} & a_{2,4} & \cdots & a_{2,185} \\ \mathfrak{P}_{29} & a_{3,1} & a_{3,2} & 1 & a_{3,4} & \cdots & a_{3,185} \\ \mathfrak{P}_{33} & a_{4,1} & a_{4,2} & a_{4,3} & 1 & \cdots & a_{4,185} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \ell2010 & a_{185,1} & a_{185,2} & a_{185,3} & a_{185,4} & \cdots & 1 \\ \end{block} \end{blockarray}$

$a_{m,n} = a_{n,m}$

Ancestry Matrix

The Ancestry Matrix: 185 × 185.

$\renewcommand\arraystretch{1.3} \begin{blockarray}{c>{\enskip}*{6}{c}<{\enskip}} & \textrm{A} & \mathfrak{P}_8 & \mathfrak{P}_{29} & \mathfrak{P}_{33} & \cdots & \ell2010 \\ \noalign{\medskip} \begin{block}{c(>{\enskip}*{6}{c}<{\enskip})} \textrm{A} & 0 & a_{1,2} & a_{1,3} & a_{1,4} & \cdots & a_{1,185} \\ \mathfrak{P}_8 & a_{2,1} & 0 & a_{2,3} & a_{2,4} & \cdots & a_{2,185} \\ \mathfrak{P}_{29} & a_{3,1} & a_{3,2} & 0 & a_{3,4} & \cdots & a_{3,185} \\ \mathfrak{P}_{33} & a_{4,1} & a_{4,2} & a_{4,3} & 0 & \cdots & a_{4,185} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \ell2010 & a_{185,1} & a_{185,2} & a_{185,3} & a_{185,4} & \cdots & 0 \\ \end{block} \end{blockarray}$

$a_{m,n} \ne a_{n,m} \text{ or } a_{m,n} = a_{n,m} = 0$

$a_{m,n} > 0\;\text{if N older than M and }\mathrm{length} (N) \ge \mathrm{length} (M)/2$

Definition Matrix Plot

Shows where the manuscripts have text.

Affinity Matrix Plot

Shows how similar the manuscripts are.

Ancestry Matrix Plot

Move along a row to find the ancestors of the ms.

The Python Code

for i in range (0, 185):
    readings_i = readings_matrix[i]
    defined_i  = np.greater (readings_i, 0)

    for j in range (i + 1, 185):
        readings_j = readings_matrix[j]
        defined_j  = np.greater (readings_j, 0)

        defined_both = np.logical_and (defined_i, defined_j)
        equal = np.logical_and (
            defined_both, np.equal (readings_i, readings_j))
        defined_matrix[i,j] = np.sum (defined_both)
        equal_matrix[i,j] = np.sum (equal)

with np.errstate (divide = 'ignore'):
    affinity_matrix = equal_matrix / defined_matrix
    affinity_matrix[defined_matrix == 0] = 0.0

Speed

As measured on my laptop:

Calculation of affinity with numpy	3s
Calculation of ancestry with numpy	8s
Writing into mysql database 17020 records	40s

Conclusion

We have rebuilt a complex system in less than 6 months using free software tools.

We will continue exploring towards a better user-interface.

We will provide installer packages on github.