Greek New Testament

Towards a global stemma of the Greek New Testament textual tradition

Technological issues and solutions.

CCeH Cologne — Marcello Perathoner <marcello@perathoner.de>

DiXiT/ESTS Antwerp 2016

Current Situation

  • A system based on mysql with scripts in perl, python and php.
  • Evolved from even more primitive system based on Foxpro (dBase).

Problems

  • It is slow.
  • It is hard to evolve because it is poorly specified.

Path Forward

  • Standardize and modernize toolchain
  • Create visual user-interface
  • Specify and document

Python

Standardize on one language: Python

  • Easy to learn
  • Many scientific libraries
    • Numpy (fast numeric calculations on large datasets)
    • Scikit-learn (clustering, dimensionality reduction)
    • Biopython (phylogenetic trees)
  • Good for glueing things together

Manuscript Affinity

One problem we face is to calculate manuscript affinity (similarity).

\text{Affinity} = {\text{No. of equal passages}\over{\text{No. of passages defined in both mss.}}}

  • 185 manuscripts
  • 7450 variant passages (in ACTA)
  • 17020 ways to pair manuscripts
  • 126.799.000 comparisons of readings.

This problem size is not tractable in SQL.

Numpy

We want to do affinity calculation in RAM for speed.

Numpy is a NUMerical PYthon library for multi-dimensional arrays and matrices.

There are:

185 \text{ mss.} \times 7450 \text{ variants} = 1378250 \text{ readings}

Matrix size: 1.4 MB. Easy fit.

Readings Matrix

Input: the Readings Matrix: 185 × 7450.

\renewcommand\arraystretch{1.3}

\begin{blockarray}{c>{\enskip}*{9}{c}<{\enskip}}
                    & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & \cdots \\
            \noalign{\medskip}
\begin{block}{c(>{\enskip}*{9}{c}<{\enskip})}
  \textrm{A}        & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \cdots \\
  \mathfrak{P}_8    & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & \cdots \\
  \mathfrak{P}_{29} & 1 & 1 & 2 & 1 & 1 & 1 & 1 & 1 & \cdots \\
  \mathfrak{P}_{33} & 0 & 0 & 1 & 1 & 1 & 2 & 1 & 1 & \cdots \\
  \vdots            & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \\
  \ell2010          & 1 & 1 & 0 & 1 & 1 & 1 & 3 & 1 & \cdots \\
\end{block}
\end{blockarray}

Affinity Matrix

Result: the Affinity Matrix: 185 × 185.

\renewcommand\arraystretch{1.3}

\begin{blockarray}{c>{\enskip}*{6}{c}<{\enskip}}
   & \textrm{A} & \mathfrak{P}_8 & \mathfrak{P}_{29} & \mathfrak{P}_{33} & \cdots & \ell2010 \\
   \noalign{\medskip}
\begin{block}{c(>{\enskip}*{6}{c}<{\enskip})}
   \textrm{A}        & 1         & a_{1,2}   & a_{1,3}   & a_{1,4}   & \cdots  & a_{1,185}  \\
   \mathfrak{P}_8    & a_{2,1}   & 1         & a_{2,3}   & a_{2,4}   & \cdots  & a_{2,185}  \\
   \mathfrak{P}_{29} & a_{3,1}   & a_{3,2}   & 1         & a_{3,4}   & \cdots  & a_{3,185}  \\
   \mathfrak{P}_{33} & a_{4,1}   & a_{4,2}   & a_{4,3}   & 1         & \cdots  & a_{4,185}  \\
   \vdots            & \vdots    & \vdots    & \vdots    & \vdots    & \ddots  & \vdots     \\
   \ell2010          & a_{185,1} & a_{185,2} & a_{185,3} & a_{185,4} & \cdots  & 1  \\
\end{block}
\end{blockarray}

a_{m,n} = a_{n,m}

Ancestry Matrix

The Ancestry Matrix: 185 × 185.

\renewcommand\arraystretch{1.3}

\begin{blockarray}{c>{\enskip}*{6}{c}<{\enskip}}
   & \textrm{A} & \mathfrak{P}_8 & \mathfrak{P}_{29} & \mathfrak{P}_{33} & \cdots & \ell2010 \\
   \noalign{\medskip}
\begin{block}{c(>{\enskip}*{6}{c}<{\enskip})}
   \textrm{A}        & 0         & a_{1,2}   & a_{1,3}   & a_{1,4}   & \cdots  & a_{1,185}  \\
   \mathfrak{P}_8    & a_{2,1}   & 0         & a_{2,3}   & a_{2,4}   & \cdots  & a_{2,185}  \\
   \mathfrak{P}_{29} & a_{3,1}   & a_{3,2}   & 0         & a_{3,4}   & \cdots  & a_{3,185}  \\
   \mathfrak{P}_{33} & a_{4,1}   & a_{4,2}   & a_{4,3}   & 0         & \cdots  & a_{4,185}  \\
   \vdots            & \vdots    & \vdots    & \vdots    & \vdots    & \ddots  & \vdots     \\
   \ell2010          & a_{185,1} & a_{185,2} & a_{185,3} & a_{185,4} & \cdots  & 0  \\
\end{block}
\end{blockarray}

a_{m,n} \ne a_{n,m} \text{ or } a_{m,n} = a_{n,m} = 0

a_{m,n} > 0\;\text{if N older than M and }\mathrm{length} (N) \ge \mathrm{length} (M)/2

Definition Matrix Plot

_images/mss-definition.png

Shows where the manuscripts have text.

Affinity Matrix Plot

_images/affinity-00.png

Shows how similar the manuscripts are.

Ancestry Matrix Plot

_images/ancestry-00.png

Move along a row to find the ancestors of the ms.

The Python Code

for i in range (0, 185):
    readings_i = readings_matrix[i]
    defined_i  = np.greater (readings_i, 0)

    for j in range (i + 1, 185):
        readings_j = readings_matrix[j]
        defined_j  = np.greater (readings_j, 0)

        defined_both = np.logical_and (defined_i, defined_j)
        equal = np.logical_and (
            defined_both, np.equal (readings_i, readings_j))
        defined_matrix[i,j] = np.sum (defined_both)
        equal_matrix[i,j] = np.sum (equal)

with np.errstate (divide = 'ignore'):
    affinity_matrix = equal_matrix / defined_matrix
    affinity_matrix[defined_matrix == 0] = 0.0

Speed

As measured on my laptop:

Calculation of affinity with numpy 3s
Calculation of ancestry with numpy 8s
Writing into mysql database 17020 records 40s

Conclusion

We have rebuilt a complex system in less than 6 months using free software tools.

We will continue exploring towards a better user-interface.

We will provide installer packages on github.