CCeH Cologne — Marcello Perathoner <marcello@perathoner.de>
DiXiT/ESTS Antwerp 2016
Problems
Standardize on one language: Python
One problem we face is to calculate manuscript affinity (similarity).
This problem size is not tractable in SQL.
We want to do affinity calculation in RAM for speed.
Numpy is a NUMerical PYthon library for multi-dimensional arrays and matrices.
There are:
Matrix size: 1.4 MB. Easy fit.
Input: the Readings Matrix: 185 × 7450.
Result: the Affinity Matrix: 185 × 185.
The Ancestry Matrix: 185 × 185.
for i in range (0, 185):
readings_i = readings_matrix[i]
defined_i = np.greater (readings_i, 0)
for j in range (i + 1, 185):
readings_j = readings_matrix[j]
defined_j = np.greater (readings_j, 0)
defined_both = np.logical_and (defined_i, defined_j)
equal = np.logical_and (
defined_both, np.equal (readings_i, readings_j))
defined_matrix[i,j] = np.sum (defined_both)
equal_matrix[i,j] = np.sum (equal)
with np.errstate (divide = 'ignore'):
affinity_matrix = equal_matrix / defined_matrix
affinity_matrix[defined_matrix == 0] = 0.0
As measured on my laptop:
Calculation of affinity with numpy | 3s |
Calculation of ancestry with numpy | 8s |
Writing into mysql database 17020 records | 40s |
We have rebuilt a complex system in less than 6 months using free software tools.
We will continue exploring towards a better user-interface.
We will provide installer packages on github.