Normalization of the CBGM Database

Normalization of the CBGM Database

CCeH Cologne — Marcello Perathoner <marcello@perathoner.de>

Akademie der Wissenschaften und der Literatur | Mainz 19.10.2017

Diagnosis

The database had problems because:

  • The database was built by experts in the problem domain with no training in database design. (Spreadsheet Syndrome)
  • The design was severely limited by early database technology, but was not revised when technology got better.

'Structure' of the Legacy Database

UML Database Graph

4 tables for each chapter of Acts. No enforced relations between tables.

  • 28 tables with negative apparatus,
  • 28 tables with lacunae,
  • 28 tables with local stemmata,
  • 28 tables relating manuscripts to local stemmata.

Problem: Compound Columns

  hs   | labez |  labezsuf   |               lesart
-------+-------+-------------+-------------------------------------
 A     | a     |             | ηγαγεν τω ισραηλ σωτηρα ιησουν
 04    | b     |             | ηγειρεν τω ισραηλ σωτηρα ιησουν
 2298  | d     |             | τω ισραηλ ηγειρεν σωτηρα ιησουν
 05    | h     |             | ηγειρεν τω ισραηλ σωτηρα τον ιησουν
 1739  | zw    | a/b/d/h_f   | τω ισραηλ σωτηρα ιησουν

Atomic columns are required for referential integrity.

Solution: Make columns atomic

Unroll labezsuf and add a certainty column:

  hs  | labez | labezsuf | certainty |               lesart
------+-------+----------+-----------+-------------------------------------
 A    | a     |          |         1 | ηγαγεν τω ισραηλ σωτηρα ιησουν
 04   | b     |          |         1 | ηγειρεν τω ισραηλ σωτηρα ιησουν
 2298 | d     |          |         1 | τω ισραηλ ηγειρεν σωτηρα ιησουν
 05   | h     |          |         1 | ηγειρεν τω ισραηλ σωτηρα τον ιησουν
 1739 | a     |          |      0.25 | τω ισραηλ σωτηρα ιησουν
 1739 | b     |          |      0.25 | τω ισραηλ σωτηρα ιησουν
 1739 | d     |          |      0.25 | τω ισραηλ σωτηρα ιησουν
 1739 | h     | f        |      0.25 | τω ισραηλ σωτηρα ιησουν

Problem: Data Inconsistency

Update anomaly: two different lemmas for one passage.

UML Database Graph
  anfadr  |  endadr  |              lemma               |  hs  | labez |              lesart
----------+----------+----------------------------------+------+-------+----------------------------------
 50109012 | 50109020 | επηρθη και νεφελη υπελαβεν αυτον | A    | a     | επηρθη και νεφελη υπελαβεν αυτον
 50109012 | 50109020 | επηρθη και νεφελη υπελαβεν αυτων | 05C1 | b     | νεφελη υπεβαλεν αυτον και επηρθη
 50109012 | 50109020 | επηρθη και νεφελη υπελαβεν αυτων | 05*  | c     | νεφελη υπεβαλεν αυτον και απηρθη

Problem: Difficult Updates

Update anomaly: Hard to change lemma. Even harder to insert or delete word in lemma.

UML Database Graph

Update anomaly: Hard to merge manuscripts.

UML Database Graph

Problem: Insertion Anomalies

We have a manuscript name and a project internal manuscript number.

Problem: we can assign a manuscript number only if we also know a reading.

UML Database Graph

Solution: Split off Passages Table

By splitting off a passages table we have one lemma entry per passage.

UML Database Graph

Easy to change lemma. Moderately easy to insert or delete word in lemma.

Solution: Split off Manuscripts Table

By splitting off a manuscripts table we can declare manuscripts first and add readings later.

UML Database Graph

Solution: Views

Define views for often used queries.

UML Database Graph
CREATE VIEW my_view AS
SELECT p.anfadr, p.endadr,
       ms.hs, ms.hsnr,
       a.labez, a.lesart
FROM apparatus   a
JOIN passages    p  USING (pass_id)
JOIN manuscripts ms USING (ms_id);
SELECT * FROM my_view;

  anfadr  |  endadr  |   hs   |  hsnr  | labez |              lesart
----------+----------+--------+--------+-------+-----------------------------------
 51615036 | 51615044 | 1642   | 316420 | c     | εις τον οικον μου μεινατε
 51707026 | 51707030 | 307    | 303070 | a     | ετερον λεγοντες ειναι
 51713024 | 51713024 | 1735   | 317350 | a     | τη
 52224004 | 52224010 | 915    | 309150 | b     | αυτον ο χιλιαρχος εισαγεσθαι
 51412022 | 51412024 | 1251   | 312510 | a     | αυτος ην

Problem: Dependent Columns

Problem: need to update many columns if one column changes.

UML Database Graph
  anfadr  | buch | kapanf | versanf | wortanf
----------+------+--------+---------+---------
 51341049 |    5 |     13 |      41 |      49
 51527004 |    5 |     15 |      27 |       4
 50516028 |    5 |      5 |      16 |      28
 52324006 |    5 |     23 |      24 |       6
 50828012 |    5 |      8 |      28 |      12

Solution: Functions

Eliminate dependent columns by using a function to compute the column value.

UML Database Graph
CREATE FUNCTION adr2chapter (adr INTEGER) RETURNS INTEGER AS $$
  SELECT ((adr / 100000) %% 100)
$$;
SELECT anfadr, adr2chapter (anfadr) AS chapter;

  anfadr  | chapter
----------+---------
 51341049 |      13
 51527004 |      15
 50516028 |       5
 52324006 |      23
 50828012 |       8

Conclusion

UML Database Graph

Advantages of the normalized database structure:

  • many database operations are easier to implement,
  • there’s one obvious place to update every datum,
  • you cannot put the database into an inconsistent state by accident.