Collections

We want to find collections of capitularies, currently very vaguely defined as capitularies that are often copied together.

The python script cluster.py reads the Wordpress database and writes a Gephi graph file. The graph is then imported into Gephi and laid out using a force field algorithm. The resulting plot is used to visually identify potential collections of capitularies.

Algorithm

Description of the algorithm used by the cluster.py script.

We define \(K\) as the number of capitularies and \(D\) as the number of documents.

The number of occurrences of capitulary \(k\) in document \(d\) is referred to as term frequency and is denoted:

\[\mbox{tf}_{k,d}\]

The term frequency of capitularies is either 0 (if not contained in the document) or 1 (if contained in the document). Technically, a document may contain more than one copy of the same capitulary, but we ignore that for our calculations.

The number of documents that include the capitulary \(k\) is referred to as its document frequency and is denoted:

\[\mbox{df}_k\]

The inverse document frequency of capitulary \(k\) is defined as:

\[\mbox{idf}_k = \log { D \over \mbox{df}_k }.\]

We assign a weight to each pair of capitulary and document \(k \times d\), given by:

\[\mbox{tf-idf}_{k,d} = \mbox{tf}_{k,d} \times \mbox{idf}_k\]

We define the document vector \(\vec{V}(d)\) as:

\[\begin{pmatrix} \mbox{tf-idf}_{1,d} & \mbox{tf-idf}_{2,d} & \dots & \mbox{tf-idf}_{K,d} \end{pmatrix}\]

the vector of the weights of all capitularies relative to the document \(d\).

The Euclidean length \(\vert\vec{V}(d)\vert\) of a document vector \(\vec{V}(d)\) is defined as:

\[\sqrt{\sum_{i=1}^K\vec{V}_i^2(d)}\]

Because the term frequency can only be 0 or 1, this is simply the square root of the number of capitularies in the document.

The cosine similarity of two documents \(d_1\) and \(d_2\), which are here represented by their document vectors \(\vec{V}(d_1)\) and \(\vec{V}(d_2)\), is now calculated as:

\[\mbox{sim}(d_1,d_2)= \frac{\vec{V}(d_1)\cdot \vec{V}(d_2)}{\vert\vec{V}(d_1)\vert \vert\vec{V}(d_2)\vert}\]

where \(\cdot\) represents the dot product.

The cosine similarities of all pairs of documents are entered into a similarity (affinity) matrix. This matrix is used as the input to the graph layout software.

The document affinity graph (Gephi Force Atlas).

This algorithm is also used to get the similarity between capitularies, instead of documents, by switching capitularies with documents.

The capitulary affinity graph (Gephi Force Atlas).