Full-text Search Overview

Full-text search is provided by the Meta Search Wordpress plugin. It is implemented with Apache Solr. Apache Solr is a fulltext search engine based on Apache Lucene.

Full-text Extraction

There are 3 main categories of text that we index in Solr.

  • The Latin text of the capitularies is extracted using mss-extract-chapters-txt.xsl, stored in the directory publ/cache/collation/ and imported into Solr using the command import_data.py --solr. This are the same texts used for the collation tool. The German notes in the capitularies are extracted along with the Latin text.

  • The material in the teiHeader and the “editorial preface to the transcription” are indexed into Solr using the command import_solr.py --mss.

  • The Wordpress pages are extracted directly from the Wordpress database and indexed into Solr using the command import_solr.py --wordpress.

The command for updating the Solr database is: make solr-import.

That same command is run nightly by cron.

Manuscript files(TEI)Extracted chapters(TEI)WordpressSolrpubl/mss/*.xmlpubl/cache/collation/*.xmlon mariadb serverimport_data.py --solrimport_solr.py --mssimport_solr.py --wordpress

Data flow during text extraction

Metadata Extraction

We extract the metadata from the manuscript files and store them in the Postgres database on the Capitularia VM. The process is similar to the pre-processing done for the Collation Tool.

Manuscript files(TEI)Corpus file(TEI)Database(Postgres)publ/mss/*.xmlpubl/cache/lists/corpus.xmlsaxon corpus.xmlimport_data.py --mss

Data flow during metadata extraction

The Makefile is run by cron on the Capitularia VM at regular intervals.

The Makefile knows all the dependencies between the files and runs the appropriate tools to keep the database up-to-date with the manuscript files.

The intermediate file publ/cache/lists/corpus.xml contains all (useful) metadata from all manuscript file but no text.

The import_data.py script scans the corpus.xml file and imports the all metadata it finds into the database.

Geodata Extraction

Geodata is stored in the file publ/mss/lists/capitularia_geo.xml. This file is periodically processed with import_data.py --geoplaces and its content is stored into the database. Also the “places” tree in the meta search dialog is built using this data.