Full-text Search Overview¶

Full-text search is provided by the Meta Search Wordpress plugin. It is implemented with Apache Solr. Apache Solr is a fulltext search engine based on Apache Lucene.

Full-text Extraction¶

There are 3 main categories of text that we index in Solr.

The Latin text of the capitularies is extracted using mss-extract-chapters-txt.xsl, stored in the directory publ/cache/collation/ and imported into Solr using the command import_data.py --solr. This are the same texts used for the collation tool. The German notes in the capitularies are extracted along with the Latin text.
The material in the teiHeader and the “editorial preface to the transcription” are indexed into Solr using the command import_solr.py --mss.
The Wordpress pages are extracted directly from the Wordpress database and indexed into Solr using the command import_solr.py --wordpress.

The command for updating the Solr database is: make solr-import.

That same command is run nightly by cron.

Data flow during text extraction¶

Metadata Extraction¶

We extract the metadata from the manuscript files and store them in the Postgres database on the Capitularia VM. The process is similar to the pre-processing done for the Collation Tool.

Data flow during metadata extraction¶

The Makefile is run by cron on the Capitularia VM at regular intervals.

The Makefile knows all the dependencies between the files and runs the appropriate tools to keep the database up-to-date with the manuscript files.

The intermediate file publ/cache/lists/corpus.xml contains all (useful) metadata from all manuscript file but no text.

The import_data.py script scans the corpus.xml file and imports the all metadata it finds into the database.

Geodata Extraction¶

Geodata is stored in the file publ/mss/lists/capitularia_geo.xml. This file is periodically processed with import_data.py --geoplaces and its content is stored into the database. Also the “places” tree in the meta search dialog is built using this data.

Search¶

The flow of a user’s search request is as follows:

The Meta Search applet on the browser sends the request to the Meta Search plugin on the web server.
The Wordpress plugin adds the user’s permissions (ie. whether she is logged in into Wordpress or not) and then sends the search query to the application server.
The application server queries the SOLR server.
The SOLR server does the actual search and returns the result as JSON.
The applet on the browser formats the JSON and displays them to the user.

Searches in the Latin texts of the manuscript bodies are done by stemming and trigram similarity. Exact results get a boost, so they show up before trigram results. To stem Latin we wrote a custom Latin stemmer for Lucene.

Searches in Mordek and Wordpress posts use more traditional search methods like stemming.

Components used in searching¶

Data flow while searching¶