Full-text Search Overview¶
Full-text search is provided by the Meta Search Wordpress plugin. It is implemented with Apache Solr. Apache Solr is a fulltext search engine based on Apache Lucene.
See also
Full-text Extraction¶
There are 3 main categories of text that we index in Solr.
The Latin text of the capitularies is extracted using
mss-extract-chapters-txt.xsl
, stored in the directorypubl/cache/collation/
and imported into Solr using the command import_data.py --solr. This are the same texts used for the collation tool. The German notes in the capitularies are extracted along with the Latin text.The material in the teiHeader and the “editorial preface to the transcription” are indexed into Solr using the command import_solr.py --mss.
The Wordpress pages are extracted directly from the Wordpress database and indexed into Solr using the command import_solr.py --wordpress.
The command for updating the Solr database is: make solr-import.
That same command is run nightly by cron.
Metadata Extraction¶
We extract the metadata from the manuscript files and store them in the Postgres database on the Capitularia VM. The process is similar to the pre-processing done for the Collation Tool.
The Makefile
is run by cron on the Capitularia VM at regular intervals.
The Makefile
knows all the dependencies between the files and runs
the appropriate tools to keep the database up-to-date with the manuscript files.
The intermediate file publ/cache/lists/corpus.xml
contains all (useful) metadata
from all manuscript file but no text.
The import_data.py script scans the corpus.xml
file and imports the
all metadata it finds into the database.
Geodata Extraction¶
Geodata is stored in the file publ/mss/lists/capitularia_geo.xml
. This file is
periodically processed with import_data.py --geoplaces and its content is
stored into the database. Also the “places” tree in the meta search dialog is built
using this data.
Search¶
The flow of a user’s search request is as follows:
The Meta Search applet on the browser sends the request to the Meta Search plugin on the web server.
The Wordpress plugin adds the user’s permissions (ie. whether she is logged in into Wordpress or not) and then sends the search query to the application server.
The application server queries the SOLR server.
The SOLR server does the actual search and returns the result as JSON.
The applet on the browser formats the JSON and displays them to the user.
Searches in the Latin texts of the manuscript bodies are done by stemming and trigram similarity. Exact results get a boost, so they show up before trigram results. To stem Latin we wrote a custom Latin stemmer for Lucene.
Searches in Mordek and Wordpress posts use more traditional search methods like stemming.