Capitularia Logo
  • Overviews
    • HTML Generation
      • XSLT Transformations
      • User Delivery
    • Overview of the Collation Tool
      • Pre-Processing of the TEI files
      • Collation
    • Meta Search
      • Metadata Extraction
      • Fulltext Extraction
      • Geodata Extraction
      • Search
    • Collections
      • Algorithm
  • Capitularia VM
    • Introduction
    • Apache Web Server
      • Wordpress Installation
      • Wordpress Database Structure
      • Wordpress Customizations
    • Application Server
      • server
      • data_server
      • collator_server
      • geo_server
      • tile_server
    • Database Structure
      • Schema capitularia:
      • Schema gis:
      • db.py
    • XSLT Transformations
      • Graph of All Transformations
      • Graph of Stylesheet Dependencies
    • Users of the VM
      • Users and Groups
      • Adding a New User
      • Security
    • TSM Backup
      • Update
  • Maintenance
    • Wordpress Maintenance
      • Wordpress Upgrades
  • Troubleshooting
    • Troubleshooting the Collation Tool
      • Manuscripts do not show up
Capitularia
  • »
  • Overviews »
  • Meta Search
  • View page source

Meta Search¶

Description of the meta search widget.

Metadata Extraction¶

We extract the metadata from the manuscript files and store them in the Postgres database on the Capitularia VM. The process is similar to the pre-processing done for the Collation Tool.

VMcorpus.xslcorpus.xml(XML+TEI)import.pyDatabase(Postgres)MakefileCronpubl/cache/lists/Manuscript files(XML+TEI)publ/mss/*.xml

Data flow during metadata extraction¶

The Makefile is run by cron on the Capitularia VM at regular intervals.

The Makefile knows all the dependencies between the files and runs the appropriate tools to keep the database up-to-date with the manuscript files.

The internediate corpus.xml file contains all (useful) metadata from all manuscript file but no text. The corpus.xml file can be found in the cache/lists directory.

The import.py script scans the corpus.xml file and imports the all metadata it finds into the database.

Fulltext Extraction¶

The TEI files are already pre-processed as described under Collation Tool and the plain text of every chapter is stored in the database.

Geodata Extraction¶

Geodata is stored in the file publ/mss/lists/capitularia_geo.xml. This file is periodically processed with import --geoplaces and its content is stored into the database. Also the “places” tree in the meta search dialog is built using this data.

Search¶

To get good full text search results in the absence of orthography, all full text search is done by similarity. The plain text of the chapters is split into trigrams and the trigrams are indexed.

The Meta Search Wordpress plugin sends the search query to the application server, which does the actual search. The app server finds the chapters and the ids of the manuscripts but it doesn’t know which Wordpress pages contain the text of those manuscripts. The plugin finds the pages in the Wordpress database using the manuscript ids and sends the search results page to the user.

VMDatabase(Postgres)API Server(Python)Plugin(Wordpress)Frontend(Javascript)

Data flow meta search¶

Previous Next

© Copyright 2018-22 CCeH - Licensed under the GNU GPL v3 or later.

Built with Sphinx using a theme provided by Read the Docs.