wiki:CmdiVirtualLanguageObservatory

Virtual Language Observatory (VLO)

Versions and deployment plan

Versionin testing in betain productionMilestoneChanges
4.9.0 (in development) 2020-05-28 https://github.com/clarin-eric/VLO/milestone/8
...
4.5.1 2018-09-20 maintenance CHANGES
4.5.0 2018-07-13 milestone CHANGES
4.4.1 2018-05-01 2018-05-01 maintenance CHANGES
4.4.0 2018-04-30 milestone CHANGES
...
4.3.2 2018-01-08 maintenance CHANGES
4.3.1 2017-12-?? maintenance CHANGES
4.3.0 2017-11-10 2017-12-?? milestone CHANGES
4.2.1 2017-08-09 milestone CHANGES
4.2.0 2017-07-12 milestone CHANGES
4.1.0 2017-02-06 2017-04-06 milestone CHANGES
4.0.2 (current) 2016-12-16 2016-12-21 milestone CHANGES
4.0.1 2016-08-04 maintenance CHANGES
4.0.0 2016-07-07 milestone CHANGES
3.4.1 2016-04-18 maintenance CHANGES
3.4.0 2016-02-19 2016-03-15 milestone CHANGES
3.3 2015-06-01 2015-08-21 (3.3)
2015-10-29 (3.3.2)
2015-09-30 milestone CHANGES
3.2 2015-05-27 2015-06-01 2015-07-08 milestone CHANGES
3.1 2015-03-11 2015-03-11 2015-03-24 milestone CHANGES
3.0.1 2014-06-16 2014-06-17 2014-06-17 milestone CHANGES
3.0 2014-05-15 2014-05-13 2014-05-20 milestone CHANGES
2.18 2014-01-17 2014-02-06 milestone CHANGES
2.17 2013-11-12 2013-11-15 milestone 2.17
2.16 2013-10-24 2013-11-04 milestone CHANGES
2.15 rolled back to 2.14 (memory problems) rolled back to 2.14 (memory problems) CHANGES
2.14 May 2013 June 2014 CHANGES

Usage

  • End users will browse to the web application (e.g. https://vlo.clarin.eu)
  • The import can be started from the command line or scheduled via cron. For instructions, see README
    • If need be, the Solr index can be flushed by removing the Solr data directory's content (for the exact location, see table below)
  • The Solr interface will typically only be available locally, for usage by the web app and the importer
  • Docker image will be available for
    • The VLO web app
    • The VLO importer
    • The Solr back-end (can be configurable general purpose Solr image)
  • A docker compose configuration will be available to combine the above into a ready-to-use solution

Technical notes

Deployment/configuration

Upgrade instructions are included in the UPGRADE.txt file.

VLO is deployed to vlo.clarin.eu for production and beta-vlo.clarin.eu for beta. There is a tomcat server for hosting both the web application, and a stand-alone Solr back end configured to index VLO data.

Dockerised setup

Ideally, the application is run and managed via the VLO docker compose configuration. It ties together two images: CLARIN-ERIC/docker-vlo and CLARIN-ERIC/docker-solr. The former provides both the web app and the importer, and connects to the latter for reading and writing from/to the index. In order to make the Solr container work in this setup, it needs to be provisioned with the right configuration. In the docker compose configuration, this is done by attaching a volume to the Solr container that is initialised with the right Solr home content by a short-lived instance of the VLO image. One could also use a host mount for this purpose. Furthermore the project introduces an optional nginx proxy that provides access to the web app, various static content and takes care of compression and caching. For details, see the documentation of the VLO docker compose projects and the two docker image projects.

Non-dockerised setup

The application bundle, which includes the importer, the web app, application configuration and Solr configuration in the form of a prepared Solr home directory are to be deployed to directories under /srv/webapps/vlo, with a persistent symlink to the current version. These symlinks are used in the tomcat configurations so that this does not need any change when updating the version of the VLO apart from the version specific configuration requirements (as documented in the upgrade instructions). A Solr instance can be installed and configured according to the instructions that can be found in the documentation of the vlo_solr subproject.

Configuration

Most of the configuration takes place in the VloConfig.xml file (the template - with Maven properties as placeholders - can be found in the sources). Its format is read by both the importer and the web app and has options for both, so this can be shared in a single file. In addition, the web app has a set of context parameters to configure the location of VloConfig.xml that should be used, and to optionally override the Solr base URL. The configuration typically file references (via XLink) an external list of data roots that are environment specific.

See the packages deployment instructions (DEPLOYMENT.md) for details.

Deployment environments

Environment Test Beta Production
Server rs236235.rs.hosteurope.de beta-vlo-clarin.esc.rzg.mpg.de rs238144.rs.hosteurope.de
URL http://alpha-vlo.clarin.eu http://beta-vlo.clarin.eu https://vlo.clarin.eu
Dockerised Yes Yes Yes
Locations
Docker compose dir /home/twagoo/docker/compose_vlo /home/deploy/vlo/compose_vlo /home/deploy/vlo/compose_vlo
Config
deleteAllFirst false false false
maxDaysInSolr 7 7 7

Solr configuration

A matching Solr configuration is provided with each release. Solr should be configured to use the appropriate directory within the expanded distribution pacakage as its "Solr home" location; this is described in detail in the deployment instructions.

A configuration or schema change can be applied to a running instance without having to restart the Solr server. This can be useful, for example, in the exceptional case a 'hotfix' has to be applied to production as soon as possible. The way to do this is by reloading the core, in our case the 'vlo-index' one. On the host, make the following request (assuming that the Solr server is listening on port 8983) to request a reload of the core:

curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=vlo-index"

This should not affect active requests and therefore end users should not notice any effect other than the new configuration being applied for requests triggered after a successful reloading of the core. More information can be found in the Solr documentation.

Logs

  • Import logs
    • Kibana: production/beta/testing
    • On disk: at ${VLO_BASE}/log/vlo-importer.log
      • Docker: inside the vlo_vlo-web_1 container
        • $VLO_BASE is available; by default, VLO_BASE=/opt/vlo
  • VLO web app
    • Kibana: production (change clarin_host for other instances)
    • On disk: at ${CATALINA_HOME}/logs/vlo.log
      • Docker: inside the vlo_vlo-web_1 container
        • $CATALINA_HOME is available; by default, CATALINA_HOME=/srv/tomcat8
  • Harvesting/import pipeline
    • On our production/beta/testing instances, there are logs in /var/log/vlo
    • This includes link checker db update, triggering of harvest and harvest viewer indexation, VLO import and VLO post-import steps (sitemap, stats)

Data sources and mapping

Data sources: see CmdiDataSources

Mapping CMDI > VLO fields and facets

Value mapping

Normalisation/harmonisation and other (cross-facet) value mapping happens on basis of value mapping definitions that are pulled from GitHub. See the VLO-mapping project. ACDH in Vienna has a fork for maintenance and development of these maps.

Development

The VLO's sources are on GitHub! Additional information can be found in the README file.

Architecture

  • Solr server holds the document index; Solrj client is used in the importer and web app
  • Importer processes (parsing + post-processing) CMDI documents retrieved by the OAI harvester and creates the Solr indexes (facets + full text search)
  • Wicket + Spring for the front-end
    • Pages and custom components and models for wicket
    • Spring wiring through eu.clarin.cmdi.vlo.config.VloSpringConfig

Maven project structure (all with eu.clarin.cmdi groupId):

  • vlo reactor project with some shared properties (library versions) and dependencies (logging, unit testing); it has the following child modules:
    • vlo-commons has some common classes for configuration (VloConfig and factories), constants and utility methods;
    • vlo-vocabularies has vocabularies and maps for value curation/normalisation/harmonisation
    • vlo-solr prepared Solr home directory for a Solr server, and a script to build and run Solr
    • vlo-importer builds as a runnable importer that reads CMDI instances from configured locations and creates an index into a deployed Solr instance;
    • vlo-web-app WAR project for the VLO front-end that uses the facets and fields in the populated Solr index;
    • vlo-distribution an assembler project that takes the output of vlo-solr, vlo-importer and vlo-web-app and combines these into a single distribution package that can be used for deployment on an arbitrary server.
    • vlo-sitemap a stand-alone tool for creating a sitemap with all records and static pages in the VLO
  • vlo-statisticsa stand-alone tool for sending statistics to a statsd server

Configuration

VloConfig (in vlo-commons) is a POJO (only getters and setters, no reading or writing logic). There are factories for different ways of creating the configuration.

For the web app, the ServletVloConfigFactory is used which reads the configuration from file but also processes context parameters which can override the parameters in the configuration file.

Some options configurable through VloConfig.xml:

  • locations of metadata to be imported
  • various solr parameters
  • parameters that control the size of the wicket cache
  • the fields that need to be hidden as technical fields or ignored altogether
  • the facets that need to be shown in the search interface
  • the name of the collection facet (or leave out to remove separate collection facet)
  • base URL's for federated content search
  • language code URL's
  • ...

Localisation

Using Wicket i8n support, see Wicket Guide.

  • Page or component specific resource strings in .properties files in eu.clarin.cmdi.vlo.wicket.pages and eu.clarin.cmdi.vlo.wicket.panels.*
  • Field names in /src/main/resources/fieldNames.properties
  • Resource types in /src/main/resources/resourceTypes.properties

Release and distribution

The project vlo-distribution is set up to produce a distribution package for easy deployment on build. It depends on the other projects in the VLO reactor (parent) project.

Instructions on how to prepare for release and distribution are available in the README file (e.g. README in trunk).

Testing

Test VM

Test versions of the VLO can be deployed to the following server: alpha-vlo-clarin.esc.rzg.mpg.de. It is publicly accessible. To access it over SSH, get an account for con01.rzg.mpg.de (contact Florian Kaiser) or ask the developer or system administrators for help. From this host, an SSH connection to root@alpha-vlo-clarin.esc.rzg.mpg.de can be made on basis of a key pair without the need for a password. The host has Apache and Tomcat configured with reverse proxies in the former to access the latter from outside.

End user testing

A test plan for end-user testing of the VLO front-end has been implemented at MPI. Currently, the MPI no longer does testing for CLARIN. A new test plan is under development (see SoftwareTesting).

Design

Page / components hierarchy

The following diagram presents the composition of the pages of the VLO, showing the pages and their most important components (hence it is not a complete picture). It represents the structure of the web application at version 3.0.0, which forms the basis for later version up to and including the 4.x releases.

  • Where not shown explicitly, the cardinality of the composition relations is 1:1
  • The red lines represent possible navigation routes. The SimpleSearchPage is the default entry page

Pages/components hierarchy class diagram

Models lifecycle

Selection models

Flow of the query and facet selection and derivatives (based on VLO 3.0.0):

Document and field models

Flow of the Solr document model, Solr field model, and derivatives (based on VLO 3.0.0):

See also

Tickets

See VLO issues on GitHub.

Old tickets (Trac)

Meetings and discussion

See the meetings page.

Misc (old)

Background

  • In general: see this paper for an overview about how the VLO works

language codes

Demo scenario

Google Earth > Language Information > Hoava > Hoava data > lexicon > nice scanned word lists

Multimodal Corpus: VLO: search 'SVC', select 'SVC' from result list and look at Corpus information and Metadata, maybe click on first Link to landing page in BAS repo, go back to result list (Back-Button), select 'i003' from result list, scroll down in link list to the second movie link (type mpg), click on it and authentify via AAI, movie plays.

Corpus of printed historical texts (Deutsches Textarchiv): Country: Germany > Collection: Deutsches Textarchiv > Genre: Gebrauchsliteratur (= functional literature) > name: Davidis, Henriette: Praktisches Kochbuch für die gewöhnliche und feinere Küche. 4. Aufl. Bielefeld, 1849. > Follow link to Landing Page

Inspiration sources

OLAC facet browser: http://dla.library.upenn.edu/dla/olac/index.html

Last modified 4 years ago Last modified on 06/18/20 07:29:27