wiki:VLO/CMDI data workflow framework

Version 16 (modified by go.sugimoto@oeaw.ac.at, 9 years ago) (diff)

--

Note

The same content is also available at GoogleDoc? https://docs.google.com/document/d/1YRbD7URQ9FRk3qGQE54QHO3wOAMTRnCPopcV4moW9iM/edit?usp=sharing Some images are omitted, as they are too big to fit. It is encouraged to read in the GoogleDocs? for the most up-to-date and comprehensive information.

Introduction

This document will outline the current state of VLO data workflow and make a suggestion to optimise it in the central services. It can be also seen as a more detailed and comprehensive view of the existing diagram https://trac.clarin.eu/attachment/wiki/MDQAS/MDflow.png. The main idea is to introduce a Dashboard which connects to different modules developed in CLARIN. In particular, it includes OAI harvester and curation module which are two of the core modules to realise the Dashboard.

Current VLO data workflow

(Figure 1. Current VLO data workflow (may not be 100% accurate))
Figure 1 illustrates the current state of data workflow. (It may not be 100% accurate, but will be modified later, if needed) It is well established, but not optimised. A data provider may use a metadata authoring tool hosted at one of the CLARIN centres. The typical examples are ARBIL in the Netherlands, COMEDI developed in Norway and the submission form of DSpace as developed in Czech. It provides an easy-to-use GUI web interface where a CMDI profile can be imported or generated and metadata records can be created. When generating a profile, the degree of computation for the interaction with Component Registry may vary. For example, COMEDI users need to create a brand new profile, they have to register the profile first in the Component Registry in order to import it to COMEDI. Each of the tools more less tightly integrates with the underlying repository, where the metadata is stored together with the actual resources in one digital object. The metadata is exposed via an OAI-PMH endpoint from where it is fetched by VLO harvester on a regular basis. However, while the authoring tools try to provide a local control over the quality of the metadata (offering custom auto-complete functionality and various consistency checks), a common, formal and rigorous mechanism for VLO data ingestion is lacking to control the quality of metadata which VLO team is struggling to cope with. The ability of these applications is limited to synchronise and interoperate with four extra CLARIN services, namely Centre Registry, Component Registry, CLAVAS, and CCR. In particular, CLAVAS is not used as authoritative source of controlled vocabularies. There is also almost no feedback from VLO team (automatic or manual) after data ingestion, thus the data providers are required to make quite some effort to improve the metadata quality by individual consultation.
OLAC and CMDI are the two formats allowed to be imported into VLO environment, and the former is converted to CMDI by a predefined mapping. When CMDI is ready, it is being ingested into the solr/lucene index, governed by a set of configuration files: facetConcepts.xml dealing with the mapping of elements to facets (via concepts) and a set of text files defining the normalisation of values. These files are the essence of the CMDI-VLO facet mapping, and, in principle, edited manually by the VLO curators. The processed data will be indexed and published seamlessly on the VLO website, where the end users can browse and search data. The VLO curators also have some difficult time to control the data quality, because they have to manually edit raw files (XML or CSV alike) of concept mapping and value mapping and normalisation, in conjunction with the external CLARIN services. They also need to examine the outcomes on the public website to check the data integrity.

Reference

The initial idea is developed under another document https://docs.google.com/document/d/1OoxDEFoZKhmotk7tbrElqcn79acKnj4T897sNctMYH8/edit?usp=sharing in which you can see the idea of even further future. It is always good to take long-term strategies and visions into account, when developing the most urgent features, so that it will be less time-consuming to move on to a next step.

Attachments (19)