Changes between Version 15 and Version 16 of VLO/CMDI data workflow framework


Ignore:
Timestamp:
11/11/15 09:52:58 (9 years ago)
Author:
go.sugimoto@oeaw.ac.at
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VLO/CMDI data workflow framework

    v15 v16  
    66
    77== Current VLO data workflow ==
    8 {{{(may not be accurate. to be precisely produced soon)}}}
    9 
    10 
     8[[Image()]]
     9{{{(Figure 1. Current VLO data workflow (may not be 100% accurate))}}}
     10[[BR]]
     11Figure 1 illustrates the current state of data workflow. (It may not be 100% accurate, but will be modified later, if needed) It is well established, but not optimised. A data provider may use a metadata authoring tool hosted at one of the CLARIN centres. The typical examples are ARBIL in the Netherlands, COMEDI developed in Norway and the submission form of DSpace as developed in Czech. It provides an easy-to-use GUI web interface where a CMDI profile can be imported or generated and metadata records can be created. When generating a profile, the degree of computation for the interaction with Component Registry may vary. For example, COMEDI users need to create a brand new profile, they have to register the profile first in the Component Registry in order to import it to COMEDI. Each of the tools more less tightly integrates with the underlying repository, where the metadata is stored together with the actual resources in one digital object. The metadata is exposed via an OAI-PMH endpoint from where it is fetched by VLO harvester on a regular basis. However, while the authoring tools try to provide a local control over the quality of the metadata (offering custom auto-complete functionality and various consistency checks), a common, formal and rigorous mechanism for VLO data ingestion is lacking to control the quality of metadata which VLO team is struggling to cope with. The ability of these applications  is limited to synchronise and interoperate with four extra CLARIN services, namely Centre Registry, Component Registry, CLAVAS, and CCR. In particular, CLAVAS is not used as authoritative source of controlled vocabularies. There is also almost no feedback from VLO team (automatic or manual) after data ingestion, thus the data providers are required to make quite some effort to improve the metadata quality by individual consultation.
     12[[BR]]
     13OLAC and CMDI are the two formats allowed to be imported into VLO environment, and the former is converted to CMDI by a predefined mapping. When CMDI is ready, it is being ingested into the solr/lucene index, governed by a set of configuration files: facetConcepts.xml dealing with the mapping of elements to facets (via concepts) and a set of text files defining the normalisation of values. These files are the essence of the CMDI-VLO facet mapping, and, in principle, edited manually by the VLO curators. The processed data will be indexed and published seamlessly on the VLO website, where the end users can browse and search data. The VLO curators also have some difficult time to control the data quality, because they have to manually edit raw files (XML or CSV alike) of concept mapping and value mapping and normalisation, in conjunction with the external CLARIN services. They also need to examine the outcomes on the public website to check the data integrity.
    1114
    1215== Reference ==