Changes between Version 6 and Version 7 of VLO/CMDI data workflow framework


Ignore:
Timestamp:
11/05/15 14:58:26 (9 years ago)
Author:
go.sugimoto@oeaw.ac.at
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VLO/CMDI data workflow framework

    v6 v7  
    1414{{{(may not be accurate. to be precisely produced soon)}}}
    1515
     16[[Image(Copy of Current VLO data workflow.png​)]]
     17
    1618Figure 1 illustrates the current state of data workflow. (It may not be 100% accurate, but will be modified later, if needed) It is well established, but not optimised. A data provider may use a metadata authoring tool hosted at one of the CLARIN centres. The typical examples are ARBIL in the Netherlands, COMEDI developed in Norway and the submission form of DSpace as developed in Czech. It provides an easy-to-use GUI web interface where a CMDI profile can be imported or generated and metadata records can be created. Each of the tools more less tightly integrates with the underlying repository, where the metadata is stored together with the actual resources in one digital object. The metadata is exposed via an OAI-PMH endpoint from where it is fetched by VLO harvester on a regular basis. However, while the authoring tools try to provide a local control over the quality of the metadata (offering custom auto-complete functionality and various consistency checks), a common, formal and rigorous mechanism is lacking for the authoring tools to control the quality of metadata which VLO team is struggling to cope with. The ability of these applications  is limited to synchronise and interoperate with four extra CLARIN services, namely Centre Registry, Component Registry, CLAVAS, and CCR. In particular, CLAVAS is not used as authoritative source of controlled vocabularies. At the moment, the data providers are required to make some effort to improve the metadata quality by consultation.
    1719
    1820OLAC and CMDI are the two formats allowed to be imported into VLO environment, and the former is converted to CMDI by a predefined mapping. When CMDI is ready, it is being ingested into the solr/lucene index, governed by a set of configuration files: facetConcepts.xml dealing with the mapping of elements to facets (via concepts) and a set of text files defining the normalisation of values. These files are the essence of the CMDI-VLO facet mapping, and, in principle, edited manually by the VLO curators. The processed data will be indexed and published seamlessly on the VLO website, where the end users can browse and search data. The VLO curators also have some difficult time to control the data quality, because they have to manually edit raw files (XML or CSV alike) of concept mapping and value mapping and normalisation, in conjunction with the external CLARIN services. They also need to examine the outcomes on the public website to check the data integrity.
    19 
    20 [[Image()]]
    21 
    2221
    2322== Phase I - VLO Dashboard data management system and extended CLARIN centre MD authoring tool ==
     
    2524{{{VLO Dashboard will manage the whole data ingestion and publication pipeline, communicating with extra CLARIN services. MD authoring tool will connect to the extra CLARIN services}}}
    2625
    27 [[Image()]]
     26[[Image(Go - VLO data workflow(3).png​)]]
    2827
    2928In this phase, the main tasks are to develop 1) VLO Dashboard data management system and 2) enhanced MD authoring tool. In terms of implementation, it seems most natural to ask VLO central developers to work on 1) and CLARIN centers to work on 2), simply due to the current responsibilities of the local and central development, so that the two can be developed in parallel, if doable.
     
    5554* Send an email to the data provider (eg data quality report)
    5655
     56[[Image(Dashboard_UI_MockUp.png​)]]
     57
    5758As well as the single data set actions, there will be batch processing options. The users can search, sort, and filter the data in this table view. In addition, s/he can select multiple data sets by clicking the checkboxes on the left. With this table view, the user should be able to see the overall statistics (and/or selected data sets), including the number of datasets, countries, data providers, the status, the number of records indexed and distributed. Those figures are extremely important for the performance indicators (for CLARIN board etc). The user should also be able to export the statistics as PDF and CSV or directly into Google Spreadsheet (this is useful to work with internet traffic statistics of Google Analytics or Piwik).
    5859
     
    6061
    6162In addition to the data management view of the Dashboard, it can offer a very simple GUI tool (eg a web form interface) to create and edit the concept mapping and the value mapping and normalisation (See figures below). It should translate the data filled into XSLT files (or equivalent) with which the data transformation/mapping will be carried out. Direct XML editing is also possible. The tool will reduce the work involving CSV, TXT, XML, XSLT by introducing a simple yet powerful collaborative web service within the Dashboard framework. Desirably, the editing in the form will not just typing a value itself, but provide a modal window to search and select from relevant extra services (CCR, CLAVAS) (note: there is no mock-up below for this function), so that the curator can double-check with and controlled by the mapping and normalisation values via API/auto-complete.
     63
     64[[Image()]]
     65[[Image(ConceptMappingInterfaceForm2)]]
     66[[Image(ConceptMappingInterfaceXML.png)]]
     67Concept mapping interface in line with the XML code structure
     68
     69[[Image(MappingInterfaceForm.png)]]
     70[[Image(MappingInterfaceFormSelected.png)]]
     71[[Image(MappingInterfaceForm.2.png)]]
     72[[Image(MappingInterfaceXML.png)]]
     73Value mapping and normalisation interface in line with the XML code structure
    6274
    6375'''The enhanced MD authoring tool''' will communicate with the extra CLARIN services including the (Centre Registry), Component Registry, CLAVAS, and CCR. The base of this tool already exists in different CLARIN centres. COMEDI in Norway and DSpace in Czech/Poland are two of the good examples. The new tool may include the functionalities as follows (but not limited to):