Changes between Version 3 and Version 4 of VLO/CMDI data workflow framework


Ignore:
Timestamp:
11/05/15 14:31:20 (9 years ago)
Author:
go.sugimoto@oeaw.ac.at
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VLO/CMDI data workflow framework

    v3 v4  
    2828'''The dashboard''' is the key development of the core VLO framework. It will integrate all the data ingestion pipeline into one, creating a user-friendly GUI web interface with which VLO curator can work on data management much more efficiently and coherently in a uniform manner. Data integrity will be much more guaranteed within the complex data life cycle of VLO in one environment. The Dashboard approach is based on the well-known OAIS model, encompassing the three information packages: Submission Information Package (SIP), Archival Information Package (AIP), and Dissemination Information Package (DIP). It offers a very intuitive data management view, illustrating the step-by-step process of the entire data life cycle, starting from harvesting, converting, and validating, to indexing and distributing. Those who have no strong technical skill should be able to use it in a similar way to organise a mailbox of an email software. The functionalities should include (but not limited to):
    2929
    30 *List of the datasets (OAI-PMH sets) bundled per data provider and per CLARIN centres/countries (MUST)
    31 *Status and statistics of the sets within the ingestion pipeline (errors, progress indicator) (MUST) (export as PDF, XML, CSV etc (SHOULD))
    32 *Simple visualisation of the statistics 2), including pie charts, bar charts etc (CLOUD)
    33 Browse the data quality reports per set (MUST) (export as PDF, XML, CSV etc (SHOULD))
    34 Send email of the data quality report to a data provider/CLARIN centre (SHOULD) (automatic email (COULD))
    35 Edit the concept mapping (MUST)
    36 Edit the value mapping and normalisation (MUST)
    37 Manual data management (deactivate indexing of the sets, delete the data sets, harvesting of data sets) (MUST)
    38 Synchronise the component registry, CLAVAS, and CCR with the data sets (MUST)
    39 Browse the log files of the VLO systems (COULD)
    40 Browse the Puwik web traffic monitoring (COULD) (do it per data set (CLOUD))
    41 Link checker which lists broken links (per set) (COULD)
     30* List of the datasets (OAI-PMH sets) bundled per data provider and per CLARIN centres/countries (MUST)
     31* Status and statistics of the sets within the ingestion pipeline (errors, progress indicator) (MUST) (export as PDF, XML, CSV etc (SHOULD))
     32* Simple visualisation of the statistics 2), including pie charts, bar charts etc (CLOUD)
     33* Browse the data quality reports per set (MUST) (export as PDF, XML, CSV etc (SHOULD))
     34* Send email of the data quality report to a data provider/CLARIN centre (SHOULD) (automatic email (COULD))
     35* Edit the concept mapping (MUST)
     36* Edit the value mapping and normalisation (MUST)
     37* Manual data management (deactivate indexing of the sets, delete the data sets, harvesting of data sets) (MUST)
     38* Synchronise the component registry, CLAVAS, and CCR with the data sets (MUST)
     39* Browse the log files of the VLO systems (COULD)
     40* Browse the Puwik web traffic monitoring (COULD) (do it per data set (CLOUD))
     41* Link checker which lists broken links (per set) (COULD)
    4242
    4343The next figure is created to visualise the idea of Dashboard in which the VLO curators can monitor the whole ingestion process at one glance. OAI-PMH data sets are listed as rows, and can be easily categorised per country and data provider. Following the ID and title, there is a date of latest update (which can be harvesting date or latest actions). The status of data processing is clearly visible with green and red circles (harvesting, converted to CMDI, validated against CMDI). When it is indexed and published, the number of records are shown. When the data is distributed in a repository (eg OAI-PMH, LOD etc), it is also indicated. The last, but not least, the data quality is provided with the indication by stars. Different actions will be selectable, according to the status of the data. Some ideas of the actions are:
    4444
    45 (Re-)harvesting of the data set
    46 Disable indexing
    47 Delete the data set
    48 Show the data quality report (various statistics e.g mapping coverage, facet coverage, controlled vocabulary coverage, broken links etc)(download them as PDF, CSV, etc)
    49 Show the error messages (download them as PDF etc)
    50 Show the metadata sets
    51 Show the schema/profile (with the link to Component Registry, CLAVAS, and CCR)
    52 Send an email to the data provider (eg data quality report)
     45* (Re-)harvesting of the data set
     46* Disable indexing
     47* Delete the data set
     48* Show the data quality report (various statistics e.g mapping coverage, facet coverage, controlled vocabulary coverage, broken links etc)(download them as PDF, CSV, etc)
     49* Show the error messages (download them as PDF etc)
     50* Show the metadata sets
     51* Show the schema/profile (with the link to Component Registry, CLAVAS, and CCR)
     52* Send an email to the data provider (eg data quality report)
    5353
    54 
    55 As well as the single data set actions, there will be batch processing options. The users can search, sort, and filter the data in this table view. In addition, s/he can
    56 select multiple data sets by clicking the checkboxes on the left. With this table view, the user should be able to see the overall statistics (and/or selected data sets), including the number of datasets, countries, data providers, the status, the number of records indexed and distributed. Those figures are extremely important for the performance indicators (for CLARIN board etc). The user should also be able to export the statistics as PDF and CSV or directly into Google Spreadsheet (this is useful to work with internet traffic statistics of Google Analytics or Piwik).
     54As well as the single data set actions, there will be batch processing options. The users can search, sort, and filter the data in this table view. In addition, s/he can select multiple data sets by clicking the checkboxes on the left. With this table view, the user should be able to see the overall statistics (and/or selected data sets), including the number of datasets, countries, data providers, the status, the number of records indexed and distributed. Those figures are extremely important for the performance indicators (for CLARIN board etc). The user should also be able to export the statistics as PDF and CSV or directly into Google Spreadsheet (this is useful to work with internet traffic statistics of Google Analytics or Piwik).
    5755
    5856Be aware that the Dashboard will provide manual data processing functions, but it is just a complementary service of the automatic data processing (as it is currently implemented). It will serve as a very important support for the automatic processing, because the VLO curators can monitor the data and manually interact with it without any knowledge of behind-the-scene codes and scripts, whenever it is needed.
     
    6260'''The enhanced MD authoring tool''' will communicate with the extra CLARIN services including the (Centre Registry), Component Registry, CLAVAS, and CCR. The base of this tool already exists in different CLARIN centres. COMEDI in Norway and DSpace in Czech/Poland are two of the good examples. The new tool may include the functionalities as follows (but not limited to):
    6361
    64 Import of Component Registry CMDI profiles (especially the recommended profiles which will be defined by the curation team soon) and selection of them to create metadata (MUST)
    65 Export of a new CMDI profile defined by a data provider to the Component Registry, and save it in there (MUST)
    66 Use of controlled CMDI components/elements (mandatory fields and occurrence etc), when defining/editing a profile (MUST)
    67 Browsing CCR concepts and create a link, when defining/editing a profile (MUST)
    68 Simulation of VLO display based on the defined profile (MUST)
    69 Import and export of the XML Schema of a CMDI profile  (SHOULD)
    70 Use of controlled vocabularies (CLAVAS) when filling the metadata records (MUST)
    71 Browse the data quality report based on the metadata created. If the data is not compliant, it will include warnings to explain the consequences. (SHOULD)
    72 The MD authoring tool (including the enhanced functionalities above) will be modualised to be split from the local application, so that it can be re-used in other CLARIN data providers and centres, as well as the future possibility to be included in the CLARIN central environment (see Phase III). (COULD)
    73 Link checker which lists broken links (COULD)
     62* Import of Component Registry CMDI profiles (especially the recommended profiles which will be defined by the curation team soon) and selection of them to create metadata (MUST)
     63* Export of a new CMDI profile defined by a data provider to the Component Registry, and save it in there (MUST)
     64* Use of controlled CMDI components/elements (mandatory fields and occurrence etc), when defining/editing a profile (MUST)
     65* Browsing CCR concepts and create a link, when defining/editing a profile (MUST)
     66* Simulation of VLO display based on the defined profile (MUST)
     67* Import and export of the XML Schema of a CMDI profile  (SHOULD)
     68* Use of controlled vocabularies (CLAVAS) when filling the metadata records (MUST)
     69* Browse the data quality report based on the metadata created. If the data is not compliant, it will include warnings to explain the consequences. (SHOULD)
     70* The MD authoring tool (including the enhanced functionalities above) will be modualised to be split from the local application, so that it can be re-used in other CLARIN data providers and centres, as well as the future possibility to be included in the CLARIN central environment (see Phase III). (COULD)
     71* Link checker which lists broken links (COULD)
    7472
    7573[Mockup will be inserted here]
     
    102100As an option, the Phase III could be implemented with Content Management System (CMS). Strictly speaking, it is not a part of normal VLO data workflow, however, it would be an important development plan, as it is certainly related to VLO record issues (eg. commenting, persistent identifiers, etc). The CMS may work especially well as a end-user management system. If VLO will require more user oriented services, the CMS may make the life of developers easier to manage registered users and user generated content such as tagging, commenting, bookmarking, saved search, uploading, forum, and other social network functionalities. It may also help to quickly build a multilingual website. For more details, please look at Implementing CMS for VLO.
    103101
    104 Issues to be considered
     102'''Issues to be considered'''
    105103Local repository applications (COMEDI, Dspace) needs to be still operational, physically separated from the central environment, unless they want everything to be integrated in CLARIN. Only the MD authoring tool/module needs to be extracted and attached to the central environment where metadata storage also takes place. Maybe the CMDI data would be pushed to the local repository. >> Check how the CLARIN centre repositories work and modify the workflow accordingly.
    106104There should be different levels of access permission to the VLO Dashboard. Although some members of CLARIN team have multiple roles, it is envisaged that there should be at least three different roles for the Dashboard: data providers/CLARIN centres who work on the data ingestion, VLO curators who work on the overall data ingestion of all the data providers, and the coordinators (Centre Registry, Component Registry, CLAVAS, CCR, Persistent Identifiers) who work on the maintenance of the corresponding services.  It is recommended that the last two roles would become one, because their jobs are closely related. On top of the three roles, there should be admin accounts which can do everything with the central management, and are assigned to the lead VLO developer(s). >> Clarify the roles of the three stakeholders.
    107105
    108 
    109106It is very important all the technical implementation in this document follows the '''VLO guidelines and recommendations''' which will indicate what to be done to organise and manage the metadata for the sake of the end-users.