wiki:Taskforces/Meeting20141024

Version 2 (modified by xnrn@gmx.net, 10 years ago) (diff)

added notes on the discussion during the meeting

Curation Taskforce, F2F meeting at the CAC 2014, 2014-10-24

  • What?
    • Status and Plan for Metadata Curation
  • Who?
  • When?
    • 24 October 2014: 17.00 - 18.00 CET
  • Where?
    • Room Botswana at the "Kontakt der Kontinenten" conference hotel, Soesterberg, NL

Agenda

  • Presentation of the tasks of the taskforce
  • Overview of the existing partial solutions to the tasks
  • Report from the CLARIN-D VLO-taskforce (to be confirmed)
  • Review of the proposal for Value Curation/Normalization
  • Brainstorming about a Curation Module #679
  • Setup work groups + next milestones

Discussion

Data categories

  • There has to be clear information for metadata creators/providers on "required" data categories, i.e. a set of data categories that are strongly recommended and have implications on how the records are displayed in the VLO. Still, it has to be taken into account that different resource types require different types of descriptions (and thus not all metadata can meaningfully fill the default required data categories). The information about required data categories should be reflected in the Best Practices Guide (recommending profiles that do link to given data categories)
  • Besides the currently used facets, there are a few candidates for new facets, e.g. access or availability. This would also go well with the newly provided "wizard for licensing" developed by the legal committee
  • Current data and feedback from users show that the semantics of the data categories are not always clear. CLARIN-D VLO taskforce is currently working on clearer definitions for the facets and should have them ready by December

Facet values

The inconsistencies in the facet values currently found in VLO are a major issue and main(?) source of frustration for the users. Thus an issue the curation taskforce should direct its utmost attention to.

An overview of the values in the individual facets in the current has been examined and also briefly presented during the meeting. Following (mostly already known) issues could be identified:

  • there are trivial problems like case folding (`'Text' and 'text') that can be solved easily in the post-processing
  • many values appear in multiple facets; even though partly this is justified ('germany' can be both country and subject) oftentimes this is result of ambiguity of the facets' semantics, differing interpretations, etc. This confuses users and has impact on the recall and should be eliminated.
  • most of the problematic facets (Resource Class / Format / Modality / Genre / Subject / Keyword) have a manageable number of distinct values (a few hundreds), except for the Subject facet
  • Subject facet has over 58.000 distinct values, many singletons - is obviously de facto a free-text field. However there are many values (or partial strings in values) that would make for a good keyword in a more specialized facet. So it would be worthwhile to try to extract these keywords from Subject and use them in a more narrowly defined facet.
  • Also, even though on first sight Subject facet may seem like a complete mess, when drilling down deeper along the other facets (e.g. looking into individual collections) subject suddenly starts to make sense, offering values nicely partitioning given subset. This confirms what one would suggest, i.e. that individual collections are consistent regarding their classifications and the "mess" only arises because the individual classifications are not aligned / local.
    That implies that subject can be still useful as a facet on deeper levels of search (when the datasets is already restricted by other dimensions)

Workflow

  • Also the issue of moving from the Data Category Registry ISOcat to CLAVAS has been raised. For the curation task force this implies that all curation work will be done in CLAVAS, which is basically a welcomed setup (as OpenSKOS and the underlying data model SKOS are much simpler then ISOcat and the data categories model).
  • The details of the actual value curation process need to be yet worked out, but the basic approach is to:
    1. extract existing values (per facet) from VLO
    2. (possibly do some automatic merging)
    3. convert and import them as concepts + labels (SKOS data model) into CLAVAS
    4. collaborativelly clean up the vocabularies (merge synonyms and spelling variants as altLabel of the same concept
    5. export the vocabularies, convert and apply them in the ingestion process of the VLO
    6. enjoy the wonderful new clean facets in the VLO ;)

After this has been done for the first time, there should be a method in place to extract yet unseen values from incoming new records that should be integrated into the existing vocabularies.

Further steps

  • (immediate) employ the (already curated) organisations vocabulary for the normalization of the organisation facet. either implemented as a PostProcessor or trying to use Solr's internal mechanism
  • CLARIN-D VLO-Taskforce will deliver new definitions for the facets ~ in December
  • agree on "required" data categories (concepts)
  • continue the analysis of the "problematic" data categories, setup workflow for collaborative curation (involving CLAVAS)

Documents