wiki:Taskforces/Meeting20141024

Curation Taskforce, F2F meeting at the CAC 2014, 2014-10-24

  • What?
    • Status and Plan for Metadata Curation
  • Who?
  • When?
    • 24 October 2014: 17.00 - 18.00 CET
  • Where?
    • Room Botswana at the "Kontakt der Kontinenten" conference hotel, Soesterberg, NL

Agenda

  • Presentation of the tasks of the taskforce
  • Overview of the existing partial solutions to the tasks
  • Report from the CLARIN-D VLO-taskforce
  • Review of the proposal for Value Curation/Normalization
  • Brainstorming about a Curation Module #679
  • Setup work groups + next milestones

Discussion

Data categories

  • There has to be clear information for metadata creators/providers on "required" data categories, i.e. a set of data categories that are strongly recommended and have implications on how the records are displayed in the VLO. The information about required data categories should be reflected in the Best Practices Guide (recommending profiles that do link to given data categories).
  • Still, it has to be taken into account that different resource types require different types of descriptions (and thus not all metadata can meaningfully fill the default required data categories). One solution could be custom views in VLO (by resource type or collection). The metadata search engine developed by Meertens Institute already implements such custom display on collection basis that could serve as inspiration (or even be reused right away)
  • Besides the currently used facets, there are a few candidates for new facets, e.g. access or availability. This would also go well with the newly provided "wizard for licensing" developed by the legal committee
  • Current data and feedback from users show that the semantics of the data categories are not always clear. CLARIN-D VLO taskforce is currently working on clearer definitions for the facets and should have them ready by December.

Facet values

The inconsistencies in the facet values currently found in VLO are a major issue and main(?) source of frustration for the users. Thus an issue the curation taskforce should direct its utmost attention to.

An overview of the values in the individual facets in the current has been examined and also briefly presented during the meeting. Following (mostly already known) issues could be identified:

  • there are trivial problems like case folding ('Text' and 'text') that can be solved easily in the post-processing
  • many values appear in multiple facets; even though partly this is justified ('germany' can be both country and subject) oftentimes this is result of ambiguity of the facets' semantics, differing interpretations, etc. This confuses users and has impact on the recall, thus we have to try to make it more consistent.
  • most of the problematic facets (Resource Class / Format / Modality / Genre / Subject / Keyword) have a manageable number of distinct values (a few hundreds), except for the Subject facet
  • Subject facet has over 58.000 distinct values, many singletons - is obviously de facto a free-text field. However there are many values (or substrings in values) that would make for a good keyword in a more specialized facet. So it would be worthwhile to try to extract these keywords from Subject and use them in a more narrowly defined facet.
  • Also, even though on first sight Subject facet may seem like a complete mess, when drilling down deeper along the other facets (e.g. looking into individual collections) subject suddenly starts to make sense, offering values nicely partitioning given subset. This confirms what one would suggest, i.e. that individual collections are consistent regarding their classifications and the "mess" only arises because the individual classifications are not aligned / local.
    That implies that subject can still be a useful facet, however only on deeper levels of search (when the datasets is already restricted by other dimensions)
  • We should also experiment with flattening the facets further, i.e. put all the values from the different facets into one "bag of words"-facet and display that as a tag cloud or similar and see how the user reacts. (Still these values should be normalized, but we would circumvent the problem of overlapping semantics (and value domains) of the individual, more specific facets.

Workflow

  • The details of the actual value curation process need to be yet worked out, but the basic approach is to:
    1. extract existing values (per facet) from VLO
    2. (possibly do some automatic merging)
    3. convert and import them as concepts + labels (SKOS data model) into CLAVAS
    4. collaborativelly clean up the vocabularies (merge synonyms and spelling variants as altLabel of the same concept
    5. export the vocabularies, convert and apply them in the ingestion process of the VLO
    6. enjoy the wonderful new clean facets in the VLO ;)
  • After this has been done for the first time, there should be a method in place to extract yet unseen values from incoming new records that should be integrated into the existing vocabularies.
  • Also the issue of moving from the Data Category Registry ISOcat to CLAVAS has been raised. For the curation task force this implies that all curation work will be done in CLAVAS, which is basically a welcomed setup (as OpenSKOS and the underlying data model SKOS are much simpler then ISOcat and the data categories model).

Further steps

  • (immediate) employ the (already curated) organisations vocabulary for the normalization of the organisation facet. either implemented as a PostProcessor or trying to use Solr's internal mechanism #683
  • CLARIN-D VLO-Taskforce will deliver new definitions for the facets ~ in December
  • inspect the recommendations provided by Jan Odijk in his evaluation and react on them
  • agree on "required" data categories (concepts)
  • continue the analysis of the "problematic" data categories, setup workflow for collaborative curation (involving CLAVAS)
  • setup a dev instance of VLO for experimenting with the facets, ingest processing and display.

Documents

Last modified 10 years ago Last modified on 11/03/14 13:11:19