Changes between Version 1 and Version 2 of Taskforces/Meeting20141024


Ignore:
Timestamp:
11/01/14 21:13:56 (10 years ago)
Author:
xnrn@gmx.net
Comment:

added notes on the discussion during the meeting

Legend:

Unmodified
Added
Removed
Modified
  • Taskforces/Meeting20141024

    v1 v2  
    1919* Setup work groups + next milestones
    2020
     21== Discussion
     22
     23=== Data categories
     24
     25* There has to be clear information for metadata creators/providers on "required" data categories, i.e.
     26  a set of data categories that are strongly recommended and have implications on how the records are displayed in the VLO. Still, it has to be taken into account that different resource types require different types of descriptions (and thus not all metadata can meaningfully fill the default required data categories). The information about required data categories should be reflected in the Best Practices Guide (recommending profiles that do link to given data categories)
     27* Besides the currently used facets, there are a few candidates for new facets, e.g. `access` or  `availability`. This would also go well with the newly provided "wizard for licensing" developed by the legal committee
     28* Current data and feedback from users show that the semantics of the data categories are not always clear. CLARIN-D VLO taskforce is currently working on clearer definitions for the facets and should have them ready by December
     29
     30=== Facet values
     31The inconsistencies in the facet values currently found in VLO are a major issue and main(?) source of frustration for the users. Thus an issue the curation taskforce should direct its utmost attention to.
     32
     33An overview of the values in the individual facets in the current has been examined and also briefly presented during the meeting. Following (mostly already known) issues could be identified:
     34
     35  * there are trivial problems like case folding (`'Text' and 'text') that can be solved easily in the post-processing
     36  * many values appear in multiple facets; even though partly this is justified (`'germany'` can be both `country` and `subject`) oftentimes this is result of ambiguity of the facets' semantics, differing interpretations, etc. This confuses users and has impact on the recall and should be eliminated.
     37  * most of the [[https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization#ResourceTypeResourceClassMediaTypeFormatModalityGenreSubjectKeywords|problematic facets]] (Resource Class / Format / Modality / Genre / Subject / Keyword) have a manageable number of distinct values (a few hundreds), except for the `Subject` facet
     38  * `Subject` facet has over 58.000 distinct values, many singletons - is obviously de facto a free-text field. However there are many values (or partial strings in values) that would make for a good keyword in a more specialized facet. So it would be worthwhile to try to extract these keywords from `Subject` and use them in a more narrowly defined facet.
     39  * Also, even though on first sight `Subject` facet may seem like a complete mess, when drilling down deeper along the other facets (e.g. looking into individual collections) `subject` suddenly starts to make sense, offering values nicely partitioning given subset. This confirms what one would suggest, i.e. that individual collections are consistent regarding their classifications and the "mess" only arises because the individual classifications are not aligned / local.[[BR]]
     40   That implies that `subject` can be still useful as a facet on deeper levels of search (when the datasets is already restricted by other dimensions)
     41
     42== Workflow
     43
     44 * Also the issue of moving from the Data Category Registry ISOcat to CLAVAS has been raised.
     45   For the curation task force this implies that all curation work will be done in CLAVAS, which is basically a welcomed setup (as OpenSKOS and the underlying data model SKOS are much simpler then ISOcat and the data categories model).
     46 * The details of the actual value curation process need to be yet worked out, but the basic approach is to:
     47  1. extract existing values (per facet) from VLO
     48  1. (possibly do some automatic merging)
     49  1. convert and import them as concepts + labels (SKOS data model) into CLAVAS
     50  1. collaborativelly clean up the vocabularies (merge synonyms and spelling variants as `altLabel` of the same concept
     51  1. export the vocabularies, convert and apply them in the ingestion process of the VLO
     52  1. enjoy the wonderful new clean facets in the VLO ;)
     53
     54After this has been done for the first time, there should be a method in place to extract yet unseen values from incoming new records that should be integrated into the existing vocabularies.
     55
     56== Further steps
     57
     58* (immediate) employ the (already curated) organisations vocabulary for the normalization of the organisation facet.
     59  either implemented as a `PostProcessor` or trying to use [[https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization#SOLR-synonyms|Solr's internal mechanism]]
     60* CLARIN-D VLO-Taskforce will deliver new definitions for the facets ~ in December
     61* agree on "required" data categories (concepts)
     62* continue the analysis of the "problematic" data categories, setup workflow for collaborative curation
     63  (involving ''CLAVAS'')
     64
     65
    2166== Documents ==
    2267