Context Navigation

Changes between Version 1 and Version 2 of Taskforces/Meeting20141024

Timestamp:: 11/01/14 21:13:56 (10 years ago)
Author:: xnrn@gmx.net
Comment:: added notes on the discussion during the meeting

Legend:

: Unmodified
: Added
: Removed
: Modified

Taskforces/Meeting20141024

-                      v1
+                      v2
 * Setup work groups + next milestones
+== Discussion
+=== Data categories
+* There has to be clear information for metadata creators/providers on "required" data categories, i.e.
+  a set of data categories that are strongly recommended and have implications on how the records are displayed in the VLO. Still, it has to be taken into account that different resource types require different types of descriptions (and thus not all metadata can meaningfully fill the default required data categories). The information about required data categories should be reflected in the Best Practices Guide (recommending profiles that do link to given data categories)
+* Besides the currently used facets, there are a few candidates for new facets, e.g. `access` or  `availability`. This would also go well with the newly provided "wizard for licensing" developed by the legal committee
+* Current data and feedback from users show that the semantics of the data categories are not always clear. CLARIN-D VLO taskforce is currently working on clearer definitions for the facets and should have them ready by December
+=== Facet values
+The inconsistencies in the facet values currently found in VLO are a major issue and main(?) source of frustration for the users. Thus an issue the curation taskforce should direct its utmost attention to.
+An overview of the values in the individual facets in the current has been examined and also briefly presented during the meeting. Following (mostly already known) issues could be identified:
+  * there are trivial problems like case folding (`'Text' and 'text') that can be solved easily in the post-processing
+  * many values appear in multiple facets; even though partly this is justified (`'germany'` can be both `country` and `subject`) oftentimes this is result of ambiguity of the facets' semantics, differing interpretations, etc. This confuses users and has impact on the recall and should be eliminated.
+  * most of the [[https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization#ResourceTypeResourceClassMediaTypeFormatModalityGenreSubjectKeywords|problematic facets]] (Resource Class / Format / Modality / Genre / Subject / Keyword) have a manageable number of distinct values (a few hundreds), except for the `Subject` facet
+  * `Subject` facet has over 58.000 distinct values, many singletons - is obviously de facto a free-text field. However there are many values (or partial strings in values) that would make for a good keyword in a more specialized facet. So it would be worthwhile to try to extract these keywords from `Subject` and use them in a more narrowly defined facet.
+  * Also, even though on first sight `Subject` facet may seem like a complete mess, when drilling down deeper along the other facets (e.g. looking into individual collections) `subject` suddenly starts to make sense, offering values nicely partitioning given subset. This confirms what one would suggest, i.e. that individual collections are consistent regarding their classifications and the "mess" only arises because the individual classifications are not aligned / local.[[BR]]
+   That implies that `subject` can be still useful as a facet on deeper levels of search (when the datasets is already restricted by other dimensions)
+== Workflow
+ * Also the issue of moving from the Data Category Registry ISOcat to CLAVAS has been raised.
+   For the curation task force this implies that all curation work will be done in CLAVAS, which is basically a welcomed setup (as OpenSKOS and the underlying data model SKOS are much simpler then ISOcat and the data categories model).
+ * The details of the actual value curation process need to be yet worked out, but the basic approach is to:
+. extract existing values (per facet) from VLO
+. (possibly do some automatic merging)
+. convert and import them as concepts + labels (SKOS data model) into CLAVAS
+. collaborativelly clean up the vocabularies (merge synonyms and spelling variants as `altLabel` of the same concept
+. export the vocabularies, convert and apply them in the ingestion process of the VLO
+. enjoy the wonderful new clean facets in the VLO ;)
+After this has been done for the first time, there should be a method in place to extract yet unseen values from incoming new records that should be integrated into the existing vocabularies.
+== Further steps
+* (immediate) employ the (already curated) organisations vocabulary for the normalization of the organisation facet.
+  either implemented as a `PostProcessor` or trying to use [[https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization#SOLR-synonyms|Solr's internal mechanism]]
+* CLARIN-D VLO-Taskforce will deliver new definitions for the facets ~ in December
+* agree on "required" data categories (concepts)
+* continue the analysis of the "problematic" data categories, setup workflow for collaborative curation
+  (involving ''CLAVAS'')
 == Documents ==