= Curation of element/field values. 

[[PageOutline]]

This aspect of metadata curation evolves around the problem of high variability of the values in metadata elements with controlled (but not closed) vocabularies, like `organisation names`, `resource type`, `subject` and similar. We encounter a) spelling variants and b) synonyms. This has serious impact on discoverability of the resources, especially when using catalogues with faceted search like the VLO.

This issue has been discussed on multiple occasions ([[ VLO Taskforce#Vocabularies]], [[VLO-Taskforce/Meeting-2014-10-15]], see especially the elaborate evaluation of the usability of facets in VLO by Jan Odijk in the attachment) and there are numerous attempts to solve this issue - see below.

== Existing pieces 

=== Added support for External vocabularies in CMDI 1.2
In the new version of CMDI new feature was added allowing to indicate/utilise external vocabularies as value domains for CMDI elements and CMDI attributes. 

[[CMDI 1.2/Vocabularies]]

The new feature is strongly rooted in the CLAVAS initiative.

=== CLAVAS

CLARIN Vocabulary Alignment Service - an instance of [[http://openskos.org|​OpenSKOS]], a vocabulary repository and editor, run by [[https://openskos.meertens.knaw.nl/clavas/|Meertens Institute]]

See more under CmdiClavasIntegration, [[Vocabularies#CLAVAS]]

Next to a comprehensive editor, it offers programmatic access to the data either via 
* [[https://openskos.meertens.knaw.nl/clavas/api|REST-Service]] or 
* [[https://openskos.meertens.knaw.nl/clavas/oai-pmh?verb=Identify|OAI-PMH]] - sample requests: [[https://openskos.meertens.knaw.nl/clavas/oai-pmh?verb=ListSets|ListSets]],  [[https://openskos.meertens.knaw.nl/clavas/oai-pmh?verb=GetRecord&metadataPrefix=oai_rdf&identifier=095fb37c-7d5f-7aef-18ef-9ecbc95eb65e|get single record]] (for an organisation, from the meertens:VLO-orgs set)


=== VLO

VLO already has some pre- and post-processing code: [[BR]]
https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/ [[BR]]
https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer

pertaining to:
* Continent
* Country
* !NationalProject
* Organisation
* Year?

In the preprocessing only `Organisation` seems to be handled (and this rather rudimentary - only two organisations are listed):

https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/OrganisationControlledVocabulary.xml

{{{
#!xml
<Organisations>
    <Organisation name="Max Planck Institute for Psycholinguistics">
        <Variation>MPI Nijmegen</Variation>
        <Variation>Max Planck Institut fuer Psycholinguistik, Nijmegen, Nl.</Variation>
         ...
        <Variation>Max Planck Institute for Psycholinguisticsc</Variation>
        <Variation>Max-Planck-Institüt für Psycholinguïstik</Variation>
    </Organisation>
    <Organisation name="University of Abidjan">
        <Variation>Abidjan : L'Universite d'Abidjan</Variation>
         ...
        <Variation>Abidjan, Ivory Coast : Université Abidjan</Variation>
    </Organisation>
</Organisations>
}}}

There seems to be some configuration going on in the [[https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resources/VloConfig.xml|VloConfig.xml]] file:
{{{
#!xml
    <countryComponentUrl>http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438104/xml</countryComponentUrl>
    <languageLinkPrefix>http://infra.clarin.eu/service/language/info.php?code=</languageLinkPrefix>
   ...
}}}


A few notes on data processing in VLO:
* What is the difference between pre- and postprocessing?  
 (When does post-processing happen?) [[BR]]
 Preprocessing is done before the ingestion step into VLO, i.e. in preprocessing individual md records come in and still individual (but "better")  md-records come out (that are than ingested) [[BR]]
 In [[https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer/PostProcessor.java|Postprocessing]] "such a postprocessor is called on a single facet value after this facet value is inserted into the solr document during import."

* What is the procedure to extend the pre-/post-processing? 
  Can this be made more dynamic? (hook in XSLs in the configuration, no re-build)[[BR]]
  In its current architecture VLO was not designed to easily inject code (like XSL stylesheets) via configuration. Basically one has to implement another postprocessing class (implementing the [[https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer/PostProcessor.java|PostProcessor]] interface and hook it with the [[https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer/MetadataImporter.java|MetadataImporter]]. So it's not that hard, but requires rebuilding the code.


=== SOLR - synonyms

Apache Solr, the indexer underlying the VLO, has some native provisions for handling synonymous values.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Sample configuration in the `schema.xml` file:
{{{
#!xml
 <fieldType name="text_normalized" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"
tokenizerFactory="solr.KeywordTokenizerFactory" />
<!--         <filter class="solr.LowerCaseFilterFactory"/> -->
      </analyzer>
    </fieldType>
...

  <field name="resourceClass_normalized" type="text_normalized" indexed="true" stored="true" multiValued="true" />
...
  <copyField source="resourceClass" dest="resourceClass_normalized" />
}}}

Important to indicate the `@tokenizerFactory` also to the `SynonymFilter`, otherwise multi-word terms are not handled correctly (not at all). Also, obviously analyze on index time, not query time.
Note also the `@expand` attribute, allowing to either reduce all synonymous terms to one, or to expand the terms (searching for any one of the synonymous terms, would match all of them). For facets we want to reduce the terms, but for the search we could do the expansions.

As input serves a file with trivial syntax:
{{{
# either list of synonyms, reduced to the first value:
text, written, texts, text corpus, written corpus
# or explicit one-way mapping:
spoken, Spoken Language Corpora => speech
}}}


=== DASISH - jmd-scripts

Within the [[http://dasish.eu|DASISH]] project one task was to establish a joint metadata domain for CLARIN, DARIAH and CESSDA. The catalogue is available under http://ckan.dasish.eu/ckan/ .

The setup is very similar to that of CLARIN (harvester -> mapping -> normalization -> ingest into indexer).
For this processing pipeline a set of scripts is maintained on github: [[https://github.com/DASISH/jmd-scripts/tree/master/workflow-scripts|DASISH/jmd-scripts]]

One step in this workflow is `harmonization` with `md-harmonizer.py` doing the work based again on a simple input file:
{{{
#GroupName,,datasetName,,facetName,,old_value,,new_value,,action
*,,*,,Language,,*,,*,,replace
*,,*,,Country,,*,,*,,replace
*,,*,,CreationDate,,*,,UTC,,changeDateFormat
*,,*,,Subject,,[Paris],,Paris,,replace
*,,*,,Subject,,AARDEWERK,,Pottery,,replace
}}}

Also here, this setup seems rather in a preliminary stage, the input is just a small sample.

''Note:'' One step in the workflow is obviously the schema-level mapping - here the [[https://github.com/TheLanguageArchive/md-mapper|TLA/md-mapper]] is being directly reused with [[https://github.com/DASISH/md-mapping|DASISH-specific configuration.]]
However as this pertains to the concept mapping on the schema level, it is only of limited interest here.


=== CMD2RDF

In an initiative to provide the CMD records also as RDF (see [[CMD2RDF]]), one (crucial) task is to transform string values to entities (cf. Ďurčo, Windhouwer, 2014; cf. Ďurčo, 2013, chapter 6.2 Mapping Field Values to Semantic Entities, pp. 80).

A first attempt for organisation names:
[[https://github.com/ekoi/cmd2rdf/blob/dev-refactoring/src/main/resources/xsl/addOrganisationEntity.xsl|addOrganisationEntity.xsl]]. Configuration via [[https://github.com/ekoi/cmd2rdf/blob/dev-refactoring/src/main/resources/cmd2rdf-jobs.xml|cmd2rdf-jobs.xml]]

Basically a lookup in a SKOS-dictionary injected as parameter matching on all label attributes (`@altLabel`, `@hiddenLabel`, `@prefLabel`). The IRI of the matching entity (`@rdf:about`) is used to construct a parallel object property (next to the data property encoding the original literal value). 

For the discussed normalization of value in the CMD elements we would use `@skos:prefLabel`, but should also encode the entity URI in `@ValueConceptLink`.


=== DARIAH-DE

A lot of work on reference data is going on in DARIAH-DE

[[https://dev2.dariah.eu/wiki/download/attachments/14651583/M%204.3.1%20Normdatens%C3%A4tze.pdf?version=1&modificationDate=1409228442038&api=v2|M 4.3.1 Normdatensätze]] - a deliverable (in German) detailing the activities

Two highlights:
* a SRU endpoint to access the GND (Gemeinsame Normdatei - the Integrated Authority File of the German National Library)
* the [[http://143.93.114.137|Labeling System]]- a system that allows to create project-specific but still interoperable vocabularies. The interoperability is ensured by linking the vocabularies to reference resources. (Still have to see how this works in practice, but sound promising.)


= Proposed common/harmonized approach

There are (at least) three aspects to this (that would make for good sub-tasks):
1. curation of controlled vocabularies
2. application of the vocabularies on the harvested data (normalization step between harvest and ingest)
3. use of the vocabularies for metadata authoring


=== Curation of vocabularies
For selected categories vocabularies are collaboratively maintained in a common vocabulary repository.

A possible workflow has been already tested within the [[CmdiClavasIntegration|CLAVAS project]] (Vocabulary Service for CLARIN) for the organisation names: the list of organisation names taken from VLO has been manually (pre-)normalized and imported into the vocabulary repository ([[https://openskos.meertens.knaw.nl/|OpenSKOS]]), so that now individual organizations (and departments) are modelled as `skos:Concept` and their spelling variants as `skos:altLabel`. The initial import is just an bootstrapping step, to seed the vocabulary. Further work (further normalization, adding new entities) happens in the Vocabulary Repository/Editor that allows collaborative work, provides an REST API etc. 

The CLAVAS vocabulary repository also already provides the list of Language Codes (`ISO-639-3`) and the closed data categories (with their values) from ''ISOcat''. And it is open to support further vocabularies. (see also [[http://www.clarin.eu/node/3780|meeting on OpenSKOS and controlled vocabularies]])


=== Normalization step in the processing pipeline (exploitation tools - VLO)

There is some simple things we can do:
* fold case! (`text => Text`)
* remove/reduce punctuation?
* better mappings?[[BR]] 
  Partly the problem of "mess" in the facets may stem from lumping too disparate fields into one index/facet. VLO allows to blacklist explicit XPaths. [[https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resources/facetConcepts.xml|facetConcepts.xml]]
* support multiselect on facets[[BR]]
  not in the internal processing, but something we could do on the side of the interface for better user experience

But the main focus is on applying curated controlled vocabularies on the (harvested) data. We have two options:
* '''preprocessing'''[[BR]]
  some XSLT run on every CMD record[[BR]]
  output: new version of the CMD record with value normalized based on `@skos:prefLabel` 
  + with the URI of the entity from `@rdf:about` in the `@ValueConceptLinks` (Don't throw away the URI-identifier, once you have it.)

* '''postprocessing'''[[BR]] 
  specific normalization code for the values of indivudal facet/solr-index:
  >  called on a single facet value after this facet value is inserted into the solr document during import.

A few remarks to the normalization task:

* ensure the normalization step is transparent for the user![[BR]]
  user needs to see (on demand, in the record-detail view) the original value, so that she can inspect the normalization result. (This requires to duplicate the fields: `resourceType -> resourceType_normalized` )
* conveying also the URI provides a basis for i18n of the facet values via a multilingual SKOS-based vocabulary
 (in some distant future)
* the vocabularies maintained in the vocabulary repository could be queried via the REST-service of OpenSKOS ([[https://openskos.meertens.knaw.nl/clavas/api#find-concepts|find concepts]]), however in practice the whole vocabularies will be rather fetched (via OAI-PMH) and cached on the side of the client application (VLO) for performance reasons.
* The client application also will need to convert the vocabularies to it's local format.


=== Metadata authoring tools 

use the vocabulary repository as source for suggest-functionality (autocomplete) for the user. This has been now catered for in CMDI1.2, which allows to associate an element with an external vocabulary. 
(AFAIK MPI’s MD-editor Arbil is planned to support the OpenSKOS vocabularies.)

The editor should write both the preferred label and the URI for given entity into the metadata element (into the content of the element and the `@ValueConceptLink` attribute respectively).

Records created in this manner (ideally) wouldn't need any extra processing on the exploitation-side.


== Categories/Facets

VLO currently provides following facets (configured in [[https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resources/VloConfig.xml|VloConfig.xml]]) :
{{{
language
collection 
resourceClass 
continent 
country 
modality 
genre 
subject 
format 
organisation 
nationalProject 
keywords 
dataProvider
}}}


=== Organisation names

As stated above, a lot of work has been already done on organisation names.
The data is currently already ingested in the CLAVAS vocabulary repository
Under the collection [[https://openskos.meertens.knaw.nl/clavas/api/collections/meertens:VLO-orgs.html|OpenSKOS-CLARIN Organizations Vocabulary]] there are two distinct concept schemes
* [[https://openskos.meertens.knaw.nl/clavas/api/concept/5b92e7cd-cab2-7594-5a59-b3bad3b7c06f.html|Organisations]] |2098 entries|
* [[https://openskos.meertens.knaw.nl/clavas/api/concept/fedf0e07-db3a-af19-4eed-ec70d04a6f70.html|CurateOrganisations]] |49 entries| - need to be looked at yet


=== Language

Language seems basically solved.
The 2- and 3-letter codes are being resolved to the name, based on the corresponding components 
( [http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438109/xml language2LetterCode],
[http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438110/xml language3LetterCode]).

Except for the case of [http://catalog.clarin.eu/vlo/search?4&fq=language:nl_NL nl_NL] #679


=== Resource Type / Resource Class / Media Type / Format / Modality / Genre / Subject / Keywords

There has been intensive discussion on ResourceType recently. See separate page on [Taskforces/Curation/ValueNormalization/ResourceType ValueNormalization/ResourceType] for the details. Information below will be included there as well.

This is the most problematic area (Jan Odijk: "complete mess"). Even though the underlying data categories have different semantics, many values appear in multiple facets, suggesting that the distinction is not at all clear and the semantic borders are fuzzy. (Or that the current mappings are too "optimistic".)

Overview (uncomplete) of the discussed facets and/or related data categories.

''Note:'' We need to keep in mind that every facet is already a conglomeration of multiple data categories (+ custom XPath, + blacklist XPath). In this overview, not all the data category for given facet are listed.[[BR]]
(TODO: In the current curation process it may be worthwhile, to dig down and inspect the contribution of individual data categories (and XPath) to given facet)

||= facet =||= data categories =||= number of values in the definition =||= ~ number of distinct values in the facet (2014-10-20) ||
|| resourceType || [http://www.isocat.org/datcat/DC-3806 isocat:DC-3806] (resourceClass) [http://purl.org/dc/terms/type dcterms:type] ||  13 / 12  ||   372  ||
|| format || [http://www.isocat.org/datcat/DC-2570 isocat:DC-2570] (mediaType) ||  8  ||  116  ||
|| modality  || [http://www.isocat.org/datcat/DC-2490 isocat:DC-2490] ||  open  ||  149  ||
|| genre || [http://www.isocat.org/datcat/DC-2470 isocat:DC-2470] [http://www.isocat.org/datcat/DC-3899 isocat:DC-3899] ||   15  ||  1307  ||
|| subject || [http://purl.org/dc/terms/subject dcterms:subject] ||  open  ||  58310  ||
|| keywords || [http://www.isocat.org/datcat/DC-5436 isocat:DC-5436] (metadata tag) ||  open  ||  267  ||
||||||= ''not in VLO as facet, but potentially related:'' =||
|| || [[http://purl.org/dc/dcmitype/|DCMI Type Vocabulary]] ||  12  || || 
|| || applicationType [[http://www.isocat.org/datcat/DC-3786| isocat:DC-3786 ]] ||  5  ||  ||
|| || corpusType [[http://www.isocat.org/datcat/DC-3822| isocat:DC-3822 ]] ||  11  ||  ||
|| || dataType [[http://www.isocat.org/datcat/DC-4181| isocat:DC-4181 ]] ||  open  ||  ||
|| || datatype [[http://www.isocat.org/datcat/DC-1662| isocat:DC-1662 ]] ||  open  ||  ||
|| || [#tadirah TaDiRAH] [[https://github.com/dhtaxonomy/TaDiRAH/blob/master/reference/objects.md| TaDiRAH Research Objects]] ||  36  ||  ||
|| 

(theoretically also, [[http://www.isocat.org/datcat/DC-4418|research approach]] and [[http://www.isocat.org/datcat/DC-4422|research design]].)

Of this, `Format` seems the least problematic, mostly mime/type.

`Subject` seems the most problematic (with + 58310 distinct values), but many values are codes from established hierarchical classification systems like DDC, where one could map to higher level classes to reduce the variety. Still this is a real challenge (or mess ;).

We should also take into account that the CMDProfile usually encodes a resource type (collection, !TextCorpus, !BilingualDictionary, etc.).

There is also an elaborate [[#D5C-2|LRT-Taxonomy]] (Carlson et al. 2010) made by us(!) - I mean CLARIN people, as one of the first deliverables of CLARIN back in 2010. In search for a starting point, it may be worth to have a look at that. An initiative outside CLARIN that could be relevant is the [http://vocabularies.coar-repositories.org/documentation/resource_types/ COAR resource type vocabulary].

Ad-hoc attempt for a mapping values between the three related(?) data categories:

||= resourceClass =||= DCMIType =||= mediaType  =||
|| corpus || Text || text, document ||
|| resource bundle || Collection, Dataset  || ||
|| tool || !InteractiveResource, (Software)  || ||
|| || !MovingImage || video ||
|| || Sound || audio ||
|| || Image, !StillImage || image, drawing ||
|| unknown ||   || unknown ||
|| experimental data ||   ||
|| fieldwork material ||   ||
|| grammar ||  ||  ||
|| lexicon ||  ||   ||
|| other ||   ||  ||
|| survey data ||   ||  ||
|| teaching material ||   ||  ||
|| test data ||   || ||
|| tool chain ||   ||
|| || Event || ||
|| || Service || ||
|| || !PhysicalObject || ||
|| || || unspecified ||

Even this most simple exercise shows that the definitions are far from complete/comprehensive, so as to be usable as a base for normalisation. Even worse, the reality check (comparison with the actual values in the data).

One (admittedly radical) approach would be to not to try to pick apart the chaos present in the data, but rather to accept it, throw away the data categories, lump all these fields together into a "tag" (or other facet with little semantics) field. We still normalize the values, but instead of providing many facets, we would work basically with a bag of words, offering the user a tag-cloud, content-first search style.


=== Cases
Here collecting problematic/questionable cases
* [[http://catalog.clarin.eu/vlo/record?fq=resourceClass:Data+Provider&docId=11022_47_0000-0000-249E-6_64_type_61_dataprovider_38_id_61_89|ResourceType:"Data Provider"]] #678
* [http://catalog.clarin.eu/vlo/search?4&fq=language:nl_NL nl_NL] #679

== References

Odijk, J. (2014). Discovering Resources in CLARIN: Problems and Suggestions for Solutions (find attached)

Ďurčo, M., & Windhouwer, M. (2014). From CLARIN Component Metadata to Linked Open Data. In LDL 2014, LREC Workshop. Reykjavik, Iceland. pp 23 [[http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-LDL2014%20Proceedings.pdf|LDL2014 proceedings.pdf]]

Ďurčo, M. (2013). SMC4LRT - Semantic Mapping Component for Language Resources and Technology. Technical University, Vienna. Retrieved from http://permalink.obvsg.at/AC11178534

[=#D5C-2]
Rolf Carlson, Tommaso Caselli, Kjell Elenius, Bertrand Gaiffe, David House, Erhard Hinrichs, Valeria
Quochi, Kiril Simov, Iris Vogel (2010). Language Resources and Tools Survey and Taxonomy and Criteria for the Quality Assessment D5C-2 
http://www-sk.let.uu.nl/u/D5C-2.pdf


 TaDiRAH :: [=#tadirah]
   Taxonomy of Digital Research Activities in the Humanities (https://github.com/dhtaxonomy/TaDiRAH/) developed within DARIAH, divided into three main aspects: Activities, Objects, Techniques