wiki:Taskforces/Curation/ValueNormalization

Curation of element/field values.

This aspect of metadata curation evolves around the problem of high variability of the values in metadata elements with controlled (but not closed) vocabularies, like organisation names, resource type, subject and similar. We encounter a) spelling variants and b) synonyms. This has serious impact on discoverability of the resources, especially when using catalogues with faceted search like the VLO.

This issue has been discussed on multiple occasions ( VLO Taskforce#Vocabularies?, VLO-Taskforce/Meeting-2014-10-15, see especially the elaborate evaluation of the usability of facets in VLO by Jan Odijk in the attachment) and there are numerous attempts to solve this issue - see below.

Existing pieces

Added support for External vocabularies in CMDI 1.2

In the new version of CMDI new feature was added allowing to indicate/utilise external vocabularies as value domains for CMDI elements and CMDI attributes.

CMDI 1.2/Vocabularies

The new feature is strongly rooted in the CLAVAS initiative.

CLAVAS

CLARIN Vocabulary Alignment Service - an instance of ​OpenSKOS, a vocabulary repository and editor, run by Meertens Institute

See more under CmdiClavasIntegration, Vocabularies#CLAVAS?

Next to a comprehensive editor, it offers programmatic access to the data either via

VLO

VLO already has some pre- and post-processing code:
https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/
https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer

pertaining to:

  • Continent
  • Country
  • NationalProject
  • Organisation
  • Year?

In the preprocessing only Organisation seems to be handled (and this rather rudimentary - only two organisations are listed):

https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/OrganisationControlledVocabulary.xml

<Organisations>
    <Organisation name="Max Planck Institute for Psycholinguistics">
        <Variation>MPI Nijmegen</Variation>
        <Variation>Max Planck Institut fuer Psycholinguistik, Nijmegen, Nl.</Variation>
         ...
        <Variation>Max Planck Institute for Psycholinguisticsc</Variation>
        <Variation>Max-Planck-Institüt für Psycholinguïstik</Variation>
    </Organisation>
    <Organisation name="University of Abidjan">
        <Variation>Abidjan : L'Universite d'Abidjan</Variation>
         ...
        <Variation>Abidjan, Ivory Coast : Université Abidjan</Variation>
    </Organisation>
</Organisations>

There seems to be some configuration going on in the VloConfig.xml file:

    <countryComponentUrl>http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438104/xml</countryComponentUrl>
    <languageLinkPrefix>http://infra.clarin.eu/service/language/info.php?code=</languageLinkPrefix>
   ...

A few notes on data processing in VLO:

  • What is the difference between pre- and postprocessing? (When does post-processing happen?)
    Preprocessing is done before the ingestion step into VLO, i.e. in preprocessing individual md records come in and still individual (but "better") md-records come out (that are than ingested)
    In Postprocessing "such a postprocessor is called on a single facet value after this facet value is inserted into the solr document during import."
  • What is the procedure to extend the pre-/post-processing? Can this be made more dynamic? (hook in XSLs in the configuration, no re-build)
    In its current architecture VLO was not designed to easily inject code (like XSL stylesheets) via configuration. Basically one has to implement another postprocessing class (implementing the PostProcessor interface and hook it with the MetadataImporter. So it's not that hard, but requires rebuilding the code.

SOLR - synonyms

Apache Solr, the indexer underlying the VLO, has some native provisions for handling synonymous values.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Sample configuration in the schema.xml file:

 <fieldType name="text_normalized" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"
tokenizerFactory="solr.KeywordTokenizerFactory" />
<!--         <filter class="solr.LowerCaseFilterFactory"/> -->
      </analyzer>
    </fieldType>
...

  <field name="resourceClass_normalized" type="text_normalized" indexed="true" stored="true" multiValued="true" />
...
  <copyField source="resourceClass" dest="resourceClass_normalized" />

Important to indicate the @tokenizerFactory also to the SynonymFilter, otherwise multi-word terms are not handled correctly (not at all). Also, obviously analyze on index time, not query time. Note also the @expand attribute, allowing to either reduce all synonymous terms to one, or to expand the terms (searching for any one of the synonymous terms, would match all of them). For facets we want to reduce the terms, but for the search we could do the expansions.

As input serves a file with trivial syntax:

# either list of synonyms, reduced to the first value:
text, written, texts, text corpus, written corpus
# or explicit one-way mapping:
spoken, Spoken Language Corpora => speech

DASISH - jmd-scripts

Within the DASISH project one task was to establish a joint metadata domain for CLARIN, DARIAH and CESSDA. The catalogue is available under http://ckan.dasish.eu/ckan/ .

The setup is very similar to that of CLARIN (harvester -> mapping -> normalization -> ingest into indexer). For this processing pipeline a set of scripts is maintained on github: DASISH/jmd-scripts

One step in this workflow is harmonization with md-harmonizer.py doing the work based again on a simple input file:

#GroupName,,datasetName,,facetName,,old_value,,new_value,,action
*,,*,,Language,,*,,*,,replace
*,,*,,Country,,*,,*,,replace
*,,*,,CreationDate,,*,,UTC,,changeDateFormat
*,,*,,Subject,,[Paris],,Paris,,replace
*,,*,,Subject,,AARDEWERK,,Pottery,,replace

Also here, this setup seems rather in a preliminary stage, the input is just a small sample.

Note: One step in the workflow is obviously the schema-level mapping - here the TLA/md-mapper is being directly reused with DASISH-specific configuration. However as this pertains to the concept mapping on the schema level, it is only of limited interest here.

CMD2RDF

In an initiative to provide the CMD records also as RDF (see CMD2RDF), one (crucial) task is to transform string values to entities (cf. Ďurčo, Windhouwer, 2014; cf. Ďurčo, 2013, chapter 6.2 Mapping Field Values to Semantic Entities, pp. 80).

A first attempt for organisation names: addOrganisationEntity.xsl. Configuration via cmd2rdf-jobs.xml

Basically a lookup in a SKOS-dictionary injected as parameter matching on all label attributes (@altLabel, @hiddenLabel, @prefLabel). The IRI of the matching entity (@rdf:about) is used to construct a parallel object property (next to the data property encoding the original literal value).

For the discussed normalization of value in the CMD elements we would use @skos:prefLabel, but should also encode the entity URI in @ValueConceptLink.

DARIAH-DE

A lot of work on reference data is going on in DARIAH-DE

M 4.3.1 Normdatensätze - a deliverable (in German) detailing the activities

Two highlights:

  • a SRU endpoint to access the GND (Gemeinsame Normdatei - the Integrated Authority File of the German National Library)
  • the Labeling System- a system that allows to create project-specific but still interoperable vocabularies. The interoperability is ensured by linking the vocabularies to reference resources. (Still have to see how this works in practice, but sound promising.)

Proposed common/harmonized approach

There are (at least) three aspects to this (that would make for good sub-tasks):

  1. curation of controlled vocabularies
  2. application of the vocabularies on the harvested data (normalization step between harvest and ingest)
  3. use of the vocabularies for metadata authoring

Curation of vocabularies

For selected categories vocabularies are collaboratively maintained in a common vocabulary repository.

A possible workflow has been already tested within the CLAVAS project (Vocabulary Service for CLARIN) for the organisation names: the list of organisation names taken from VLO has been manually (pre-)normalized and imported into the vocabulary repository (OpenSKOS), so that now individual organizations (and departments) are modelled as skos:Concept and their spelling variants as skos:altLabel. The initial import is just an bootstrapping step, to seed the vocabulary. Further work (further normalization, adding new entities) happens in the Vocabulary Repository/Editor? that allows collaborative work, provides an REST API etc.

The CLAVAS vocabulary repository also already provides the list of Language Codes (ISO-639-3) and the closed data categories (with their values) from ISOcat. And it is open to support further vocabularies. (see also meeting on OpenSKOS and controlled vocabularies)

Normalization step in the processing pipeline (exploitation tools - VLO)

There is some simple things we can do:

  • fold case! (text => Text)
  • remove/reduce punctuation?
  • better mappings?
    Partly the problem of "mess" in the facets may stem from lumping too disparate fields into one index/facet. VLO allows to blacklist explicit XPaths. facetConcepts.xml
  • support multiselect on facets
    not in the internal processing, but something we could do on the side of the interface for better user experience

But the main focus is on applying curated controlled vocabularies on the (harvested) data. We have two options:

  • preprocessing
    some XSLT run on every CMD record
    output: new version of the CMD record with value normalized based on @skos:prefLabel + with the URI of the entity from @rdf:about in the @ValueConceptLinks (Don't throw away the URI-identifier, once you have it.)
  • postprocessing
    specific normalization code for the values of indivudal facet/solr-index:

    called on a single facet value after this facet value is inserted into the solr document during import.

A few remarks to the normalization task:

  • ensure the normalization step is transparent for the user[[BR]] user needs to see (on demand, in the record-detail view) the original value, so that she can inspect the normalization result. (This requires to duplicate the fields: resourceType -> resourceType_normalized )
  • conveying also the URI provides a basis for i18n of the facet values via a multilingual SKOS-based vocabulary (in some distant future)
  • the vocabularies maintained in the vocabulary repository could be queried via the REST-service of OpenSKOS (find concepts), however in practice the whole vocabularies will be rather fetched (via OAI-PMH) and cached on the side of the client application (VLO) for performance reasons.
  • The client application also will need to convert the vocabularies to it's local format.

Metadata authoring tools

use the vocabulary repository as source for suggest-functionality (autocomplete) for the user. This has been now catered for in CMDI1.2, which allows to associate an element with an external vocabulary. (AFAIK MPI’s MD-editor Arbil is planned to support the OpenSKOS vocabularies.)

The editor should write both the preferred label and the URI for given entity into the metadata element (into the content of the element and the @ValueConceptLink attribute respectively).

Records created in this manner (ideally) wouldn't need any extra processing on the exploitation-side.

Categories/Facets?

VLO currently provides following facets (configured in VloConfig.xml) :

language
collection 
resourceClass 
continent 
country 
modality 
genre 
subject 
format 
organisation 
nationalProject 
keywords 
dataProvider

Organisation names

As stated above, a lot of work has been already done on organisation names. The data is currently already ingested in the CLAVAS vocabulary repository Under the collection OpenSKOS-CLARIN Organizations Vocabulary there are two distinct concept schemes

Language

Language seems basically solved. The 2- and 3-letter codes are being resolved to the name, based on the corresponding components ( language2LetterCode, language3LetterCode).

Except for the case of nl_NL #679

Resource Type / Resource Class / Media Type / Format / Modality / Genre / Subject / Keywords

There has been intensive discussion on ResourceType? recently. See separate page on ValueNormalization/ResourceType for the details. Information below will be included there as well.

This is the most problematic area (Jan Odijk: "complete mess"). Even though the underlying data categories have different semantics, many values appear in multiple facets, suggesting that the distinction is not at all clear and the semantic borders are fuzzy. (Or that the current mappings are too "optimistic".)

Overview (uncomplete) of the discussed facets and/or related data categories.

Note: We need to keep in mind that every facet is already a conglomeration of multiple data categories (+ custom XPath, + blacklist XPath). In this overview, not all the data category for given facet are listed.
(TODO: In the current curation process it may be worthwhile, to dig down and inspect the contribution of individual data categories (and XPath) to given facet)

facet data categories number of values in the definition ~ number of distinct values in the facet (2014-10-20)
resourceType isocat:DC-3806 (resourceClass) dcterms:type 13 / 12 372
format isocat:DC-2570 (mediaType) 8 116
modality isocat:DC-2490 open 149
genre isocat:DC-2470 isocat:DC-3899 15 1307
subject dcterms:subject open 58310
keywords isocat:DC-5436 (metadata tag) open 267
not in VLO as facet, but potentially related:
DCMI Type Vocabulary 12
applicationType isocat:DC-3786 5
corpusType isocat:DC-3822 11
dataType isocat:DC-4181 open
datatype isocat:DC-1662 open
TaDiRAH TaDiRAH Research Objects 36

(theoretically also, research approach and research design.)

Of this, Format seems the least problematic, mostly mime/type.

Subject seems the most problematic (with + 58310 distinct values), but many values are codes from established hierarchical classification systems like DDC, where one could map to higher level classes to reduce the variety. Still this is a real challenge (or mess ;).

We should also take into account that the CMDProfile usually encodes a resource type (collection, TextCorpus, BilingualDictionary, etc.).

There is also an elaborate LRT-Taxonomy (Carlson et al. 2010) made by us(!) - I mean CLARIN people, as one of the first deliverables of CLARIN back in 2010. In search for a starting point, it may be worth to have a look at that. An initiative outside CLARIN that could be relevant is the COAR resource type vocabulary.

Ad-hoc attempt for a mapping values between the three related(?) data categories:

resourceClass DCMIType mediaType
corpus Text text, document
resource bundle Collection, Dataset
tool InteractiveResource, (Software)
MovingImage video
Sound audio
Image, StillImage image, drawing
unknown unknown
experimental data
fieldwork material
grammar
lexicon
other
survey data
teaching material
test data
tool chain
Event
Service
PhysicalObject
unspecified

Even this most simple exercise shows that the definitions are far from complete/comprehensive, so as to be usable as a base for normalisation. Even worse, the reality check (comparison with the actual values in the data).

One (admittedly radical) approach would be to not to try to pick apart the chaos present in the data, but rather to accept it, throw away the data categories, lump all these fields together into a "tag" (or other facet with little semantics) field. We still normalize the values, but instead of providing many facets, we would work basically with a bag of words, offering the user a tag-cloud, content-first search style.

Cases

Here collecting problematic/questionable cases

References

Odijk, J. (2014). Discovering Resources in CLARIN: Problems and Suggestions for Solutions (find attached)

Ďurčo, M., & Windhouwer, M. (2014). From CLARIN Component Metadata to Linked Open Data. In LDL 2014, LREC Workshop. Reykjavik, Iceland. pp 23 LDL2014 proceedings.pdf

Ďurčo, M. (2013). SMC4LRT - Semantic Mapping Component for Language Resources and Technology. Technical University, Vienna. Retrieved from http://permalink.obvsg.at/AC11178534

Rolf Carlson, Tommaso Caselli, Kjell Elenius, Bertrand Gaiffe, David House, Erhard Hinrichs, Valeria Quochi, Kiril Simov, Iris Vogel (2010). Language Resources and Tools Survey and Taxonomy and Criteria for the Quality Assessment D5C-2 http://www-sk.let.uu.nl/u/D5C-2.pdf

TaDiRAH
Taxonomy of Digital Research Activities in the Humanities (https://github.com/dhtaxonomy/TaDiRAH/) developed within DARIAH, divided into three main aspects: Activities, Objects, Techniques
Last modified 5 years ago Last modified on 02/14/19 13:33:35

Attachments (1)

Download all attachments as: .zip