Curation of element/field values.
This aspect of metadata curation evolves around the problem of high variability of the values in metadata elements with controlled (but not closed) vocabularies, like organisation names
, resource type
, subject
and similar. We encounter a) spelling variants and b) synonyms. This has serious impact on discoverability of the resources, especially when using catalogues with faceted search like the VLO.
This issue has been discussed on multiple occasions ( VLO Taskforce#Vocabularies?, VLO-Taskforce/Meeting-2014-10-15, see especially the elaborate evaluation of the usability of facets in VLO by Jan Odijk in the attachment) and there are numerous attempts to solve this issue - see below.
Existing pieces
Added support for External vocabularies in CMDI 1.2
In the new version of CMDI new feature was added allowing to indicate/utilise external vocabularies as value domains for CMDI elements and CMDI attributes.
The new feature is strongly rooted in the CLAVAS initiative.
CLAVAS
CLARIN Vocabulary Alignment Service - an instance of OpenSKOS, a vocabulary repository and editor, run by Meertens Institute
See more under CmdiClavasIntegration, Vocabularies#CLAVAS?
Next to a comprehensive editor, it offers programmatic access to the data either via
- REST-Service or
- OAI-PMH - sample requests: ListSets, get single record (for an organisation, from the meertens:VLO-orgs set)
VLO
VLO already has some pre- and post-processing code:
https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/
https://trac.clarin.eu/browser/vlo/trunk/vlo-importer/src/main/java/eu/clarin/cmdi/vlo/importer
pertaining to:
- Continent
- Country
- NationalProject
- Organisation
- Year?
In the preprocessing only Organisation
seems to be handled (and this rather rudimentary - only two organisations are listed):
https://svn.clarin.eu/vlo/trunk/vlo_preprocessor/OrganisationControlledVocabulary.xml
<Organisations> <Organisation name="Max Planck Institute for Psycholinguistics"> <Variation>MPI Nijmegen</Variation> <Variation>Max Planck Institut fuer Psycholinguistik, Nijmegen, Nl.</Variation> ... <Variation>Max Planck Institute for Psycholinguisticsc</Variation> <Variation>Max-Planck-Institüt für Psycholinguïstik</Variation> </Organisation> <Organisation name="University of Abidjan"> <Variation>Abidjan : L'Universite d'Abidjan</Variation> ... <Variation>Abidjan, Ivory Coast : Université Abidjan</Variation> </Organisation> </Organisations>
There seems to be some configuration going on in the VloConfig.xml file:
<countryComponentUrl>http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438104/xml</countryComponentUrl> <languageLinkPrefix>http://infra.clarin.eu/service/language/info.php?code=</languageLinkPrefix> ...
A few notes on data processing in VLO:
- What is the difference between pre- and postprocessing?
(When does post-processing happen?)
Preprocessing is done before the ingestion step into VLO, i.e. in preprocessing individual md records come in and still individual (but "better") md-records come out (that are than ingested)
In Postprocessing "such a postprocessor is called on a single facet value after this facet value is inserted into the solr document during import."
- What is the procedure to extend the pre-/post-processing?
Can this be made more dynamic? (hook in XSLs in the configuration, no re-build)
In its current architecture VLO was not designed to easily inject code (like XSL stylesheets) via configuration. Basically one has to implement another postprocessing class (implementing the PostProcessor interface and hook it with the MetadataImporter. So it's not that hard, but requires rebuilding the code.
SOLR - synonyms
Apache Solr, the indexer underlying the VLO, has some native provisions for handling synonymous values.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Sample configuration in the schema.xml
file:
<fieldType name="text_normalized" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" tokenizerFactory="solr.KeywordTokenizerFactory" /> <!-- <filter class="solr.LowerCaseFilterFactory"/> --> </analyzer> </fieldType> ... <field name="resourceClass_normalized" type="text_normalized" indexed="true" stored="true" multiValued="true" /> ... <copyField source="resourceClass" dest="resourceClass_normalized" />
Important to indicate the @tokenizerFactory
also to the SynonymFilter
, otherwise multi-word terms are not handled correctly (not at all). Also, obviously analyze on index time, not query time.
Note also the @expand
attribute, allowing to either reduce all synonymous terms to one, or to expand the terms (searching for any one of the synonymous terms, would match all of them). For facets we want to reduce the terms, but for the search we could do the expansions.
As input serves a file with trivial syntax:
# either list of synonyms, reduced to the first value: text, written, texts, text corpus, written corpus # or explicit one-way mapping: spoken, Spoken Language Corpora => speech
DASISH - jmd-scripts
Within the DASISH project one task was to establish a joint metadata domain for CLARIN, DARIAH and CESSDA. The catalogue is available under http://ckan.dasish.eu/ckan/ .
The setup is very similar to that of CLARIN (harvester -> mapping -> normalization -> ingest into indexer). For this processing pipeline a set of scripts is maintained on github: DASISH/jmd-scripts
One step in this workflow is harmonization
with md-harmonizer.py
doing the work based again on a simple input file:
#GroupName,,datasetName,,facetName,,old_value,,new_value,,action *,,*,,Language,,*,,*,,replace *,,*,,Country,,*,,*,,replace *,,*,,CreationDate,,*,,UTC,,changeDateFormat *,,*,,Subject,,[Paris],,Paris,,replace *,,*,,Subject,,AARDEWERK,,Pottery,,replace
Also here, this setup seems rather in a preliminary stage, the input is just a small sample.
Note: One step in the workflow is obviously the schema-level mapping - here the TLA/md-mapper is being directly reused with DASISH-specific configuration. However as this pertains to the concept mapping on the schema level, it is only of limited interest here.
CMD2RDF
In an initiative to provide the CMD records also as RDF (see CMD2RDF), one (crucial) task is to transform string values to entities (cf. Ďurčo, Windhouwer, 2014; cf. Ďurčo, 2013, chapter 6.2 Mapping Field Values to Semantic Entities, pp. 80).
A first attempt for organisation names: addOrganisationEntity.xsl. Configuration via cmd2rdf-jobs.xml
Basically a lookup in a SKOS-dictionary injected as parameter matching on all label attributes (@altLabel
, @hiddenLabel
, @prefLabel
). The IRI of the matching entity (@rdf:about
) is used to construct a parallel object property (next to the data property encoding the original literal value).
For the discussed normalization of value in the CMD elements we would use @skos:prefLabel
, but should also encode the entity URI in @ValueConceptLink
.
DARIAH-DE
A lot of work on reference data is going on in DARIAH-DE
M 4.3.1 Normdatensätze - a deliverable (in German) detailing the activities
Two highlights:
- a SRU endpoint to access the GND (Gemeinsame Normdatei - the Integrated Authority File of the German National Library)
- the Labeling System- a system that allows to create project-specific but still interoperable vocabularies. The interoperability is ensured by linking the vocabularies to reference resources. (Still have to see how this works in practice, but sound promising.)
Proposed common/harmonized approach
There are (at least) three aspects to this (that would make for good sub-tasks):
- curation of controlled vocabularies
- application of the vocabularies on the harvested data (normalization step between harvest and ingest)
- use of the vocabularies for metadata authoring
Curation of vocabularies
For selected categories vocabularies are collaboratively maintained in a common vocabulary repository.
A possible workflow has been already tested within the CLAVAS project (Vocabulary Service for CLARIN) for the organisation names: the list of organisation names taken from VLO has been manually (pre-)normalized and imported into the vocabulary repository (OpenSKOS), so that now individual organizations (and departments) are modelled as skos:Concept
and their spelling variants as skos:altLabel
. The initial import is just an bootstrapping step, to seed the vocabulary. Further work (further normalization, adding new entities) happens in the Vocabulary Repository/Editor? that allows collaborative work, provides an REST API etc.
The CLAVAS vocabulary repository also already provides the list of Language Codes (ISO-639-3
) and the closed data categories (with their values) from ISOcat. And it is open to support further vocabularies. (see also meeting on OpenSKOS and controlled vocabularies)
Normalization step in the processing pipeline (exploitation tools - VLO)
There is some simple things we can do:
- fold case! (
text => Text
) - remove/reduce punctuation?
- better mappings?
Partly the problem of "mess" in the facets may stem from lumping too disparate fields into one index/facet. VLO allows to blacklist explicit XPaths. facetConcepts.xml - support multiselect on facets
not in the internal processing, but something we could do on the side of the interface for better user experience
But the main focus is on applying curated controlled vocabularies on the (harvested) data. We have two options:
- preprocessing
some XSLT run on every CMD record
output: new version of the CMD record with value normalized based on@skos:prefLabel
+ with the URI of the entity from@rdf:about
in the@ValueConceptLinks
(Don't throw away the URI-identifier, once you have it.)
- postprocessing
specific normalization code for the values of indivudal facet/solr-index:called on a single facet value after this facet value is inserted into the solr document during import.
A few remarks to the normalization task:
- ensure the normalization step is transparent for the user[[BR]]
user needs to see (on demand, in the record-detail view) the original value, so that she can inspect the normalization result. (This requires to duplicate the fields:
resourceType -> resourceType_normalized
) - conveying also the URI provides a basis for i18n of the facet values via a multilingual SKOS-based vocabulary (in some distant future)
- the vocabularies maintained in the vocabulary repository could be queried via the REST-service of OpenSKOS (find concepts), however in practice the whole vocabularies will be rather fetched (via OAI-PMH) and cached on the side of the client application (VLO) for performance reasons.
- The client application also will need to convert the vocabularies to it's local format.
Metadata authoring tools
use the vocabulary repository as source for suggest-functionality (autocomplete) for the user. This has been now catered for in CMDI1.2, which allows to associate an element with an external vocabulary. (AFAIK MPI’s MD-editor Arbil is planned to support the OpenSKOS vocabularies.)
The editor should write both the preferred label and the URI for given entity into the metadata element (into the content of the element and the @ValueConceptLink
attribute respectively).
Records created in this manner (ideally) wouldn't need any extra processing on the exploitation-side.
Categories/Facets?
VLO currently provides following facets (configured in VloConfig.xml) :
language collection resourceClass continent country modality genre subject format organisation nationalProject keywords dataProvider
Organisation names
As stated above, a lot of work has been already done on organisation names. The data is currently already ingested in the CLAVAS vocabulary repository Under the collection OpenSKOS-CLARIN Organizations Vocabulary there are two distinct concept schemes
- Organisations |2098 entries|
- CurateOrganisations |49 entries| - need to be looked at yet
Language
Language seems basically solved. The 2- and 3-letter codes are being resolved to the name, based on the corresponding components ( language2LetterCode, language3LetterCode).
Except for the case of nl_NL #679
Resource Type / Resource Class / Media Type / Format / Modality / Genre / Subject / Keywords
There has been intensive discussion on ResourceType? recently. See separate page on ValueNormalization/ResourceType for the details. Information below will be included there as well.
This is the most problematic area (Jan Odijk: "complete mess"). Even though the underlying data categories have different semantics, many values appear in multiple facets, suggesting that the distinction is not at all clear and the semantic borders are fuzzy. (Or that the current mappings are too "optimistic".)
Overview (uncomplete) of the discussed facets and/or related data categories.
Note: We need to keep in mind that every facet is already a conglomeration of multiple data categories (+ custom XPath, + blacklist XPath). In this overview, not all the data category for given facet are listed.
(TODO: In the current curation process it may be worthwhile, to dig down and inspect the contribution of individual data categories (and XPath) to given facet)
facet | data categories | number of values in the definition | ~ number of distinct values in the facet (2014-10-20) |
---|---|---|---|
resourceType | isocat:DC-3806 (resourceClass) dcterms:type | 13 / 12 | 372 |
format | isocat:DC-2570 (mediaType) | 8 | 116 |
modality | isocat:DC-2490 | open | 149 |
genre | isocat:DC-2470 isocat:DC-3899 | 15 | 1307 |
subject | dcterms:subject | open | 58310 |
keywords | isocat:DC-5436 (metadata tag) | open | 267 |
not in VLO as facet, but potentially related: | |||
DCMI Type Vocabulary | 12 | ||
applicationType isocat:DC-3786 | 5 | ||
corpusType isocat:DC-3822 | 11 | ||
dataType isocat:DC-4181 | open | ||
datatype isocat:DC-1662 | open | ||
TaDiRAH TaDiRAH Research Objects | 36 | ||
(theoretically also, research approach and research design.)
Of this, Format
seems the least problematic, mostly mime/type.
Subject
seems the most problematic (with + 58310 distinct values), but many values are codes from established hierarchical classification systems like DDC, where one could map to higher level classes to reduce the variety. Still this is a real challenge (or mess ;).
We should also take into account that the CMDProfile usually encodes a resource type (collection, TextCorpus, BilingualDictionary, etc.).
There is also an elaborate LRT-Taxonomy (Carlson et al. 2010) made by us(!) - I mean CLARIN people, as one of the first deliverables of CLARIN back in 2010. In search for a starting point, it may be worth to have a look at that. An initiative outside CLARIN that could be relevant is the COAR resource type vocabulary.
Ad-hoc attempt for a mapping values between the three related(?) data categories:
resourceClass | DCMIType | mediaType |
---|---|---|
corpus | Text | text, document |
resource bundle | Collection, Dataset | |
tool | InteractiveResource, (Software) | |
MovingImage | video | |
Sound | audio | |
Image, StillImage | image, drawing | |
unknown | unknown | |
experimental data | ||
fieldwork material | ||
grammar | ||
lexicon | ||
other | ||
survey data | ||
teaching material | ||
test data | ||
tool chain | ||
Event | ||
Service | ||
PhysicalObject | ||
unspecified |
Even this most simple exercise shows that the definitions are far from complete/comprehensive, so as to be usable as a base for normalisation. Even worse, the reality check (comparison with the actual values in the data).
One (admittedly radical) approach would be to not to try to pick apart the chaos present in the data, but rather to accept it, throw away the data categories, lump all these fields together into a "tag" (or other facet with little semantics) field. We still normalize the values, but instead of providing many facets, we would work basically with a bag of words, offering the user a tag-cloud, content-first search style.
Cases
Here collecting problematic/questionable cases
References
Odijk, J. (2014). Discovering Resources in CLARIN: Problems and Suggestions for Solutions (find attached)
Ďurčo, M., & Windhouwer, M. (2014). From CLARIN Component Metadata to Linked Open Data. In LDL 2014, LREC Workshop. Reykjavik, Iceland. pp 23 LDL2014 proceedings.pdf
Ďurčo, M. (2013). SMC4LRT - Semantic Mapping Component for Language Resources and Technology. Technical University, Vienna. Retrieved from http://permalink.obvsg.at/AC11178534
Rolf Carlson, Tommaso Caselli, Kjell Elenius, Bertrand Gaiffe, David House, Erhard Hinrichs, Valeria Quochi, Kiril Simov, Iris Vogel (2010). Language Resources and Tools Survey and Taxonomy and Criteria for the Quality Assessment D5C-2 http://www-sk.let.uu.nl/u/D5C-2.pdf
- TaDiRAH
- Taxonomy of Digital Research Activities in the Humanities (https://github.com/dhtaxonomy/TaDiRAH/) developed within DARIAH, divided into three main aspects: Activities, Objects, Techniques
Attachments (1)
-
Searching with the VLOwithApp_Odijk.pdf (444.9 KB) - added by 10 years ago.
Extensive evaluation of VLO's usability by Jan Odijk
Download all attachments as: .zip