= VLO Taskforce (CLARIN-D) = [[PageOutline(2-4)]] * [[attachment:VLO_Tischvorlage_2014-01-07.pdf|Summary of VLO metadata quality issues and curation plans]] == VLO Facets == * [[attachment:facets.ods|List of facets desired by CLARIN-D centers]] * [[attachment:VLO_facets.pdf|List of facets with ISO-CAT info (incomplete)]] == Exploiting ISOcat data categories == Currently (January 2014) the VLO relies on two separate strategies for MD selection to populate its facets: 1. ISOcat DCs associated with a MD field in the underlying CMDI profile 2. a set of centre specific XPaths for CMDI instances to * explicitly select MD fields for inclusion in a facet that was not matched by a ISOcat DC (white list) * explicitly discard MD fields that were selected via their ISOcat DC (black list) Ideally, strategy 1 should suffice for proper MD selection for the facets. To fix the current state of the MD instances and the VLO we will have to answer the questions: 1. Does VLO rely on appropriate, i.e. sufficiently concrete defined ISOcat DCs? Obviously DCs like http://www.isocat.org/datcat/DC-2482 (language ID) or http://www.isocat.org/datcat/DC-2484 (language name) are semantically too vague and don't allow for the differentiation between the language a resource is written in or the language of an actor in case of transcribed recordings and so on. * Solution: Only use narrowly defined DCs for VLO facets. * Task: Evaluate the current mapping of ISOcat DCs to VlO facets 2. Do CMDI profiles use sufficiently concrete ISOcat DCs? * Task: Create an overview of the DCs actually used by the centres. * Task: Evaluate the whitelist/blackclist XPaths with respect to why they are needed (to select vaguely defined DCs? to select MD fields that don't have a DC associated? anything else?) [[herold|Axel Herold]]: I will prepare such a list. * Task: Propose the adoption of container DCs for profiles that rely on vaguely defined DCs == Vocabularies == A first draft of recommended vocabularies (and in some cases components) was attached to this page (Gruppe Vokabular_v3.pdf (German)). There are some main problems that have to be adressed and that prevented a more complete recommendation: * missing precise specification of most facets * missing accordance between definition/vocabulary of elements in CMDI and linked datcats * many ISOcat definitions are problematic (to narrow, self-referential definitions, no explicit conceptual domain etc.) * a more powerful CMDI-based mechanism for specifying (external) vocabularies will only be part of CMDI 1.2 == Relationships == The CMDI metadata framework provides for a generic mechanism to represent (directional) relationships between objects: . A specifies the (one of "Landing Page", "Resource", "Metadata") and the target of a relationship, which (usually) contains a URL, sometimes a PID (as a special case of a URL). A is equipped with an id attribute, which can be and is being used for specifying further information, such as a descriptive, human readable anchor text and its semantics (partOf, versionOf, source, ...) in the component part of a CMDI record. Finally, for some kinds of relationships, there exist some dedicated elements in the section of a CMDI record: * for specifying one direction of a partOf relationship. * possibly for dealing with versioning (TODO: link to discussion on versioning) * for specifying the semantics (?) of all other kinds of relationships (see also: https://trac.clarin.eu/wiki/CMDI%201.2/Resource%20proxies/ResourceRelation) This huge, combinatorial design space for representing relationships has been creatively exhausted by the CLARIN-D Centers. As a consequence, the VLO is basically agnostic w.r.t. relationships. In the following, some of the existing representations are analyzed, with the explicit goal of narrowing the design space and making some kinds of relationships more useful for the VLO. For some more information on the status of relationships in the discussion on CMDI 1.2 see also https://docs.google.com/spreadsheet/ccc?key=0Avyg_78eBoF4dFUxR2VpR01XRFEzSUVUb2tXcFduSXc&usp=sharing#gid=0 === HZSK === HZSK represents all resources of a corpus in one CMDI record. Thus, the target of a relationship () has always type (ResourceType) Resource. The individual resources are further specified in the component part of the CMDI record by referring to the via its id attribute. Example: http://virt-fedora.multilingua.uni-hamburg.de/drupal/fedora/repository/cmdi:demo/cmdi/metadata.xml {{{ Resource http://hdl.handle.net/11858/00-248C-0000-000E-0181-F ... Rudi Völler Wutausbruch HIAT (simplified) ... }}} === BAS === BAS represents the resources of a corpus by several CMDI records, and employs a variety of approaches to represent relationships: a. The relationship between the CMDI record for a corpus and its parts is specified explicitly in one direction by means of CMDI's built-in element. b. Relationships with a target of type Resource are further specified in the component part of the CMDI record by referring to the via its id attribute c. Relationships between individual components, such as from to are represented by referring to the target's id attribute as well. Example: https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/ZIPTEL.2.cmdi.xml {{{ Metadata https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/0001.2.cmdi.xml }}} http://catalog.clarin.eu/oai-harvester/cmdi-providers/harvested/results/cmdi/Bayerisches_Archiv_f_r_Sprachsignale/oai_BAS_repo_Corpora_ZIPTEL_0001.xml or https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/0001.2.cmdi.xml {{{ Resource https://clarin.phonetik.uni-muenchen.de/BASRepository/Corpora/ZIPTEL/0001/z10001z2.dea ... https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/ZIPTEL.2.cmdi.xml ... audio3 un-supervised answering of a question prompted via telephone ... ... question answering unspecified unspecified 0001 ... }}} === IDS-Mannheim === IDS-Mannheim represents resources by several CMDI records, and employs a variety of approaches to represent relationships: a. The first version of the historical newspaper corpus MKHZ represents relationships by as well as OLAC-Dcmi-Terms elements such as , where both point to a PID (see for example http://repos.ids-mannheim.de/fedora/objects/clarind-ids:mkhz.000000/datastreams/CMDI/content) b. The second version of the historical newspaper corpus represents relationships by and further specifies the semantics (and anchor text) of the relationship in the component part of the relationship (see for example http://repos.ids-mannheim.de/fedora/objects/clarin-ids:mkhz1.00000/datastreams/CMDI/content). The underlying conceptual model is depicted in the figure below. c. Relationships in the corpora of spoken language are represented by ResourceProxy's only, and partOf relationships are further specified by means of CMDI's built-in element (see for example http://repos.ids-mannheim.de/fedora/objects/clarin-ids:folk.FOLK_S_00248.cmdi/datastreams/CMDI/content) [[Image(mkhz-representation.jpg)]] === Leipzig === Leipzig's approach to representing relationships is similar to IDS (and BAS) Option b. The differences are as follows: a. For each relationship there exists a separate Component, which already has a built-in attribute ref of type idrefs. b. The description of a relationship is structured rather than just a simple anchor text. c. Inverse relationships are not represented explicitly. Example: {{{ Metadata http://hdl.handle.net/11858/00-229C-0000-0001-B06F-3@type=dataprovider&id=1 ... 1 LCC data provider "www.shortnews.de" in resource with identifier 11858/00-229C-0000-0001-B06F-3 Data provider of the Leipzig Corpora Collection: www.shortnews.de }}} === IMS Stuttgart === This is a list of the relations which we would like to represent in the CMDI records, and how we deal with them so far. Due to the fact, that we were not sure about recommendations how to use the 'Resources' component, we have not applied it yet, but are of course interested in using it, so that some of the relations might also be exploited in the VLO: a. * Relation type: tool -- modular tool component (parameter file, language model, lexicon, ...) * Current approach: Using the CMDI component 'Prereqisites' (clarin.eu:cr1:c_1290431694521) to state that a tool needs an additional component. The tool components are described in their own CMDI record (profile not in public section yet). * Examples: * !TreeTagger (Metadata: http://hdl.handle.net/11858/00-247C-0000-0022-C698-E) --> Italian parameter file (Achim Stein) * !TreeTagger (Metadata: http://hdl.handle.net/11858/00-247C-0000-0022-C698-E) --> Italian parameter file (Marco Baroni) * !TreeTagger (Metadata: http://hdl.handle.net/11858/00-247C-0000-0022-C698-E) --> English parameter file * Mate Tools Parser (Metadata: http://hdl.handle.net/11858/00-247C-0000-0022-C697-0) --> French model * Mate Tools Parser (Metadata: http://hdl.handle.net/11858/00-247C-0000-0022-C697-0) --> English model * !BitPar (Metadata http://hdl.handle.net/11858/00-247C-0000-0022-F7AF-1)--> Lexicon DE * !BitPar (Metadata http://hdl.handle.net/11858/00-247C-0000-0022-F7AF-1)--> Grammar EN b. * Relation type: trained model -- data set on which the model was trained * Current approach: Optional CMDI component '!BasedOn' as part of the profile to describe tool components (CMDI component not in public section yet). * Examples: * Dutch parameter file for !TreeTagger --> trained on Eindhoven corpus * German grammar for !BitPar --> extracted from Tiger treebank c. * Relation type: versioning * Current approach: Element 'Version' in CMDI component '!GeneralInfo' (clarin.eu:cr1:c_1290431694495) and where applicable a common part in the '!ResourceName' to implicitly relate different versions. * No examples yet, but this will change in future. Affects all types of resources: corpora, lexicons, tools, web services, ... d. * Relation type: tool -- web service * We keep getting mails asking for information and publications on the tools 'behind' the web services. This also affects the action item regarding the CMDI profiles for web services. Further notes: * Making these relations explicit will also mean more maintenance effort to keep the md-records up to date, thus we should discuss where this will make sense and where it creates too much overhead. * Some relationships are between 'clearly distinct' resources: 'Source' was mentioned in the Tischvorlage, others might include: * 'A is based on B' e.g. when a trainable component (parameterfile, language model, ...) has been trained on a specific corpus, or a collection of technical terms has been extracted based on the frequencies of a specific corpus resource (see relation type b.) * 'A can be used with B', e.g. when a tool can use an (additional) knowledge base (lexicon, language model, ...), or a corpus is encoded for a specific query engine (see relation type a.) * 'A was used for the creation of B', e.g. when manual annotations were added with the help of a specific system, ... At the moment some of these relationships can be expressed in the component section of the CMDI-records (e.g. 'Derivation', 'Source') but are not exploited in the VLO. === Summary on Representation of Relationships === TBD, in the form of a table of design options. == Example Profiles == == Related Work == [[http://www.datacite.org/|DataCite]] has similar goals as CLARIN: sharing and persistently identifying research data. For the description of it uses a fixed schema, which is somewhat inspired by Dublin Core. In terms of CMDI, the schema comprises both, facets and relationships. For more information see the [[http://schema.datacite.org/|DataCite Schema Repository]].