wiki:VLO-Taskforce/Relations

Representation and exploitation of relations among resources

General

The CMDI metadata framework (1.1) provides for a generic mechanism to represent (directional) relationships between objects: <ResourceProxy>. A <ResourceProxy> specifies the <ResourceType> (one of LandingPage, Resource, Metadata, SearchService) and the target <ResourceRef> of a relationship, which (usually) contains a URL, sometimes a PID (as a special case of a URL). A <ResourceProxy> is equipped with an id attribute, which can be and is being used for specifying further information, such as a descriptive, human readable anchor text and its semantics (partOf, versionOf, source, ...) in the component part of a CMDI record. Finally, for some kinds of relationships, there exist some dedicated elements in the <Resources> section of a CMDI record:

  • <isPartOfList> for specifying one direction of a partOf relationship.
  • <JournalFileProxyList> possibly for dealing with versioning (TODO: link to discussion on versioning)
  • <Relations> for specifying the semantics (?) of all other kinds of relationships (see also: CMDI 1.2/Resource proxies/ResourceRelation)

This huge, combinatorial design space for representing relationships has been creatively exhausted by the CLARIN-D Centers. As a consequence, the VLO is basically agnostic w.r.t. relationships.

In the following, some of the existing representations are analyzed, with the explicit goal of narrowing the design space and making some kinds of relationships more useful for the VLO. For some more information on the status of relationships in the discussion on CMDI 1.2 see also https://docs.google.com/spreadsheet/ccc?key=0Avyg_78eBoF4dFUxR2VpR01XRFEzSUVUb2tXcFduSXc&usp=sharing#gid=0

Current center specific solutions

HZSK

HZSK represents all resources of a corpus in one CMDI record. Thus, the target of a relationship (<ResourceRef?>) has always type (ResourceType?) Resource. The individual resources are further specified in the component part of the CMDI record by referring to the <ResourceProxy?> via its id attribute.

Example:

http://virt-fedora.multilingua.uni-hamburg.de/drupal/fedora/repository/cmdi:demo/cmdi/metadata.xml

<ResourceProxy id="ACDIMB865D8-D9BA-7F9B-E652-D00D960850B4">
  <ResourceType mimetype="text/xml">Resource</ResourceType>
  <ResourceRef>http://hdl.handle.net/11858/00-248C-0000-000E-0181-F</ResourceRef>
</ResourceProxy>
...
<HZSKTranscription ComponentId="clarin.eu:cr1:c_1345561703658" ref="ID8392DD18-04C3-9DC7-A7F5-2FA8A3639EA4">
  <Name>Rudi Völler Wutausbruch</Name>
  <TranscriptionConvention>HIAT (simplified)</TranscriptionConvention>
  ...
</HZSKTranscription>

BAS

BAS represents the resources of a corpus by several CMDI records, and employs a variety of approaches to represent relationships:

  1. The relationship between the CMDI record for a corpus and its parts is specified explicitly in one direction by means of CMDI's built-in <isPartOf> element.
  2. Relationships with a target of type Resource are further specified in the component part of the CMDI record by referring to the <ResourceProxy?> via its id attribute
  3. Relationships between individual components, such as from <media-file> to <media-session-actor> are represented by referring to the target's id attribute as well.

Example:

https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/ZIPTEL.2.cmdi.xml

<ResourceProxy id="c_0000000001">
  <ResourceType mimetype="text/xml">Metadata</ResourceType>
  <ResourceRef>https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/0001.2.cmdi.xml</ResourceRef>
</ResourceProxy>

http://catalog.clarin.eu/oai-harvester/cmdi-providers/harvested/results/cmdi/Bayerisches_Archiv_f_r_Sprachsignale/oai_BAS_repo_Corpora_ZIPTEL_0001.xml or https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/0001.2.cmdi.xml

<ResourceProxy id="r_0000000001">
  <ResourceType mimetype="audio/raw">Resource</ResourceType>
  <ResourceRef>https://clarin.phonetik.uni-muenchen.de/BASRepository/Corpora/ZIPTEL/0001/z10001z2.dea</ResourceRef>
</ResourceProxy>
...
<IsPartOfList>
   <IsPartOf>https://clarin.phonetik.uni-muenchen.de/BASRepository/Public/Corpora/ZIPTEL/ZIPTEL.2.cmdi.xml</IsPartOf>
</IsPartOfList>
...
<media-file actor-ref="s_0000000001" ref="r_0000000001">
  <Type>audio</Type><Quality>3</Quality>
  <RecordingConditions>un-supervised answering of a question prompted via telephone</RecordingConditions>
...
</media-file>
...
<media-session-actor id="s_0000000001">
  <Role>question answering</Role> 
  <Name>unspecified</Name>
  <FullName>unspecified</FullName>
  <Code>0001</Code>
  ...
</media-session-actor>

IDS-Mannheim

IDS-Mannheim represents resources by several CMDI records, and employs a variety of approaches to represent relationships:

  1. The first version of the historical newspaper corpus MKHZ represents relationships by <ResourceProxy?> as well as OLAC-Dcmi-Terms elements such as <hasPart>, where both point to a PID (see for example http://repos.ids-mannheim.de/fedora/objects/clarind-ids:mkhz.000000/datastreams/CMDI/content)
  2. The second version of the historical newspaper corpus represents relationships by <ResourceProxy?> and further specifies the semantics (and anchor text) of the relationship in the component part of the relationship (see for example http://repos.ids-mannheim.de/fedora/objects/clarin-ids:mkhz1.00000/datastreams/CMDI/content). The underlying conceptual model is depicted in the figure below.
  3. Relationships in the corpora of spoken language are represented by ResourceProxy?'s only, and partOf relationships are further specified by means of CMDI's built-in <IsPartOf?> element (see for example http://repos.ids-mannheim.de/fedora/objects/clarin-ids:folk.FOLK_S_00248.cmdi/datastreams/CMDI/content)

conceptual model of MKHZ representation (small)

Leipzig

Leipzig's approach to representing relationships is similar to IDS (and BAS) Option b. The differences are as follows:

  1. For each relationship there exists a separate Component, which already has a built-in attribute ref of type idrefs.
  2. The description of a relationship is structured rather than just a simple anchor text.
  3. Inverse relationships are not represented explicitly.

Example:

<ResourceProxy id="ulei-11858-00-229C-0000-0001-B06F-3-component-dataprovider-1">
  <ResourceType mimetype="text/xml">Metadata</ResourceType>
  <ResourceRef>http://hdl.handle.net/11858/00-229C-0000-0001-B06F-3@type=dataprovider&id=1</ResourceRef>
</ResourceProxy>
...
<LCC_DataProvider ComponentId="clarin.eu:cr1:c_1381926654509" ref="ulei-11858-00-229C-0000-0001-B06F-3-component-dataprovider-1">
  <Id>1</Id>
  <Name>LCC data provider "www.shortnews.de" in resource with identifier 11858/00-229C-0000-0001-B06F-3</Name>
  <Description xml:lang="eng">Data provider of the Leipzig Corpora Collection: www.shortnews.de</Description>
</LCC_DataProvider>

IMS Stuttgart

This is a list of the relations which we would like to represent in the CMDI records, and how we deal with them so far. Due to the fact, that we were not sure about recommendations how to use the 'Resources' component, we have not applied it yet, but are of course interested in using it, so that some of the relations might also be exploited in the VLO:

    • Relation type: trained model -- data set on which the model was trained
    • Current approach: Optional CMDI component 'BasedOn' as part of the profile to describe tool components (CMDI component not in public section yet).
    • Examples:
      • Dutch parameter file for TreeTagger --> trained on Eindhoven corpus
      • German grammar for BitPar --> extracted from Tiger treebank
    • Relation type: versioning
    • Current approach: Element 'Version' in CMDI component 'GeneralInfo' (clarin.eu:cr1:c_1290431694495) and where applicable a common part in the 'ResourceName' to implicitly relate different versions.
    • No examples yet, but this will change in future. Affects all types of resources: corpora, lexicons, tools, web services, ...
    • Relation type: tool -- web service
    • We keep getting mails asking for information and publications on the tools 'behind' the web services. This also affects the action item regarding the CMDI profiles for web services.

Further notes:

  • Making these relations explicit will also mean more maintenance effort to keep the md-records up to date, thus we should discuss where this will make sense and where it creates too much overhead.
  • Some relationships are between 'clearly distinct' resources: 'Source' was mentioned in the Tischvorlage, others might include:
    • 'A is based on B' e.g. when a trainable component (parameterfile, language model, ...) has been trained on a specific corpus, or a collection of technical terms has been extracted based on the frequencies of a specific corpus resource (see relation type b.)
    • 'A can be used with B', e.g. when a tool can use an (additional) knowledge base (lexicon, language model, ...), or a corpus is encoded for a specific query engine (see relation type a.)
    • 'A was used for the creation of B', e.g. when manual annotations were added with the help of a specific system, ...
    At the moment some of these relationships can be expressed in the component section of the CMDI-records (e.g. 'Derivation', 'Source') but are not exploited in the VLO.

Summary on Representation of Relationships

The conceptual model for relationships is fairly simple:

A relationship

  1. relates two or more resources,
  2. has a semantic type (and possibly semantic roles for the related resources, such as source/target)
  3. has some human readable description, ranging from simple text to sth. more structured.

While the built-in mechanisms in CMDI can not express all the aspects of the conceptual model, data providers have found ways to combine CMDI's built-in mechanisms with the full flexibility provided by CMDI components to encode this conceptual model or some variants of it.

This initial attempt of a summary can probably not do justice to the full solution space.

Relationships between Internal Resources

These relationships are established between internal resources.

Specified by ResourceRelationList?

For a discussion of this option see the CMDI 1.2 Issue https://trac.clarin.eu/wiki/CMDI%201.2/Resource%20proxies/ResourceRelation (e.g. solution 2).

I am currently not aware of any data provider in CLARIN-D using this option.

Specified in the Components Part

For an example see https://trac.clarin.eu/wiki/%20VLO%20Taskforce#BAS, second example, which establishes a relationship media-file between a Resource with id=r_0000000001 and an Actor with id=s_0000000001 by way of a component (clarin.eu:cr1:c_1336550377512)

The Resource is listed in ResourceProxyList?, the Actor is specified as an Element in the component part.

The semantic type of this relationship is not specified explicitly, i.e. the component media-file is not associated with a data category.

Relationships from a Resource to other Resources

As the title indicates, these relationships are established from a resource of type Metadata (a CMDI record) to other Resources (of Type Metadata or Resource).

ResourceProxyList?

ResourceProxyLists? are the most straightforward way to express such relationships. However, they do not provide for any direct means to specify (b) the semantic type of a relationship or (c) a human readable description.

There seems to be an implicit assumption among some CMDI illuminati that the semantic type of such relationships is sth. like hasPart. For representing the inverse of this assumption, CMDI currently provides isPartOf as a built-in.

ResourceProxyList? + Component for further information

It has become evident that there is a need to represent many more kinds of relationships between Resources than just hasPart/isPartOf.

One popular profile (OLAC_DcmiTerms) lists the following relationships, with the according dcr:datcat:

At least one data provider (IDS-Mannheim) thus has chosen to represent the additional semantics of a relationships in the component part, by reusing the semantics of the undlerlying dcr:datcat categories. See for example: http://hdl.handle.net/10932/00-01B8-AE41-41A4-DC01-5

This approach represents (a) the reference to the resource via a (not supported) attribute ref pointing to the id of a resourceProxy, (b) the semantic role of the relationship via the dcr:datcat attribute in the underlying component specification, and (c) the human readable description via the content of the element.

The description of (planned) relationships at IMS Stuttgart (https://trac.clarin.eu/wiki/%20VLO%20Taskforce#IMSStuttgart) warrants some further analysis, but generally their overall design pattern seems to be inline with the design pattern followed in this section.

ResourceProxyList? + isPartOfList

TBD: example: http://hdl.handle.net/10932/00-01B8-EDCB-0B8B-D501-C

Summary

There is an obvious need for representing and using more information about relationships than currently supported by the built-in mechanisms in CMDI.

The key question is whether this need is to be fulfilled by (a) extending built-in encodings in CMDI ("easy" for the VLO) or by (b) using CMDI's extensibility features (components + iso/dcr:cat) in a more principled manner (more difficult for the VLO).

For a rather elaborate discussion of the pros and cons of (a) vs. (b) see https://trac.clarin.eu/wiki/CMDI%201.2/Resource%20proxies/ResourceRelation

Proposal

In light of the current representational alternatives sketched above, the discussion at https://trac.clarin.eu/wiki/CMDI%201.2/Resource%20proxies/ResourceRelation and the discussion in http://www.clarin.eu/system/files/AP3-007-CMDI_and_granularity.pdf IDS proposes the following "unified" approach to representing relationships.

ResourceProxy? Graph must be a DAG

ResourceProxies? are exclusively used for representing the constituent structure of a (composite) resource, such as a corpus consisting of subcorpora (or individual documents, with their own metadata). To allow sharing of subresources among resources, the ResourceProxy? Graph should not be constrained to be a forest of trees, but rather be a directed acyclic graph (DAG) - not necessarily fully connected.

The inverse relationship should (must?) be represented by means of isPartOfList. (Axel?: Why this redundancy? The inverse can always be derived.) (Peter?: Indeed for the VLO which has a full harvest of all repository metadata, this can in principle be derived. However, when some agent is just accessing the CMDI-record for some part of a resource, it does not know, to which resource(s) this part belongs. I am fine with dropping this as a requirement, and rather just allow the representation of isPartOf relationships in the component part, just like any other relationship. In this case we could also consider to drop isPartOfList in the header.)

This seems to be consistent with e.g. https://trac.clarin.eu/attachment/wiki/CMDI%201.2/header.pdf, albeit more restrictive.

It has the following implications:

The VLO can rank/sort resources (of type Metadata) by their "depth" in the ResourceProxyGraph?. "Depth" is defined as the length of the shortest path to a ResourceProxy? without incoming ResourceProxy?-Links.

Quality assurance should check the ResourceGraph? for acyclicity, and all referenced ResourceProxies? for availability (dereferencability).

All references to ResourceProxies? should be PIDs. This is currently not the case for all Centers.

Relationships in the Component Part

Both, relationships expressed by "ResourceProxies?" as well as other relationships may be (further) specified in the component part.

A possible best practice to this end could be as follows: (adapted from: http://hdl.handle.net/10932/00-01B8-AE41-41A4-DC01-5)

<source ref="clarind_ids_mkhz1_005720">TUSTEP kuriert (zip, 13.3 MB)</source>

if the related resource is listed in the ResourceProxyList?

<source href="http://hdl.handle.net/10932/00-01B8-AE41-8824-DE01-B">TUSTEP kuriert (zip, 13.3 MB)</source>

if the related resource is not listed in the ResourceProxyList? (because of the DAG-constraint).

Here the semantics of the element "source" is defined in the profile specification by the category: http://purl.org/dc/terms/source. The target of the relationship is either the id of a resourceProxy, if available (@ref), or a PID (@href).

Last modified 9 years ago Last modified on 11/29/14 14:24:37

Attachments (7)

Download all attachments as: .zip