Changeset 3818 for CMDI-Interoperability


Ignore:
Timestamp:
10/20/13 21:22:19 (11 years ago)
Author:
vronk
Message:

cleaned up paper, reflecting latest additions to general.ttl (:Resource)

Location:
CMDI-Interoperability/CMD2RDF/trunk
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/data/general.ttl

    r3817 r3818  
    7171#COMMENT Matej: I still yearn for something like cmdm:Resource and cmdm:MDRecord
    7272#COMMENT Menzo: Added cmdm:Resource
     73
     74#COMMENT matej: isn't there a loop now? Is it meant like this?:
     75# <lr1> a  :Resource; :hasResourceType :Resource.
     76# <lr1.cmd> a  :Resource; :hasResourceType  :Metadata.
     77                               
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3816 r3818  
    7171\maketitle
    7272%
    73 \begin{abstract}
    74 The hype/trend to Web of Data...
    75 
    76 Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will allow to fully exploit the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF.
    77 
    78 \commentx{Menzo: I don't think we can express CMD data automatically as an ontology. For that too many semantics are still hidden in CMDI. We are building blocks (e.g., RR/CLAVAS) that might enable us to do so in the future, but I think its better now to go for CMD as LOD linked into the LOD cloud ...}
    79 
    80 \end{abstract}
     73%\begin{abstract}
     74%\end{abstract}
    8175%
    8276\begin{keywords}
     
    8579\end{keywords}
    8680%
    87 \section{Introduction}
    88 %
    89 \commentx{Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...}
    90 
    91 
    92 The hype/trend to Web of Data...
    93 
    94 In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
     81\section{Motivation}
     82%
     83Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will opens a whole new level of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
     84%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    9585
    9686%
     
    131121\section{CMD to RDF}
    132122\label{sec:cmd2rdf}
    133 In this section, RDF encoding is proposed for all levels of the CMD data domain:
     123In the following, RDF encoding is proposed for all levels of the CMD data domain:
    134124
    135125\begin{itemize}
     
    144134The main entity of the meta model is the CMD component modelled as \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:ElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:AttributeValue, cmdm:AttributeEntity}.
    145135
    146 The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, again analogously for attributes of individual components and elements \code{cmdm:containsAttribute}.
     136The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, attributes of individual components and elements are bound with \code{cmdm:containsAttribute}.
    147137
    148138\label{table:rdf-spec}
     
    156146%cmdm:Attribute & a & rdfs:Class .  \\
    157147\\
    158 \multicolumn{3}{l}{\# basic CMD nexting}  \\
     148\multicolumn{3}{l}{\# basic CMD nesting}  \\
    159149cmdm:contains & a & rdf:Property ; \\
    160150        & rdfs:domain & cmdm:Component ; \\
     
    191181\noindent
    192182This entities are used for modelling the actual profiles, components and elements as they are defined in the Component Registry.
    193 For stand-alone/top components, the IDs as issued by Component Registry can be used as entity IRIs. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor\_Languages.Actor\_Language}).
    194 
    195 \commentx{Matej: shouldn't we add the name of the component in the IRI for human-readability?
    196 similar to how it is generated in profile XSDs: \textless xs:simpleType name="simpletype-MimeType-clarin.eu.cr1.c\_1290431694511"\textgreater }
    197 
    198 \commentx{Menzo: the IRI is the exact path into the CR to get the RDF representation for the profile/component. I think it should stay like that because you need to be able to fetch it to get, for example, the dcr:datcat mappings. Actually the profile/component name is there as its (in general) the first component name after the '\#'.}
     183For stand-alone/top components, the IRI is the exact path into the CR to get the RDF representation for the profile/component. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component IRI and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}).\footnote{For the sake of readability, we will collapse the component IRIs, stripping them of the identifier and referring to them just by their name, prefixed with \code{cmd:}.}
    199184
    200185\label{table:rdf-cmd}
     
    208193
    209194\commentx{Menzo: we need more context for inner components. In the example LanguageName looks well defined, but take a Component/Element like Title. Is it the title of a book or the title of a person. Only when the semantics are clear, e.g., with a dcr:datcat, one can ignore the context and collapse all Components/Elements to a single RDF class/property.}
    210 \commentx{Matej: wouldn't that be remedied by cmdm:contains? or is it too much inferencing?}
    211195
    212196\begin{notex}
    213197Menzo: inner components don't have IDs so I propose a path build from the context up to a shareable component (we need some nice term for that, in the TDS I called it a top notion so maybe a top component. The cmd prefix also needs to be bound to a component specific URI. This URI contains the top component ID, e.g., \furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}.
    214198\end{notex}
     199\commentx{Matej: see above - treated sufficiently?}
    215200
    216201\subsection{Data Categories}
     
    231216\end{example3}
    232217
    233 \begin{comment}
    234 Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
    235 used usually directly as data properties:
    236 
    237 \begin{example3}
    238 <lr1> & dc:title & "Language Resource 1"
    239 \end{example3}
    240 
    241 However, e argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties,
    242 In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
    243 
    244 \begin{example3}
    245 \#myPOS & owl:equivalentClass & isocat:DC-1345. \\
    246 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
    247 \#myNoun & owl:sameAs & isocat:DC-1333. \\
    248 \end{example3}
    249  
    250 \end{comment}
    251218
    252219\subsection{RELcat - Ontological relations}
    253 As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     220\commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
     221
     222As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
    254223
    255224\begin{example3}
     
    265234
    266235\commentx{Menzo: I would use owl:sameAs rdfs:subPropertyOf rel:sameAs. I see the rel:* properties as an upper layer of a taxonony of relation types. The RELcat types are loose and the OWL ones specific, hence the subtyping. In RELcat you might also query multiple graphs with multiple vocabularies various 'same-as' properties then still need to be distinguishable but the general rel:sameAs need to be created.}
    267 
    268 \commentx{Matej: strip this stipulations - rest of the subsection or just short referrer to SPIN rules ?}
    269 \begin{comment}
    270 Is this correct:
    271 ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
    272 
    273 \begin{example2}
    274  cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
    275 \end{example2}
    276 
    277 \commentx{Menzo: yes. I do have some of the SPIN rules somewhere to generate those. My idea is that one takes a dcr:datcat annotated graph. This can be using OWL or SKOS or any other RDF vocabulary. This base graph should have been expanded depending on the reasoning one uses, i.e., all entailments are in place. The dcr:datcat can then be translated into rel:sameAs and all equivalences get expanded, so one can also query using ISOcat DCs.}
    278 
    279 \noindent
    280 following facts need to be present in the ontology :
    281 
    282 \begin{example3}
    283 <lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
    284 cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
    285 isocat:DC-2538 & rel:sameAs & dc:created \\
    286 owl:sameAs & rdfs:subPropertyOf &  rel:sameAs \\
    287 $\rightarrow$ \\
    288 <lr1> & dc:created & 2012\^{}\^{}xs:year \\
    289 \end{example3}
    290 \end{comment}
    291236
    292237
     
    295240In the next step, we want to express the individual CMD instances, the metadata records.
    296241
     242We provide a generic top level class for all resources (including metadata records), the \code{cmdm:Resource},
     243with \code{cmdm:hasResourceType} and  \code{cmdm:hasMimeType} predicates to type the resources.
     244
     245\commentx{not sure about the use of \code{cmdm:Resource}. both as rdf:type and :hasResourcetype? - see comment in general.ttl}
     246
     247\begin{example3}
     248<lr1> & a  & cmdm:Resource; \\
     249& cmdm:hasResourceType & cmdm:Resource. \\
     250<lr1.cmd> & a  & cmdm:Resource; \\
     251& cmdm:hasResourceType & cmdm:Metadata. \\
     252\end{example3}
     253
    297254\subsubsection {Resource Identifier}
    298 
    299 \commentx{Matej: I still yearn for something like cmdm:Resource and cmdm:MDRecord}
    300 \begin{example3}
    301 <lr1> & a  & cmdm:Resource; \\
    302 <lr1.cmd> & a & cmdm:MDRecord;
    303 \end{example3}
    304255
    305256It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
    306257If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
    307 (Note also, that one MD record can describe multiple resources, this can be also easily accommodated in OpenAnnotation:
     258(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation).
    308259
    309260\commentx{Menzo: also there can be multiple resource proxies. Maybe we can use an RDF list?}
     
    359310\begin{example3}
    360311\_:actor1  & a & cmd:Actor . \\
    361 ?? <lr1> ? & cmd:contains & \_:actor1 . \\
    362 ?? <lr1.cmd> ? & cmd:contains & \_:actor1 . \\
     312\_:actor1lang1  & a & cmd:Actor.Actor\_Language . \\
     313\_:actor1  & cmd:contains & \_:actor1lang1 . \\
     314\end{example3}
     315
     316Additionally, we have to hook the top component to its containing metadata record.
     317
     318\begin{example3}
     319\_:coll1 & a & cmd:collection. \\
     320\_:coll1 & cmdm:describesResource & <lr1.cmd> . \\
    363321\end{example3}
    364322
     
    367325As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
    368326
    369 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities. Following example show the whole chain of statements from metamodel to literal value. The mapping process is detailed in \ref{sec:values2entities}.
     327While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity. The mapping process is detailed in next section \ref{sec:values2entities}.
    370328
    371329\begin{example3}
     
    391349\end{example3}
    392350
    393 \begin{comment}
    394 \begin{example3}
    395 cmd:timeCoverage  & a   & cmds:Element \\
    396 cmd:timeCoverageValue & a & cmds:ElementValue \\
    397 cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
    398 <lr1> & cmd:contains & \_:timeCoverage1 \\
    399 \_:timeCoverage1 & a & cmd:timeCoverage \\
    400 \_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
    401 \end{example3}
    402 
    403 \commentx{Menzo: no need to repeat dcr:datcat in the instance.}
    404 
    405 \begin{example3}
    406 \var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
    407 \_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
    408 \end{example3}
    409 
    410 \begin{notex}
    411 Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
    412 i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
    413 \end{notex}
    414 \end{comment}
    415 
    416351
    417352%%%%%%%%%%%%%%%%%
     
    421356\commentx{this is probably definitely too much for one abstract  - so we could just anounce the need for this mapping process.}
    422357
    423 This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
     358This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
     359
     360It involves following steps:
    424361
    425362\begin{enumerate}
     
    434371% This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
    435372
    436 The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
    437 
    438 \begin{example3}
    439 \_:organisation1 & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
    440 \end{example3}
    441 
    442 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept-value pairs:
     373\subsubsection{Identify vocabularies}
     374
     375One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}).
     376
     377The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources.
     378
     379\subsubsection{Extract input data}
     380Starting from the literal triples as defined in previous section (\code{cmdm:hasElementValue}) we aggregate the elemnt values to retrieve distinct \emph{concept-value pairs}:
    443381
    444382\begin{example3}
     
    447385\end{example3}
    448386
    449 \subsubsection{Identify vocabularies}
    450 
    451 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}).
    452 
    453 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources.
    454 
    455387\subsubsection{Lookup}
    456388
    457 In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
     389In abstract terms, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities, ideally with some confidence score. Before actual lookup, there may have to be some string-normalizing preprocessing.
    458390
    459391%\begin{definition}[{signature of the lookup function}]
    460392\begin{equation}
    461 lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
     393lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ \textless  Concept \ | \ Entity  ,\ confidenceScore \textgreater \ )*
    462394\end{equation}
    463395%\end{definition}
    464396
    465397In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
    466 which will be the result of the previous step \
     398which will be the result of the previous step -- identification of vocabularies. \
    467399
    468400
     
    475407
    476408As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
    477 However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces.
     409However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via varying interfaces.
    478410
    479411\subsubsection{Candidate evaluation}
     
    484416\section{Implementation}
    485417
    486 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets.
    487 Once the data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
     418The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. In the near future, a test with the whole CMD dataset will be performed.
     419
     420Once the data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
    488421
    489422% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
     
    491424
    492425\section{Conclusions and Future Work}
    493 In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the core model the general component schema. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
    494 In the near future, a test with the whole CMD dataset will be performed.
    495 And work on mapping values to entities.
    496 
    497 With this new enhanced dataset, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
    498 The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
    499 
     426In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema and the mapping of element values to semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data.
     427
     428With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e. the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked to.
    500429
    501430
     
    503432\bibliography{CMD2RDF}
    504433
    505 \end{document}
     434\end{document}s
Note: See TracChangeset for help on using the changeset viewer.