Changeset 4451
- Timestamp:
- 02/05/14 20:27:38 (10 years ago)
- Location:
- CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.bib
r3868 r4451 604 604 } 605 605 606 @STANDARD{ISODIS24622-1_2013, 607 title = {Language resource management -- Component Metadata Infrastructure -- Part 1: The Component Metadata Model (CMDI-1)}, 608 organization = {ISO}, 609 author = {{ISO/DIS 24622-1}}, 610 type = {International Standard}, 611 number = {24622-1}, 612 address = {Geneva, Switzerland}, 613 year = {2013}, 614 url = {http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37336}, 615 abstract = {}, 616 owner = {m}, 617 publisher = {ISO}, 618 timestamp = {2014.02.05} 619 } 620 621 606 622 @comment{jabref-meta: selector_publisher:} 607 623 -
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex
r4428 r4451 1 1 \documentclass[10pt, a4paper]{article} 2 2 \usepackage{lrec2006} 3 3 4 \usepackage{color} 4 5 \usepackage{graphicx} … … 14 15 15 16 %%% PAGE DIMENSIONS 16 \usepackage{geometry} % to change the page dimensions17 \geometry{a4paper} % or letterpaper (US) or a5paper or....18 \geometry{margin=2.5cm} % for example, change the margins to 2 inches all round17 %\usepackage{geometry} % to change the page dimensions 18 %\geometry{a4paper} % or letterpaper (US) or a5paper or.... 19 %\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round 19 20 %\topmargin=-0.6in 20 \textheight=700pt21 %\textheight=700pt 21 22 % \geometry{landscape} % set up the page for landscape 22 23 % read geometry.pdf for detailed page layout information … … 32 33 \begin{sffamily} \begin{shaded*} \noindent 33 34 \begin{tabular}{@{\hspace{-1mm}} p{0.3\textwidth} p{0.7\textwidth} } } 35 {\end{tabular} \end{shaded*} \end{sffamily} } 36 37 \newenvironment{example2a} 38 { \footnotesize 39 \begin{sffamily} \begin{shaded*} \noindent 40 \begin{tabular}{@{\hspace{-1mm}} p{0.4\textwidth} p{0.6\textwidth} } } 34 41 {\end{tabular} \end{shaded*} \end{sffamily} } 35 42 … … 56 63 { \end{textit} \normalsize} 57 64 58 \title{From C omponent Metadata to Linked Open Data}65 \title{From CLARIN Component Metadata to Linked Open Data} 59 66 60 67 \name{Matej Durco, Menzo Windhouwer} … … 64 71 matej.durco@assoc.oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\} 65 72 66 \abstract{To be done ...} 73 \abstract{In the european CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we 74 describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, metadata}} 67 75 68 76 % … … 73 81 \section{Motivation} 74 82 % 75 Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure (CMDI), until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF andinterlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).83 Although semantic interoperability has been one of the main motivations for CLARIN's Component Metadata Infrastructure (CMDI), until now there has been no work on the obvious -- bringing CMDI to the Semantic Web. We believe that providing the CLARIN CMD records as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD Infrastructure can be expressed in RDF and made ready to be interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). 76 84 %This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data. 77 85 … … 87 95 \hspace{-0.1\textwidth}\includegraphics[width=0.8\textwidth]{CMDM} 88 96 \end{center} 89 \caption{Component Metadata Model }97 \caption{Component Metadata Model \cite{ISODIS24622-1_2013}} 90 98 \label{fig:CMDM} 91 99 \end{figure*} … … 118 126 %So we encounter both situations: one profile being used by many providers and one provider using many profiles. 119 127 120 %121 \section{LOD -- Linked Open Data}122 %123 The main added value of LOD\cite{TimBL2006} is the interconnecting of disparate datasets.124 In the broader context of LOD there is meanwhile an Open Knowledge Foundationâs Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate125 datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.126 Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.127 \xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.128 Of course, language is just one dimension to use for mapping.129 Step by step we will link other categories like countries, geographica, organisations, etc.130 to some of the central nodes of the LOD cloud \cite{Cyganiak2010}, like \xne{dbpedia}, \xne{Yago} or \xne{geonames},131 but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.132 133 128 \section{CMD to RDF} 134 129 \label{sec:cmd2rdf} … … 143 138 \subsection{CMD specification}\label{sec:CMDM} 144 139 145 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations the abstractwill not further discuss attributes.}, relation to the containing component) it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.140 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations we will not further discuss attributes.}, relation to the containing component) it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}. 146 141 147 142 \begin{figure*} … … 160 155 cmdm:contains & a & rdf:Property ; \\ 161 156 & rdfs:domain & cmdm:Component ; \\ 162 & rdfs:range & :Component ,:Element . \\157 & rdfs:range & cmdm:Component , cmdm:Element . \\ 163 158 164 159 %cmdm:containsAttribute & a &rdf:Property; 165 % & rdfs:domain & :Component,:Element;166 % & rdfs:range & :Attribute.160 % & rdfs:domain & cmdm:Component, cmdm:Element; 161 % & rdfs:range & cmdm:Attribute. 167 162 168 163 \multicolumn{3}{l}{\# values} \\ … … 178 173 \\ 179 174 cmdm:hasElementEntity & a & rdf:Property ; \\ 180 & rdfs:domain & :Element ; \\181 & rdfs:range & :Entity . \\175 & rdfs:domain & cmdm:Element ; \\ 176 & rdfs:range & cmdm:Entity . \\ 182 177 % \\ 183 178 %\multicolumn{3}{l}{\# analogue for attributes ...} \\ … … 187 182 188 183 %cmdm:hasAttributeEntity & a & rdf:Property ; \\ 189 % & rdfs:domain & :Attribute ; \\190 % & rdfs:range & :Entity . \\184 % & rdfs:domain & cmdm:Attribute ; \\ 185 % & rdfs:range & cmdm:Entity . \\ 191 186 \end{example3} 192 187 \end{center} … … 197 192 \subsection{CMD profile and component definitions} 198 193 This top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR. 199 For stand-alone components, the IRI is the exactpath into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})194 For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}}) 200 195 201 196 \begin{example2} 202 197 cmd:collection \\ 203 $\;$ a & cmdm:Profile ; \\204 $\;$ rdfs:label & "collection" ; \\205 $\;$ dc:identifier & cr:clarin.eu:cr1:p\_1345561703620 . \\198 $\;$ a & cmdm:Profile ; \\ 199 $\;$ rdfs:label & "collection" ; \\ 200 $\;$ dc:identifier & cr:clarin.eu:cr1:p\_1345561703620 . \\ 206 201 cmd:Actor \\ 207 $\;$ a &cmdm:Component . \\202 $\;$ a &cmdm:Component . \\ 208 203 \end{example2} 209 204 … … 214 209 dcr:datcat \\ 215 210 $\;$ a & owl:AnnotationProperty ; \\ 216 $\;$ rdfs:label & "data category"@en ;\\211 $\;$ rdfs:label & "data category"@en . \\ 217 212 % & rdfs:comment & "This resource is equivalent to this data category."@en ; \\ 218 213 % & skos:note & "The data category should be identified by its PID."@en ; \\ … … 223 218 \begin{example2} 224 219 cmd:LanguageName \\ 225 $\;$ dcr:datcat & isocat:DC-2484 . \\220 $\;$ dcr:datcat & isocat:DC-2484 . \\ 226 221 \end{example2} 227 222 … … 254 249 \begin{example3} 255 250 \textless lr1\textgreater \\ 256 $\enspace \,$ a & & cmdm:Resource ; \\257 \multicolumn{2}{l}{cmdm:hasMimeType } & "audio/wav" . \\251 $\enspace \,$ a & & cmdm:Resource ; \\ 252 \multicolumn{2}{l}{cmdm:hasMimeType } & "audio/wav" . \\ 258 253 \end{example3} 259 254 … … 267 262 (Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.) 268 263 269 \begin{example2 }264 \begin{example2a} 270 265 \_:anno1 \\ 271 266 $\:$ a & oa:Annotation ; \\ 272 267 $\:$ oa:hasTarget & \textless lr1a \textgreater, \textless lr1b\textgreater ; \\ 273 268 $\:$ oa:hasBody & \_:topComponent1 ; \\ 274 $\:$ oa:motivatedBy& oa:describing . \\275 \end{example2 }269 $\:$ oa:motivatedBy & oa:describing . \\ 270 \end{example2a} 276 271 277 272 \subsubsection{Provenance} … … 282 277 \_:topComponent1 \\ 283 278 $\:$ dc:identifier & \textless lr1.cmd \textgreater ; \\ 284 $\:$ dc:creator & \var{\{cmd:MdCreator\}}; \\279 $\:$ dc:creator & "John Doe" ; \\ 285 280 $\:$ dc:publisher & \textless http://clarin.eu\textgreater ; \\ 286 $\:$ dc:created & \var{\{cmd:MdCreated\}}. \\281 $\:$ dc:created & "2014-02-05"\^{}\^{}xs:date . \\ 287 282 \end{example2} 288 283 289 284 \subsubsection{Collection hierarchy} % ( Resource Proxy â IsPartOf)} 290 285 291 In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer \#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.)286 In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.) 292 287 : 293 288 … … 317 312 \begin{example3} 318 313 \_:actor1 & a & cmd:Actor . \\ 319 \_:actor1lang1 & a & cmd:Actor \\320 & & .Actor\_Language . \\314 \_:actor1lang1 & a & cmd:Actor. \\ 315 & & Actor\_Language . \\ 321 316 \_:actor1 & cmd:contains & \_:actor1lang1 . \\ 322 317 \end{example3} … … 338 333 cmd:Person & a & cmdm:Component . \\ 339 334 cmd:Person.Organisation & a & cmdm:Element . \\ 340 cmd:has OrganisationElementValue \\335 cmd:hasPerson.OrganisationElementValue \\ 341 336 & rdfs:subProperyOf & cmdm:hasElementValue ; \\ 342 & rdfs:domain & cmd: Organisation ; \\337 & rdfs:domain & cmd:Person.Organisation ; \\ 343 338 & rdfs:range & xs:string . \\ 344 cmd:has OrganisationElementEntity \\339 cmd:hasPerson.OrganisationElementEntity \\ 345 340 & rdfs:subProperyOf & cmdm:hasElementEntity ; \\ 346 & rdfs:domain & cmd: Organisation ; \\347 & rdfs:range & cmd: OrganisationElementEntity .\\348 cmd: OrganisationElementEntity \\341 & rdfs:domain & cmd:Person.Organisation ; \\ 342 & rdfs:range & cmd:Person.OrganisationElementEntity .\\ 343 cmd:Person.OrganisationElementEntity \\ 349 344 & a & cmdm:Entity . \\ 350 345 \\ … … 353 348 & cmdm:contains & \_:org . \\ 354 349 \_:org & a & cmd:Person.Organisation ; \\ 355 & \multicolumn{2}{l}{cmd:has OrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\356 & \multicolumn{2}{l}{ cmd:has OrganisationElementEntity \quad <http://www.mpi.nl/> . }\\350 & \multicolumn{2}{l}{cmd:hasPerson.OrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\ 351 & \multicolumn{2}{l}{ cmd:hasPerson.OrganisationElementEntity \quad <http://www.mpi.nl/> . }\\ 357 352 358 353 <http://www.mpi.nl/> & a & cmd:OrganisationElementEnity . … … 364 359 365 360 366 \subsubsection{Elements, Fields, Values} 361 \subsubsection{Elements, Fields, Values}\label{sec:values} 367 362 Finally, we want to integrate also the actual field values in the CMD records into the linked data. 368 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related bua a \code{cmdm:hasElementValue} property. 369 370 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity. 371 372 The actual mapping process from values to entities is a complex challenging task and will be tackled in more detail in the full paper. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links. 363 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related by a \code{cmdm:hasElementValue} property. 364 365 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. The example in Figure \ref{fig:final-example} shows the whole chain of statements from metamodel to literal value and corresponding semantic entity. 366 373 367 374 368 \begin{comment} … … 441 435 \section{Implementation} 442 436 443 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. Once ready, they will be integrated into the CMDI core infrastructure, e.g., the CR. 444 %And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed. 445 Once the linked data is available it has to be stored and published in a RDF triple store. We will elaborate on this aspect further in the final paper. 437 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets. In the future, when the mapping has been tested extensively, they will be integrated into the CMD core infrastructure, e.g., the CR. A linked data representation of the CLARIN joint metadata domain can then be stored in a RDF triple store and exposed via a SPARQL endpoint. 446 438 %The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana} 447 439 448 440 % Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset. 449 441 450 \section{Conclusions and Future Work} 451 In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the full paper we will also elaborate on the task of mapping element values to semantic entities. Additionally, some technical considerations will be discussed regarding exposing this dataset as Linked Open Data. 442 % 443 \section{CMDI's future in the LOD Cloud} 444 % 445 The main added value of LOD \cite{TimBL2006} is the interconnecting of disparate datasets in the so called LOD cloud \cite{Cyganiak2010}. 446 447 The actual mapping process from CMDI values (see Section \ref{sec:values}) to entities is a complex and challenging task. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links. 448 449 In the broader context of LOD Cloud there is the Open Knowledge Foundationâs Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate 450 datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}. Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records. 451 \xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}. 452 Of course, language is just one dimension to use for mapping. 453 Step by step we will link other categories like countries, geographica, organisations, etc. 454 to some of the central nodes of the LOD cloud , like \xne{dbpedia}, \xne{Yago} or \xne{geonames}, 455 but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI. 456 457 \section{Conclusions} 458 In this paper, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the future we will extend this with mapping element values to semantic entities. 452 459 453 460 With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e., the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked.
Note: See TracChangeset
for help on using the changeset viewer.