Context Navigation

source: CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex @ 3816

Last change on this file since 3816 was 3816, checked in by mwindhouwer, 11 years ago

M docs/papers/2014-LREC/CMD2RDF.tex

some fixes
maybe skip Attributes for the abstract
reply to IRI question

M xsl/Component2RDF.xsl

added dcterms:identifier

File size: 28.3 KB

Line
1	%\documentclass{article}
2	\documentclass{llncs}
3	\usepackage{llncsdoc}
4	\usepackage{color}
5	\usepackage{graphicx}
6	\usepackage{amsmath}
7	\usepackage{framed}
8
9	\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
10
11	%\newcommand{\comment}[1]{}
12	\newcommand{\commentx}[1]{\textcolor{red}{#1}}
13
14	%%% PAGE DIMENSIONS
15	\usepackage{geometry} % to change the page dimensions
16	\geometry{a4paper} % or letterpaper (US) or a5paper or....
17	\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
18	%\topmargin=-0.6in
19	\textheight=700pt
20	% \geometry{landscape} % set up the page for landscape
21	% read geometry.pdf for detailed page layout information
22
23	\newcommand{\code}[1]{\texttt{#1}}
24	\newcommand{\xne}[1]{\textsf{#1}} % named entity
25	\newcommand{\furl}[1]{\footnote{\url{#1}}}
26	\newcommand{\var}[1]{\textrm{\textit{#1}}} % variable, definition
27
28	%@{\hspace{-2mm}}
29	\newenvironment{example2}
30	{ \footnotesize
31	\begin{ttfamily} \begin{shaded*} \noindent
32	\begin{tabular}{p{0.4\textwidth} p{0.6\textwidth} } }
33	{\end{tabular} \end{shaded*} \end{ttfamily} }
34
35	\newenvironment{example3}
36	{ \footnotesize
37	\begin{ttfamily} \begin{shaded*} \noindent
38	\begin{tabular}{@{\hspace{-1mm}} p{0.25\textwidth} p{0.25\textwidth} p{0.45\textwidth}}
39	}
40	{ \end{tabular} \end{shaded*} \end{ttfamily} }
41
42	\definecolor{shadecolor}{rgb}{0.95,0.95,1.0}
43
44	% xml syntax highlighting
45	% source http://snipt.org/vngf3
46	\usepackage{listings}
47
48	\definecolor{grey}{rgb}{0.4,0.4,0.4}
49	\definecolor{darkblue}{rgb}{0.0,0.0,0.6}
50	\definecolor{cyan}{rgb}{0.0,0.6,0.6}
51
52
53	\newenvironment{notex}
54	{\footnotesize \color{grey} \begin{textit}}
55	{ \end{textit} \normalsize}
56
57
58	%
59	\begin{document}
60
61	\title{Component Metadata to Linked Open Data}
62
63	\author{Matej Durco\inst{1} \and Menzo Windhouwer\inst{2}}
64
65	\institute{\email{matej.durco@assoc.oeaw.ac.at}\newline
66	Institute for Corpus Linguistics and Text Technology (ICLTT), Vienna, Austria
67	\and
68	\email{menzo.windhouwer@dans.knaw.nl}\newline
69	The Language Archive - DANS, The Hague, The Netherlands}
70
71	\maketitle
72	%
73	\begin{abstract}
74	The hype/trend to Web of Data...
75
76	Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data linked with external semantic resources, will allow to fully exploit the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF.
77
78	\commentx{Menzo: I don't think we can express CMD data automatically as an ontology. For that too many semantics are still hidden in CMDI. We are building blocks (e.g., RR/CLAVAS) that might enable us to do so in the future, but I think its better now to go for CMD as LOD linked into the LOD cloud ...}
79
80	\end{abstract}
81	%
82	\begin{keywords}
83	Linked Open Data, RDF, metadata
84	%metamodel, research infrastructure
85	\end{keywords}
86	%
87	\section{Introduction}
88	%
89	\commentx{Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...}
90
91
92	The hype/trend to Web of Data...
93
94	In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
95
96	%
97	\section{The Component Metadata Infrastructure}\label{CMDI}
98	%
99	?
100
101	%
102	\subsection{Current status of the joint CMD Domain}
103	%
104	To provide a frame of reference for the proportions of the undertaking in the following section, a few numbers about the data in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
105
106	\subsubsection{CMD Profiles }
107	In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
108
109	Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements.
110
111	\subsubsection{Instance Data}
112
113	The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
114	collects records from 69 providers on daily basis. The complete dataset amounts to over half a million records.
115	16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
116	On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
117
118
119	%
120	\section{LOD -- Linked Open Data}
121	%
122	Linked Data\cite{TimBL2006}, RDF\cite{RDF2004}
123
124	dbpedia, Yago - huge compiled knowledgebases to link to...
125
126	Ontology for Language Technology: LT-World \cite{Joerg2010}
127
128	LOD cloud Cyganiak and Jentzsch\cite{Cyganiak2010}.
129
130
131	\section{CMD to RDF}
132	\label{sec:cmd2rdf}
133	In this section, RDF encoding is proposed for all levels of the CMD data domain:
134
135	\begin{itemize}
136	\item CMD meta model
137	\item profile definitions
138	\item the administrative and structural information of CMD records
139	\item individual values in the fields of the CMD records
140	\end{itemize}
141
142	\subsection{CMD specification}
143
144	The main entity of the meta model is the CMD component modelled as \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component) it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:ElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:AttributeValue, cmdm:AttributeEntity}.
145
146	The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, again analogously for attributes of individual components and elements \code{cmdm:containsAttribute}.
147
148	\label{table:rdf-spec}
149	\begin{example3}
150	@prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . \\
151	\\
152	\multicolumn{3}{l}{\# basic building blocks of CMD Model} \\
153	cmdm:Component & a & rdfs:Class . \\
154	cmdm:Profile & rdfs:subClassOf & cmdm:Component . \\
155	cmdm:Element & a & rdfs:Class . \\
156	%cmdm:Attribute & a & rdfs:Class . \\
157	\\
158	\multicolumn{3}{l}{\# basic CMD nexting} \\
159	cmdm:contains & a & rdf:Property ; \\
160	& rdfs:domain & cmdm:Component ; \\
161	& rdfs:range & :Component , :Element . \\
162
163	%cmdm:containsAttribute & a &rdf:Property;
164	% & rdfs:domain & :Component, :Element;
165	% & rdfs:range & :Attribute.
166
167	\multicolumn{3}{l}{\# values} \\
168
169	cmdm:Value & a & rdfs:Literal . \\
170	\\
171	cmdm:hasElementValue & a & rdf:Property ; \\
172	& rdfs:domain & cmdm:Element ; \\
173	& rdfs:range & cmdm:Value . \\
174	\\
175	\multicolumn{3}{l}{\# add a parallel separate class/property for the resolved entities} \\
176	cmdm:Entity & a & rdfs:Class . \\
177	\\
178	cmdm:hasElementEntity & a & rdf:Property ; \\
179	& rdfs:domain & :Element ; \\
180	& rdfs:range & :Entity . \\
181	\\
182	%cmdm:hasAttributeValue & a & rdf:Property ; \\
183	% & rdfs:domain & cmdm:Attribute ; \\
184	% & rdfs:range & rdfs:Literal . \\
185
186	%cmdm:hasAttributeEntity & a & rdf:Property ; \\
187	% & rdfs:domain & :Attribute ; \\
188	% & rdfs:range & :Entity . \\
189	\end{example3}
190
191	\noindent
192	This entities are used for modelling the actual profiles, components and elements as they are defined in the Component Registry.
193	For stand-alone/top components, the IDs as issued by Component Registry can be used as entity IRIs. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor\_Languages.Actor\_Language}).
194
195	\commentx{Matej: shouldn't we add the name of the component in the IRI for human-readability?
196	similar to how it is generated in profile XSDs: \textless xs:simpleType name="simpletype-MimeType-clarin.eu.cr1.c\_1290431694511"\textgreater }
197
198	\commentx{Menzo: the IRI is the exact path into the CR to get the RDF representation for the profile/component. I think it should stay like that because you need to be able to fetch it to get, for example, the dcr:datcat mappings. Actually the profile/component name is there as its (in general) the first component name after the '\#'.}
199
200	\label{table:rdf-cmd}
201	\begin{example3}
202	cmd:collection & a & cmdm:Profile; \\
203	& rdfs:label & "collection"; \\
204	& dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
205	cr:clarin.eu:cr1:c\_1271859438197\#Actor \\
206	& a &cmdm:Component. \\
207	\end{example3}
208
209	\commentx{Menzo: we need more context for inner components. In the example LanguageName looks well defined, but take a Component/Element like Title. Is it the title of a book or the title of a person. Only when the semantics are clear, e.g., with a dcr:datcat, one can ignore the context and collapse all Components/Elements to a single RDF class/property.}
210	\commentx{Matej: wouldn't that be remedied by cmdm:contains? or is it too much inferencing?}
211
212	\begin{notex}
213	Menzo: inner components don't have IDs so I propose a path build from the context up to a shareable component (we need some nice term for that, in the TDS I called it a top notion so maybe a top component. The cmd prefix also needs to be bound to a component specific URI. This URI contains the top component ID, e.g., \furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}.
214	\end{notex}
215
216	\subsection{Data Categories}
217	Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties
218	so as to avoid too strong semantic implications.
219
220	\begin{example3}
221	dcr:datcat & a & owl:AnnotationProperty ; \\
222	& rdfs:label & "data category"@en ; \\
223	& rdfs:comment & "This resource is equivalent to this data category."@en ; \\
224	& skos:note & "The data category should be identified by its PID."@en ; \\
225	\end{example3}
226
227	That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
228
229	\begin{example3}
230	cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
231	\end{example3}
232
233	\begin{comment}
234	Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
235	used usually directly as data properties:
236
237	\begin{example3}
238	<lr1> & dc:title & "Language Resource 1"
239	\end{example3}
240
241	However, e argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties,
242	In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
243
244	\begin{example3}
245	\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
246	\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
247	\#myNoun & owl:sameAs & isocat:DC-1333. \\
248	\end{example3}
249
250	\end{comment}
251
252	\subsection{RELcat - Ontological relations}
253	As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
254
255	\begin{example3}
256	isocat:DC-2538 & rel:sameAs & dct:date
257	\end{example3}
258
259	\noindent
260	By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
261
262	\begin{example3}
263	rel:sameAs & rdfs:subPropertyOf & owl:sameAs
264	\end{example3}
265
266	\commentx{Menzo: I would use owl:sameAs rdfs:subPropertyOf rel:sameAs. I see the rel:* properties as an upper layer of a taxonony of relation types. The RELcat types are loose and the OWL ones specific, hence the subtyping. In RELcat you might also query multiple graphs with multiple vocabularies various 'same-as' properties then still need to be distinguishable but the general rel:sameAs need to be created.}
267
268	\commentx{Matej: strip this stipulations - rest of the subsection or just short referrer to SPIN rules ?}
269	\begin{comment}
270	Is this correct:
271	?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
272
273	\begin{example2}
274	cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
275	\end{example2}
276
277	\commentx{Menzo: yes. I do have some of the SPIN rules somewhere to generate those. My idea is that one takes a dcr:datcat annotated graph. This can be using OWL or SKOS or any other RDF vocabulary. This base graph should have been expanded depending on the reasoning one uses, i.e., all entailments are in place. The dcr:datcat can then be translated into rel:sameAs and all equivalences get expanded, so one can also query using ISOcat DCs.}
278
279	\noindent
280	following facts need to be present in the ontology :
281
282	\begin{example3}
283	<lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
284	cmd:PublicationYear & owl:equivalentProperty & isocat:DC-2538 \\
285	isocat:DC-2538 & rel:sameAs & dc:created \\
286	owl:sameAs & rdfs:subPropertyOf & rel:sameAs \\
287	$\rightarrow$ \\
288	<lr1> & dc:created & 2012\^{}\^{}xs:year \\
289	\end{example3}
290	\end{comment}
291
292
293	%%%%%%%%%%%%%%%%%%%%%
294	\subsection{CMD instances}
295	In the next step, we want to express the individual CMD instances, the metadata records.
296
297	\subsubsection {Resource Identifier}
298
299	\commentx{Matej: I still yearn for something like cmdm:Resource and cmdm:MDRecord}
300	\begin{example3}
301	<lr1> & a & cmdm:Resource; \\
302	<lr1.cmd> & a & cmdm:MDRecord;
303	\end{example3}
304
305	It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
306	If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
307	(Note also, that one MD record can describe multiple resources, this can be also easily accommodated in OpenAnnotation:
308
309	\commentx{Menzo: also there can be multiple resource proxies. Maybe we can use an RDF list?}
310
311	\begin{example3}
312	\_:anno1 & a & oa:Annotation ; \\
313	& oa:hasTarget & <lr1a>, <lr1b> ; \\
314	& oa:hasBody & <lr1.cmd> ; \\
315	& oa:motivatedBy & oa:describing . \\
316	\end{example3}
317
318	\subsubsection{Provenance}
319
320	The information from \code{cmd:Header} represents the provenance information about the modelled data:
321
322	\begin{example3}
323	<lr1.cmd> & dcterms:identifier & <lr1.cmd> ; \\
324	& dcterms:creator & \var{\{cmd:MdCreator\}} ; \\
325	& dcterms:publisher & <http://clarin.eu> ; \\
326	& dcterms:created & \var{\{cmd:MdCreated\}} . \\
327	\end{example3}
328
329	\subsubsection{Hierarchy ( Resource Proxy â IsPartOf)}
330	In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
331	:
332
333	\begin{example3}
334	<lr0.cmd> & a & ore:ResourceMap . \\
335	<lr0.cmd> & ore:describes & <lr0.agg> . \\
336	<lr0.agg> & a & ore:Aggregation ; \\
337	& ore:aggregates & <lr1.cmd>, <lr2.cmd> . \\
338	\end{example3}
339
340	\commentx{Matej: Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?}
341
342	\begin{comment}
343	This is rather complicated: skip this?:
344	Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
345	This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
346	Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
347
348	\begin{example3}
349	\_:mdcoll & a & ore:ResourceMap; \\
350	& rdfs:label & "Collection 1"; \\
351	\_:mdcoll\#aggreg & a & ore:Aggregation \\
352	& ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\
353	\end{example3}
354	\end{comment}
355
356	\subsubsection{Components â nested structures}
357	For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
358
359	\begin{example3}
360	\_:actor1 & a & cmd:Actor . \\
361	?? <lr1> ? & cmd:contains & \_:actor1 . \\
362	?? <lr1.cmd> ? & cmd:contains & \_:actor1 . \\
363	\end{example3}
364
365	\subsection{Elements, Fields, Values}
366	Finally, we want to integrate also the actual field values in the CMD records into the ontology.
367	As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
368
369	While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities. Following example show the whole chain of statements from metamodel to literal value. The mapping process is detailed in \ref{sec:values2entities}.
370
371	\begin{example3}
372	cmd:Person & a & cmdm:Component . \\
373	cmd:Person.Organisation & a & cmdm:Element . \\
374	cmd:hasOrganisationElementValue \\
375	& rdfs:subProperyOf & cmdm:hasElementValue ; \\
376	& rdfs:domain & cmd:Organisation ; \\
377	& rdfs:range & xs:string . \\
378	cmd:hasOrganisationElementEntity \\
379	& rdfs:subProperyOf & cmdm:hasElementEntity ; \\
380	& rdfs:domain & cmd:Organisation ; \\
381	& rdfs:range & cmd:OrganisationElementEntity .\\
382	\\
383	\multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\
384	\_:pers & a & cmd:Person ; \\
385	& cmdm:contains & \_:org . \\
386	\_:org & a & cmd:Person.Organisation ; \\
387	& \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
388	& \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity \quad <http://mpi.nl> . }\\
389
390	<http://mpi.nl> & a & cmd:OrganisationElementEnity .
391	\end{example3}
392
393	\begin{comment}
394	\begin{example3}
395	cmd:timeCoverage & a & cmds:Element \\
396	cmd:timeCoverageValue & a & cmds:ElementValue \\
397	cmd:timeCoverage & dcr:datcat & isocat:DC-2502 \\
398	<lr1> & cmd:contains & \_:timeCoverage1 \\
399	\_:timeCoverage1 & a & cmd:timeCoverage \\
400	\_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
401	\end{example3}
402
403	\commentx{Menzo: no need to repeat dcr:datcat in the instance.}
404
405	\begin{example3}
406	\var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
407	\_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
408	\end{example3}
409
410	\begin{notex}
411	Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
412	i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
413	\end{notex}
414	\end{comment}
415
416
417	%%%%%%%%%%%%%%%%%
418	\section{Mapping field values to semantic entities}
419	\label{sec:values2entities}
420
421	\commentx{this is probably definitely too much for one abstract - so we could just anounce the need for this mapping process.}
422
423	This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
424
425	\begin{enumerate}
426	\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
427	\item extract \emph{distinct data category, value pairs} from the metadata records
428	\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
429	\item assess the reliability of the match
430	\item generate new RDF triples with entity identifiers as object properties
431	\end{enumerate}
432
433	This task is basically an application of ontology mapping method, trying to find for our ``anonymous'' concepts semantically equivalent concepts from other semantic resources / vocabularies.
434	% This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
435
436	The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
437
438	\begin{example3}
439	\_:organisation1 & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
440	\end{example3}
441
442	However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept-value pairs:
443
444	\begin{example3}
445	\_:1 & a & cmd:OrganisationElementEntity . \\
446	& skos:altLabel & "MPI";
447	\end{example3}
448
449	\subsubsection{Identify vocabularies}
450
451	One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}).
452
453	The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources.
454
455	\subsubsection{Lookup}
456
457	In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
458
459	%\begin{definition}[{signature of the lookup function}]
460	\begin{equation}
461	lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ Concept \ \| \ Entity \ )*
462	\end{equation}
463	%\end{definition}
464
465	In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
466	which will be the result of the previous step \
467
468
469	%\begin{definition}{Required configuration data indicating data category to available }
470	\begin{equation}
471	DataCategory \quad \mapsto \quad SemanticResource+
472	\end{equation}
473	%\end{definition}
474
475
476	As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
477	However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces.
478
479	\subsubsection{Candidate evaluation}
480	The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data.
481
482	%One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
483
484	\section{Implementation}
485
486	The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets.
487	Once the data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
488
489	% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
490
491
492	\section{Conclusions and Future Work}
493	In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the core model the general component schema. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
494	In the near future, a test with the whole CMD dataset will be performed.
495	And work on mapping values to entities.
496
497	With this new enhanced dataset, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
498	The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
499
500
501
502	\bibliographystyle{splncs}
503	\bibliography{CMD2RDF}
504
505	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: