Context Navigation

source: CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex @ 3862

Last change on this file since 3862 was 3862, checked in by vronk, 11 years ago
small stripping
File size: 26.2 KB

Line
1	%\documentclass{article}
2	\documentclass{llncs}
3	\usepackage{llncsdoc}
4	\usepackage{color}
5	\usepackage{graphicx}
6	\usepackage{amsmath}
7	\usepackage{framed}
8
9	\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
10
11	%\newcommand{\comment}[1]{}
12	\newcommand{\commentx}[1]{\textcolor{red}{#1}}
13
14	%%% PAGE DIMENSIONS
15	\usepackage{geometry} % to change the page dimensions
16	\geometry{a4paper} % or letterpaper (US) or a5paper or....
17	\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
18	%\topmargin=-0.6in
19	\textheight=700pt
20	% \geometry{landscape} % set up the page for landscape
21	% read geometry.pdf for detailed page layout information
22
23	\newcommand{\code}[1]{\texttt{#1}}
24	\newcommand{\xne}[1]{\textsf{#1}} % named entity
25	\newcommand{\furl}[1]{\footnote{\url{#1}}}
26	\newcommand{\var}[1]{\textrm{\textit{#1}}} % variable, definition
27
28	%@{\hspace{-2mm}}
29	\newenvironment{example2}
30	{ \footnotesize
31	\begin{ttfamily} \begin{shaded*} \noindent
32	\begin{tabular}{p{0.4\textwidth} p{0.6\textwidth} } }
33	{\end{tabular} \end{shaded*} \end{ttfamily} }
34
35	\newenvironment{example3}
36	{ \footnotesize
37	\begin{ttfamily} \begin{shaded*} \noindent
38	\begin{tabular}{@{\hspace{-1mm}} p{0.25\textwidth} p{0.25\textwidth} p{0.45\textwidth}}
39	}
40	{ \end{tabular} \end{shaded*} \end{ttfamily} }
41
42	\definecolor{shadecolor}{rgb}{0.95,0.95,1.0}
43
44	% xml syntax highlighting
45	% source http://snipt.org/vngf3
46	\usepackage{listings}
47
48	\definecolor{grey}{rgb}{0.4,0.4,0.4}
49	\definecolor{darkblue}{rgb}{0.0,0.0,0.6}
50	\definecolor{cyan}{rgb}{0.0,0.6,0.6}
51
52
53	\newenvironment{notex}
54	{\footnotesize \color{grey} \begin{textit}}
55	{ \end{textit} \normalsize}
56
57
58	%
59	\begin{document}
60
61	\title{Component Metadata to Linked Open Data}
62
63	\author{Matej Durco\inst{1} \and Menzo Windhouwer\inst{2}}
64
65	\institute{\email{matej.durco@assoc.oeaw.ac.at}\newline
66	Institute for Corpus Linguistics and Text Technology (ICLTT), Vienna, Austria
67	\and
68	\email{menzo.windhouwer@dans.knaw.nl}\newline
69	The Language Archive - DANS, The Hague, The Netherlands}
70
71	\maketitle
72	%
73	%\begin{abstract}
74	%\end{abstract}
75	%
76	\begin{keywords}
77	Linked Open Data, RDF, metadata
78	%metamodel, research infrastructure
79	\end{keywords}
80	%
81	\section{Motivation}
82	%
83	Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data linked with external semantic resources, will opens a whole new level of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
84	%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
85
86	%
87	\section{The Component Metadata Infrastructure}\label{CMDI}
88	%
89
90	The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centres offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata %elements and
91	terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure.
92
93	\begin{figure*}
94	\begin{center}
95	\hspace{-0.1\textwidth}\includegraphics[width=0.8\textwidth]{CMDM}
96	\end{center}
97	\caption{Component Metadata Model}
98	\label{fig:CMDM}
99	\end{figure*}
100
101
102	%
103	\subsection{Current status of the joint CMD Domain}
104	%
105	To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain.
106	%, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
107
108	\subsubsection{CMD Profiles }
109	In the CR 133 public profiles and 772 components are defined.
110	Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema.
111	%The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel.
112	The individual profiles differ also very much in their structure -- next to simple flat profiles
113	%with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles)
114	there are complex ones with up to 10 levels %(\textit{ExperimentProfile}, profiles for describing Web Services)
115	and a few hundred elements.
116	%The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements.
117
118	\subsubsection{Instance Data}
119
120	The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
121	regularly collects records from the providers -- currently 69 over 550.000 records.
122	16 of the providers offer CMDI records, the other 53 provide around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile.
123	%Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
124	On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that all in all there is instance data for more than 60 profiles.
125	%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
126
127	%
128	\section{LOD -- Linked Open Data}
129	%
130	The main added value of LOD\cite{TimBL2006} is the interconnecting of disparate datasets.
131	In the broader context of LOD, there is meanwhile a Open Knowledge Foundationâs Working Group on Open Data in Linguistics, that renders an obvious pool of candidate
132	datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
133	Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. for the ISO-639-3 language identifiers which are also used in CMD records.
134	\xne{lexvo} also seems suitable as it is already linked with a number of LDL datasets among others \xne{WALS}, \xne{lingvoj}, \xne{Glottolog}.
135	Of course, language is just one dimension to use for linking/mapping.
136	Step by step we will link other categories like countries, geolocations, organisations, etc.
137	to some of the central nodes of the LOD cloud \cite{Cyganiak2010}, like \xne{dbpedia}, \xne{Yago} or \xne{geonames},
138	but also domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
139
140	\section{CMD to RDF}
141	\label{sec:cmd2rdf}
142	In the following a RDF encoding is proposed for all levels of the CMD data domain:
143	\begin{itemize}
144	\item CMD meta model
145	\item profile definitions
146	\item the administrative and structural information of CMD records
147	\item individual values in the fields of the CMD records
148	\end{itemize}
149
150	\subsection{CMD specification}\label{sec:CMDM}
151
152	The main entity of the meta model is the CMD component modelled as A \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes\footnote{Due to space considerations the remainder of the paper will not discuss attributes.}, relation to the containing component) it to has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:hasElementEntity}. The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
153
154	\label{table:rdf-spec}
155	\begin{example3}
156	@prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . \\
157	\\
158	\multicolumn{3}{l}{\# basic building blocks of CMD Model} \\
159	cmdm:Component & a & rdfs:Class . \\
160	cmdm:Profile & rdfs:subClassOf & cmdm:Component . \\
161	cmdm:Element & a & rdfs:Class . \\
162	%cmdm:Attribute & a & rdfs:Class . \\
163	\\
164	\multicolumn{3}{l}{\# basic CMD nesting} \\
165	cmdm:contains & a & rdf:Property ; \\
166	& rdfs:domain & cmdm:Component ; \\
167	& rdfs:range & :Component , :Element . \\
168
169	%cmdm:containsAttribute & a &rdf:Property;
170	% & rdfs:domain & :Component, :Element;
171	% & rdfs:range & :Attribute.
172
173	\multicolumn{3}{l}{\# values} \\
174
175	cmdm:Value & a & rdfs:Literal . \\
176	\\
177	cmdm:hasElementValue & a & rdf:Property ; \\
178	& rdfs:domain & cmdm:Element ; \\
179	& rdfs:range & cmdm:Value . \\
180	\\
181	\multicolumn{3}{l}{\# add a parallel separate class/property for the resolved entities} \\
182	cmdm:Entity & a & rdfs:Class . \\
183	\\
184	cmdm:hasElementEntity & a & rdf:Property ; \\
185	& rdfs:domain & :Element ; \\
186	& rdfs:range & :Entity . \\
187	% \\
188	%\multicolumn{3}{l}{\# analogue for attributes ...} \\
189	%cmdm:hasAttributeValue & a & rdf:Property ; \\
190	% & rdfs:domain & cmdm:Attribute ; \\
191	% & rdfs:range & rdfs:Literal . \\
192
193	%cmdm:hasAttributeEntity & a & rdf:Property ; \\
194	% & rdfs:domain & :Attribute ; \\
195	% & rdfs:range & :Entity . \\
196	\end{example3}
197
198	\noindent
199	This entities are used for modelling the actual profiles, components and elements as they are defined in the CR.
200	For stand-alone/top components, the IRI\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf} is the exact path into the CR to get the RDF representation for the profile/component. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component IRI and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}).\footnote{For the sake of readability, we will collapse the component IRIs, refer to them just by their name, prefixed with \code{cmd:}.}
201
202	\begin{example3}
203	cmd:collection & a & cmdm:Profile; \\
204	& rdfs:label & "collection"; \\
205	& dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
206	cmd:Actor & a &cmdm:Component. \\
207	\end{example3}
208
209	\subsubsection{Data Categories}
210	One of the semantic registries in use by CMDI for its concept links is ISOcat. In \cite{Windhouwer2012_LDL} proposes to link to the data categories via an annotation property.
211
212	\begin{example3}
213	dcr:datcat & a & owl:AnnotationProperty ; \\
214	& rdfs:label & "data category"@en ; \\
215	% & rdfs:comment & "This resource is equivalent to this data category."@en ; \\
216	% & skos:note & "The data category should be identified by its PID."@en ; \\
217	\end{example3}
218
219	The \code{@ConceptLink} attribute on CMD elements and components referencing the data category can be modelled as:
220
221	\begin{example3}
222	cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
223	\end{example3}
224
225	%\subsection{RELcat - Ontological relations}
226	% \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
227
228	Relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in the dedicated Relation Registry \xne{RELcat} as RDF triples \cite{WINDHOUWER12.954} with dedicated predicates based on an extensible taxonomy of relation types. In the final paper, we will provide more details on the role of this important building block in the endeavour.
229
230	\begin{comment}
231	A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
232
233	\begin{example3}
234	isocat:DC-2538 & rel:sameAs & dct:date
235	\end{example3}
236
237	\noindent
238	By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
239
240	\begin{example3}
241	rel:sameAs & rdfs:subPropertyOf & owl:sameAs
242	\end{example3}
243	\end{comment}
244
245	%%%%%%%%%%%%%%%%%%%%%
246	\subsection{CMD instances}
247	In the next step, we want to express the individual CMD instances, the metadata records.
248
249	We provide a generic top level class for all resources (including metadata records), the \code{cmdm:Resource} class and the \code{cmdm:hasMimeType} predicate to type the resources.
250
251	\begin{example3}
252	<lr1> & a & cmdm:Resource; \\
253	& cmdm:hasMimeType & "audio/wav". \\
254	\end{example3}
255
256	\subsubsection {Resource Identifier}
257
258	\begin{comment}
259	It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
260	If identifiers are present for both resource and metadata, \end{comment}
261	The relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
262	(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.)
263
264	\begin{example3}
265	\_:anno1 & a & oa:Annotation ; \\
266	& oa:hasTarget & <lr1a>, <lr1b> ; \\
267	& oa:hasBody & <lr1.cmd> ; \\
268	& oa:motivatedBy & oa:describing . \\
269	\end{example3}
270
271	\subsubsection{Provenance}
272
273	The information from CMD record \code{cmd:Header} represents the provenance information about the modelled data:
274
275	\begin{example3}
276	<lr1.cmd> & dcterms:identifier & <lr1.cmd> ; \\
277	& dcterms:creator & \var{\{cmd:MdCreator\}} ; \\
278	& dcterms:publisher & <http://clarin.eu> ; \\
279	& dcterms:created & \var{\{cmd:MdCreated\}} . \\
280	\end{example3}
281
282	\subsubsection{Hierarchy ( Resource Proxy â IsPartOf)}
283	In CMD, the \code{cmd:ResourceProxyList} structure is used to express both the collection hierarchy and point to resource(s) described by the CMD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
284	:
285
286	\begin{example3}
287	<lr0.cmd> & a & ore:ResourceMap . \\
288	<lr0.cmd> & ore:describes & <lr0.agg> . \\
289	<lr0.agg> & a & ore:Aggregation ; \\
290	& ore:aggregates & <lr1.cmd>, <lr2.cmd> . \\
291	\end{example3}
292
293	\begin{comment}
294	This is rather complicated: skip this?:
295	Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
296	This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
297	Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
298
299	\begin{example3}
300	\_:mdcoll & a & ore:ResourceMap; \\
301	& rdfs:label & "Collection 1"; \\
302	\_:mdcoll\#aggreg & a & ore:Aggregation \\
303	& ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\
304	\end{example3}
305	\end{comment}
306
307	\subsubsection{Components â nested structures}
308	For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
309
310	\begin{example3}
311	\_:actor1 & a & cmd:Actor . \\
312	\_:actor1lang1 & a & cmd:Actor.Actor\_Language . \\
313	\_:actor1 & cmd:contains & \_:actor1lang1 . \\
314	\end{example3}
315
316	Additionally, we have to hook the top component to its containing metadata record.
317
318	\begin{example3}
319	\_:coll1 & a & cmd:collection. \\
320	\_:coll1 & cmdm:describesResource & <lr1.cmd> . \\
321	\end{example3}
322
323	\subsubsection{Elements, Fields, Values}
324	Finally, we want to integrate also the actual field values in the CMD records into the ontology.
325	As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
326
327	While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
328
329	The actual mapping process from values to entities is a complex challenging task and will be tackled in more detail in the full paper. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
330
331	\begin{example3}
332	cmd:Person & a & cmdm:Component . \\
333	cmd:Person.Organisation & a & cmdm:Element . \\
334	cmd:hasOrganisationElementValue \\
335	& rdfs:subProperyOf & cmdm:hasElementValue ; \\
336	& rdfs:domain & cmd:Organisation ; \\
337	& rdfs:range & xs:string . \\
338	cmd:hasOrganisationElementEntity \\
339	& rdfs:subProperyOf & cmdm:hasElementEntity ; \\
340	& rdfs:domain & cmd:Organisation ; \\
341	& rdfs:range & cmd:OrganisationElementEntity .\\
342	\\
343	\multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\
344	\_:pers & a & cmd:Person ; \\
345	& cmdm:contains & \_:org . \\
346	\_:org & a & cmd:Person.Organisation ; \\
347	& \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
348	& \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity \quad <http://mpi.nl> . }\\
349
350	<http://mpi.nl> & a & cmd:OrganisationElementEnity .
351	\end{example3}
352
353	\begin{comment}
354	%%%%%%%%%%%%%%%%%
355	\section{Mapping field values to semantic entities}
356	\label{sec:values2entities}
357
358	\commentx{this is probably definitely too much for one abstract - so we could just anounce the need for this mapping process.}
359
360	This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
361
362	It involves following steps:
363
364	\begin{enumerate}
365	\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
366	\item extract \emph{distinct data category, value pairs} from the metadata records
367	\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
368	\item assess the reliability of the match
369	\item generate new RDF triples with entity identifiers as object properties
370	\end{enumerate}
371
372	This task is basically an application of ontology mapping method, trying to find for our ``anonymous'' concepts semantically equivalent concepts from other semantic resources / vocabularies.
373	% This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
374
375	\subsubsection{Identify vocabularies}
376
377	One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}).
378
379	The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources.
380
381	\subsubsection{Extract input data}
382	Starting from the literal triples as defined in previous section (\code{cmdm:hasElementValue}) we aggregate the elemnt values to retrieve distinct \emph{concept-value pairs}:
383
384	\begin{example3}
385	\_:1 & a & cmd:OrganisationElementEntity . \\
386	& skos:altLabel & "MPI";
387	\end{example3}
388
389	\subsubsection{Lookup}
390
391	In abstract terms, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities, ideally with some confidence score. Before actual lookup, there may have to be some string-normalizing preprocessing.
392
393	%\begin{definition}[{signature of the lookup function}]
394	\begin{equation}
395	lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ \textless Concept \ \| \ Entity ,\ confidenceScore \textgreater \ )*
396	\end{equation}
397	%\end{definition}
398
399	In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
400	which will be the result of the previous step -- identification of vocabularies. \
401
402
403	%\begin{definition}{Required configuration data indicating data category to available }
404	\begin{equation}
405	DataCategory \quad \mapsto \quad SemanticResource+
406	\end{equation}
407	%\end{definition}
408
409
410	As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
411	However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via varying interfaces.
412
413	\subsubsection{Candidate evaluation}
414	The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data.
415
416	%One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
417
418	\end{comment}
419
420	\section{Implementation}
421
422	The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. Once ready they will be integrated into the CMDI core infrastructure, e.g., the CR.
423	%And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed.
424
425	Once the linked data is available it has to be stored and published in a RDF triple store, which we will tackle in the final paper.
426	%The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
427
428	% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
429
430	\section{Conclusions and Future Work}
431	In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the full paper we will also elaborate on the task of mapping element values to semantic entities. Additionally, some technical considerations will be discussed regarding exposing this dataset as Linked Open Data.
432
433	With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e. the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked to.
434
435	\bibliographystyle{splncs}
436	\bibliography{CMD2RDF}
437
438	\end{document}s

Note: See TracBrowser for help on using the repository browser.

Download in other formats: