source: SMC4LRT/chapters/Design_SMCinstance.tex @ 3776

Last change on this file since 3776 was 3776, checked in by vronk, 11 years ago

final layout cleaning; backup

File size: 21.4 KB
Line 
1\chapter{Mapping on instance level,\\ CMD as LOD}
2\label{ch:design-instance}
3
4\begin{quotation}
5I do think that ISOcat, CLAVAS, RELcat and actual language
6resource all provide a part of the semantic network.
7
8And if you can express these all in RDF, which we can for almost all of them (maybe
9except the actual language resource ... unless it has a schema adorned
10with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
11metadata we have that in the CMDI profiles ...) you could load all the
12relevant parts in a triple store and do your SPARQL/reasoning on it. Well
13that's where I'm ultimately heading with all these registries related to
14semantic interoperability ... I hope ;-)
15
16\hfill \textit{Menzo Windhouwer} \cite{Menzo2013mail}
17\end{quotation}
18
19
20As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
21
22One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
23
24In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} 
25as well as for real semantic (ontology-driven) search and exploration of the data.
26
27The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
28In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
29
30\section{CMD to RDF}
31\label{sec:cmd2rdf}
32In this section, RDF encoding is proposed for all levels of the CMD data domain:
33
34\begin{itemize}
35\item CMD meta model
36\item profile definitions
37\item the administrative and structural information of CMD records
38\item individual values in the fields of the CMD records
39\end{itemize}
40
41\subsection{CMD specification}
42
43The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element.
44
45\label{table:rdf-spec}
46\begin{example3}
47cmds:Component & a  & rdfs:Class. \\
48cmds:Profile & rdfs:subClassOf  & cmds:Component. \\
49cmds:Element & a  & rdfs:Class. \\
50cmds:ElementValue & a & rdf:Property \\
51cmds:Attribute & a & rdf:Property \\
52\end{example3}
53
54
55\noindent
56This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
57
58\label{table:rdf-cmd}
59\begin{example3}
60cmd:collection & a & cmds:Profile; \\
61 & rdfs:label & "collection"; \\
62 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
63cmd:Actor       & a & cmds:Component. \\
64cmd:Actor.LanguageName  & a & cmds:Element. \\
65\end{example3}
66
67%\begin{note}
68%Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
69%\end{note}
70
71
72\subsection{Data Categories}
73Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties:
74
75\begin{example3}
76dcr:datcat & a  & owl:AnnotationProperty ; \\
77 & rdfs:label  & "data category"@en ; \\
78 & rdfs:comment  & "This resource is equivalent to this data category."@en ; \\
79 & skos:note  & "The data category should be identified by its PID."@en ; \\
80\end{example3}
81
82That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
83
84\begin{example3}
85cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
86\end{example3}
87
88Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
89used usually directly as data properties:
90
91\begin{example3}
92<lr1> & dc:title & "Language Resource 1"
93\end{example3}
94
95\noindent
96However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL} 
97In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
98
99\begin{example3} 
100\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
101\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
102\#myNoun & owl:sameAs & isocat:DC-1333. \\
103\end{example3} 
104
105
106\subsection{RELcat - Ontological relations}
107As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
108
109\begin{example3}
110isocat:DC-2538 & rel:sameAs & dct:date
111\end{example3}
112
113\noindent
114By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
115
116\begin{example3}
117rel:sameAs & rdfs:subPropertyOf & owl:sameAs
118\end{example3}
119
120
121
122\subsection{CMD instances}
123In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
124
125\subsubsection {Resource Identifier}
126
127It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
128If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
129(Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
130
131\begin{example3}
132\_:anno1  & a & oa:Annotation; \\
133 & oa:hasTarget  & <lr1a>, <lr1b>; \\
134 & oa:hasBody  & <lr1.cmd>; \\
135 & oa:motivatedBy  & oa:describing \\
136\end{example3}
137
138\subsubsection{Provenance}
139
140The information from \code{cmd:Header} represents the provenance information about the modelled data:
141
142\begin{example3}
143<lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
144 & dcterms:creator & "\var{\{cmd:MdCreator\}}";  \\
145 & dcterms:publisher  & <http://clarin.eu>\\
146 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" \\
147\end{example3}
148
149\subsubsection{Hierarchy ( Resource Proxy – IsPartOf)}
150In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
151:
152
153\begin{example3}
154<lr0.cmd>  & a   & ore:ResourceMap \\
155<lr0.cmd> & ore:describes & <lr0.agg> \\
156<lr0.agg> & a   & ore:Aggregation \\
157& ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
158\end{example3}
159       
160\subsubsection{Components – nested structures}
161
162For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
163
164\begin{example3}
165\_:Actor1  & a & cmd:Actor \\
166<lr1> & cmd:contains & \_:Actor1 \\
167\end{example3}
168
169\subsection{Elements, Fields, Values}
170Finally, we want to integrate also the actual field values in the CMD records into the ontology.
171
172% \subsubsection{Predicates}
173As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
174
175Following example show the whole chains of statements from metamodel to literal value:
176
177\begin{example3}
178cmd:timeCoverage  & a   & cmds:Element \\
179cmd:timeCoverageValue & a & cmds:ElementValue \\
180cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
181<lr1> & cmd:contains & \_:timeCoverage1 \\
182\_:timeCoverage1 & a & cmd:timeCoverage \\
183\_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
184\end{example3}
185
186
187While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities:
188
189\begin{example3}
190\var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
191\_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
192\end{example3}
193
194\begin{comment}
195Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
196i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation} 
197\end{comment}
198
199The mapping process is detailed in \ref{sec:values2entities}.
200
201
202
203%%%%%%%%%%%%%%%%%
204\section{Mapping field values to semantic entities}
205\label{sec:values2entities}
206
207This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
208
209\begin{enumerate}
210\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
211\item extract \emph{distinct data category, value pairs} from the metadata records
212\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
213\item assess the reliability of the match
214\item generate new RDF triples with entity identifiers as object properties
215\end{enumerate}
216
217This task is basically an application of ontology mapping method.
218
219We don't try to achieve complete ontology alignment, we just want to find
220for our ``anonymous'' concepts semantically equivalent concepts from other ontologies.
221This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
222``for each concept (node) in ontology A [tries to] find a corresponding concept
223(node), which has the same or similar semantics, in ontology B and vice verse''.
224
225The first two points in the above enumeration represent the steps necessary to be able to apply the ontology mapping.
226The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated semantic resource to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
227
228The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
229
230\begin{example3}
231\_:organisation1 & cmd:OrganisationValue & "MPI" \\
232\end{example3}
233
234However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}):
235
236\begin{example3}
237\_:1 & a & clavas:Organisation;\\
238   & skos:altLabel & "MPI";
239\end{example3}
240
241\var{lookup} function is a customized version of the \var(map) function, that operates on this information pairs (concept, label).
242
243The two steps \var{lookup} and \var{assess} correspond exactly to the two steps in \cite{jimenez2012large} in their system \xne{LogMap2}: 1) computation of mapping candidates (maximise recall) and b) assessment of the candidates (maximize precision)
244
245
246\begin{figure*}[!ht]
247\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
248\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation}
249\label{fig:smc_cmd2lod}
250\end{figure*}
251
252\subsubsection{Identify vocabularies}
253
254One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
255
256The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â€“ a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
257
258Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
259
260\begin{example3}
261<org1> & a   & skos:Concept \\
262\end{example3}
263
264\noindent
265We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets
266
267\begin{example3}
268<org1> & skos:exactMatch  & <dbpedia/org1>, <lt-world/orgx>;
269\end{example3}
270
271\noindent
272we could use it to expand the data with alternative identifiers, fostering the interlinking of data:
273
274\begin{example3}
275<org1>  & dcterms:identifier  & <org1>, <dbpedia/org1>, <lt-world/orgx>;
276\end{example3}
277
278\subsubsection{Lookup}
279
280In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
281
282\begin{definition}{signature of the lookup function}
283lookup \ ( \ DataCategory \ \ Literal \ \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
284\end{definition}
285
286In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,
287which will be the result of the previous step.
288
289\begin{definition}{Required configuration data indicating data category to available }
290DataCategory \quad \mapsto \quad Dataset+
291\end{definition}
292
293As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
294However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets.
295
296\begin{figure*}[!ht]
297\includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp}
298\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service}
299\label{fig:vocabulary_proxy}
300\end{figure*}
301
302\subsubsection{Candidate evaluation}
303The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
304
305One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
306
307In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
308
309
310
311%%%%%%%%%%%%%%%%%%%%%
312\section{SMC LOD - Semantic Web Application}
313\label{sec:lod}
314
315With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
316The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
317
318The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}.
319Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
320
321
322Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
323
324\section{Summary}
325
326In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the method to translate the string values in metadata fields to corresponding semantic entities.
327This task can be also seen as building a bridge between the world XML resources and semantic resources expressed in RDF.
328Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
329
330%The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
Note: See TracBrowser for help on using the repository browser.