1 | \documentclass[10pt, a4paper]{article} |
---|
2 | \usepackage{lrec2006} |
---|
3 | |
---|
4 | \usepackage{color} |
---|
5 | \usepackage{graphicx} |
---|
6 | \usepackage{amsmath} |
---|
7 | \usepackage{framed} |
---|
8 | \usepackage{url} |
---|
9 | |
---|
10 | \usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim |
---|
11 | \usepackage{multicol} |
---|
12 | |
---|
13 | %\newcommand{\comment}[1]{} |
---|
14 | \newcommand{\commentx}[1]{\textcolor{red}{#1}} |
---|
15 | |
---|
16 | %%% PAGE DIMENSIONS |
---|
17 | %\usepackage{geometry} % to change the page dimensions |
---|
18 | %\geometry{a4paper} % or letterpaper (US) or a5paper or.... |
---|
19 | %\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round |
---|
20 | %\topmargin=-0.6in |
---|
21 | %\textheight=700pt |
---|
22 | % \geometry{landscape} % set up the page for landscape |
---|
23 | % read geometry.pdf for detailed page layout information |
---|
24 | |
---|
25 | \newcommand{\code}[1]{\texttt{#1}} |
---|
26 | \newcommand{\xne}[1]{\textsf{#1}} % named entity |
---|
27 | \newcommand{\furl}[1]{\footnote{\url{#1}}} |
---|
28 | \newcommand{\var}[1]{\textrm{\textit{#1}}} % variable, definition |
---|
29 | |
---|
30 | %@{\hspace{-2mm}} |
---|
31 | \newenvironment{example2} |
---|
32 | { \footnotesize |
---|
33 | \begin{sffamily} \begin{shaded*} \noindent |
---|
34 | \begin{tabular}{@{\hspace{-1mm}} p{0.3\textwidth} p{0.7\textwidth} } } |
---|
35 | {\end{tabular} \end{shaded*} \end{sffamily} } |
---|
36 | |
---|
37 | \newenvironment{example2a} |
---|
38 | { \footnotesize |
---|
39 | \begin{sffamily} \begin{shaded*} \noindent |
---|
40 | \begin{tabular}{@{\hspace{-1mm}} p{0.4\textwidth} p{0.6\textwidth} } } |
---|
41 | {\end{tabular} \end{shaded*} \end{sffamily} } |
---|
42 | |
---|
43 | \newenvironment{example3} |
---|
44 | { \footnotesize |
---|
45 | \begin{sffamily} \begin{shaded*} \noindent |
---|
46 | \begin{tabular}{@{\hspace{-1mm}} p{0.25\textwidth} p{0.25\textwidth} p{0.45\textwidth}} |
---|
47 | } |
---|
48 | { \end{tabular} \end{shaded*} \end{sffamily} } |
---|
49 | |
---|
50 | \definecolor{shadecolor}{rgb}{0.95,0.95,1.0} |
---|
51 | |
---|
52 | % xml syntax highlighting |
---|
53 | % source http://snipt.org/vngf3 |
---|
54 | \usepackage{listings} |
---|
55 | |
---|
56 | \definecolor{grey}{rgb}{0.4,0.4,0.4} |
---|
57 | \definecolor{darkblue}{rgb}{0.0,0.0,0.6} |
---|
58 | \definecolor{cyan}{rgb}{0.0,0.6,0.6} |
---|
59 | |
---|
60 | |
---|
61 | \newenvironment{notex} |
---|
62 | {\footnotesize \color{grey} \begin{textit}} |
---|
63 | { \end{textit} \normalsize} |
---|
64 | |
---|
65 | \title{From CLARIN Component Metadata to Linked Open Data} |
---|
66 | |
---|
67 | \name{Matej \v{D}ur\v{c}o, Menzo Windhouwer} |
---|
68 | |
---|
69 | \address{ Institute for Corpus Linguistics and Text Technology (ICLTT), The Language Archive - DANS \\ |
---|
70 | Vienna, Austria, The Hague, The Netherlands \\ |
---|
71 | matej.durco@oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\} |
---|
72 | |
---|
73 | \abstract{In the european CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we |
---|
74 | describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, metadata}} |
---|
75 | |
---|
76 | % |
---|
77 | \begin{document} |
---|
78 | |
---|
79 | \maketitleabstract |
---|
80 | |
---|
81 | \section{Motivation} |
---|
82 | % |
---|
83 | Although semantic interoperability has been one of the main motivations for CLARIN's Component Metadata Infrastructure (CMDI) \cite{Broeder+2010} \furl{http://www.clarin.eu/cmdi/}, until now there has been no work on the obvious -- bringing CMDI to the Semantic Web. We believe that providing the CLARIN CMD records as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD data domain can be expressed in RDF and made ready to be interlinked with existing external semantic resources (ontologies, taxonomies, knowledge bases, vocabularies). |
---|
84 | %This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data. |
---|
85 | |
---|
86 | % |
---|
87 | \section{The Component Metadata Infrastructure}\label{CMDI} |
---|
88 | % |
---|
89 | |
---|
90 | The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components (see Figure \ref{fig:CMDM}). Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile serves as blueprint for a schema for metadata records. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory\furl{http://www.clarin.eu/vlo/}. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and |
---|
91 | terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use this semantic linkage to overcome differences in terminology and also in structure. |
---|
92 | |
---|
93 | \begin{figure*} |
---|
94 | \begin{center} |
---|
95 | \hspace{-0.1\textwidth}\includegraphics[width=0.8\textwidth]{CMDM} |
---|
96 | \end{center} |
---|
97 | \caption{Component Metadata Model \cite{ISODIS24622-1_2013}} |
---|
98 | \label{fig:CMDM} |
---|
99 | \end{figure*} |
---|
100 | |
---|
101 | |
---|
102 | % |
---|
103 | \subsection{Current status of the joint CMD Domain} |
---|
104 | % |
---|
105 | To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain. |
---|
106 | %, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records. |
---|
107 | |
---|
108 | \subsubsection{CMD Profiles } |
---|
109 | Currently 146 public profiles and 857 components are defined in the CR. |
---|
110 | Next to the `native' ones a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. |
---|
111 | %The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. |
---|
112 | The individual profiles differ also very much in their structure -- next to simple flat profiles |
---|
113 | %with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) |
---|
114 | there are complex ones with up to 10 levels %(\textit{ExperimentProfile}, profiles for describing Web Services) |
---|
115 | and a few hundred elements. |
---|
116 | %The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements. |
---|
117 | |
---|
118 | \subsubsection{Instance Data} |
---|
119 | |
---|
120 | The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} |
---|
121 | regularly collects records from the -- currently 58 -- providers, all in all over 600.000 records. |
---|
122 | Some 20 of the providers offer CMDI records, the rest provides around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile. |
---|
123 | %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. |
---|
124 | %On the other hand, some |
---|
125 | Some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present. |
---|
126 | %So we encounter both situations: one profile being used by many providers and one provider using many profiles. |
---|
127 | |
---|
128 | \section{CMD to RDF} |
---|
129 | \label{sec:cmd2rdf} |
---|
130 | In the following a RDF encoding is proposed for all levels of the CMD data domain: |
---|
131 | \begin{itemize} |
---|
132 | \item CMD meta model, |
---|
133 | \item profile and component definitions, |
---|
134 | \item administrative and structural information of CMD records and |
---|
135 | \item individual values in the fields of the CMD records. |
---|
136 | \end{itemize} |
---|
137 | |
---|
138 | \subsection{CMD specification}\label{sec:CMDM} |
---|
139 | |
---|
140 | The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations we will not further discuss attributes.}, relation to the containing component) it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}. |
---|
141 | |
---|
142 | \begin{figure*} |
---|
143 | \begin{center} |
---|
144 | \label{table:rdf-spec} |
---|
145 | \begin{example3} |
---|
146 | \multicolumn{3}{l}{@prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . }\\ |
---|
147 | \\ |
---|
148 | \multicolumn{3}{l}{\# basic building blocks of CMD Model} \\ |
---|
149 | cmdm:Component & a & rdfs:Class . \\ |
---|
150 | cmdm:Profile & rdfs:subClassOf & cmdm:Component . \\ |
---|
151 | cmdm:Element & a & rdfs:Class . \\ |
---|
152 | %cmdm:Attribute & a & rdfs:Class . \\ |
---|
153 | \\ |
---|
154 | \multicolumn{3}{l}{\# basic CMD nesting} \\ |
---|
155 | cmdm:contains & a & rdf:Property ; \\ |
---|
156 | & rdfs:domain & cmdm:Component ; \\ |
---|
157 | & rdfs:range & cmdm:Component , cmdm:Element . \\ |
---|
158 | |
---|
159 | %cmdm:containsAttribute & a &rdf:Property; |
---|
160 | % & rdfs:domain & cmdm:Component, cmdm:Element; |
---|
161 | % & rdfs:range & cmdm:Attribute. |
---|
162 | |
---|
163 | \multicolumn{3}{l}{\# values} \\ |
---|
164 | |
---|
165 | cmdm:Value & a & rdfs:Literal . \\ |
---|
166 | \\ |
---|
167 | cmdm:hasElementValue & a & rdf:Property ; \\ |
---|
168 | & rdfs:domain & cmdm:Element ; \\ |
---|
169 | & rdfs:range & cmdm:Value . \\ |
---|
170 | \\ |
---|
171 | \multicolumn{3}{l}{\# add a parallel separate class/property for the resolved entities} \\ |
---|
172 | cmdm:Entity & a & rdfs:Class . \\ |
---|
173 | \\ |
---|
174 | cmdm:hasElementEntity & a & rdf:Property ; \\ |
---|
175 | & rdfs:domain & cmdm:Element ; \\ |
---|
176 | & rdfs:range & cmdm:Entity . \\ |
---|
177 | % \\ |
---|
178 | %\multicolumn{3}{l}{\# analogue for attributes ...} \\ |
---|
179 | %cmdm:hasAttributeValue & a & rdf:Property ; \\ |
---|
180 | % & rdfs:domain & cmdm:Attribute ; \\ |
---|
181 | % & rdfs:range & rdfs:Literal . \\ |
---|
182 | |
---|
183 | %cmdm:hasAttributeEntity & a & rdf:Property ; \\ |
---|
184 | % & rdfs:domain & cmdm:Attribute ; \\ |
---|
185 | % & rdfs:range & cmdm:Entity . \\ |
---|
186 | \end{example3} |
---|
187 | \end{center} |
---|
188 | \caption{The CMD meta model in RDF} |
---|
189 | \label{fig:final-example} |
---|
190 | \end{figure*} |
---|
191 | |
---|
192 | \subsection{CMD profile and component definitions} |
---|
193 | This top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR. |
---|
194 | For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}}) |
---|
195 | |
---|
196 | \begin{example2} |
---|
197 | cmd:collection \\ |
---|
198 | $\;$ a & cmdm:Profile ; \\ |
---|
199 | $\;$ rdfs:label & "collection" ; \\ |
---|
200 | $\;$ dc:identifier & cr:clarin.eu:cr1:p\_1345561703620 . \\ |
---|
201 | cmd:Actor \\ |
---|
202 | $\;$ a &cmdm:Component . \\ |
---|
203 | \end{example2} |
---|
204 | |
---|
205 | \subsubsection{Data Categories} |
---|
206 | The primary concept registry in use by CMDI for its concept links is ISOcat. The recommended approach to link to the data categories is via an annotation property \cite{Windhouwer2012_LDL}. |
---|
207 | |
---|
208 | \begin{example2} |
---|
209 | dcr:datcat \\ |
---|
210 | $\;$ a & owl:AnnotationProperty ; \\ |
---|
211 | $\;$ rdfs:label & "data category"@en . \\ |
---|
212 | % & rdfs:comment & "This resource is equivalent to this data category."@en ; \\ |
---|
213 | % & skos:note & "The data category should be identified by its PID."@en ; \\ |
---|
214 | \end{example2} |
---|
215 | |
---|
216 | Consequently, the \code{@ConceptLink} attribute on CMD elements and components referencing the data category can be modelled as: |
---|
217 | |
---|
218 | \begin{example2} |
---|
219 | cmd:LanguageName \\ |
---|
220 | $\;$ dcr:datcat & isocat:DC-2484 . \\ |
---|
221 | \end{example2} |
---|
222 | |
---|
223 | %\subsection{RELcat - Ontological relations} |
---|
224 | % \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.} |
---|
225 | \begin{comment} |
---|
226 | Relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in the dedicated Relation Registry \xne{RELcat} as RDF triples \cite{WINDHOUWER12.954} with dedicated predicates based on an extensible taxonomy of relation types. In the final paper, we will provide more details on the role of this important building block in the endeavour. |
---|
227 | |
---|
228 | |
---|
229 | A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms: |
---|
230 | |
---|
231 | \begin{example3} |
---|
232 | isocat:DC-2538 & rel:sameAs & dct:date |
---|
233 | \end{example3} |
---|
234 | |
---|
235 | \noindent |
---|
236 | By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping: |
---|
237 | |
---|
238 | \begin{example3} |
---|
239 | rel:sameAs & rdfs:subPropertyOf & owl:sameAs |
---|
240 | \end{example3} |
---|
241 | \end{comment} |
---|
242 | |
---|
243 | %%%%%%%%%%%%%%%%%%%%% |
---|
244 | \subsection{CMD instances} |
---|
245 | In the next step, we want to express in RDF the individual CMD instances, the metadata records. |
---|
246 | |
---|
247 | We provide a generic top level class for all resources (including metadata records), the \code{cmdm:Resource} class and the \code{cmdm:hasMimeType} predicate to type the resources. |
---|
248 | |
---|
249 | \begin{example3} |
---|
250 | \textless lr1\textgreater \\ |
---|
251 | $\enspace \,$ a & & cmdm:Resource ; \\ |
---|
252 | \multicolumn{2}{l}{cmdm:hasMimeType } & "audio/wav" . \\ |
---|
253 | \end{example3} |
---|
254 | |
---|
255 | \subsubsection {Resource Identifier} |
---|
256 | |
---|
257 | \begin{comment} |
---|
258 | It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. |
---|
259 | If identifiers are present for both resource and metadata, \end{comment} |
---|
260 | The PID of a Language Resource ( \code{<lr1>} ) is used as the IRI for the described resource in the RDF representation. |
---|
261 | The relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html}. |
---|
262 | (Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.) |
---|
263 | |
---|
264 | \begin{example2a} |
---|
265 | \_:anno1 \\ |
---|
266 | $\:$ a & oa:Annotation ; \\ |
---|
267 | $\:$ oa:hasTarget & \textless lr1a \textgreater, \textless lr1b\textgreater ; \\ |
---|
268 | $\:$ oa:hasBody & \_:topComponent1 ; \\ |
---|
269 | $\:$ oa:motivatedBy & oa:describing . \\ |
---|
270 | \end{example2a} |
---|
271 | |
---|
272 | \subsubsection{Provenance} |
---|
273 | |
---|
274 | The information from the CMD record \code{cmd:Header} represents the provenance information about the modelled data. |
---|
275 | |
---|
276 | \begin{example2} |
---|
277 | \_:topComponent1 \\ |
---|
278 | $\:$ dc:identifier & \textless lr1.cmd \textgreater ; \\ |
---|
279 | $\:$ dc:creator & "John Doe" ; \\ |
---|
280 | $\:$ dc:publisher & \textless http://clarin.eu\textgreater ; \\ |
---|
281 | $\:$ dc:created & "2014-02-05"\^{}\^{}xs:date . \\ |
---|
282 | \end{example2} |
---|
283 | |
---|
284 | \subsubsection{Collection hierarchy} % ( Resource Proxy â IsPartOf)} |
---|
285 | |
---|
286 | In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.) |
---|
287 | : |
---|
288 | |
---|
289 | \begin{example3} |
---|
290 | \textless lr0.cmd \textgreater & a & ore:ResourceMap . \\ |
---|
291 | \textless lr0.cmd\textgreater & ore:describes & \textless lr0.agg\textgreater . \\ |
---|
292 | \textless lr0.agg\textgreater & a & ore:Aggregation ; \\ |
---|
293 | & ore:aggregates & \textless lr1.cmd\textgreater, \textless lr2.cmd\textgreater . \\ |
---|
294 | \end{example3} |
---|
295 | |
---|
296 | \begin{comment} |
---|
297 | Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part. |
---|
298 | This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}. |
---|
299 | Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected. |
---|
300 | |
---|
301 | \begin{example3} |
---|
302 | \_:mdcoll & a & ore:ResourceMap; \\ |
---|
303 | & rdfs:label & "Collection 1"; \\ |
---|
304 | \_:mdcoll\#aggreg & a & ore:Aggregation \\ |
---|
305 | & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ |
---|
306 | \end{example3} |
---|
307 | \end{comment} |
---|
308 | |
---|
309 | \subsubsection{Components -- nested structures} |
---|
310 | For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used: |
---|
311 | |
---|
312 | \begin{example3} |
---|
313 | \_:actor1 & a & cmd:Actor . \\ |
---|
314 | \_:actor1lang1 & a & cmd:Actor.Language . \\ |
---|
315 | \_:actor1 & cmd:contains & \_:actor1lang1 . \\ |
---|
316 | \end{example3} |
---|
317 | |
---|
318 | \begin{comment} |
---|
319 | \noindent |
---|
320 | We use \code{cmdm:describesResource} for if the \code{@res} attribute is used , i.e., one or more references to a resource (via a proxy), on a component |
---|
321 | |
---|
322 | \begin{example3} |
---|
323 | \_:coll1 & a & cmd:collection. \\ |
---|
324 | \_:coll1 & cmdm:describesResource & <lr1> . \\ |
---|
325 | \end{example3} |
---|
326 | \end{comment} |
---|
327 | |
---|
328 | |
---|
329 | \begin{figure*} |
---|
330 | \begin{center} |
---|
331 | \begin{example3} |
---|
332 | cmd:Person & a & cmdm:Component . \\ |
---|
333 | cmd:Person.Organisation & a & cmdm:Element . \\ |
---|
334 | cmd:hasPerson.OrganisationElementValue \\ |
---|
335 | & rdfs:subProperyOf & cmdm:hasElementValue ; \\ |
---|
336 | & rdfs:domain & cmd:Person.Organisation ; \\ |
---|
337 | & rdfs:range & xs:string . \\ |
---|
338 | cmd:hasPerson.OrganisationElementEntity \\ |
---|
339 | & rdfs:subProperyOf & cmdm:hasElementEntity ; \\ |
---|
340 | & rdfs:domain & cmd:Person.Organisation ; \\ |
---|
341 | & rdfs:range & cmd:Person.OrganisationElementEntity .\\ |
---|
342 | cmd:Person.OrganisationElementEntity \\ |
---|
343 | & a & cmdm:Entity . \\ |
---|
344 | \\ |
---|
345 | \multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\ |
---|
346 | \_:pers & a & cmd:Person ; \\ |
---|
347 | & cmdm:contains & \_:org . \\ |
---|
348 | \_:org & a & cmd:Person.Organisation ; \\ |
---|
349 | & \multicolumn{2}{l}{cmd:hasPerson.OrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\ |
---|
350 | & \multicolumn{2}{l}{ cmd:hasPerson.OrganisationElementEntity \quad \textless http://www.mpi.nl/\textgreater . }\\ |
---|
351 | |
---|
352 | \textless http://www.mpi.nl/\textgreater & a & cmd:OrganisationElementEnity . |
---|
353 | \end{example3} |
---|
354 | \end{center} |
---|
355 | \caption{Chain of statements from metamodel to literal value and corresponding semantic entity} |
---|
356 | \label{fig:final-example} |
---|
357 | \end{figure*} |
---|
358 | |
---|
359 | |
---|
360 | \subsubsection{Elements, Fields, Values}\label{sec:values} |
---|
361 | Finally, we want to integrate also the actual field values in the CMD records into the linked data. |
---|
362 | As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related by a \code{cmdm:hasElementValue} property. |
---|
363 | |
---|
364 | While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. The example in Figure \ref{fig:final-example} shows the whole chain of statements from metamodel to literal value and corresponding semantic entity. |
---|
365 | |
---|
366 | |
---|
367 | \begin{comment} |
---|
368 | %%%%%%%%%%%%%%%%% |
---|
369 | \section{Mapping field values to semantic entities} |
---|
370 | \label{sec:values2entities} |
---|
371 | |
---|
372 | This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links. |
---|
373 | |
---|
374 | It involves following steps: |
---|
375 | |
---|
376 | \begin{enumerate} |
---|
377 | \item identify appropriate controlled vocabularies for individual metadata fields or data categories (manual task) |
---|
378 | \item extract \emph{distinct data category, value pairs} from the metadata records |
---|
379 | \item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts |
---|
380 | \item assess the reliability of the match |
---|
381 | \item generate new RDF triples with entity identifiers as object properties |
---|
382 | \end{enumerate} |
---|
383 | |
---|
384 | This task is basically an application of ontology mapping method, trying to find for our ``anonymous'' concepts semantically equivalent concepts from other semantic resources / vocabularies. |
---|
385 | % This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''. |
---|
386 | |
---|
387 | \subsubsection{Identify vocabularies} |
---|
388 | |
---|
389 | One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}). |
---|
390 | |
---|
391 | The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources. |
---|
392 | |
---|
393 | \subsubsection{Extract input data} |
---|
394 | Starting from the literal triples as defined in previous section (\code{cmdm:hasElementValue}) we aggregate the elemnt values to retrieve distinct \emph{concept-value pairs}: |
---|
395 | |
---|
396 | \begin{example3} |
---|
397 | \_:1 & a & cmd:OrganisationElementEntity . \\ |
---|
398 | & skos:altLabel & "MPI"; |
---|
399 | \end{example3} |
---|
400 | |
---|
401 | \subsubsection{Lookup} |
---|
402 | |
---|
403 | In abstract terms, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities, ideally with some confidence score. Before actual lookup, there may have to be some string-normalizing preprocessing. |
---|
404 | |
---|
405 | %\begin{definition}[{signature of the lookup function}] |
---|
406 | \begin{equation} |
---|
407 | lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ \textless Concept \ | \ Entity ,\ confidenceScore \textgreater \ )* |
---|
408 | \end{equation} |
---|
409 | %\end{definition} |
---|
410 | |
---|
411 | In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories, |
---|
412 | which will be the result of the previous step -- identification of vocabularies. \ |
---|
413 | |
---|
414 | |
---|
415 | %\begin{definition}{Required configuration data indicating data category to available } |
---|
416 | \begin{equation} |
---|
417 | DataCategory \quad \mapsto \quad SemanticResource+ |
---|
418 | \end{equation} |
---|
419 | %\end{definition} |
---|
420 | |
---|
421 | |
---|
422 | As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}. |
---|
423 | However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via varying interfaces. |
---|
424 | |
---|
425 | \subsubsection{Candidate evaluation} |
---|
426 | The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. |
---|
427 | |
---|
428 | %One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description. |
---|
429 | |
---|
430 | \end{comment} |
---|
431 | |
---|
432 | \section{Implementation} |
---|
433 | |
---|
434 | The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets. In the future, when the mapping has been tested extensively, they will be integrated into the CMD core infrastructure, e.g., the CR. A linked data representation of the CLARIN joint metadata domain can then be stored in a RDF triple store and exposed via a SPARQL endpoint. |
---|
435 | %The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana} |
---|
436 | |
---|
437 | % Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset. |
---|
438 | |
---|
439 | % |
---|
440 | \section{CMDI's future in the LOD Cloud} |
---|
441 | % |
---|
442 | The main added value of LOD \cite{TimBL2006} is the interconnecting of disparate datasets in the so called LOD cloud \cite{Cyganiak2010}. |
---|
443 | |
---|
444 | The actual mapping process from CMDI values (see Section \ref{sec:values}) to entities is a complex and challenging task. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links. |
---|
445 | |
---|
446 | In the broader context of LOD Cloud there is the Open Knowledge Foundationâs Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate |
---|
447 | datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}. Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records. |
---|
448 | \xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}. |
---|
449 | Of course, language is just one dimension to use for mapping. |
---|
450 | Step by step we will link other categories like countries, geographica, organisations, etc. |
---|
451 | to some of the central nodes of the LOD cloud, like \xne{dbpedia}, \xne{Yago} or \xne{geonames}, |
---|
452 | but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI. |
---|
453 | |
---|
454 | \section{Conclusions} |
---|
455 | In this paper, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the future we will extend this with mapping element values to semantic entities. |
---|
456 | |
---|
457 | With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e., the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked. |
---|
458 | |
---|
459 | \bibliographystyle{lrec2006} |
---|
460 | \bibliography{CMD2RDF} |
---|
461 | |
---|
462 | \end{document}s |
---|