1 | \chapter{Mapping on instance level,\\ CMD as LOD} |
---|
2 | \label{ch:design-instance} |
---|
3 | |
---|
4 | \begin{quotation} |
---|
5 | I do think that ISOcat, CLAVAS, RELcat and actual language |
---|
6 | resource all provide a part of the semantic network. |
---|
7 | |
---|
8 | And if you can express these all in RDF, which we can for almost all of them (maybe |
---|
9 | except the actual language resource ... unless it has a schema adorned |
---|
10 | with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for |
---|
11 | metadata we have that in the CMDI profiles ...) you could load all the |
---|
12 | relevant parts in a triple store and do your SPARQL/reasoning on it. Well |
---|
13 | that's where I'm ultimately heading with all these registries related to |
---|
14 | semantic interoperability ... I hope ;-)\cite{Menzo2013mail} |
---|
15 | \end{quotation} |
---|
16 | |
---|
17 | As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values. |
---|
18 | |
---|
19 | One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities. |
---|
20 | |
---|
21 | In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} |
---|
22 | as well as for real semantic (ontology-driven) search and exploration of the data. |
---|
23 | |
---|
24 | The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF. |
---|
25 | In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively. |
---|
26 | |
---|
27 | \section{CMD to RDF} |
---|
28 | \label{sec:cmd2rdf} |
---|
29 | In this section, RDF encoding is proposed for all levels of the CMD data domain: |
---|
30 | |
---|
31 | \begin{itemize} |
---|
32 | \item CMD meta model |
---|
33 | \item profile definitions |
---|
34 | \item the administrative and structural information of CMD records |
---|
35 | \item individual values in the fields of the CMD records |
---|
36 | \end{itemize} |
---|
37 | |
---|
38 | \subsection{CMD specification} |
---|
39 | |
---|
40 | The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation: |
---|
41 | |
---|
42 | \label{table:rdf-spec} |
---|
43 | \begin{example3} |
---|
44 | cmds:Component & subClassOf & owl:Class. \\ |
---|
45 | cmds:Profile & subClassOf & cmds:Component. \\ |
---|
46 | cmds:Element & subClassOf & rdf:Property. \\ |
---|
47 | \end{example3} |
---|
48 | |
---|
49 | \noindent |
---|
50 | This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry): |
---|
51 | |
---|
52 | \label{table:rdf-cmd} |
---|
53 | \begin{example3} |
---|
54 | cmd:collection & a & cmds:Profile; \\ |
---|
55 | & rdfs:label & "collection"; \\ |
---|
56 | & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\ |
---|
57 | cmd:Actor & a & cmds:Component. \\ |
---|
58 | cmd:LanguageName & a & cmds:Element. \\ |
---|
59 | \end{example3} |
---|
60 | |
---|
61 | \begin{note} |
---|
62 | Should the ID assigned in the Component Registry for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?) |
---|
63 | \end{note} |
---|
64 | |
---|
65 | \subsection{Data Categories} |
---|
66 | Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties: |
---|
67 | |
---|
68 | \begin{example3} |
---|
69 | dcr:datcat & a & owl:AnnotationProperty ; \\ |
---|
70 | & rdfs:label & "data category"@en ; \\ |
---|
71 | & rdfs:comment & "This resource is equivalent to this data category."@en ; \\ |
---|
72 | & skos:note & "The data category should be identified by its PID."@en ; \\ |
---|
73 | \end{example3} |
---|
74 | |
---|
75 | That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as: |
---|
76 | |
---|
77 | \begin{example3} |
---|
78 | cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\ |
---|
79 | \end{example3} |
---|
80 | |
---|
81 | Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms |
---|
82 | used usually directly as data properties: |
---|
83 | |
---|
84 | \begin{example3} |
---|
85 | <lr1> & dc:title & "Language Resource 1" |
---|
86 | \end{example3} |
---|
87 | |
---|
88 | \noindent |
---|
89 | Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows: |
---|
90 | |
---|
91 | \begin{example3} |
---|
92 | <lr1> & isocat:DC-2502 & "19th century" |
---|
93 | \end{example3} |
---|
94 | |
---|
95 | \noindent |
---|
96 | However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. |
---|
97 | |
---|
98 | This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties. |
---|
99 | |
---|
100 | |
---|
101 | \noindent |
---|
102 | The REST web service of \xne{ISOcat} provides a RDF representation of the data categories: |
---|
103 | |
---|
104 | \begin{example3} |
---|
105 | isocat:languageName & dcr:datcat & isocat:DC-2484; \\ |
---|
106 | & rdfs:label & "language name"@en; \\ |
---|
107 | & rdfs:comment & "A human understandable..."@en; \\ |
---|
108 | & ⊠\\ |
---|
109 | \end{example3} |
---|
110 | |
---|
111 | However this is only meant as template, as is stated in the explanatory comment of the exported data: |
---|
112 | |
---|
113 | \begin{quotation} |
---|
114 | By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals. |
---|
115 | \end{quotation} |
---|
116 | |
---|
117 | So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: |
---|
118 | |
---|
119 | \begin{example3} |
---|
120 | \#myPOS & owl:equivalentClass & isocat:DC-1345. \\ |
---|
121 | \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\ |
---|
122 | \#myNoun & owl:sameAs & isocat:DC-1333. \\ |
---|
123 | \end{example3} |
---|
124 | |
---|
125 | |
---|
126 | \subsection{RELcat - Ontological relations} |
---|
127 | As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms: |
---|
128 | |
---|
129 | \begin{example3} |
---|
130 | isocat:DC-2538 & rel:sameAs & dct:date |
---|
131 | \end{example3} |
---|
132 | |
---|
133 | \noindent |
---|
134 | By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. |
---|
135 | |
---|
136 | \begin{note} |
---|
137 | Does this mean, that I would say: |
---|
138 | \begin{example3} |
---|
139 | rel:sameAs & owl:equivalentProperty & owl:sameAs |
---|
140 | \end{example3} |
---|
141 | |
---|
142 | to enable the inference of the equivalences? |
---|
143 | |
---|
144 | Is this correct: |
---|
145 | \end{note} |
---|
146 | ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.: |
---|
147 | |
---|
148 | \begin{example2} |
---|
149 | cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012 |
---|
150 | \end{example2} |
---|
151 | |
---|
152 | \noindent |
---|
153 | following facts need to be present in the ontology : |
---|
154 | |
---|
155 | \begin{example3} |
---|
156 | <lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\ |
---|
157 | cmd:PublicationYear & owl:equivalentProperty & isocat:DC-2538 \\ |
---|
158 | isocat:DC-2538 & rel:sameAs & dc:created \\ |
---|
159 | rel:sameAs & owl:equivalentProperty & owl:sameAs \\ |
---|
160 | $\rightarrow$ \\ |
---|
161 | <lr1> & dc:created & 2012\^{}\^{}xs:year \\ |
---|
162 | \end{example3} |
---|
163 | |
---|
164 | \noindent |
---|
165 | What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples: |
---|
166 | |
---|
167 | \begin{example3} |
---|
168 | cmd:MDCreator & owl:subClassOf & dcterms:Agent \\ |
---|
169 | clavas:Organization & owl:subClassOf & dcterms:Agent \\ |
---|
170 | <org1> & a & clavas:Organization \\ |
---|
171 | \end{example3} |
---|
172 | |
---|
173 | \subsection{CMD instances} |
---|
174 | In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies. |
---|
175 | |
---|
176 | \subsubsection {Resource Identifier} |
---|
177 | |
---|
178 | It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. |
---|
179 | If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}: |
---|
180 | |
---|
181 | \begin{example3} |
---|
182 | \_:anno1 & a & oa:Annotation; \\ |
---|
183 | & oa:hasTarget & <lr1>; \\ |
---|
184 | & oa:hasBody & <lr1.cmd>; \\ |
---|
185 | & oa:motivatedBy & oa:describing \\ |
---|
186 | \end{example3} |
---|
187 | |
---|
188 | \subsubsection{Provenance} |
---|
189 | |
---|
190 | The information from \code{cmd:Header} represents the provenance information about the modelled data: |
---|
191 | |
---|
192 | \begin{example3} |
---|
193 | <lr1.cmd> & dcterms:identifier & <lr1.cmd>; \\ |
---|
194 | & dcterms:creator ?? & "\var{\{cmd:MdCreator\}}"; \\ |
---|
195 | & dcterms:publisher & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\ |
---|
196 | & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\ |
---|
197 | \end{example3} |
---|
198 | |
---|
199 | \subsubsection{Hierarchy ( Resource Proxy â IsPartOf)} |
---|
200 | In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} |
---|
201 | : |
---|
202 | |
---|
203 | \begin{example3} |
---|
204 | <lr0.cmd> & a & ore:ResourceMap \\ |
---|
205 | <lr0.cmd> & ore:describes & <lr0.agg> \\ |
---|
206 | <lr0.agg> & a & ore:Aggregation \\ |
---|
207 | & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ |
---|
208 | \end{example3} |
---|
209 | |
---|
210 | \noindent |
---|
211 | ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation? |
---|
212 | Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part. |
---|
213 | This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}. |
---|
214 | Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected. |
---|
215 | |
---|
216 | \todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data} |
---|
217 | |
---|
218 | \begin{example3} |
---|
219 | \_:mdcoll & a & ore:ResourceMap; \\ |
---|
220 | & rdfs:label & "Collection 1"; \\ |
---|
221 | \_:mdcoll\#aggreg & a & ore:Aggregation \\ |
---|
222 | & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ |
---|
223 | \end{example3} |
---|
224 | |
---|
225 | \subsubsection{Components â nested structures} |
---|
226 | |
---|
227 | There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components: |
---|
228 | |
---|
229 | \begin{enumerate}[a)] |
---|
230 | \item the components are encoded as object property |
---|
231 | |
---|
232 | \begin{example3} |
---|
233 | <lr1> & cmd:Actor & \_:Actor1 \\ |
---|
234 | <lr1> & cmd:Actor & \_:Actor2 \\ |
---|
235 | \_:Actor1 & cmd:motherTongue & iso-639:aac \\ |
---|
236 | \_:Actor2 & cmd:motherTongue & iso-639:deu \\ |
---|
237 | \_:Actor1 & cmd:role & "Interviewer" \\ |
---|
238 | \_:Actor2 & cmd:role & "Speaker" \\ |
---|
239 | \end{example3} |
---|
240 | |
---|
241 | \item a dedicated object property is used |
---|
242 | |
---|
243 | \begin{example3} |
---|
244 | \_:Actor1 & a & cmd:Actor \\ |
---|
245 | <lr1> & cmd:contains & \_:Actor1 \\ |
---|
246 | \end{example3} |
---|
247 | |
---|
248 | \end{enumerate} |
---|
249 | |
---|
250 | \subsection{Elements, Fields, Values} |
---|
251 | Finally, we want to integrate also the actual field values in the CMD records into the ontology. |
---|
252 | |
---|
253 | \subsubsection{Predicates} |
---|
254 | As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property: |
---|
255 | |
---|
256 | \begin{example3} |
---|
257 | cmd:timeCoverage & a & cmds:Element \\ |
---|
258 | cmd:timeCoverage & dcr:datcat & isocat:DC-2502 \\ |
---|
259 | <lr1> & cmd:timeCoverage & "19th century" \\ |
---|
260 | |
---|
261 | \end{example3} |
---|
262 | |
---|
263 | \subsubsection{Literal values -- data properties} |
---|
264 | |
---|
265 | To generate triples with literal values is straightforward: |
---|
266 | |
---|
267 | \begin{definition}{Literal triples} |
---|
268 | lr:Resource \ \quad cmds:Property \ \quad xsd:string |
---|
269 | \end{definition} |
---|
270 | |
---|
271 | \begin{example3} |
---|
272 | <lr1> & cmd:Organisation & "MPI" \\ |
---|
273 | \end{example3} |
---|
274 | |
---|
275 | \subsubsection{Mapping to entities -- object properties} |
---|
276 | |
---|
277 | The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities: |
---|
278 | |
---|
279 | \begin{definition}{new RDF triples} |
---|
280 | lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI |
---|
281 | \end{definition} |
---|
282 | |
---|
283 | \begin{example3} |
---|
284 | <lr1> & cmd:Organisation\_? & <org1> \\ |
---|
285 | \end{example3} |
---|
286 | |
---|
287 | \begin{note} |
---|
288 | Don't we need a separate property (predicate) for the triples with object properties pointing to entities, |
---|
289 | i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation} |
---|
290 | \end{note} |
---|
291 | |
---|
292 | The mapping process is detailed in \ref{sec:values2entities} |
---|
293 | |
---|
294 | %%%%%%%%%%%%%%%%%55 |
---|
295 | \section{Mapping field values to semantic entities} |
---|
296 | \label{sec:values2entities} |
---|
297 | |
---|
298 | This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps: |
---|
299 | |
---|
300 | \begin{enumerate} |
---|
301 | \item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task) |
---|
302 | \item extract \emph{distinct data category, value pairs} from the metadata records |
---|
303 | \item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts |
---|
304 | \item assess the reliability of the match |
---|
305 | \item generate new RDF triples with entity identifiers as object properties |
---|
306 | \end{enumerate} |
---|
307 | |
---|
308 | \begin{figure*}[!ht] |
---|
309 | \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} |
---|
310 | \caption{Sketch of the process of transforming the CMD metadata records to a RDF representation} |
---|
311 | \label{fig:smc_cmd2lod} |
---|
312 | \end{figure*} |
---|
313 | |
---|
314 | \subsubsection{Identify vocabularies} |
---|
315 | |
---|
316 | \todoin{Identify related ontologies, vocabularies? - see DARIAH:CV} |
---|
317 | LT-World \cite{Joerg2010} |
---|
318 | |
---|
319 | One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. |
---|
320 | |
---|
321 | The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}). |
---|
322 | |
---|
323 | Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}: |
---|
324 | |
---|
325 | \begin{example3} |
---|
326 | <org1> & a & skos:Concept \\ |
---|
327 | \end{example3} |
---|
328 | |
---|
329 | \noindent |
---|
330 | We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets |
---|
331 | |
---|
332 | \begin{example3} |
---|
333 | <org1> & skos:exactMatch & <dbpedia/org1>, <lt-world/orgx>; |
---|
334 | \end{example3} |
---|
335 | |
---|
336 | \noindent |
---|
337 | we could use it to expand the data with alternative identifiers, fostering the interlinking of data: |
---|
338 | |
---|
339 | \begin{example3} |
---|
340 | <org1> & dcterms:identifier & <org1>, <dbpedia/org1>, <lt-world/orgx>; |
---|
341 | \end{example3} |
---|
342 | |
---|
343 | \subsubsection{Lookup} |
---|
344 | |
---|
345 | In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing. |
---|
346 | |
---|
347 | \begin{definition}{signature of the lookup function} |
---|
348 | lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ Concept \ | \ Entity \ )* |
---|
349 | \end{definition} |
---|
350 | |
---|
351 | In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories, |
---|
352 | which will be the result of the previous step. |
---|
353 | |
---|
354 | \begin{definition}{Required configuration data indicating data category to available } |
---|
355 | DataCategory \quad \mapsto \quad Dataset+ |
---|
356 | \end{definition} |
---|
357 | |
---|
358 | As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}. |
---|
359 | However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets. |
---|
360 | |
---|
361 | \begin{figure*}[!ht] |
---|
362 | \includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp} |
---|
363 | \caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service} |
---|
364 | \label{fig:vocabulary_proxy} |
---|
365 | \end{figure*} |
---|
366 | |
---|
367 | \subsubsection{Candidate evaluation} |
---|
368 | The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct. |
---|
369 | |
---|
370 | One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description. |
---|
371 | |
---|
372 | In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records. |
---|
373 | |
---|
374 | |
---|
375 | %%%%%%%%%%%%%%%%%%%%% |
---|
376 | \section{SMC LOD - Semantic Web Application} |
---|
377 | \label{sec:lod} |
---|
378 | |
---|
379 | |
---|
380 | |
---|
381 | \cite{Europeana RDF Store Report} |
---|
382 | |
---|
383 | Technical aspects (RDF-store?): Virtuoso |
---|
384 | |
---|
385 | \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site} |
---|
386 | |
---|
387 | \todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/} |
---|
388 | |
---|
389 | \todocode{check install siren}\furl{http://siren.sindice.com/} |
---|
390 | |
---|
391 | |
---|
392 | \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} |
---|
393 | |
---|
394 | \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} |
---|
395 | |
---|
396 | / interface (ontology browser?) |
---|
397 | |
---|
398 | semantic search component in the Linked Media Framework |
---|
399 | \todocode{!!! check install LMF - kiwi - SemanticSearch !!!}\furl{http://code.google.com/p/kiwi/wiki/SemanticSearch} |
---|
400 | |
---|
401 | \todoin{check SARQ}\furl{http://github.com/castagna/SARQ} |
---|
402 | |
---|
403 | |
---|
404 | \section {Full semantic search - concept-based + ontology-driven ?} |
---|
405 | \label{semantic-search} |
---|
406 | |
---|
407 | With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset. |
---|
408 | |
---|
409 | Namely to enhance it by employing ontological resources. |
---|
410 | Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects. |
---|
411 | |
---|
412 | |
---|
413 | SPARQL |
---|
414 | |
---|
415 | rechercheisidore, dbpedia, ... |
---|
416 | |
---|
417 | \section{Summary} |
---|
418 | In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. |
---|
419 | |
---|