- Timestamp:
- 09/30/13 11:54:57 (11 years ago)
- Location:
- SMC4LRT/chapters
- Files:
-
- 9 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Data.tex
r3553 r3638 8 8 9 9 10 \subsection{CMD-Framework} 11 10 \subsection{Component Metadata Framework} 11 \label{def:CMD} 12 13 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.) 14 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information. 15 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus 16 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. 17 18 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}. 19 20 Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records. 21 The generated schema also conveys as annotation the information about the referenced data categories. 12 22 13 23 … … 100 110 12.893 & MPI CGN \\ 101 111 10.628 & Bavarian Archive for Speech Signals (BAS) \\ 102 7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)\\112 7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\ 103 113 7.348 & WALS RefDB \\ 104 114 5.689 & Lund Corpora \\ -
SMC4LRT/chapters/Design_SMCinstance.tex
r3553 r3638 1 \chapter{ System design - mapping on instance level}1 \chapter{Mapping on instance level, CMD as LOD} 2 2 \label{ch:design-instance} 3 3 4 \begin{quotation} 4 I do think that ISOcat, CLAVAS, RELcat , anactual language5 I do think that ISOcat, CLAVAS, RELcat and actual language 5 6 resource all provide a part of the semantic network. 6 7 … … 11 12 relevant parts in a triple store and do your SPARQL/reasoning on it. Well 12 13 that's where I'm ultimately heading with all these registries related to 13 semantic interoperability ... I hope ;-) 14 semantic interoperability ... I hope ;-)\cite{Menzo2013mail} 14 15 \end{quotation} 15 \cite{Menzo2013mail} 16 17 18 Linked Data - Express dataset in RDF 19 20 21 Partly as by-product of the entities-mapping effort we will get the metadata rendered in RDF, linked with 22 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud. 23 24 25 Technical aspects (RDF-store?) / interface (ontology browser?) 26 27 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 28 29 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 30 31 defining the Mapping: 32 \begin{enumerate} 33 \item convert to RDF 34 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal] 35 \item map: \#mdrecord \#property literal $\rightarrow$ [\#mdrecord \#property \#entity] 36 \end{enumerate} 37 38 \begin{figure*}[!ht] 39 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 40 \caption{The process of transforming the CMD metadata records to and RDF representation} 41 \label{fig:smc_cmd2lod} 42 \end{figure*} 43 16 17 As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values. 18 19 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities. 20 21 In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} 22 as well as for real semantic (ontology-driven) search and exploration of the data. 23 24 The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF. 25 In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively. 44 26 45 27 \section{CMD to RDF} 46 \label{ch:cmd2rdf} 47 48 A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts. 49 However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary. 50 51 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL} 52 53 The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF 28 \label{sec:cmd2rdf} 29 In this section, RDF encoding is proposed for all levels of the CMD data domain: 30 31 \begin{itemize} 32 \item CMD meta model 33 \item profile definitions 34 \item the administrative and structural information of CMD records 35 \item individual values in the fields of the CMD records 36 \end{itemize} 54 37 55 38 \subsection{CMD specification} 56 The meta model 39 40 The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation: 57 41 58 42 \label{table:rdf-spec} 59 \begin{example} 60 cmd\_spec:Profile & subClassOf & owl:Class. \\ 61 cmd\_spec:Component & subClassOf & owl:Class. \\ 62 cmd\_spec:Element & subClassOf & rdf:Property. \\ 63 \end{example} 64 65 Typing the profiles, components and elements: 43 \begin{example3} 44 cmds:Component & subClassOf & owl:Class. \\ 45 cmds:Profile & subClassOf & cmds:Component. \\ 46 cmds:Element & subClassOf & rdf:Property. \\ 47 \end{example3} 48 49 \noindent 50 This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry): 66 51 67 52 \label{table:rdf-cmd} 68 \begin{example }69 cmd:collection & a & cmd \_spec:Profile; \\70 & rdfs:label & `collection'; \\53 \begin{example3} 54 cmd:collection & a & cmds:Profile; \\ 55 & rdfs:label & "collection"; \\ 71 56 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\ 72 cmd:Actor & a & cmd \_spec:Component. \\73 cmd:LanguageName & a & cmd \_spec:Element. \\74 \end{example }57 cmd:Actor & a & cmds:Component. \\ 58 cmd:LanguageName & a & cmds:Element. \\ 59 \end{example3} 75 60 76 61 \begin{note} 77 Should the ID assigned in the component registry for the CMD entities used as ID in rdf, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?)62 Should the ID assigned in the Component Registry for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?) 78 63 \end{note} 79 64 80 65 \subsection{Data Categories} 81 Windhouwer (2012) proposes to use the data categories as annotation properties. 82 Definition of the annotation property \code{dcr:datcat} 83 84 \begin{example} 66 Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties: 67 68 \begin{example3} 85 69 dcr:datcat & a & owl:AnnotationProperty ; \\ 86 70 & rdfs:label & "data category"@en ; \\ 87 & rdfs:comment & "This resource is equivalent to \\ 88 this data category."@en ; \\ 89 & skos:note & "The data category should be \\ 90 & & identified by its PID."@en ; \\ 91 \end{example} 92 93 Still, leaving open the possibility for âa stronger semantic linkâ : 94 \begin{quotation} 95 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals. 96 \end{quotation} 97 98 For classes the OWL 2 \code{owl:equivalentClass} can be used, for example: 99 100 \begin{example} 101 \#myPOS & owl:equivalentClass & isocat:DC-1345. \\ 102 \end{example} 103 104 For properties OWL 2 provides \code{owl:equivalentProperty}, for example: 105 106 \begin{example} 107 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\ 108 \end{example} 109 110 Finally \code{owl:sameAs} can be used for individuals, for example: 111 112 \begin{example} 113 \#myNoun & owl:sameAs & isocat:DC-1333. \\ 114 \end{example} 115 116 117 ISOcat provides a RDF representation of the data categories : 118 119 \begin{example} 71 & rdfs:comment & "This resource is equivalent to this data category."@en ; \\ 72 & skos:note & "The data category should be identified by its PID."@en ; \\ 73 \end{example3} 74 75 That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as: 76 77 \begin{example3} 78 cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\ 79 \end{example3} 80 81 Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms 82 used usually directly as data properties: 83 84 \begin{example3} 85 <lr1> & dc:title & "Language Resource 1" 86 \end{example3} 87 88 \noindent 89 Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows: 90 91 \begin{example3} 92 <lr1> & isocat:DC-2502 & "19th century" 93 \end{example3} 94 95 \noindent 96 However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. 97 98 This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties. 99 100 101 \noindent 102 The REST web service of \xne{ISOcat} provides a RDF representation of the data categories: 103 104 \begin{example3} 120 105 isocat:languageName & dcr:datcat & isocat:DC-2484; \\ 121 106 & rdfs:label & "language name"@en; \\ 122 107 & rdfs:comment & "A human understandable..."@en; \\ 123 108 & ⊠\\ 124 \end{example} 109 \end{example3} 110 111 However this is only meant as template, as is stated in the explanatory comment of the exported data: 112 113 \begin{quotation} 114 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals. 115 \end{quotation} 116 117 So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: 118 119 \begin{example3} 120 \#myPOS & owl:equivalentClass & isocat:DC-1345. \\ 121 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\ 122 \#myNoun & owl:sameAs & isocat:DC-1333. \\ 123 \end{example3} 124 125 126 \subsection{RELcat - Ontological relations} 127 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms: 128 129 \begin{example3} 130 isocat:DC-2538 & rel:sameAs & dct:date 131 \end{example3} 132 133 \noindent 134 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. 125 135 126 136 \begin{note} 127 Output from isocat is only meant as template! 128 129 In the RDF representation, the data categories seem to be referenced by their mnemonicIdentifier (rdf:ID=âlanguageNameâ) how is this guaranteed URI and how is the data category meant to be referred to? 137 Does this mean, that I would say: 138 \begin{example3} 139 rel:sameAs & owl:equivalentProperty & owl:sameAs 140 \end{example3} 141 142 to enable the inference of the equivalences? 143 144 Is this correct: 130 145 \end{note} 131 132 Finally, the ConceptLink attribute used in the CMD profiles to reference the data category is modelled as: 133 134 \begin{example} 135 cmd:LanguageName & dcr:datcat & isocat:DC-248. \\ 136 \end{example} 137 146 ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.: 147 148 \begin{example2} 149 cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012 150 \end{example2} 151 152 \noindent 153 following facts need to be present in the ontology : 154 155 \begin{example3} 156 <lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\ 157 cmd:PublicationYear & owl:equivalentProperty & isocat:DC-2538 \\ 158 isocat:DC-2538 & rel:sameAs & dc:created \\ 159 rel:sameAs & owl:equivalentProperty & owl:sameAs \\ 160 $\rightarrow$ \\ 161 <lr1> & dc:created & 2012\^{}\^{}xs:year \\ 162 \end{example3} 163 164 \noindent 165 What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples: 166 167 \begin{example3} 168 cmd:MDCreator & owl:subClassOf & dcterms:Agent \\ 169 clavas:Organization & owl:subClassOf & dcterms:Agent \\ 170 <org1> & a & clavas:Organization \\ 171 \end{example3} 138 172 139 173 \subsection{CMD instances} 140 174 In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies. 141 175 142 176 \subsubsection {Resource Identifier} 143 177 144 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>} from \code{<cmd:MdSelfLink>})as the resource identifier.145 The relationship between the resource and the metadata record could be expressed as an annotation:146 147 \begin{example }178 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. 179 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}: 180 181 \begin{example3} 148 182 \_:anno1 & a & oa:Annotation; \\ 149 183 & oa:hasTarget & <lr1>; \\ 150 184 & oa:hasBody & <lr1.cmd>; \\ 151 185 & oa:motivatedBy & oa:describing \\ 152 \end{example} 153 154 \subsection{Provenance} 155 156 Use the information from CMD-Header for information about the modelled data : 157 158 \begin{example} 159 <lr1.cmd> 160 & dcterms:identifier & <lr1.cmd>; \\ 161 & dcterms:creator ?? & "\{<cmd:MdCreator>\}"; \\ 162 \end{example} 163 164 Other proposed fields: 165 166 \begin{example} 167 & dcterms:publisher & <http://clarin.eu>, \\ 168 & <provider-oai-accesspoint>; ?? \\ 169 & dcterms:created/modified â\{<cmd:MdCreated>\}â ?? \\ 170 \end{example} 186 \end{example3} 187 188 \subsubsection{Provenance} 189 190 The information from \code{cmd:Header} represents the provenance information about the modelled data: 191 192 \begin{example3} 193 <lr1.cmd> & dcterms:identifier & <lr1.cmd>; \\ 194 & dcterms:creator ?? & "\var{\{cmd:MdCreator\}}"; \\ 195 & dcterms:publisher & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\ 196 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\ 197 \end{example3} 171 198 172 199 \subsubsection{Hierarchy ( Resource Proxy â IsPartOf)} 173 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} 174 \furl{http://openannotation.org/spec/core/core.html\#Motivations} 200 In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} 175 201 : 176 202 177 \begin{example }203 \begin{example3} 178 204 <lr0.cmd> & a & ore:ResourceMap \\ 179 205 <lr0.cmd> & ore:describes & <lr0.agg> \\ 180 206 <lr0.agg> & a & ore:Aggregation \\ 181 ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\182 \end{example }183 184 207 & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ 208 \end{example3} 209 210 \noindent 185 211 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation? 186 Additionally the flat header field <cmd:MdCollectionDisplayName>has been introduced to indicate by simple means the collection, of which given resource is part.187 This information can be used to generate a separate one-level grouping of the resources, in which the value from the <cmd:MdCollectionDisplayName> element would be used as the label of an otherwise undefined ore:ResourceMap.212 Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part. 213 This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}. 188 214 Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected. 215 189 216 \todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data} 190 217 191 \begin{example }218 \begin{example3} 192 219 \_:mdcoll & a & ore:ResourceMap; \\ 193 220 & rdfs:label & "Collection 1"; \\ 194 \_:mdcoll\#aggreg ation& a & ore:Aggregation \\221 \_:mdcoll\#aggreg & a & ore:Aggregation \\ 195 222 & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ 196 \end{example} 197 223 \end{example3} 198 224 199 225 \subsubsection{Components â nested structures} 200 226 201 \begin{note} 202 ?? Model (instance) components as blank nodes via objectProperty: 203 \end{note} 204 205 \begin{example} 227 There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components: 228 229 \begin{enumerate}[a)] 230 \item the components are encoded as object property 231 232 \begin{example3} 206 233 <lr1> & cmd:Actor & \_:Actor1 \\ 207 234 <lr1> & cmd:Actor & \_:Actor2 \\ … … 210 237 \_:Actor1 & cmd:role & "Interviewer" \\ 211 238 \_:Actor2 & cmd:role & "Speaker" \\ 212 \end{example} 213 214 ?? or rather as Classes (and express the containement hierarchy with some extra predicate): 215 \begin{example} 239 \end{example3} 240 241 \item a dedicated object property is used 242 243 \begin{example3} 216 244 \_:Actor1 & a & cmd:Actor \\ 217 245 <lr1> & cmd:contains & \_:Actor1 \\ 218 \end{example} 219 220 \subsubsection{Elements, Fields, Values} 221 222 There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields â using the literal values â try to find corresponding entities in appropriate controlled vocabularies and generate new triples. 223 There seems to need to be a separate property (predicate) for fields that are mapped to entities, like: 224 225 \begin{example} 226 <lr1> & cmd:Organisation & "MPI" \\ 227 <lr1> & cmd:Organisation\_? & <org1> \\ 228 \end{example} 229 230 %\subsubsection{Literal Values} 231 \paragraph{Literal Values} 232 233 Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile ) 234 235 \begin{example} 236 <lr1> & dct:title & "Language Resource 1" 237 \end{example} 238 239 Analogously, we could model isocat data categories as data properties . Metadata elements referencing ISOcat datacategories could be encoded as follows: 240 241 \begin{example} 242 <lr1> & isocat:DC-2502 & "19th century" 243 \end{example} 244 245 However, Windhouwer (2012) argues against direct mapping of complex data categories to data properties, but proposes to rather model data categories as annotation properties. 246 247 \begin{example} 248 cmd:timeCoverage & a & cmd\_spec:Element \\ 246 \end{example3} 247 248 \end{enumerate} 249 250 \subsection{Elements, Fields, Values} 251 Finally, we want to integrate also the actual field values in the CMD records into the ontology. 252 253 \subsubsection{Predicates} 254 As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property: 255 256 \begin{example3} 257 cmd:timeCoverage & a & cmds:Element \\ 249 258 cmd:timeCoverage & dcr:datcat & isocat:DC-2502 \\ 250 259 <lr1> & cmd:timeCoverage & "19th century" \\ 251 ... 252 \end{example} 253 254 This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties. 255 256 %\subsubsection{Mapping to entities â Vocabularies â CLAVAS} 257 \paragraph{Mapping to entities â Vocabularies â CLAVAS} 258 259 A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) 260 Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS â a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies. 261 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \xne{skos:Concepts}: 262 263 \begin{example} 260 261 \end{example3} 262 263 \subsubsection{Literal values -- data properties} 264 265 To generate triples with literal values is straightforward: 266 267 \begin{definition}{Literal triples} 268 lr:Resource \ \quad cmds:Property \ \quad xsd:string 269 \end{definition} 270 271 \begin{example3} 272 <lr1> & cmd:Organisation & "MPI" \\ 273 \end{example3} 274 275 \subsubsection{Mapping to entities -- object properties} 276 277 The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities: 278 279 \begin{definition}{new RDF triples} 280 lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI 281 \end{definition} 282 283 \begin{example3} 284 <lr1> & cmd:Organisation\_? & <org1> \\ 285 \end{example3} 286 287 \begin{note} 288 Don't we need a separate property (predicate) for the triples with object properties pointing to entities, 289 i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation} 290 \end{note} 291 292 The mapping process is detailed in \ref{sec:values2entities} 293 294 %%%%%%%%%%%%%%%%%55 295 \section{Mapping field values to semantic entities} 296 \label{sec:values2entities} 297 298 This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps: 299 300 \begin{enumerate} 301 \item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task) 302 \item extract \emph{distinct data category, value pairs} from the metadata records 303 \item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts 304 \item assess the reliability of the match 305 \item generate new RDF triples with entity identifiers as object properties 306 \end{enumerate} 307 308 \begin{figure*}[!ht] 309 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 310 \caption{Sketch of the process of transforming the CMD metadata records to a RDF representation} 311 \label{fig:smc_cmd2lod} 312 \end{figure*} 313 314 \subsubsection{Identify vocabularies â CLAVAS} 315 316 \todoin{Identify related ontologies, vocabularies? - see DARIAH:CV} 317 LT-World \cite{Joerg2010} 318 319 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. 320 321 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}). 322 323 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}: 324 325 \begin{example3} 264 326 <org1> & a & skos:Concept \\ 265 \end{example} 266 267 We may want to add some more typing and introduce classes for entities from individual vocabularies like clavas:Organization or similar. 268 As far as CLAVAS will also maintain mappings/links to other datasets: 269 270 \begin{example} 271 <org1> skos:exactMatch <dbpedia/org1>, <lt-world/orgx>; 272 \end{example} 273 327 \end{example3} 328 329 \noindent 330 We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets 331 332 \begin{example3} 333 <org1> & skos:exactMatch & <dbpedia/org1>, <lt-world/orgx>; 334 \end{example3} 335 336 \noindent 274 337 we could use it to expand the data with alternative identifiers, fostering the interlinking of data: 275 338 276 \begin{example} 277 <org1> dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>; 278 \end{example} 279 280 281 282 \paragraph{Mapping from strings to Entities} 283 284 Find matching entities in selected Ontologies based on the textual values in the metadata records. 285 286 287 Identify related ontologies: 288 LT-World \cite{Joerg2010} 289 290 task: 291 \begin{enumerate} 292 \item express MDRecords in RDF 293 \item identify related ontologies/vocabularies (category $\rightarrow$ vocabulary) 294 \item use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?) 295 296 %\fbox{ function lookup: Category x String -> ConceptualDomain} 297 \begin{eqnarray*} 298 lookup(Category, Literal) \rightarrow ConceptualDomain?? 299 \end{eqnarray*} 300 301 302 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc. 303 \end{enumerate} 304 305 306 307 \subsection{RELcat - Ontological relations} 308 Information in RELcat is already stored in RDF \cite{SchuurmanWindhouwer2011}. One relation from the example relation set for CMDI : 309 310 \begin{example} 311 isocat:DC-2538 rel:sameAs dct:date 312 \end{example} 313 314 Should we generate the redundant triples based on the relations defined between data categories? I.e. if there is a relation and a resource has value: 315 316 \begin{example} 317 <lr1> isocat:DC-2538 2012^^xs:year 318 \end{example} 319 320 should we generate 321 322 \begin{example} 323 <lr1> dct:date 2012^^xs:year 324 \end{example} 325 326 ? 327 328 What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples: 329 330 \begin{example} 331 cmd:MDCreator & owl:subClassOf & dcterms:Agent \\ 332 clavas:Organization & owl:subClassOf & dcterms:Agent \\ 333 <org1> & a & clavas:Organization \\ 334 \end{example} 335 336 337 339 \begin{example3} 340 <org1> & dcterms:identifier & <org1>, <dbpedia/org1>, <lt-world/orgx>; 341 \end{example3} 342 343 \subsubsection{Lookup} 344 345 In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing. 346 347 \begin{definition}{signature of the lookup function} 348 lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ Concept \ | \ Entity \ )* 349 \end{definition} 350 351 In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories, 352 which will be the result of the previous step. 353 354 \begin{definition}{Required configuration data indicating data category to available } 355 DataCategory \quad \mapsto \quad Dataset+ 356 \end{definition} 357 358 As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}. 359 However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets. 360 361 \begin{figure*}[!ht] 362 \includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp} 363 \caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service} 364 \label{fig:vocabulary_proxy} 365 \end{figure*} 366 367 \subsubsection{Candidate evaluation} 368 The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct. 369 370 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description. 371 372 In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records. 373 374 375 %%%%%%%%%%%%%%%%%%%%% 338 376 \section{SMC LOD - Semantic Web Application} 377 \label{sec:lod} 339 378 340 379 \todoin{read: Europeana RDF Store Report} 341 380 381 Technical aspects (RDF-store?): Virtuoso 382 342 383 \todocode{install Jena + fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site} 343 384 … … 345 386 346 387 \todocode{check install siren}\furl{http://siren.sindice.com/} 388 389 390 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/} 391 392 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)} 393 394 / interface (ontology browser?) 347 395 348 396 semantic search component in the Linked Media Framework … … 353 401 354 402 \section {Full semantic search - concept-based + ontology-driven ?} 403 \label{semantic-search} 355 404 356 405 With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset. 357 406 358 407 Namely to enhance it by employing ontological resources. 359 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 360 408 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects. 409 410 411 SPARQL 412 413 rechercheisidore, dbpedia, ... 361 414 362 415 \section{Summary} 363 364 365 416 In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. 417 -
SMC4LRT/chapters/Design_SMCschema.tex
r3553 r3638 1 1 2 \chapter{ Concept-based mapping on schema level -- system design}2 \chapter{System design -- concept-based mapping on schema level} 3 3 \label{ch:design} 4 4 5 In this chapter, we define the part of the proposed system pertaining to the schema level: the concept-based crosswalk and search functionality -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}) -- and, additionally, the aspect of visualization of schema-level (model)data.6 7 We start by drawing a global view onthe system, introducing its individual components and the dependencies among them.8 In the next section, the internal data model is presented and explained. In section \ref{ sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.5 In this chapter, we define the main function of the proposed system -- the \textbf{concept-based crosswalk and search functionality} -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}). Additionally we explore the related aspect of analytic visualization of the processed data. 6 7 We start by drawing an overall view of the system, introducing its individual components and the dependencies among them. 8 In the next section, the internal data model is presented and explained. In section \ref{def:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{def:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed. 9 9 10 10 \section{System Architecture} 11 11 12 The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR}) 13 and is being developed as a separate service on the side of CLARIN Metadata Service, its primary consuming service, but shall be equally usable by other applications. 14 12 The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain. 15 13 16 14 \begin{figure*}[!ht] … … 20 18 \end{figure*} 21 19 20 The SMC module can be broken down into following components: 22 21 23 22 \begin{description} 24 \item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx}25 \item[concept-based query expansion] 23 \item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx} 24 \item[concept-based query expansion] a module for query expansion based on the crosswalks 26 25 \item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data 27 26 \item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph} … … 30 29 \end{description} 31 30 31 The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}. 32 33 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs. 34 32 35 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}. 33 36 34 \section{Data model - Terms} 37 \section{Data model} 38 39 Before we get to the definition of the actual service, we define the internal data model, divided into of two parts: 40 41 \begin{description} 42 \item[smcIndex] a data type for denoting indexes in a human-readable way used internally and as input and output format of the service 43 \item[Terms.xsd] the schema for internal representation of the processed data 44 \end{description} 45 46 \subsection{smcIndex}\label{def:smcIndex} 47 In this section, we describe \code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces. 48 49 An \code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces. 50 51 \begin{defcap} 52 \caption{Grammar of \code{smcIndex}} 53 \begin{align*} 54 smcIndex &::= dcrIndex \ | \ cmdIndex \\ 55 dcrIndex &::= dcrID \ contextSep \ datcatLabel \\ 56 & \quad \quad | \ [\ dcrID \ contextSep \ ] \ datcatID \\ 57 cmdIndex &::= profile \\ 58 & \quad \quad | \ cmdEntityId \\ 59 & \quad \quad | \ [\ profile \ contextSep \ ] \ dotPath \\ 60 profile &::= profileName \ [ \ \texttt{\#} \ profileID \ ] \\ 61 dotPath &::= [\ dotPath \ pathSep \ ] \ elemName \\ 62 cmdEntityId &::= componentId \ [ \ \texttt{\#} \ elemName \ ] \\ 63 contextSep &::= \texttt{`.`} \ | \ \texttt{`:`} \\ 64 pathSep &::= \texttt{`.`} \\ 65 dcrId &::= \texttt{`isocat`} \ | \ \texttt{`dc`} 66 \end{align*} 67 \end{defcap} 68 69 The grammar distinguishes two main types of \code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model). 70 These two types of \code{smcIndex} follow different construction patterns. 71 \code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category. 72 73 It is important to note, that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique. 74 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it. 75 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar: 76 77 \code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace. 78 79 \code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier: 80 81 \begin{example1} 82 \concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1272022528363} \\ 83 \concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1290431694579} 84 \end{example1} 85 86 \noindent 87 \code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down the ambiguity. 88 89 \subsection{Terms} 35 90 \label{datamodel-terms} 36 91 37 \todocode{Terms.xsd} 38 39 \begin{note} 40 Describe the CMD-format? 41 \end{note} 42 92 In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{list:terms-schema}. 93 94 \subsubsection{Type \code{Term}} 95 96 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents. 97 98 \begin{table}[ht] 99 \caption{Attributes of \code{Term} when encoding data category} 100 \label{table:terms-attributes-datcat} 101 \begin{tabular}{ l | l | l } 102 attribute & allowed values & sample value\\ 103 \hline 104 \var{concept-id} & PID given by DCR & \code{isocat:DC-2522} \\ 105 \var{set} & identifier of the DCR \emph{dcrID} & \code{isocat} \\ 106 \var{type} & one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\ 107 \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\ 108 \end{tabular} 109 \end{table} 110 111 %\captionsetup{justification=raggedright, singlelinecheck=false} 112 \lstset{language=XML} 113 \begin{lstlisting}[label=list:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category] 114 <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat" 115 type="label" xml:lang="fr">nom de ressource</Term> 116 \end{lstlisting} 117 118 \begin{table}[ht] 119 \caption{Attributes of \code{Term} when encoding CMD entity} 120 \label{table:terms-attributes-cmd} 121 \begin{tabularx}{1\textwidth}{ l | X | X } 122 attribute & allowed values & sample value\\ 123 \hline 124 \var{id} & \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\ 125 \var{type} & one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\ 126 \var{name} & name of the component or element & \code{Url} \\ 127 \var{path} & \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\ 128 \var{parent} & name of the parent component & \code{Contact} \\ 129 \end{tabularx} 130 \end{table} 131 132 \lstset{language=XML} 133 \begin{lstlisting}[label=list:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element] 134 <Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546" 135 id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact" 136 path="SpeechCorpus.Access.Contact.Url"/> 137 \end{lstlisting} 138 139 \begin{table}[ht] 140 \caption{Attributes of \code{Term} when encoding a term in the inverted index?} 141 \label{table:terms-attributes-index} 142 \begin{tabularx}{1\textwidth}{ l | X | X } 143 attribute & allowed values & sample value\\ 144 \hline 145 \var{id} & \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\ 146 \var{type} & one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\ 147 \var{schema} & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\ 148 \var{concept-id} & id of the corresponding (data category) & \var{isocat:}\code{DC-2545} \\ 149 \var{node-value} & \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\ 150 \end{tabularx} 151 \end{table} 152 153 \lstset{language=XML} 154 \begin{lstlisting}[label=list:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index] 155 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520" 156 id="clarin.eu:cr1:c_1359626292113#ResourceTitle" 157 concept-id="http://www.isocat.org/datcat/DC-2545" > 158 AnnotatedCorpusProfile.GeneralInfo.ResourceTitle 159 </Term> 160 \end{lstlisting} 161 162 163 \subsubsection{Type \code{Concept}} 164 \code{Concept} represents a data category. Identifier is the PID issued by the DCR. 165 It groups all terms belonging to given data category. 166 The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements. 167 Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages: 168 169 \lstset{language=XML} 170 \begin{lstlisting}[label=list:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}] 171 <Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat" 172 id="http://www.isocat.org/datcat/DC-2545"> 173 <Term set="isocat" type="mnemonic">resourceTitle</Term> 174 <Term set="isocat" type="id">DC-2545</Term> 175 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 176 <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term> 177 ... 178 <info xml:lang="en">The title is the complete title 179 of the resource without any abbreviations.</info> 180 ... 181 </Concept> 182 \end{lstlisting} 183 184 In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{list:concept-cmd-term}). 185 186 \lstset{language=XML} 187 \begin{lstlisting}[label=list:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}] 188 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620" 189 id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term> 190 \end{lstlisting} 191 192 \lstset{language=XML} 193 \begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}] 194 <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat"> 195 <Term set="isocat" type="mnemonic">resourceTitle</Term> 196 <Term set="isocat" type="id">DC-2545</Term> 197 <Term set="isocat" type="label" xml:lang="en">resource title</Term> 198 <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term> 199 <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term> 200 ... 201 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520" 202 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"> 203 AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term> 204 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880" 205 id="clarin.eu:cr1:c_1271859438123#Title"> 206 AnnotationTool.GeneralInfo.Title</Term> 207 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885" 208 id="clarin.eu:cr1:c_1274880881884#Title"> 209 imdi-corpus.Corpus.Title</Term> 210 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204" 211 id="clarin.eu:cr1:c_1271859438201#Title"> 212 Session.Title</Term> 213 ... 214 </Concept> 215 \end{lstlisting} 216 217 218 \subsubsection{Type \code{Termsets/Termset}} 219 \code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute. 220 For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile. 221 222 Finally, \code{Termsets} is a root element grouping \code{Termset} elements. 223 224 \lstset{language=XML} 225 \begin{lstlisting}[label=list:termset, caption=\code{Termset} element representing a CMD profile] 226 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520" 227 type="CMD_Profile"> 228 <info> 229 <id>clarin.eu:cr1:p_1357720977520</id> 230 <description>A CMDI profile for annotated text corpus resources.</description> 231 <name>AnnotatedCorpusProfile</name> 232 <registrationDate>2013-01-31T11:57:12+00:00</registrationDate> 233 <creatorName>nalida</creatorName> 234 ... 235 </info> 236 <Term type="CMD_Component" name="GeneralInfo" datcat="" 237 id="clarin.eu:cr1:c_1359626292113" 238 parent="AnnotatedCorpusProfile" 239 path="AnnotatedCorpusProfile.GeneralInfo"> 240 <Term ... 241 </Term> 242 ... 243 </Termset> 244 \end{lstlisting} 245 246 The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements. 247 248 249 %%%%%%%%%%%%%%%%%%%%%% 43 250 \section{cx -- crosswalk service} 44 \label{ def:cx}251 \label{sec:cx} 45 252 46 253 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. 47 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas, building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain. (cf. \ref{def:qx}). 48 49 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemata annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemata by some matching algorithm, but rather the data categories are used as bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), rather than in a collection of pair-wise equivalencies between the fields. 50 51 \subsection{smcIndex}\label{indexes} 52 In this section we describe \emph{smcIndex} -- the data type for input and output of the proposed application. 53 An smcIndex is a human-readable string adhering to a specific syntax, denoting some search index. 54 The generic syntax is: 55 \begin{eqnarray*} 56 smcIndex ::= context \ contextSep \ conceptLabel 57 \end{eqnarray*} 58 59 We distinguish two types of smcIndexes: (i) \emph{dcrIndex} referring to data categories and (ii) \emph{cmdIndex} denoting a specific 60 ``CMD-entity'', i.e. a metadata field, component or whole profile defined within CMD. The \textit{cmdIndex} can be interpreted as a XPath into the instances of CMD-profiles. In contrast to it, the \textit{dcrIndexes} are generally not directly applicable on existing data, but can be understood as abstract indexes referring to well-defined concepts -- the data categories -- and for actual search they need to be resolved to the metadata fields they are referred by. In return one can expect to match more metadata fields from multiple profiles, all referring to the same data category. 61 62 These two types of smcIndex also follow different construction patterns: 63 \begin{eqnarray*} 64 smcIndex & ::= & dcrIndex \ | \ cmdIndex \\ 65 dcrIndex & ::= & dcrID \ contextSep \ datcatLabel \\ 66 cmdIndex & ::= & profile \ \\ 67 & & | \ [\ profile \ contextSep \ ] \ dotPath \\ 68 dotPath & ::= & [\ dotPath \ pathSep \ ] \ elemName \\ 69 contextSep & ::= & \texttt{`.`} \ | \ \texttt{`:`} \\ 70 pathSep & ::= & \texttt{`.`} \\ 71 dcrId & ::= & \texttt{`isocat`} \ | \ \texttt{`dc`} 72 \end{eqnarray*} 73 74 The grammar is based on the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (\texttt{dc.title}) and on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} (\texttt{Session.Location.Country}). 75 76 \textit{dcrID} is a shortcut referring to a data category registry 77 %\footnote{Next to ISOcat other registries can function as a DCR, e.g., the Dublin Core set of metadata terms.} 78 similar to the namespace-mechanism in XML-documents. \textit{datcatLabel} is the verbose Identifier- (e.g. \texttt{telephoneNumber}) or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category. 254 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}. 255 256 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{def:qx}). 257 258 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields. 259 260 \subsection{Interface Specification} 261 \label{def:cx-interface} 262 263 In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format. 264 265 \todoin{The two interfaces list and map 266 Full definition in appendix and under link!} 267 268 \subsubsection*{Method \code{list}} 269 270 Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input and provide autocompletion functionality. 271 272 \begin{definition}{URI-pattern of the \code{list} method} 273 /smc/cx/list/\$context 274 \end{definition} 275 276 \noindent 277 Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data 278 279 \begin{table} 280 \caption{Allowed values for parameters of the \code{list}-method and corresponding return values} 281 \label{table:cx-list-params} 282 \begin{tabular}{ l | p{0.7\textwidth} } 283 \var{\$context} & returns a list of \\ 284 \hline 285 \code{*,top} & available termsets \\ 286 \var{\{termset\}} & terms (CMD components and elements) of given termset \\ 287 \code{dcr} & available data category registries (isocat, dublincore) \\ 288 \code{isocat} & ISOcat data categories referenced in CMD data \\ 289 \code{languages} & available languages (only for isocat data categories) \\ 290 \code{cmd-profiles} & all available CMD profiles \\ 291 \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\ 292 \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\ 293 \code{relsets} & available relation sets (defined in the Relation Registry) 294 \end{tabular} 295 \end{table} 296 297 Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry. 298 %NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.} 79 299 % While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices. 80 \textit{profile} is the name of the profile. % (despite the danger of ambiguity). 81 \textit{dotPath} allows to address a leaf element (\texttt{Session.Actor.Role}), or any intermediary XML-element corresponding to a CMD-component (\texttt{Session.Actor}) within a metadata description. %This allows to easily express search in whole components, instead of having to list all individual fields. 82 83 Generally, smcIndexes can be ambiguous, meaning they can refer to multiple concepts, or entities (CMD-elements). This is due to the fact that the names of the data categories, and CMD-entities are not guaranteed unique. The module will have to cope with this, by providing on demand the list of identifiers corresponding to a given smcIndex. 84 85 %As an important sidenote -- cmdIndexes can be ambiguous, meaning they can refer to multiple entities (metadata fields), examples of valid indexes: 86 %\begin{verbatim} 87 %Name 88 %Actor.Name, Project.Name 89 %Session.Actor.Name, Drama.Actor.Name 90 %\end{verbatim} 91 92 %So we disambiguate (or narrow down the ambiguity) by prefixing context. 93 94 \subsection{Interface Specification} 95 96 In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas. 97 % \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.} 98 99 In the operation mode, the application accepts any index (\textit{smcIndex}, cf. \ref{indexes}) and returns a list of corresponding indexes (or only the input index, if no correspondences were found): 100 \newline 101 102 \textit{smcIndex $\mapsto$ smcIndex[ ]} 103 \newline 104 105 We can distinguish following levels for this mapping function: 106 107 (1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD-elements (\textit{cmdIndexes)} bound to a given data category are returned: 108 \newline 109 110 \begin{example} 111 isocat.size & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number] 112 \end{example} 113 \newline 114 115 \textit{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above: 116 \newline 117 118 \begin{example} 119 imdi-corpus.Name & $\mapsto$ \\ 120 (isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name 121 \end{example} 122 \newline 123 124 (2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes: 125 \newline 126 127 \texttt{isocat.resourceTitle $\mapsto$ } 128 \verb| (+ dc.title) |$\mapsto$ \newline 129 \verb| [imdi-corpus.Title, | \newline 130 \verb| TextCorpusProfile.GeneralInfo.Title,| \newline 131 \verb| teiHeader.titleStmt.title,| \newline 132 \verb| teiHeader.monogr.title]| 133 \newline 134 135 (3) \emph{container data categories} -- further expansions will be possible once the container data categories \cite{SchuurmanWindhouwer2011} will be used. Currently only fields (leaf nodes) in metadata descriptions are linked to data categories. However, at times, there is a need to conceptually bind also the components, meaning that besides the ``atomic'' data category for \texttt{actorName, there would be also a data category for the complex concept \texttt{Actor}.} 136 Having concept links also on components will require a compositional approach to the task of semantic mapping, resulting in: 137 \newline 138 \texttt{Actor.Name $\mapsto$ }\newline 139 \verb| [Actor.Name, Actor.FullName, |\newline 140 \verb| Person.Name, Person.FullName]| 141 300 301 302 \subsubsection*{Method \code{map} } 303 304 Method \code{map} performs the actual translations: 305 it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes. 306 %it returns list of equivalent terms/smcIndexes for a given term/smcIndex. 307 308 \begin{definition}{General function definition} 309 smcIndex \mapsto smcIndex[ ] 310 \end{definition} 311 312 \begin{definition}{URI-pattern of the \code{map} method} 313 /smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ] 314 \end{definition} 315 316 \noindent 317 Parameter definition:\\* 318 \begin{description} 319 \item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would be one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted 320 \item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive 321 \item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default 322 \item[\var{\$relset}] optional; reference to a relset to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets are they are all applied in the expansion 323 \end{description} 324 325 \noindent 326 Possible return formats: 327 \begin{description} 328 \item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output}) 329 330 331 \item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string 332 \lstset{language=XML} 333 \begin{lstlisting} 334 <Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/> 335 \end{lstlisting} 336 \item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id} 337 \lstset{language=XML} 338 \begin{lstlisting} 339 <Term concept-id="http://www.isocat.org/datcat/DC-2512" 340 set="isocat" type="datcat">creatorFullName</Term> 341 \end{lstlisting} 342 \item[\var{cmdid, id}] distinct cmd entities (\code{Term}) by \code{@id} 343 \begin{lstlisting} 344 <Term type="CMD_Element" name="Name" elem="Name" parent="Session" 345 datcat="http://www.isocat.org/datcat/DC-2544" 346 id="clarin.eu:cr1:c_1349361150645#Name" path="DBD.Session.Name"/> 347 \end{lstlisting} 348 349 \end{description} 350 351 \begin{table}[ht] 352 \caption{Sample values for parameters of the \code{map}-method and corresponding return values} 353 \label{table:cx-map-params} 354 355 \begin{tabular}{ l l | l} 356 \var{\$context} & \var{\$term} & returns \\ 357 \hline 358 \code{*} & \code{name} & ? \\ 359 \code{isocat} & \code{resourceTitle} & CMD terms \\ 360 \code{cmd} & \code{name} & \\ 361 362 \end{tabular} 363 \end{table} 364 365 \noindent 366 Sample request\\* 367 \begin{example1} 368 /smc/cx/map/isocat/resourceTitle 369 \end{example1} 370 \lstset{language=XML} 371 \begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ] 372 <Terms > 373 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880" 374 id="clarin.eu:cr1:c_1271859438123#Title"> 375 AnnotationTool.GeneralInfo.Title</Term> 376 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014" 377 id="clarin.eu:cr1:c_1288172614011#resourceTitle"> 378 BamdesLexicalResource.BamdesCommonFields.resourceTitle 379 </Term> 380 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885" 381 id="clarin.eu:cr1:c_1274880881884#Title"> 382 imdi-corpus.Corpus.Title</Term> 383 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204" 384 id="clarin.eu:cr1:c_1271859438201#Title"> 385 Session.Title</Term> 386 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363" 387 id="clarin.eu:cr1:c_1271859438123#Title"> 388 LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term> 389 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187" 390 id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term> 391 \end{lstlisting} 392 393 \noindent 394 We can distinguish following levels for the mapping function: 395 396 \noindent 397 (1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD elements (\var{cmdIndex)} bound to a given data category are returned: 398 \noindent 399 \begin{example2} 400 %\begin{tabularx}{\textwidth}{| p{0.4\textwidth} p{0.6\textwidth} } 401 isocat.size $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number] 402 \end{example2} 403 %\end{tabularx} 404 405 \noindent 406 \var{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above: 407 408 \begin{example2} 409 imdi-corpus.Name $\mapsto$ \\ 410 (isocat.resourceName) $\mapsto$ & TextCorpusProfile.GeneralInfo.Name 411 \end{example2} 412 413 \noindent 414 (2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to a list of \var{cmdIndexes}: 415 \begin{example2} 416 isocat.resourceTitle $\mapsto$ \\ 417 (+ dc.title) $\mapsto$ & [GeneralInfo.Title, Text.TextTitle, collection.CollectionInfo.Title, resourceInfo. identificationInfo. resourceName, teiHeader.titleStmt.title, teiHeader.monogr.title] 418 \end{example2} 419 420 \noindent 421 (3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}. 422 Having concept links also on components will require a compositional approach for the mapping function, resulting in: 423 \begin{example2} 424 Actor.Name $\mapsto$ & [Actor.Name, Actor.FullName, \\ 425 & Person.Name, Person.FullName] 426 \end{example2} 142 427 143 428 \subsection{Implementation} 144 429 145 At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries.430 At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries. 146 431 147 432 \todoin{generate and reference XSLT-documentation} 148 433 434 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.) 435 149 436 150 437 \subsubsection{Initialization} 151 152 First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories: 153 \newline 154 155 \textit{datcatURI $\mapsto$ profile.component.element[]} 156 \newline 157 158 The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface. 159 160 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories. 161 162 \todocode{example of inverted index} 438 \label{smc_init} 439 During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories: 440 441 \begin{definition}{Principal structure of the inverted index} 442 datcatURI \mapsto profile.component.element[] 443 \end{definition} 444 445 The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface. 446 447 Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories. 448 449 \begin{figure*}[!ht] 450 \includegraphics[width=1\textwidth]{images/smc_init.png} 451 \caption{The various stages of the data flow during the initialization} 452 \label{fig:smc_init} 453 \end{figure*} 454 455 Following datasets are available, after the initialization sequence has finished (cf. figure \ref{fig:smc_init}): 456 \begin{description} 457 \item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language 458 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles 459 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile 460 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements 461 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map}) 462 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute 463 \end{description} 163 464 164 465 \subsubsection{Operation} 165 166 \subsubsection{Computing summaries} 466 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL-stylesheets for post-processing depending on requested format. 467 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq}-library within a \xne{eXist} XML-database. 167 468 168 469 \subsection{Extensions} 169 470 170 A useful supplementary function of the module would be to provide a list of existing indexes. 171 That would allow the search user-interface to equip the query-input with autocompletion. Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry. 172 173 Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function. 174 175 Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.} 176 177 471 Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function. 472 473 Also, use of \emph{other than equivalency} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio. 178 474 179 475 \section{qx -- concept-based search} … … 182 478 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user. 183 479 184 The emphasis lies on the query language and the corresponding query input interface. 185 186 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user. 187 188 offering it (the information) semi-transparently to the user (or application) on the consumption side. 189 190 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily. 191 192 193 ? 194 Facets 195 Controlled Vocabularies 196 Synonym Expansion (via TermExtraction(ContentSet)) 197 480 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily. 481 482 Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath). 483 484 Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}. 198 485 199 486 \subsection{Query language} 200 CQL? 201 487 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind. 202 488 203 489 \subsection{Query Expansion} 204 490 205 491 As long as the indexes to expand with are equivalent the query expansion is simply disjunction, returning a union of matching records. Thus \code{isocat.resourceTitle any "elephant"} would translate into 492 493 \begin{example1} 494 GeneralInfo.Title any "elephant" \\ 495 OR resourceInfo.resourceName any "elephant" \\ 496 OR CollectionInfo.Title any "elephant" \\ 497 OR teiHeader.titleStmt.title any "elephant" \\ 498 \end{example1} 499 500 \noindent 501 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}). 206 502 207 503 \subsection{SMC as module for Metadata Repository} 208 504 209 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain .505 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}). 210 506 211 507 Metadata repository is implemented in xquery running within the eXist XML-database as a web application. … … 219 515 220 516 221 \subsection{User Interface?} 222 223 224 \subsubsection*{Query Input} 225 517 \subsection{User Interface} 518 519 A starting point for our considerations is the traditional structure found in many (advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator: 520 \begin{definition}{Generic data format for structured queries} 521 [ < index, operation, term, boolean > ] 522 \end{definition} 523 524 \noindent 525 This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}. 526 % {Basic clause of the CQL syntax} 527 \begin{definition}{The main clause of the CQL syntax, the \code{searchClause}} 528 \label{def:searchClause} 529 searchClause \ ::= \ index \ relation \ searchTerm 530 \end{definition} 531 532 \noindent 533 An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model. 226 534 227 535 \begin{figure*}[!ht] … … 231 539 \end{figure*} 232 540 541 \noindent 233 542 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. 234 543 235 \subsubsection*{Columns} 236 237 \subsubsection*{Summaries} 238 239 \subsubsection*{Differential Views} 240 Visualize impact of given mapping in terms of covered dataset (number of matched records). 241 242 \subsubsection*{Visualization} 243 Landscape, Treemap, SOM 244 245 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf} 246 247 \section{SMC-Browser} 544 A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.) 545 546 Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search. 547 548 549 \section{SMC Browser} 248 550 \label{smc-browser} 249 551 250 Explore the Component Metadata Framework 251 252 As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used. 253 254 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}. 255 256 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). 257 552 As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger. In the following, some design considerations for an application to answer this need are proposed. 553 554 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data. 555 556 \subsection{Design} 557 In the following, we elaborate on the basic idea of the proposed application, the source data, requirements and proposed application UI-layout. 558 559 \subsubsection{Basic concept} 560 561 If we consider the CMD data model (cf. \ref{def:CMD}) we recognize that every profile can be expressed as a tree with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by \var{inclusion} and \var{reference}. 562 563 \begin{defcap}[!ht] 564 \caption{\var{inclusion} and \var{reference} relationship} 565 \begin{align*} 566 cmds:Component & \xrightarrow{includes} \quad cmds:Component \\ 567 cmds:Component & \xrightarrow{includes} \quad cmds:Element \\ 568 cmds:Element & \xrightarrow{refersTo} \quad DatCat 569 \end{align*} 570 \end{defcap} 571 The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). The main idea for the \xne{SMC Browser} is to \textbf{visualize this graph inherent in the CMD data}. 572 573 \subsubsection{Requirements} 574 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means. 575 576 In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced. 577 578 This scenario implies a few requirements on the user interface: 579 \begin{itemize} 580 \item select nodes from a list of all available nodes (ideally grouped by type) 581 \item filter the node list 582 \item select an arbitrary number of nodes of any type (be it profiles, components, elements, data categories) 583 \item traverse the graph starting from selected nodes into arbitrary depth 584 \item traverse the graph backwards (meaning against the direction of the edges, i.e. e.g. from data categories towards the profiles) 585 \item maintain the identity of the nodes, meaning one component or one data category used in two profiles has to be represented by one node (for displaying the reuse) 586 \item show auxiliary information about the nodes on demand 587 \end{itemize} 588 589 \subsubsection{Application layout} 590 \begin{figure*}[!ht] 591 \begin{center} 592 \includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png} 593 \end{center} 594 \caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface} 595 \label{fig:smc-browser_sketch} 596 \end{figure*} 597 598 \noindent 599 Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}): 600 \begin{description} 601 \item[index panel] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane 602 \item[main graph pane] displays the selected subgraph, needs as much space as possible 603 \item[graph navigation bar] for manipulation of the displayed graph by various means 604 \item[detail view] displaying definition and statistical information for selected nodes 605 \item[statistics] a separate view on the data listing the statistical information for whole dataset in tables 606 \end{description} 607 608 \subsection{Implementation} 609 The application is implemented in \xne{javascript} based on a generic visualization \xne{js}-library \xne{d3}\furl{https://github.com/mbostock/d3/}. The library allows for data-driven visualization (hence the name \xne{d3 = data-driven documents}), attributes of data items being dynamically bound to attributes of the SVG objects representing them. This caters for high flexibility, fast development and consistent data views. The library also delivers the base graph layout algorithm: \emph{force-directed graph layout}\furl{https://github.com/mbostock/d3/wiki/Force-Layout##wiki-force}: 610 611 \begin{quotation} 612 A flexible force-directed graph layout implementation using position Verlet integration to allow simple constraints. [\dots] 613 In addition to the repulsive charge force, a pseudo-gravity force keeps nodes centered in the visible area and avoids expulsion of disconnected subgraphs, while links are fixed-distance geometric constraints. Additional custom forces and constraints may be applied on the "tick" event, simply by updating the x and y attributes of nodes. 614 \end{quotation} 615 616 Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout. 617 618 \subsubsection{Data preprocessing} 619 \label{smc-browser-data-preprocessing} 620 The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S}) via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset: 621 622 \begin{description} 623 \item[SMC graph basic] 624 the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats} 625 \item[SMC graph all] 626 additionally rendering the new profile-groups and relations between data categories (from Relation Registry) 627 \item[only profiles + datcats] 628 just profiles and data categories are rendered (with direct links between those, skipping all components and elements) 629 \item[profiles + datcats + datcats + groups + rr] 630 as above but again with profile-groups and relations 631 \item[only profiles] 632 just profiles with links between them representing the degree of similarity based on the reuse of components and data categories 633 \end{description} 634 635 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout. 636 637 638 \begin{figure*} 639 \includegraphics[width=1\textwidth]{images/smc_processing_-mdrepo} 640 \caption{The data flow in process of precomputing data for the SMC browser} 641 \label{fig:smc_processing} 642 \end{figure*} 643 644 \subsubsection{User interface} 645 646 \begin{figure*}[!ht] 647 \includegraphics[width=1\textwidth]{images/navigation_bar_2013-09-28.png} 648 \caption{Navigation bar of the SMC Browser with a number of options to manipulate the visible graph} 649 \label{fig:navbar} 650 \end{figure*} 651 652 653 As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the 654 basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph. 655 Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}). 656 657 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}. 658 659 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where a all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described. 660 661 \subsection{Extensions} 662 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool. 663 664 \subsubsection*{Graph operations -- differential views} 665 An important feature would be to be able to apply set operations on selected (sub)graphs, especially \emph{intersection} and \emph{difference}. This would enable the user to easily extract components (nodes) that are shared (or not shared) among given schemas (subgraphs). 666 667 \subsubsection*{Generalization} 668 There is a high potential to broaden the scope of application for the discussed tool, provided some generalizations are taken into account. 669 Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones. 670 671 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc. 672 673 \subsubsection*{Viewer for external data} 674 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser. 675 676 One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format. 677 678 \subsubsection*{Integrate with instance data} 679 The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations. 680 681 Also such a visualization could feature direct search links from individual nodes into the dataset, i.e. from a profile node a link could lead into a search interface listing metadata records of given profile. 258 682 259 683 \section{Summary} 260 261 262 684 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level. 685 The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks. 686 -
SMC4LRT/chapters/Infrastructure.tex
r3553 r3638 3 3 4 4 5 \section{CLARIN / CMDI}5 \section{CLARIN} 6 6 \label{def:CLARIN} 7 8 CLARIN - Common Language Resource and Technology Infrastructure\cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide 9 10 \begin{quote} 11 \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located.\cite{CLARIN2013web} 12 \end{quote} 13 14 \begin{comment} 15 To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work. 16 \end{comment} 17 18 The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries. 19 20 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and bodies ensuring the flow of information and coherent action on European level. 21 22 Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects. 23 24 \section{Component Metadata Infrastructure - CMDI} 7 25 \label{def:CMDI} 8 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to 9 10 \begin{quotation} 11 \dots create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily usable. 12 \end{quotation} 13 14 The infrastructure foresees a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accommodate existing schemas. 15 16 As stated before, the SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide: 26 27 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}). 28 29 The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide: 17 30 18 31 \begin{itemize} 19 32 \item Data Category Registry 33 \item Component Registry 20 34 \item Relation Registry 21 \item Component Registry 22 \item Vocabulary Alignement Service (OpenSKOS) 35 \end{itemize} 36 37 \noindent 38 All these components are running services, that this work shall directly build upon. 39 40 Next to these core services, that SMC has direct dependencies to, some other services are being developed within the CMDI ecosystem that are also relevant in the context of SMC: 41 42 \begin{itemize} 23 43 \item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html}) 24 44 \item SchemaParser 45 \item Vocabulary Alignement Service (OpenSKOS) 25 46 \end{itemize} 26 47 27 On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ label{cmdi_exploitation}.48 On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ref{cmdi_exploitation}. 28 49 29 50 \begin{figure*}[!ht] … … 33 54 34 55 35 \subsection{CMDI registries: DCR, CR, RR} 36 \label{def:CMD} 37 \label{def:DCR} 38 39 40 41 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework. 42 The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \emph{ISOcat}\footnote{\url{http://www.isocat.org/}}. 43 Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable. 44 45 The \emph{Component Metadata Framework} (CMD) is built on top of the DCR and complements it. While the DCR defines the atomic concepts, within CMD the metadata schemas can be constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles as long as each field ``refers via a PID to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. This allows to trivially infer equivalences between metadata fields in different CMD-based schemata. While the primary registry used in CMD is the ISOcat DCR, other authoritative sources for data categories (``trusted registries'') are accepted, especially Dublin Core Metadata Initiative \cite{DCMI:2005}. 46 47 \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles. 48 56 \subsection{CMDI registries} 57 58 The CMD framework as data model (cf. \ref{def:CMD} together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. In the following we explain briefly their role and interaction. 49 59 50 60 \begin{figure*}[!ht] … … 53 63 \end{figure*} 54 64 65 \subsubsection*{Data Category Registry} 66 \label{def:DCR} 67 68 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework. 69 The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \xne{ISOcat}\furl{http://www.isocat.org/}. 70 Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable. 71 72 \subsubsection*{Component Registry} 73 74 \emph{Component Registry} (CR)\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} implements the CMD data model and fulfills two functions. For one it as a robust web application for creating and editing new CMD components and profiles. On the other hand it is the actual registry the persistently stores and exposes published CMD profiles, allowing to browse and search in them and view their structure. 75 76 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., add or a remove some metadata elements and/or components. Also new components can be created to model the unique aspects of the resources under consideration. All components are combined into one profile. Components, elements and values should be linked to a concept to make its semantics explicit.\cite{Durco2013_MTSR} 77 78 This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs 79 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}. 80 81 \subsubsection*{Ontological Relations -- Relation Registry} 82 55 83 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. 56 84 However there needs to be an additional means to capture information about relations between data categories. 57 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler. 58 These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. 59 60 There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen. \cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. 85 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design grounds on the expectation that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeller. 86 87 These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. 88 89 There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. 61 90 This implementation stores the individual relations as RDF-triples 62 \begin{example} 63 <subjectDatcat, relationPredicate, objectDatcat> 64 \end{example} 65 allowing typed relations, like equivalency (\texttt{rel:sameAs}) and subsumption (\texttt{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. 66 67 !check DCR-RR/Odijk2010 -follow up 68 !Cf. Erhard Hinrichs 2009 91 92 \begin{example3} 93 <subjectDatcat, & relationPredicate, & objectDatcat> 94 \end{example3} 95 96 allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. 97 98 \todoin{check DCR-RR/Odijk2010 -follow up ?; Cf. Erhard Hinrichs 2009 } 99 100 \subsubsection*{Schema Registry} 69 101 70 102 SCHEMAcat is a registry for schemata of all kinds (not just XML-based) semantically annotated with data categories. … … 73 105 (search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}. 74 106 75 \noindent76 All these components are running services, that this work shall directly build upon.77 78 This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs79 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.80 81 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}.82 107 83 108 \subsection{Vocabulary Service / Reference Data Registry} … … 86 111 The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data for both the LRT and Digital Humanities community has been discussed on many occasions in various contexts. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives. 87 112 88 In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centers (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-center) services to be dealt with. 89 113 In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with. 114 115 \begin{note} 90 116 In parallel, within the sister ESFRI project DARIAH a taskforce with the same goal has been set up : \emph{Service for Reference Data and Controlled Vocabularies}. This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to \emph{establish a service providing controlled vocabularies and reference data} for the DARIAH (and CLARIN) community. 91 117 92 Regarding the responsibilities of the DARIAH working groups:93 VCC3/Task 3 identifies and recommends vocabularies relevant for the community. VCC1/Task 5 provides basic/generic services relevant for whole community. Especially, the Schema Registry, that allows to express mappings between different schemas seems to be one starting point. In accordance with the VCC1 strategy, concentrate on pulling together (pooling) existing resources and only implement necessary ``glue'' to put the pieces together (data conversion, service-wrappers...)94 95 118 Thus there is a momentum and a high potential for a collaborative approach in at least these two big initiatives CLARIN and DARIAH, that serve a very wide-spread and diverse community. 119 \end{note} 96 120 97 121 \subsubsection{Abstract service description} 98 122 As to the service itself it is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface is still necessary for administration etc. By using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and \xne{LOD}. 99 Besides providing vocabularies, the service should also hold and expose equivalenc ies (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalencies from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:123 Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}: 100 124 \begin{verbatim} 101 GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065 | Wikipedia-Personensuche 125 GND: 118540238 | LCCN: n79003362 | 126 NDL: 00441109 | VIAF: 24602065 102 127 \end{verbatim} 103 128 104 129 \subsubsection{Vocabulary Service - CLAVAS} 105 130 \label{def:CLAVAS} 106 As described in previous section (\ref{def:DCR}), a solid pil ar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).131 As described in previous section (\ref{def:DCR}), a solid pillar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added). 107 132 108 133 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge. … … 130 155 \label{interaction-dcr-skos} 131 156 132 133 157 DCR recognizes following types of data categories (Figure \ref{fig:dc_type}): 134 simple, complex: closed, open, constrained, (container)? 158 \code{simple, complex: closed, open, constrained, (container)?} 135 159 136 160 \begin{figure*}[!ht] … … 149 173 150 174 151 The semantic proximity of a /data category/ to a /concept/may mislead to152 a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept153 all of them belonging to the \xne{ISOcat -profile:ConceptScheme}.175 The fact that data categories are basically definitions of concepts may mislead to 176 a na"ive approach to mapping DCR to SKOS, namely mapping every data category to a \code{skos:Concept} 177 all of them belonging to the \xne{ISOcat:ConceptScheme}. 154 178 However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary. 155 179 156 A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme. 180 A more sensible approach is to export only closed DCs as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{Concepts} within that scheme. 181 182 \begin{quotation} 157 183 The rationale is, that if we see a vocabulary as a set of possible values for a 158 184 field/element/attribute, complex DCs in ISOcat are the users of such 159 185 vocabularies and simple DCs the DCR equivalence of values in such a 160 vocabulary.\cite{Menzo2013mail} 161 162 Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs. 163 Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}. 164 So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts]. 186 vocabulary. 187 \end{quotation}\cite{Menzo2013mail} 188 189 Another aspect is, that a simple DC can be in value domains of multiple closed DCs. 190 Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}. 191 So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts]. 165 192 That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes. 166 193 … … 332 359 \todocite {MI Search Engine} 333 360 334 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN cent ers,361 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centres, 335 362 and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011} 336 363 -
SMC4LRT/chapters/Literature.tex
r3551 r3638 16 16 17 17 \subsection{Metadata} 18 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\f ootnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.18 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\furl{http://www.clarin.eu/cmdi} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}. 19 19 20 20 Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}. … … 87 87 In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains. 88 88 89 One more specific recent inspirati vework is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.89 One more specific recent inspirational work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. 90 90 91 91 \todoin{check if relevant: http://schema.org/} … … 99 99 100 100 \subsection{Ontology Visualization} 101 102 Landscape, Treemap, SOM 103 104 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf} 101 105 102 106 … … 123 127 124 128 \section{Summary} 125 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and 126 on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization. 129 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization. -
SMC4LRT/chapters/Results.tex
r3551 r3638 49 49 50 50 51 \subsection{SMC Browser -- Advanced Interactive User Interface}51 \subsection{SMC Browser -- advanced interactive user interface} 52 52 53 53 SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph. 54 In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g. counting how many elements a profiles contains, or in how many profiles a DC is used. 54 55 55 56 It is implemented on top of the js-library d3, the code is checked in clarin-svn. … … 249 250 The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. 250 251 251 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\ xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record.252 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record. 252 253 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema. 253 254 … … 288 289 \item MD Search employing Semantic Mapping 289 290 \item MD Search employing Fuzzy Search 291 \item Visualize impact of given mapping in terms of covered dataset (number of matched records). 290 292 \end{itemize} 291 293 -
SMC4LRT/chapters/abstract_en.tex
r2672 r3638 1 1 \chapter*{Abstract} 2 2 3 According to the guidelines of the faculty, an abstract in English has to be inserted here. 3 4 This work is embedded in the context of a large research infrastructure initiative aimed at easing and harmonizing access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in at the core of the infrastructure. 5 6 The ultimate objective of the effort -- in line with the overall mission of the infrastructure -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This was pursued by two separate, complementary approaches: a) Enriching the search capabilities with concept-based crosswalks on schema level. 7 And -- acknowledging the integrative power of the \emph{Linked Open Data} paradigm -- b) expressing the domain data as a \emph{Semantic Web} resource. 8 9 In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}. 10 As a by-product, the application \emph{SMC browser} was developed -- a visualization tool for interactive exploration of the dataset. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset. As such, they are considered the main contribution of this work by the author. 11 -
SMC4LRT/chapters/appendix.tex
r3551 r3638 4 4 5 5 6 \chapter{Data model ?} 6 \chapter{Data model reference} 7 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model}, \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC. 8 7 9 \begin{figure*}[!ht] 8 10 \begin{center} … … 12 14 \label{fig:DCR_data_model} 13 15 \end{figure*} 16 17 \input{images/Terms.xsd} 18 19 \input{images/general-component-schema.xsd} 20 14 21 15 22 \begin{figure*}[!ht] … … 29 36 \end{figure*} 30 37 31 \section {SMC Reports}32 \label{sec:reports}33 38 34 SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}. 39 \chapter{SMC Browser} 35 40 36 41 42 \begin{figure*}[!ht] 43 \begin{center} 44 \includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png} 45 \end{center} 46 \caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.} 47 \label{fig:cmd-dep-dotgraph} 48 \end{figure*} 49 50 \section{SMC Browser user documentation} 51 \label{sec:smc-browser-userdocs} 52 53 \input{chapters/userdocs_cleaned} 54 55 56 57 58 59 \chapter{SMC Reports} 60 \label{ch:reports} 61 62 SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}. 63 37 64 \input{chapters/examples_cleaned} -
SMC4LRT/chapters/danksagung.tex
r2672 r3638 1 1 \chapter*{Danksagung} 2 2 3 Hier fÌgen Sie optional eine Danksagung ein. 3 Ich möchte mich herzlich bedanken, bei allen Kollegen die mir mit Rat zur Seite gestanden sind 4 und meinen Liebsten fÌr ihre extra-portion Geduld, die ich ihnen abverlangt habe.
Note: See TracChangeset
for help on using the changeset viewer.