Changeset 3638 for SMC4LRT


Ignore:
Timestamp:
09/30/13 11:54:57 (11 years ago)
Author:
vronk
Message:

major reorganization, detailing of Design-chapters; abstract_en

Location:
SMC4LRT/chapters
Files:
9 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Data.tex

    r3553 r3638  
    88
    99
    10 \subsection{CMD-Framework}
    11 
     10\subsection{Component Metadata Framework}
     11\label{def:CMD}
     12
     13The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.)
     14CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
     15The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
     16indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
     17
     18While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
     19
     20Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
     21The generated schema also conveys as annotation the information about the referenced data categories.
    1222
    1323
     
    10011012.893 & MPI CGN \\
    10111110.628 & Bavarian Archive for Speech Signals (BAS) \\
    102 7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) \\
     1127.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
    1031137.348 & WALS RefDB \\
    1041145.689 & Lund Corpora \\
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3553 r3638  
    1 \chapter{System design - mapping on instance level}
     1\chapter{Mapping on instance level, CMD as LOD}
    22\label{ch:design-instance}
     3
    34\begin{quotation}
    4 I do think that ISOcat, CLAVAS, RELcat, an actual language
     5I do think that ISOcat, CLAVAS, RELcat and actual language
    56resource all provide a part of the semantic network.
    67
     
    1112relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    1213that's where I'm ultimately heading with all these registries related to
    13 semantic interoperability ... I hope ;-)
     14semantic interoperability ... I hope ;-)\cite{Menzo2013mail}
    1415\end{quotation}
    15 \cite{Menzo2013mail}
    16 
    17 
    18 Linked Data - Express dataset in RDF
    19 
    20 
    21 Partly as by-product of the entities-mapping effort we will get the metadata rendered in RDF, linked with
    22 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
    23 
    24 
    25 Technical aspects (RDF-store?) / interface (ontology browser?)
    26 
    27 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
    28 
    29 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
    30 
    31 defining the Mapping:
    32 \begin{enumerate}
    33 \item convert to RDF
    34 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
    35 \item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
    36 \end{enumerate}
    37 
    38 \begin{figure*}[!ht]
    39 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
    40 \caption{The process of transforming the CMD metadata records to and RDF representation}
    41 \label{fig:smc_cmd2lod}
    42 \end{figure*}
    43 
     16
     17As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     18
     19One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
     20
     21In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006}
     22as well as for real semantic (ontology-driven) search and exploration of the data.
     23
     24The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
     25In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
    4426
    4527\section{CMD to RDF}
    46 \label{ch:cmd2rdf}
    47 
    48 A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts.
    49 However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary.
    50 
    51 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL}
    52 
    53 The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF
     28\label{sec:cmd2rdf}
     29In this section, RDF encoding is proposed for all levels of the CMD data domain:
     30
     31\begin{itemize}
     32\item CMD meta model
     33\item profile definitions
     34\item the administrative and structural information of CMD records
     35\item individual values in the fields of the CMD records
     36\end{itemize}
    5437
    5538\subsection{CMD specification}
    56 The meta model
     39
     40The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
    5741
    5842\label{table:rdf-spec}
    59 \begin{example}
    60 cmd\_spec:Profile & subClassOf  & owl:Class. \\
    61 cmd\_spec:Component & subClassOf  & owl:Class. \\
    62 cmd\_spec:Element & subClassOf  & rdf:Property. \\
    63 \end{example}
    64 
    65 Typing the profiles, components and elements:
     43\begin{example3}
     44cmds:Component & subClassOf  & owl:Class. \\
     45cmds:Profile & subClassOf  & cmds:Component. \\
     46cmds:Element & subClassOf  & rdf:Property. \\
     47\end{example3}
     48
     49\noindent
     50This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
    6651
    6752\label{table:rdf-cmd}
    68 \begin{example}
    69 cmd:collection & a & cmd\_spec:Profile; \\
    70  & rdfs:label & `collection'; \\
     53\begin{example3}
     54cmd:collection & a & cmds:Profile; \\
     55 & rdfs:label & "collection"; \\
    7156 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
    72 cmd:Actor       & a & cmd\_spec:Component. \\
    73 cmd:LanguageName  & a & cmd\_spec:Element. \\
    74 \end{example}
     57cmd:Actor       & a & cmds:Component. \\
     58cmd:LanguageName  & a & cmds:Element. \\
     59\end{example3}
    7560
    7661\begin{note}
    77 Should the ID assigned in the component registry  for the CMD entities  used as ID in rdf, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
     62Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
    7863\end{note}
    7964
    8065\subsection{Data Categories}
    81 Windhouwer (2012) proposes to use the data categories as annotation properties.
    82 Definition of the annotation property \code{dcr:datcat}
    83 
    84 \begin{example}
     66Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties:
     67
     68\begin{example3}
    8569dcr:datcat & a  & owl:AnnotationProperty ; \\
    8670 & rdfs:label  & "data category"@en ; \\
    87  & rdfs:comment  & "This resource is equivalent to  \\
    88 this data category."@en ; \\
    89  & skos:note  & "The data category should be  \\
    90  &   & identified by its PID."@en ; \\
    91 \end{example}
    92 
    93 Still, leaving open the possibility for “a stronger semantic link” :
    94 \begin{quotation}
    95 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
    96 \end{quotation}
    97 
    98 For classes the OWL 2 \code{owl:equivalentClass} can be used, for example:
    99 
    100 \begin{example}
    101 \#myPOS & owl:equivalentClass & isocat:DC-1345. \\
    102 \end{example}
    103 
    104 For properties OWL 2 provides \code{owl:equivalentProperty}, for example:
    105 
    106 \begin{example}
    107 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
    108 \end{example}
    109 
    110 Finally \code{owl:sameAs} can be used for individuals, for example:
    111 
    112 \begin{example}
    113 \#myNoun & owl:sameAs & isocat:DC-1333. \\
    114 \end{example}
    115 
    116 
    117 ISOcat provides a RDF representation of the data categories :
    118 
    119 \begin{example}
     71 & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
     72 & skos:note  & "The data category should be identified by its PID."@en ; \\
     73\end{example3}
     74
     75That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
     76
     77\begin{example3}
     78cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
     79\end{example3}
     80
     81Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
     82used usually directly as data properties:
     83
     84\begin{example3}
     85<lr1> & dc:title & "Language Resource 1"
     86\end{example3}
     87
     88\noindent
     89Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows:
     90
     91\begin{example3}
     92<lr1> & isocat:DC-2502 & "19th century"
     93\end{example3}
     94
     95\noindent
     96However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.
     97
     98This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties.
     99
     100
     101\noindent
     102The REST web service of \xne{ISOcat} provides a RDF representation of the data categories:
     103
     104\begin{example3}
    120105isocat:languageName & dcr:datcat & isocat:DC-2484; \\
    121106 & rdfs:label & "language name"@en; \\
    122107 & rdfs:comment & "A human understandable..."@en; \\
    123108 & 
  \\
    124 \end{example}
     109\end{example3}
     110
     111However this is only meant as template, as is stated in the explanatory comment of the exported data:
     112
     113\begin{quotation}
     114By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
     115\end{quotation}
     116
     117So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
     118
     119\begin{example3}
     120\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
     121\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
     122\#myNoun & owl:sameAs & isocat:DC-1333. \\
     123\end{example3}
     124
     125
     126\subsection{RELcat - Ontological relations}
     127As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     128
     129\begin{example3}
     130isocat:DC-2538 & rel:sameAs & dct:date
     131\end{example3}
     132
     133\noindent
     134By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
    125135
    126136\begin{note}
    127 Output from isocat is only meant as template!
    128 
    129 In the RDF representation, the data categories seem to be referenced by their mnemonicIdentifier (rdf:ID=”languageName”) how is this guaranteed URI and how is the data category meant to be referred to?
     137Does this mean, that I would say:
     138\begin{example3}
     139rel:sameAs & owl:equivalentProperty & owl:sameAs
     140\end{example3}
     141
     142to enable the inference of the equivalences?
     143
     144Is this correct:
    130145\end{note}
    131 
    132 Finally, the ConceptLink attribute used in the CMD profiles to reference the data category is modelled as:
    133 
    134 \begin{example}
    135 cmd:LanguageName & dcr:datcat & isocat:DC-248. \\
    136 \end{example}
    137 
     146?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
     147
     148\begin{example2}
     149 cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
     150\end{example2}
     151
     152\noindent
     153following facts need to be present in the ontology :
     154
     155\begin{example3}
     156<lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
     157cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
     158isocat:DC-2538 & rel:sameAs & dc:created \\
     159rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
     160$\rightarrow$ \\
     161<lr1> & dc:created & 2012\^{}\^{}xs:year \\
     162\end{example3}
     163
     164\noindent
     165What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
     166
     167\begin{example3}
     168cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
     169clavas:Organization & owl:subClassOf & dcterms:Agent \\
     170<org1> & a & clavas:Organization \\
     171\end{example3}
    138172
    139173\subsection{CMD instances}
    140 
     174In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
    141175
    142176\subsubsection {Resource Identifier}
    143177
    144 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>}  from \code{<cmd:MdSelfLink>}) as the resource identifier.
    145 The relationship between the resource and the metadata record could be expressed as an annotation :
    146 
    147 \begin{example}
     178It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
     179If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
     180
     181\begin{example3}
    148182\_:anno1  & a & oa:Annotation; \\
    149183 & oa:hasTarget  & <lr1>; \\
    150184 & oa:hasBody  & <lr1.cmd>; \\
    151185 & oa:motivatedBy  & oa:describing \\
    152 \end{example}
    153 
    154 \subsection{Provenance}
    155 
    156 Use the information from CMD-Header for information about the modelled data  :
    157 
    158 \begin{example}
    159 <lr1.cmd>
    160  & dcterms:identifier  & <lr1.cmd>;  \\
    161  & dcterms:creator ??  & "\{<cmd:MdCreator>\}";  \\
    162 \end{example}
    163 
    164 Other proposed fields:
    165 
    166 \begin{example}
    167  & dcterms:publisher  & <http://clarin.eu>,  \\
    168  & <provider-oai-accesspoint>; ?? \\
    169  & dcterms:created/modified “\{<cmd:MdCreated>\}” ?? \\
    170 \end{example}
     186\end{example3}
     187
     188\subsubsection{Provenance}
     189
     190The information from \code{cmd:Header} represents the provenance information about the modelled data:
     191
     192\begin{example3}
     193<lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
     194 & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
     195 & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
     196 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
     197\end{example3}
    171198
    172199\subsubsection{Hierarchy ( Resource Proxy – IsPartOf)}
    173 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    174 \furl{http://openannotation.org/spec/core/core.html\#Motivations}
     200In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    175201:
    176202
    177 \begin{example}
     203\begin{example3}
    178204<lr0.cmd>  & a   & ore:ResourceMap \\
    179205<lr0.cmd> & ore:describes & <lr0.agg> \\
    180206<lr0.agg> & a   & ore:Aggregation \\
    181 ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    182 \end{example}
    183 
    184  
     207& ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
     208\end{example3}
     209
     210\noindent
    185211?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
    186 Additionally the flat header field <cmd:MdCollectionDisplayName> has been introduced to indicate by simple means the collection, of which given resource is part.
    187 This information can be used to generate a separate one-level grouping of the resources, in which the value from the <cmd:MdCollectionDisplayName> element would be used as the label of an otherwise undefined ore:ResourceMap.
     212Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
     213This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
    188214Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
     215
    189216\todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}
    190217
    191 \begin{example}
     218\begin{example3}
    192219\_:mdcoll  & a   & ore:ResourceMap; \\
    193220 & rdfs:label & "Collection 1"; \\
    194 \_:mdcoll\#aggregation & a   & ore:Aggregation \\
     221\_:mdcoll\#aggreg & a   & ore:Aggregation \\
    195222 & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    196 \end{example}
    197 
     223\end{example3}
    198224       
    199225\subsubsection{Components – nested structures}
    200226
    201 \begin{note}
    202 ?? Model (instance) components as blank nodes via objectProperty:
    203 \end{note}
    204 
    205 \begin{example}
     227There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
     228
     229\begin{enumerate}[a)]
     230\item the components are encoded as object property
     231
     232\begin{example3}
    206233<lr1>  & cmd:Actor  & \_:Actor1 \\
    207234<lr1>  & cmd:Actor  & \_:Actor2 \\
     
    210237\_:Actor1  & cmd:role & "Interviewer" \\
    211238\_:Actor2 & cmd:role & "Speaker" \\
    212 \end{example}
    213 
    214 ?? or rather as Classes (and express the containement hierarchy with some extra predicate):
    215 \begin{example}
     239\end{example3}
     240
     241\item a dedicated object property is used
     242
     243\begin{example3}
    216244\_:Actor1  & a & cmd:Actor \\
    217245<lr1> & cmd:contains & \_:Actor1 \\
    218 \end{example}
    219 
    220 \subsubsection{Elements, Fields, Values}
    221 
    222 There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields – using the literal values – try to find corresponding entities in appropriate controlled vocabularies and generate new triples.
    223 There seems to need to be a separate property (predicate) for fields that are mapped to entities, like:
    224 
    225 \begin{example}
    226 <lr1> & cmd:Organisation & "MPI" \\
    227 <lr1> & cmd:Organisation\_? & <org1> \\
    228 \end{example}
    229 
    230 %\subsubsection{Literal Values}
    231 \paragraph{Literal Values}
    232 
    233 Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile )
    234 
    235 \begin{example}
    236 <lr1> & dct:title & "Language Resource 1"
    237 \end{example}
    238 
    239 Analogously, we could model isocat data categories  as data properties . Metadata elements referencing ISOcat datacategories could be encoded as follows:
    240 
    241 \begin{example}
    242 <lr1> & isocat:DC-2502 & "19th century"
    243 \end{example}
    244 
    245 However, Windhouwer (2012) argues against direct mapping of complex data categories to data properties, but proposes to rather model data categories as annotation properties.
    246 
    247 \begin{example}
    248 cmd:timeCoverage  & a   & cmd\_spec:Element \\
     246\end{example3}
     247
     248\end{enumerate}
     249
     250\subsection{Elements, Fields, Values}
     251Finally, we want to integrate also the actual field values in the CMD records into the ontology.
     252
     253\subsubsection{Predicates}
     254As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
     255
     256\begin{example3}
     257cmd:timeCoverage  & a   & cmds:Element \\
    249258cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
    250259<lr1>  & cmd:timeCoverage  & "19th century" \\
    251 ...
    252 \end{example}
    253 
    254 This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties.
    255 
    256 %\subsubsection{Mapping to entities – Vocabularies  – CLAVAS}
    257 \paragraph{Mapping to entities – Vocabularies  – CLAVAS}
    258 
    259 A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over  and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.)
    260 Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS  – a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies.
    261 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \xne{skos:Concepts}:
    262 
    263 \begin{example}
     260
     261\end{example3}
     262
     263\subsubsection{Literal values -- data properties}
     264
     265To generate triples with literal values is straightforward:
     266
     267\begin{definition}{Literal triples}
     268lr:Resource \ \quad cmds:Property \ \quad xsd:string
     269\end{definition}
     270
     271\begin{example3}
     272<lr1> & cmd:Organisation & "MPI" \\
     273\end{example3}
     274
     275\subsubsection{Mapping to entities -- object properties}
     276
     277The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
     278
     279\begin{definition}{new RDF triples}
     280lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI
     281\end{definition}
     282
     283\begin{example3}
     284<lr1> & cmd:Organisation\_? & <org1> \\
     285\end{example3}
     286
     287\begin{note}
     288Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
     289i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
     290\end{note}
     291
     292The mapping process is detailed in \ref{sec:values2entities}
     293
     294%%%%%%%%%%%%%%%%%55
     295\section{Mapping field values to semantic entities}
     296\label{sec:values2entities}
     297
     298This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
     299
     300\begin{enumerate}
     301\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
     302\item extract \emph{distinct data category, value pairs} from the metadata records
     303\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
     304\item assess the reliability of the match
     305\item generate new RDF triples with entity identifiers as object properties
     306\end{enumerate}
     307
     308\begin{figure*}[!ht]
     309\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
     310\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation}
     311\label{fig:smc_cmd2lod}
     312\end{figure*}
     313
     314\subsubsection{Identify vocabularies  – CLAVAS}
     315
     316\todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
     317LT-World \cite{Joerg2010}
     318
     319One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
     320
     321The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
     322
     323Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
     324
     325\begin{example3}
    264326<org1> & a   & skos:Concept \\
    265 \end{example}
    266 
    267 We may want to add some more typing and introduce classes for entities from individual vocabularies like clavas:Organization or similar.
    268 As far as CLAVAS will also maintain mappings/links to other datasets:
    269 
    270 \begin{example}
    271 <org1>   skos:exactMatch    <dbpedia/org1>, <lt-world/orgx>;
    272 \end{example}
    273 
     327\end{example3}
     328
     329\noindent
     330We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets
     331
     332\begin{example3}
     333<org1> & skos:exactMatch  & <dbpedia/org1>, <lt-world/orgx>;
     334\end{example3}
     335
     336\noindent
    274337we could use it to expand the data with alternative identifiers, fostering the interlinking of data:
    275338
    276 \begin{example}
    277 <org1>   dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>;
    278 \end{example}
    279 
    280 
    281 
    282 \paragraph{Mapping from strings to Entities}
    283 
    284 Find matching entities in selected Ontologies based on the textual values in the metadata records.
    285 
    286 
    287 Identify related ontologies:
    288 LT-World \cite{Joerg2010}
    289 
    290 task:
    291 \begin{enumerate}
    292 \item  express MDRecords in RDF
    293 \item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
    294 \item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
    295 
    296 %\fbox{ function lookup: Category x String -> ConceptualDomain}
    297 \begin{eqnarray*}
    298 lookup(Category, Literal) \rightarrow ConceptualDomain??
    299 \end{eqnarray*}
    300 
    301 
    302 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
    303 \end{enumerate}
    304 
    305 
    306 
    307 \subsection{RELcat - Ontological relations}
    308 Information in RELcat is already stored in RDF \cite{SchuurmanWindhouwer2011}.  One relation from the example relation set for CMDI :
    309 
    310 \begin{example}
    311 isocat:DC-2538 rel:sameAs dct:date
    312 \end{example}
    313 
    314 Should we generate the redundant triples based on the relations defined between data categories?  I.e.  if there is a relation and a resource has value:
    315 
    316 \begin{example}
    317 <lr1> isocat:DC-2538 2012^^xs:year
    318 \end{example}
    319 
    320 should we generate
    321 
    322 \begin{example}
    323 <lr1> dct:date 2012^^xs:year
    324 \end{example}
    325 
    326 ?
    327 
    328 What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
    329 
    330 \begin{example}
    331 cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
    332 clavas:Organization & owl:subClassOf & dcterms:Agent \\
    333 <org1> & a & clavas:Organization \\
    334 \end{example}
    335 
    336 
    337 
     339\begin{example3}
     340<org1>  & dcterms:identifier  & <org1>, <dbpedia/org1>, <lt-world/orgx>;
     341\end{example3}
     342
     343\subsubsection{Lookup}
     344
     345In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
     346
     347\begin{definition}{signature of the lookup function}
     348lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
     349\end{definition}
     350
     351In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
     352which will be the result of the previous step.
     353
     354\begin{definition}{Required configuration data indicating data category to available }
     355DataCategory \quad \mapsto \quad Dataset+
     356\end{definition}
     357
     358As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
     359However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets.
     360
     361\begin{figure*}[!ht]
     362\includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp}
     363\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service}
     364\label{fig:vocabulary_proxy}
     365\end{figure*}
     366
     367\subsubsection{Candidate evaluation}
     368The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
     369
     370One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
     371
     372In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
     373
     374
     375%%%%%%%%%%%%%%%%%%%%%
    338376\section{SMC LOD - Semantic Web Application}
     377\label{sec:lod}
    339378
    340379\todoin{read: Europeana RDF Store Report}
    341380
     381Technical aspects (RDF-store?): Virtuoso
     382
    342383\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
    343384
     
    345386
    346387\todocode{check install siren}\furl{http://siren.sindice.com/}
     388
     389
     390\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
     391
     392\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
     393
     394 / interface (ontology browser?)
    347395
    348396semantic search component in the Linked Media Framework
     
    353401
    354402\section {Full semantic search - concept-based + ontology-driven ?}
     403\label{semantic-search}
    355404
    356405With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
    357406
    358407Namely to enhance it by employing ontological resources.
    359 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
    360 
     408Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
     409
     410
     411SPARQL
     412
     413rechercheisidore, dbpedia, ...
    361414
    362415\section{Summary}
    363 
    364 
    365 
     416In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
     417
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3553 r3638  
    11
    2 \chapter{Concept-based mapping on schema level -- system design}
     2\chapter{System design -- concept-based mapping on schema level}
    33\label{ch:design}
    44
    5 In this chapter, we define the part of the proposed system pertaining to the schema level: the concept-based crosswalk and search functionality -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}) -- and, additionally,  the aspect of visualization of schema-level (model) data.
    6 
    7 We start by drawing a global view on the system, introducing its individual components and the dependencies among them.
    8 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
     5In this chapter, we define the main function of the proposed system -- the \textbf{concept-based crosswalk and search functionality} -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}). Additionally we explore the related aspect of analytic visualization of the processed data.
     6
     7We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
     8In the next section, the internal data model is presented and explained. In section \ref{def:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{def:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
    99
    1010\section{System Architecture}
    1111
    12 The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR})
    13 and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
    14 
     12The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain.
    1513
    1614\begin{figure*}[!ht]
     
    2018\end{figure*}
    2119
     20The SMC module can be broken down into following components:
    2221
    2322\begin{description}
    24 \item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx}
    25 \item[concept-based query expansion]
     23\item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx}
     24\item[concept-based query expansion] a module for query expansion based on the crosswalks
    2625\item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
    2726\item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph}
     
    3029\end{description}
    3130
     31The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
     32
     33\xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
     34
    3235For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
    3336
    34 \section{Data model - Terms}
     37\section{Data model}
     38
     39Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
     40
     41\begin{description}
     42\item[smcIndex] a data type for denoting indexes in a human-readable way used internally and as input and output format of the service
     43\item[Terms.xsd] the schema for internal representation of the processed data
     44\end{description}
     45
     46\subsection{smcIndex}\label{def:smcIndex}
     47In this section, we describe \code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
     48
     49An \code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
     50
     51\begin{defcap}
     52\caption{Grammar of \code{smcIndex}}
     53\begin{align*}
     54smcIndex &::= dcrIndex \ | \ cmdIndex  \\
     55dcrIndex &::= dcrID \ contextSep \ datcatLabel \\
     56            & \quad \quad   | \  [\ dcrID \ contextSep \ ] \ datcatID \\
     57cmdIndex &::= profile  \\
     58                    &    \quad \quad  | \  cmdEntityId \\
     59                      &   \quad \quad | \  [\ profile \ contextSep \ ] \ dotPath \\
     60profile &::= profileName \ [ \ \texttt{\#} \ profileID \ ] \\
     61dotPath  &::= [\ dotPath \ pathSep \ ] \ elemName \\
     62cmdEntityId &::= componentId \ [ \ \texttt{\#} \ elemName \ ] \\
     63contextSep &::= \texttt{`.`} \ | \  \texttt{`:`} \\
     64pathSep &::= \texttt{`.`} \\
     65dcrId &::= \texttt{`isocat`} \ | \ \texttt{`dc`}
     66\end{align*}
     67\end{defcap}
     68
     69The grammar distinguishes two main types of \code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
     70These two types of \code{smcIndex} follow different construction patterns.
     71\code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
     72
     73It is important to note, that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
     74Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
     75However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
     76
     77\code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
     78
     79\code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
     80
     81\begin{example1}
     82\concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1272022528363} \\
     83\concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1290431694579}
     84\end{example1}
     85
     86\noindent
     87\code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down the ambiguity.
     88
     89\subsection{Terms}
    3590\label{datamodel-terms}
    3691
    37 \todocode{Terms.xsd}
    38 
    39 \begin{note}
    40 Describe the CMD-format?
    41 \end{note}
    42 
     92In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{list:terms-schema}.
     93
     94\subsubsection{Type \code{Term}}
     95
     96\code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
     97
     98\begin{table}[ht]
     99\caption{Attributes of \code{Term} when encoding data category}
     100\label{table:terms-attributes-datcat}
     101 \begin{tabular}{ l | l | l }
     102  attribute & allowed values & sample value\\
     103\hline
     104  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
     105  \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
     106  \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
     107 \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\
     108 \end{tabular}
     109\end{table}
     110
     111%\captionsetup{justification=raggedright, singlelinecheck=false}
     112\lstset{language=XML}
     113\begin{lstlisting}[label=list:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
     114<Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
     115        type="label" xml:lang="fr">nom de ressource</Term>
     116\end{lstlisting}
     117
     118\begin{table}[ht]
     119\caption{Attributes of \code{Term} when encoding CMD entity}
     120\label{table:terms-attributes-cmd}
     121 \begin{tabularx}{1\textwidth}{ l | X | X }
     122  attribute & allowed values & sample value\\
     123\hline
     124  \var{id} &  \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\
     125  \var{type} &  one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\
     126  \var{name} & name of the component or element & \code{Url} \\
     127  \var{path} &  \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\
     128  \var{parent} & name of the parent component &  \code{Contact} \\
     129 \end{tabularx}
     130\end{table}
     131
     132\lstset{language=XML}
     133\begin{lstlisting}[label=list:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
     134<Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
     135          id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
     136          path="SpeechCorpus.Access.Contact.Url"/>
     137\end{lstlisting}
     138
     139\begin{table}[ht]
     140\caption{Attributes of \code{Term} when encoding a term in the inverted index?}
     141\label{table:terms-attributes-index}
     142 \begin{tabularx}{1\textwidth}{ l | X | X }
     143  attribute & allowed values & sample value\\
     144\hline
     145  \var{id} &  \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\
     146  \var{type} &  one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\
     147  \var{schema}  & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\ 
     148  \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
     149  \var{node-value} &  \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\
     150 \end{tabularx}
     151\end{table}
     152
     153\lstset{language=XML}
     154\begin{lstlisting}[label=list:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
     155   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
     156                id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
     157                concept-id="http://www.isocat.org/datcat/DC-2545" >
     158        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle
     159   </Term>
     160\end{lstlisting}
     161
     162
     163\subsubsection{Type \code{Concept}}
     164\code{Concept} represents a data category. Identifier is the PID issued by the DCR.
     165It groups all terms belonging to given data category.
     166The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements.
     167Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages:
     168
     169\lstset{language=XML}
     170\begin{lstlisting}[label=list:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
     171<Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
     172               id="http://www.isocat.org/datcat/DC-2545">
     173         <Term set="isocat" type="mnemonic">resourceTitle</Term>
     174         <Term set="isocat" type="id">DC-2545</Term>
     175         <Term set="isocat" type="label" xml:lang="en">resource title</Term>
     176         <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
     177        ...
     178         <info xml:lang="en">The title is the complete title
     179                        of the resource without any abbreviations.</info>
     180        ...
     181</Concept>
     182\end{lstlisting}
     183
     184In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{list:concept-cmd-term}).
     185
     186\lstset{language=XML}
     187\begin{lstlisting}[label=list:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
     188 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
     189            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
     190\end{lstlisting}
     191
     192\lstset{language=XML}
     193\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
     194    <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
     195        <Term set="isocat" type="mnemonic">resourceTitle</Term>
     196        <Term set="isocat" type="id">DC-2545</Term>
     197        <Term set="isocat" type="label" xml:lang="en">resource title</Term>
     198        <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
     199        <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
     200        ...
     201        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
     202                id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
     203                        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
     204        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
     205                id="clarin.eu:cr1:c_1271859438123#Title">
     206                        AnnotationTool.GeneralInfo.Title</Term>
     207        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
     208                id="clarin.eu:cr1:c_1274880881884#Title">
     209                        imdi-corpus.Corpus.Title</Term>
     210        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
     211                id="clarin.eu:cr1:c_1271859438201#Title">
     212                        Session.Title</Term>
     213        ...
     214    </Concept>
     215\end{lstlisting}
     216
     217
     218\subsubsection{Type \code{Termsets/Termset}}
     219\code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute.
     220For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile.
     221
     222Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
     223
     224\lstset{language=XML}
     225\begin{lstlisting}[label=list:termset, caption=\code{Termset} element representing a CMD profile]
     226<Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
     227            type="CMD_Profile">
     228      <info>
     229         <id>clarin.eu:cr1:p_1357720977520</id>
     230         <description>A CMDI profile for annotated text corpus resources.</description>
     231         <name>AnnotatedCorpusProfile</name>
     232         <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
     233         <creatorName>nalida</creatorName>
     234          ...
     235     </info>
     236     <Term type="CMD_Component" name="GeneralInfo" datcat=""
     237            id="clarin.eu:cr1:c_1359626292113"     
     238            parent="AnnotatedCorpusProfile"
     239            path="AnnotatedCorpusProfile.GeneralInfo">
     240            <Term ...
     241     </Term>
     242     ...
     243</Termset>
     244\end{lstlisting}
     245
     246The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements.
     247
     248
     249%%%%%%%%%%%%%%%%%%%%%%
    43250\section{cx -- crosswalk service}
    44 \label{def:cx}
     251\label{sec:cx}
    45252
    46253The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
    47 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas, building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain. (cf. \ref{def:qx}).
    48 
    49 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemata annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemata by some matching algorithm, but rather the data categories are used as bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), rather than in a collection of pair-wise equivalencies between the fields.
    50 
    51 \subsection{smcIndex}\label{indexes}
    52 In this section we describe \emph{smcIndex} -- the data type for input and output of the proposed application.
    53 An smcIndex is a human-readable string adhering to a specific syntax, denoting some search index.
    54 The generic syntax is:
    55 \begin{eqnarray*}
    56 smcIndex ::= context \ contextSep \ conceptLabel
    57 \end{eqnarray*}
    58 
    59 We distinguish two types of smcIndexes: (i) \emph{dcrIndex} referring to data categories and (ii) \emph{cmdIndex} denoting a specific
    60 ``CMD-entity'', i.e. a metadata field, component or whole profile defined within CMD. The \textit{cmdIndex} can be interpreted as a XPath into the instances of CMD-profiles. In contrast to it, the \textit{dcrIndexes} are generally not directly applicable on existing data, but can be understood as abstract indexes referring to well-defined concepts -- the data categories -- and for actual search they need to be resolved to the metadata fields they are referred by. In return one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
    61 
    62 These two types of smcIndex also follow different construction patterns:
    63 \begin{eqnarray*}
    64 smcIndex & ::= & dcrIndex \ | \ cmdIndex  \\
    65 dcrIndex & ::= & dcrID \ contextSep \ datcatLabel \\
    66 cmdIndex & ::= & profile \  \\
    67                       &  &  | \  [\ profile \ contextSep \ ] \ dotPath \\
    68 dotPath  & ::= & [\ dotPath \ pathSep \ ] \ elemName \\
    69 contextSep & ::= & \texttt{`.`} \ | \  \texttt{`:`} \\
    70 pathSep & ::= & \texttt{`.`} \\
    71 dcrId & ::= & \texttt{`isocat`} \ | \ \texttt{`dc`}
    72 \end{eqnarray*}
    73 
    74 The grammar is based on the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (\texttt{dc.title}) and on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} (\texttt{Session.Location.Country}).
    75 
    76 \textit{dcrID} is a shortcut referring to a data category registry
    77 %\footnote{Next to ISOcat other registries can function as a DCR, e.g., the Dublin Core set of metadata terms.}
    78 similar to the namespace-mechanism in XML-documents.  \textit{datcatLabel} is the verbose Identifier- (e.g. \texttt{telephoneNumber}) or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.
     254Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
     255
     256The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{def:qx}).
     257
     258The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
     259
     260\subsection{Interface Specification}
     261\label{def:cx-interface}
     262
     263In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format.
     264
     265\todoin{The two interfaces list and map
     266Full definition in appendix and under link!}
     267
     268\subsubsection*{Method \code{list}}
     269
     270Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input  and provide autocompletion functionality.
     271
     272\begin{definition}{URI-pattern of the \code{list} method}
     273/smc/cx/list/\$context
     274\end{definition}
     275
     276\noindent
     277Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data
     278
     279\begin{table}
     280\caption{Allowed values for parameters of the \code{list}-method and corresponding return values}
     281\label{table:cx-list-params}
     282 \begin{tabular}{ l | p{0.7\textwidth} }
     283  \var{\$context}  & returns a list of \\
     284 \hline
     285  \code{*,top} & available termsets \\
     286  \var{\{termset\}} & terms (CMD components and elements) of given termset \\
     287  \code{dcr} & available data category registries (isocat, dublincore) \\
     288  \code{isocat}  & ISOcat data categories referenced in CMD data \\
     289  \code{languages} & available languages (only for isocat data categories) \\
     290  \code{cmd-profiles} & all available CMD profiles \\
     291  \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\
     292  \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\
     293  \code{relsets} & available relation sets (defined in the Relation Registry)
     294 \end{tabular}
     295\end{table}
     296
     297 Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
     298%NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.}
    79299% While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
    80 \textit{profile} is the name of the profile. % (despite the danger of ambiguity).
    81 \textit{dotPath} allows to address a leaf element (\texttt{Session.Actor.Role}), or any intermediary XML-element corresponding to a CMD-component (\texttt{Session.Actor})   within a metadata description. %This allows to easily express search in whole components, instead of having to list all individual fields.
    82 
    83 Generally, smcIndexes can be ambiguous, meaning they can refer to multiple concepts, or entities (CMD-elements). This is due to the fact that the names of the data categories, and CMD-entities are not guaranteed unique. The module will have to cope with this, by providing on demand the list of identifiers corresponding to a given smcIndex.
    84 
    85 %As an important sidenote -- cmdIndexes can be ambiguous, meaning they can refer to multiple entities (metadata fields), examples of valid indexes:
    86 %\begin{verbatim}
    87 %Name
    88 %Actor.Name, Project.Name
    89 %Session.Actor.Name, Drama.Actor.Name
    90 %\end{verbatim}
    91 
    92 %So we disambiguate (or narrow down the ambiguity) by prefixing context.
    93 
    94 \subsection{Interface Specification}
    95 
    96 In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
    97 % \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
    98 
    99 In the operation mode, the application accepts any index (\textit{smcIndex}, cf. \ref{indexes}) and returns a list of corresponding indexes (or only the input index, if no correspondences were found):
    100 \newline
    101 
    102 \textit{smcIndex $\mapsto$ smcIndex[ ]}
    103 \newline
    104 
    105 We can distinguish following levels for this mapping function:
    106 
    107 (1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD-elements (\textit{cmdIndexes)} bound to a given data category are returned:
    108 \newline
    109 
    110 \begin{example}
    111 isocat.size     & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
    112 \end{example}
    113 \newline
    114 
    115 \textit{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
    116 \newline
    117 
    118 \begin{example}
    119 imdi-corpus.Name & $\mapsto$ \\
    120 (isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name
    121 \end{example}   
    122 \newline
    123 
    124 (2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
    125 \newline
    126 
    127 \texttt{isocat.resourceTitle  $\mapsto$ }
    128 \verb|   (+ dc.title) |$\mapsto$  \newline
    129 \verb|   [imdi-corpus.Title, | \newline
    130 \verb|    TextCorpusProfile.GeneralInfo.Title,| \newline
    131 \verb|    teiHeader.titleStmt.title,| \newline
    132 \verb|    teiHeader.monogr.title]|
    133 \newline
    134 
    135 (3) \emph{container data categories} -- further expansions will be possible once the container data categories \cite{SchuurmanWindhouwer2011} will be used. Currently only fields (leaf nodes) in metadata descriptions are linked to data categories. However, at times, there is a need to conceptually bind also the components, meaning that besides the ``atomic'' data category for \texttt{actorName, there would be also a data category for the complex concept \texttt{Actor}.}
    136 Having concept links also on components will require a compositional approach to the task of semantic mapping, resulting in:
    137 \newline
    138 \texttt{Actor.Name $\mapsto$ }\newline
    139 \verb|    [Actor.Name, Actor.FullName, |\newline
    140 \verb|     Person.Name, Person.FullName]|
    141 
     300
     301
     302\subsubsection*{Method \code{map} }
     303
     304Method \code{map} performs the actual translations:
     305it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes.
     306%it returns list of equivalent terms/smcIndexes for a given term/smcIndex.
     307
     308\begin{definition}{General function definition}
     309smcIndex \mapsto smcIndex[ ]
     310\end{definition}
     311
     312\begin{definition}{URI-pattern of the \code{map} method}
     313/smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ]
     314\end{definition}
     315
     316\noindent
     317Parameter definition:\\*
     318\begin{description}
     319\item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would be one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
     320\item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive
     321\item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default
     322\item[\var{\$relset}] optional; reference to a relset to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets are they are all applied in the expansion
     323\end{description}
     324
     325\noindent
     326Possible return formats:
     327\begin{description}
     328\item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
     329
     330
     331\item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string
     332\lstset{language=XML}
     333\begin{lstlisting}
     334<Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/>
     335\end{lstlisting}
     336\item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id}
     337\lstset{language=XML}
     338\begin{lstlisting}
     339<Term concept-id="http://www.isocat.org/datcat/DC-2512"
     340           set="isocat" type="datcat">creatorFullName</Term>
     341\end{lstlisting}
     342\item[\var{cmdid, id}] distinct cmd entities (\code{Term}) by \code{@id}
     343\begin{lstlisting}
     344<Term type="CMD_Element" name="Name" elem="Name" parent="Session"
     345       datcat="http://www.isocat.org/datcat/DC-2544"
     346       id="clarin.eu:cr1:c_1349361150645#Name"  path="DBD.Session.Name"/>
     347\end{lstlisting}
     348
     349\end{description}
     350
     351\begin{table}[ht]
     352\caption{Sample values for parameters of the \code{map}-method and corresponding return values}
     353\label{table:cx-map-params}
     354
     355 \begin{tabular}{ l  l | l}
     356  \var{\$context}  & \var{\$term} & returns \\
     357 \hline
     358  \code{*} & \code{name} & ? \\
     359  \code{isocat} & \code{resourceTitle} & CMD terms \\
     360  \code{cmd} & \code{name} & \\
     361
     362 \end{tabular}
     363\end{table}
     364
     365\noindent
     366Sample request\\*
     367\begin{example1}
     368/smc/cx/map/isocat/resourceTitle
     369\end{example1}
     370\lstset{language=XML}
     371\begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ]
     372<Terms >
     373    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
     374        id="clarin.eu:cr1:c_1271859438123#Title">
     375                AnnotationTool.GeneralInfo.Title</Term>
     376    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014"
     377        id="clarin.eu:cr1:c_1288172614011#resourceTitle">
     378                BamdesLexicalResource.BamdesCommonFields.resourceTitle
     379     </Term>
     380   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
     381        id="clarin.eu:cr1:c_1274880881884#Title">
     382                imdi-corpus.Corpus.Title</Term>
     383   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
     384        id="clarin.eu:cr1:c_1271859438201#Title">
     385                Session.Title</Term>
     386   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363"
     387        id="clarin.eu:cr1:c_1271859438123#Title">
     388                LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
     389    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187"
     390        id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term>
     391\end{lstlisting}
     392
     393\noindent
     394We can distinguish following levels for the mapping function:
     395
     396\noindent
     397(1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD elements (\var{cmdIndex)} bound to a given data category are returned:
     398\noindent
     399\begin{example2}
     400%\begin{tabularx}{\textwidth}{| p{0.4\textwidth}  p{0.6\textwidth} }
     401isocat.size     $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
     402\end{example2}
     403%\end{tabularx}
     404
     405\noindent
     406\var{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
     407
     408\begin{example2}
     409imdi-corpus.Name $\mapsto$ \\
     410(isocat.resourceName) $\mapsto$ & TextCorpusProfile.GeneralInfo.Name
     411\end{example2} 
     412
     413\noindent
     414(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to a list of \var{cmdIndexes}:
     415\begin{example2}
     416isocat.resourceTitle $\mapsto$  \\
     417(+ dc.title) $\mapsto$  & [GeneralInfo.Title, Text.TextTitle, collection.CollectionInfo.Title, resourceInfo. identificationInfo. resourceName, teiHeader.titleStmt.title, teiHeader.monogr.title]
     418\end{example2}
     419
     420\noindent
     421(3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
     422Having concept links also on components will require a compositional approach for the mapping function, resulting in:
     423\begin{example2}
     424Actor.Name $\mapsto$ & [Actor.Name, Actor.FullName, \\
     425& Person.Name, Person.FullName]
     426\end{example2}
    142427
    143428\subsection{Implementation}
    144429
    145 At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries.
     430At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
    146431
    147432\todoin{generate and reference XSLT-documentation}
    148433
     434The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
     435
    149436
    150437\subsubsection{Initialization}
    151 
    152 First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
    153 \newline
    154 
    155 \textit{datcatURI $\mapsto$ profile.component.element[]}
    156 \newline
    157 
    158 The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
    159 
    160 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
    161 
    162 \todocode{example of inverted index}
     438\label{smc_init}
     439During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
     440
     441\begin{definition}{Principal structure of the inverted index}
     442datcatURI \mapsto profile.component.element[]
     443\end{definition}
     444
     445The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
     446
     447Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
     448
     449\begin{figure*}[!ht]
     450\includegraphics[width=1\textwidth]{images/smc_init.png}
     451\caption{The various stages of the data flow during the initialization}
     452\label{fig:smc_init}
     453\end{figure*}
     454
     455Following datasets are available, after the initialization sequence has finished (cf. figure \ref{fig:smc_init}):
     456\begin{description}
     457\item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
     458\item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
     459\item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
     460\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements
     461\item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
     462\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
     463\end{description}
    163464
    164465\subsubsection{Operation}
    165 
    166 \subsubsection{Computing summaries}
     466For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL-stylesheets for post-processing depending on requested format.
     467The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq}-library within a \xne{eXist} XML-database.
    167468
    168469\subsection{Extensions}
    169470
    170 A useful supplementary function of the module would be to provide a list of existing indexes.
    171 That would allow the search user-interface to equip the query-input with autocompletion. Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
    172 
    173 Once there will be overlapping\footnote{i.e. different relations may  be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
    174 
    175 Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.}
    176 
    177 
     471Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
     472
     473Also, use of \emph{other than equivalency} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
    178474
    179475\section{qx -- concept-based search}
     
    182478In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
    183479
    184 The emphasis lies on the query language and the corresponding query input interface.
    185 
    186 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
    187 
    188 offering it (the information) semi-transparently to the user (or application) on the consumption side.
    189 
    190 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
    191 
    192 
    193 ?
    194 Facets
    195 Controlled Vocabularies
    196 Synonym Expansion (via TermExtraction(ContentSet))
    197 
     480The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
     481
     482Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
     483
     484Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}.
    198485
    199486\subsection{Query language}
    200 CQL?
    201 
     487As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
    202488
    203489\subsection{Query Expansion}
    204490
    205 
     491As long as the indexes to expand with are equivalent the query expansion is simply disjunction, returning a union of matching records. Thus \code{isocat.resourceTitle any "elephant"} would translate into
     492
     493\begin{example1}
     494GeneralInfo.Title any "elephant" \\
     495OR resourceInfo.resourceName any "elephant" \\
     496OR CollectionInfo.Title any "elephant" \\
     497OR teiHeader.titleStmt.title any "elephant" \\
     498\end{example1}
     499
     500\noindent
     501Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
    206502
    207503\subsection{SMC as module for Metadata Repository}
    208504
    209 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain.
     505As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
    210506
    211507Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
     
    219515
    220516
    221 \subsection{User Interface?}
    222 
    223 
    224 \subsubsection*{Query Input}
    225 
     517\subsection{User Interface}
     518
     519A starting point for our considerations is the traditional structure found in many (advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator:
     520\begin{definition}{Generic data format for structured queries}
     521 [ < index, operation, term, boolean > ]
     522\end{definition}
     523
     524\noindent
     525This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}.
     526% {Basic clause of the CQL syntax}
     527\begin{definition}{The main clause of the CQL syntax, the \code{searchClause}}
     528\label{def:searchClause}
     529searchClause \ ::= \ index \ relation \ searchTerm
     530\end{definition}
     531
     532\noindent
     533An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model.
    226534
    227535\begin{figure*}[!ht]
     
    231539\end{figure*}
    232540
     541\noindent
    233542Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
    234543
    235 \subsubsection*{Columns}
    236 
    237 \subsubsection*{Summaries}
    238 
    239 \subsubsection*{Differential Views}
    240 Visualize impact of given mapping in terms of covered dataset (number of matched records).
    241 
    242 \subsubsection*{Visualization}
    243 Landscape, Treemap, SOM
    244 
    245 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
    246 
    247 \section{SMC-Browser}
     544A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
     545
     546Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search.
     547
     548
     549\section{SMC Browser}
    248550\label{smc-browser}
    249551
    250 Explore the Component Metadata Framework
    251 
    252 As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
    253 
    254 In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}.
    255 
    256 Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
    257 
     552As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
     553
     554While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
     555
     556\subsection{Design}
     557In the following, we elaborate on the basic idea of the proposed application, the source data, requirements and proposed application UI-layout.
     558
     559\subsubsection{Basic concept}
     560
     561If we consider the CMD data model (cf. \ref{def:CMD}) we recognize that every profile can be expressed as a tree with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by \var{inclusion} and \var{reference}.
     562
     563\begin{defcap}[!ht]
     564\caption{\var{inclusion} and \var{reference} relationship}
     565\begin{align*}
     566cmds:Component  & \xrightarrow{includes} \quad  cmds:Component \\
     567cmds:Component  & \xrightarrow{includes} \quad  cmds:Element \\
     568cmds:Element  & \xrightarrow{refersTo} \quad DatCat
     569\end{align*}
     570\end{defcap}
     571The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). The main idea for the \xne{SMC Browser} is to \textbf{visualize this graph inherent in the CMD data}.
     572
     573\subsubsection{Requirements}
     574Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
     575
     576In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
     577
     578This scenario implies a few requirements on the user interface:
     579\begin{itemize}
     580\item select nodes from a list of all available nodes (ideally grouped by type)
     581\item filter the node list
     582\item select an arbitrary number of nodes of any type (be it profiles, components, elements, data categories)
     583\item traverse the graph starting from selected nodes into arbitrary depth
     584\item traverse the graph backwards (meaning against the direction of the edges, i.e. e.g. from data categories towards the profiles)
     585\item maintain the identity of the nodes, meaning one component or one data category used in two profiles has to be represented by one node (for displaying the reuse)
     586\item show auxiliary information about the nodes on demand
     587\end{itemize}
     588
     589\subsubsection{Application layout}
     590\begin{figure*}[!ht]
     591\begin{center}
     592\includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png}
     593\end{center}
     594\caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface}
     595\label{fig:smc-browser_sketch}
     596\end{figure*}
     597
     598\noindent
     599Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}):
     600\begin{description}
     601\item[index panel] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
     602\item[main graph pane] displays the selected subgraph, needs as much space as possible
     603\item[graph navigation bar] for manipulation of the displayed graph by various means
     604\item[detail view] displaying definition and statistical information for selected nodes
     605\item[statistics] a separate view on the data listing the statistical information for whole dataset in tables
     606\end{description}
     607
     608\subsection{Implementation}
     609The application is implemented in \xne{javascript} based on a generic visualization \xne{js}-library \xne{d3}\furl{https://github.com/mbostock/d3/}. The library allows for data-driven visualization (hence the name \xne{d3 = data-driven documents}), attributes of data items being dynamically bound to attributes of the SVG objects representing them. This caters for high flexibility, fast development and consistent data views. The library also delivers the base graph layout algorithm: \emph{force-directed graph layout}\furl{https://github.com/mbostock/d3/wiki/Force-Layout##wiki-force}:
     610
     611\begin{quotation}
     612A flexible force-directed graph layout implementation using position Verlet integration to allow simple constraints.  [\dots]
     613In addition to the repulsive charge force, a pseudo-gravity force keeps nodes centered in the visible area and avoids expulsion of disconnected subgraphs, while links are fixed-distance geometric constraints. Additional custom forces and constraints may be applied on the "tick" event, simply by updating the x and y attributes of nodes.
     614\end{quotation}
     615
     616Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
     617
     618\subsubsection{Data preprocessing}
     619\label{smc-browser-data-preprocessing}
     620The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
     621
     622\begin{description}
     623\item[SMC graph basic]
     624        the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}
     625\item[SMC graph all]
     626        additionally rendering the new profile-groups and relations between data categories (from Relation Registry)
     627\item[only profiles + datcats]
     628        just profiles and data categories are rendered (with direct links between those, skipping all components and elements)
     629\item[profiles + datcats + datcats + groups + rr]
     630        as above but again with profile-groups and relations
     631\item[only profiles]
     632       just profiles with links between them representing the degree of similarity based on the reuse of components and data categories
     633\end{description}
     634
     635Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
     636
     637
     638\begin{figure*}
     639\includegraphics[width=1\textwidth]{images/smc_processing_-mdrepo}
     640\caption{The data flow in process of precomputing data for the SMC browser}
     641\label{fig:smc_processing}
     642\end{figure*}
     643
     644\subsubsection{User interface}
     645
     646\begin{figure*}[!ht]
     647\includegraphics[width=1\textwidth]{images/navigation_bar_2013-09-28.png}
     648\caption{Navigation bar of the SMC Browser with a number of options to manipulate the visible graph}
     649\label{fig:navbar}
     650\end{figure*}
     651
     652
     653As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
     654basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
     655Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
     656
     657One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
     658
     659There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where a all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
     660
     661\subsection{Extensions}
     662Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
     663
     664\subsubsection*{Graph operations -- differential views}
     665An important feature would be to be able to apply set operations on selected (sub)graphs, especially \emph{intersection} and \emph{difference}. This would enable the user to easily extract components (nodes) that are shared (or not shared) among given schemas (subgraphs).
     666
     667\subsubsection*{Generalization}
     668There is a high potential to broaden the scope of application for the discussed tool, provided some generalizations are taken into account.
     669Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
     670
     671Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
     672
     673\subsubsection*{Viewer for external data}
     674The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
     675
     676One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
     677
     678\subsubsection*{Integrate with instance data}
     679The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
     680
     681Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
    258682
    259683\section{Summary}
    260 
    261 
    262 
     684In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
     685The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
     686
  • SMC4LRT/chapters/Infrastructure.tex

    r3553 r3638  
    33
    44
    5 \section{CLARIN / CMDI}
     5\section{CLARIN}
    66\label{def:CLARIN}
     7
     8CLARIN - Common Language Resource and Technology Infrastructure\cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
     9
     10\begin{quote}
     11\dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located.\cite{CLARIN2013web}
     12\end{quote}
     13
     14\begin{comment}
     15To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work.
     16\end{comment}
     17
     18The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
     19
     20In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and bodies ensuring the flow of information and coherent action on European level.
     21
     22Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
     23
     24\section{Component Metadata Infrastructure -  CMDI}
    725\label{def:CMDI}
    8 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
    9 
    10 \begin{quotation}
    11 \dots create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily usable.
    12 \end{quotation}
    13 
    14 The infrastructure foresees a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accommodate existing schemas.
    15 
    16 As stated before, the SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide:
     26
     27One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
     28
     29The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide:
    1730
    1831\begin{itemize}
    1932\item Data Category Registry
     33\item Component Registry
    2034\item Relation Registry
    21 \item Component Registry
    22 \item Vocabulary Alignement Service (OpenSKOS)
     35\end{itemize}
     36
     37\noindent
     38All these components are running services, that this work shall directly build upon.
     39
     40Next to these core services, that SMC has direct dependencies to, some other services are being developed within the CMDI ecosystem that are also relevant in the context of SMC:
     41
     42\begin{itemize}
    2343\item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
    2444\item SchemaParser
     45\item Vocabulary Alignement Service (OpenSKOS)
    2546\end{itemize}
    2647
    27 On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \label{cmdi_exploitation}.
     48On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ref{cmdi_exploitation}.
    2849
    2950\begin{figure*}[!ht]
     
    3354
    3455
    35 \subsection{CMDI registries: DCR, CR, RR}
    36 \label{def:CMD}
    37 \label{def:DCR}
    38 
    39 
    40 
    41 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
    42 The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \emph{ISOcat}\footnote{\url{http://www.isocat.org/}}.
    43 Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
    44 
    45 The \emph{Component Metadata Framework} (CMD) is built on top of the DCR and complements it. While the DCR defines the atomic concepts, within CMD the metadata schemas can be constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles as long as each field ``refers via a PID to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. This allows to trivially infer equivalences between metadata fields in different CMD-based schemata. While the primary registry used in CMD is the ISOcat DCR, other authoritative sources for data categories (``trusted registries'') are accepted, especially Dublin Core Metadata Initiative \cite{DCMI:2005}.
    46 
    47 \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles.
    48 
     56\subsection{CMDI registries}
     57
     58The CMD framework as data model (cf. \ref{def:CMD} together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. In the following we explain briefly their role and interaction.
    4959
    5060\begin{figure*}[!ht]
     
    5363\end{figure*}
    5464       
     65\subsubsection*{Data Category Registry}
     66\label{def:DCR}
     67
     68The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
     69The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \xne{ISOcat}\furl{http://www.isocat.org/}.
     70Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
     71
     72\subsubsection*{Component Registry}
     73
     74\emph{Component Registry} (CR)\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} implements the CMD data model and fulfills two functions. For one it as a robust web application for creating and editing new CMD components and profiles. On the other hand it is the actual registry the persistently stores and exposes published CMD profiles, allowing to browse and search in them and view their structure.
     75
     76The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., add or a remove some metadata elements and/or components. Also new components can be created to model the unique aspects of the resources under consideration. All components are combined into one profile. Components, elements and values should be linked to a concept to make its semantics explicit.\cite{Durco2013_MTSR}
     77
     78This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
     79from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     80
     81\subsubsection*{Ontological Relations -- Relation Registry}
     82
    5583The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
    5684However there needs to be an additional means to capture information about relations between data categories.
    57 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
    58  These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
    59 
    60 There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen. \cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
     85This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design grounds on the expectation that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeller.
     86
     87These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
     88
     89There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
    6190This implementation stores the individual relations as RDF-triples
    62 \begin{example}
    63 <subjectDatcat, relationPredicate, objectDatcat>
    64 \end{example}
    65 allowing typed relations, like equivalency (\texttt{rel:sameAs}) and subsumption (\texttt{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
    66 
    67 !check DCR-RR/Odijk2010 -follow up
    68 !Cf. Erhard Hinrichs 2009
     91
     92\begin{example3}
     93<subjectDatcat, & relationPredicate, & objectDatcat>
     94\end{example3}
     95
     96allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
     97
     98\todoin{check DCR-RR/Odijk2010 -follow up ?; Cf. Erhard Hinrichs 2009 }
     99
     100\subsubsection*{Schema Registry}
    69101
    70102SCHEMAcat is a registry for schemata of all kinds (not just XML-based) semantically annotated with data categories.
     
    73105(search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
    74106
    75 \noindent
    76 All these components are running services, that this work shall directly build upon.
    77 
    78 This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
    79 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    80 
    81 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}.
    82107
    83108\subsection{Vocabulary Service / Reference Data Registry}
     
    86111The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data for both the LRT and Digital Humanities community has been discussed on many occasions in various contexts. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives.
    87112
    88 In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centers (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-center) services to be dealt with.
    89 
     113In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
     114
     115\begin{note}
    90116In parallel, within the sister ESFRI project DARIAH a taskforce with the same goal has been set up : \emph{Service for Reference Data and Controlled Vocabularies}. This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to \emph{establish a service providing controlled vocabularies and reference data} for the DARIAH (and CLARIN) community.
    91117
    92 Regarding the responsibilities of the DARIAH working groups:
    93 VCC3/Task 3 identifies and recommends vocabularies relevant for the community. VCC1/Task 5 provides basic/generic services relevant for whole community. Especially, the Schema Registry, that allows to express mappings between different schemas seems to be one starting point. In accordance with the VCC1 strategy, concentrate on pulling together (pooling) existing resources and only implement necessary ``glue'' to put the pieces together (data conversion, service-wrappers...)
    94 
    95118Thus there is a momentum and a high potential for a collaborative approach in at least these two big initiatives CLARIN and DARIAH, that serve a very wide-spread and diverse community.
     119\end{note}
    96120
    97121\subsubsection{Abstract service description}
    98122As to the service itself it is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface is still necessary for administration etc.  By using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and \xne{LOD}.
    99 Besides providing vocabularies, the service should also hold and expose equivalencies (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalencies from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
     123Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
    100124\begin{verbatim}
    101 GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065 | Wikipedia-Personensuche
     125GND: 118540238 | LCCN: n79003362 |
     126NDL: 00441109 | VIAF: 24602065
    102127\end{verbatim}
    103128
    104129\subsubsection{Vocabulary Service - CLAVAS}
    105130\label{def:CLAVAS}
    106 As described in previous section (\ref{def:DCR}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
     131As described in previous section (\ref{def:DCR}), a solid pillar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
    107132
    108133This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
     
    130155\label{interaction-dcr-skos}
    131156
    132 
    133157DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
    134 simple, complex: closed, open, constrained, (container)?
     158\code{simple, complex: closed, open, constrained, (container)?}
    135159
    136160\begin{figure*}[!ht]
     
    149173
    150174
    151 The semantic proximity of a /data category/ to a /concept/ may mislead to
    152 a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept
    153 all of them belonging to the \xne{ISOcat-profile:ConceptScheme}.
     175The fact that data categories are basically definitions of concepts may mislead to
     176a na"ive approach to mapping DCR to SKOS, namely mapping every data category to a \code{skos:Concept}
     177all of them belonging to the \xne{ISOcat:ConceptScheme}.
    154178However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
    155179
    156 A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme.
     180A more sensible approach is to export only closed DCs as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{Concepts} within that scheme.
     181
     182\begin{quotation}
    157183The rationale is, that if we see a vocabulary as a set of possible values for a
    158184field/element/attribute, complex DCs in ISOcat are the users of such
    159185vocabularies and simple DCs the DCR equivalence of values in such a
    160 vocabulary.\cite{Menzo2013mail}
    161 
    162 Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs.
    163 Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
    164 So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
     186vocabulary.
     187\end{quotation}\cite{Menzo2013mail}
     188
     189Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
     190Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
     191So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
    165192That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
    166193
     
    332359\todocite {MI Search Engine}
    333360
    334 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
     361And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centres,
    335362and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
    336363
  • SMC4LRT/chapters/Literature.tex

    r3551 r3638  
    1616
    1717\subsection{Metadata}
    18 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
     18A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\furl{http://www.clarin.eu/cmdi} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
    1919
    2020Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}.
     
    8787In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
    8888
    89 One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
     89One more specific recent inspirational work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
    9090
    9191\todoin{check if relevant: http://schema.org/}
     
    9999
    100100\subsection{Ontology Visualization}
     101
     102Landscape, Treemap, SOM
     103
     104\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
    101105
    102106
     
    123127
    124128\section{Summary}
    125 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and
    126 on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
     129This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
  • SMC4LRT/chapters/Results.tex

    r3551 r3638  
    4949
    5050
    51 \subsection{SMC Browser -- Advanced Interactive User Interface}
     51\subsection{SMC Browser -- advanced interactive user interface}
    5252
    5353SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
     54In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g. counting how many elements a profiles contains, or in how many profiles a DC is used.
    5455
    5556It is implemented on top of the js-library d3, the code is checked in clarin-svn.
     
    249250The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
    250251
    251 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record.
     252In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
    252253This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
    253254
     
    288289\item MD Search employing Semantic Mapping
    289290\item MD Search employing Fuzzy Search
     291\item Visualize impact of given mapping in terms of covered dataset (number of matched records).
    290292\end{itemize}
    291293
  • SMC4LRT/chapters/abstract_en.tex

    r2672 r3638  
    11\chapter*{Abstract}
    22
    3 According to the guidelines of the faculty, an abstract in English has to be inserted here.
     3
     4This work is embedded in the context of a large research infrastructure initiative aimed at easing and harmonizing access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in at the core of the infrastructure.
     5
     6The ultimate objective of the effort -- in line with the overall mission of the infrastructure -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This was pursued by two separate, complementary approaches: a) Enriching the search capabilities with concept-based crosswalks on schema level.
     7And -- acknowledging the integrative power of the \emph{Linked Open Data} paradigm  -- b) expressing the domain data as a \emph{Semantic Web} resource.
     8
     9In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
     10As a by-product, the application \emph{SMC browser} was developed -- a visualization tool for interactive exploration of the dataset. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, they are considered the main contribution of this work by the author.
     11
  • SMC4LRT/chapters/appendix.tex

    r3551 r3638  
    44
    55
    6 \chapter{Data model ?}
     6\chapter{Data model reference}
     7In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
     8
    79\begin{figure*}[!ht]
    810\begin{center}
     
    1214\label{fig:DCR_data_model}
    1315\end{figure*}
     16
     17\input{images/Terms.xsd}
     18
     19\input{images/general-component-schema.xsd}
     20
    1421
    1522\begin{figure*}[!ht]
     
    2936\end{figure*}
    3037
    31 \section {SMC Reports}
    32 \label{sec:reports}
    3338
    34 SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
     39\chapter{SMC Browser}
    3540
    3641
     42\begin{figure*}[!ht]
     43\begin{center}
     44\includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
     45\end{center}
     46\caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
     47\label{fig:cmd-dep-dotgraph}
     48\end{figure*}
     49
     50\section{SMC Browser user documentation}
     51\label{sec:smc-browser-userdocs}
     52
     53\input{chapters/userdocs_cleaned}
     54
     55
     56
     57
     58
     59\chapter{SMC Reports}
     60\label{ch:reports}
     61
     62SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
     63
    3764\input{chapters/examples_cleaned}
  • SMC4LRT/chapters/danksagung.tex

    r2672 r3638  
    11\chapter*{Danksagung}
    22
    3 Hier fÃŒgen Sie optional eine Danksagung ein.
     3Ich möchte mich herzlich bedanken, bei allen Kollegen die mir mit Rat zur Seite gestanden sind
     4und meinen Liebsten fÃŒr ihre extra-portion Geduld, die ich ihnen abverlangt habe.
Note: See TracChangeset for help on using the changeset viewer.