Context Navigation

← Previous Change
Next Change →

Changeset 3638 for SMC4LRT

Timestamp:

09/30/13 11:54:57 (11 years ago)

Author:

vronk

Message:

major reorganization, detailing of Design-chapters; abstract_en

Location:

SMC4LRT/chapters

Files:

: 9 edited

Data.tex (modified) (2 diffs)
Design_SMCinstance.tex (modified) (5 diffs)
Design_SMCschema.tex (modified) (6 diffs)
Infrastructure.tex (modified) (8 diffs)
Literature.tex (modified) (4 diffs)
Results.tex (modified) (3 diffs)
abstract_en.tex (modified) (1 diff)
appendix.tex (modified) (3 diffs)
danksagung.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Data.tex

-                      r3553
+                      r3638
+\subsection{CMD-Framework}
+\subsection{Component Metadata Framework}
+\label{def:CMD}
+The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.)
+CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
+The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
+indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
+While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
+Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
+The generated schema also conveys as annotation the information about the referenced data categories.
 …
 .893 & MPI CGN \\
 .628 & Bavarian Archive for Speech Signals (BAS) \\
 .964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) \\
+.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
 .348 & WALS RefDB \\
 .689 & Lund Corpora \\

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3553
+                      r3638
 \chapter{System design - mapping on instance level}
+\chapter{Mapping on instance level, CMD as LOD}
 \label{ch:design-instance}
 \begin{quotation}
 I do think that ISOcat, CLAVAS, RELcat, an actual language
+I do think that ISOcat, CLAVAS, RELcat and actual language
 resource all provide a part of the semantic network.
 …
 relevant parts in a triple store and do your SPARQL/reasoning on it. Well
 that's where I'm ultimately heading with all these registries related to
 semantic interoperability ... I hope ;-)
+semantic interoperability ... I hope ;-)\cite{Menzo2013mail}
 \end{quotation}
+\cite{Menzo2013mail}
+Linked Data - Express dataset in RDF
+Partly as by-product of the entities-mapping effort we will get the metadata rendered in RDF, linked with
+So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
+Technical aspects (RDF-store?) / interface (ontology browser?)
+\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
+\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
+defining the Mapping:
+\begin{enumerate}
+\item convert to RDF
+translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
+\item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
+\end{enumerate}
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
+\caption{The process of transforming the CMD metadata records to and RDF representation}
+\label{fig:smc_cmd2lod}
+\end{figure*}
+As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
+One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
+In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006}
+as well as for real semantic (ontology-driven) search and exploration of the data.
+The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
+In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
 \section{CMD to RDF}
+\label{ch:cmd2rdf}
+A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts.
+However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary.
+Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL}
+The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF
+\label{sec:cmd2rdf}
+In this section, RDF encoding is proposed for all levels of the CMD data domain:
+\begin{itemize}
+\item CMD meta model
+\item profile definitions
+\item the administrative and structural information of CMD records
+\item individual values in the fields of the CMD records
+\end{itemize}
 \subsection{CMD specification}
+The meta model
+The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
 \label{table:rdf-spec}
+\begin{example}
+cmd\_spec:Profile & subClassOf  & owl:Class. \\
+cmd\_spec:Component & subClassOf  & owl:Class. \\
+cmd\_spec:Element & subClassOf  & rdf:Property. \\
+\end{example}
+Typing the profiles, components and elements:
+\begin{example3}
+cmds:Component & subClassOf  & owl:Class. \\
+cmds:Profile & subClassOf  & cmds:Component. \\
+cmds:Element & subClassOf  & rdf:Property. \\
+\end{example3}
+\noindent
+This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
 \label{table:rdf-cmd}
 \begin{example}
 cmd:collection & a & cmd\_spec:Profile; \\
  & rdfs:label & `collection'; \\
+\begin{example3}
+cmd:collection & a & cmds:Profile; \\
+ & rdfs:label & "collection"; \\
  & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
 cmd:Actor       & a & cmd\_spec:Component. \\
 cmd:LanguageName  & a & cmd\_spec:Element. \\
 \end{example}
+cmd:Actor       & a & cmds:Component. \\
+cmd:LanguageName  & a & cmds:Element. \\
+\end{example3}
 \begin{note}
 Should the ID assigned in the component registry  for the CMD entities  used as ID in rdf, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?)
+Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?)
 \end{note}
 \subsection{Data Categories}
+Windhouwer (2012) proposes to use the data categories as annotation properties.
+Definition of the annotation property \code{dcr:datcat}
+\begin{example}
+Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties:
+\begin{example3}
 dcr:datcat & a  & owl:AnnotationProperty ; \\
  & rdfs:label  & "data category"@en ; \\
+ & rdfs:comment  & "This resource is equivalent to  \\
+this data category."@en ; \\
+ & skos:note  & "The data category should be  \\
+ &   & identified by its PID."@en ; \\
+\end{example}
+Still, leaving open the possibility for âa stronger semantic linkâ :
+\begin{quotation}
+By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
+\end{quotation}
+For classes the OWL 2 \code{owl:equivalentClass} can be used, for example:
+\begin{example}
+\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
+\end{example}
+For properties OWL 2 provides \code{owl:equivalentProperty}, for example:
+\begin{example}
+\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
+\end{example}
+Finally \code{owl:sameAs} can be used for individuals, for example:
+\begin{example}
+\#myNoun & owl:sameAs & isocat:DC-1333. \\
+\end{example}
+ISOcat provides a RDF representation of the data categories :
+\begin{example}
+ & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
+ & skos:note  & "The data category should be identified by its PID."@en ; \\
+\end{example3}
+That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
+\begin{example3}
+cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
+\end{example3}
+Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
+used usually directly as data properties:
+\begin{example3}
+<lr1> & dc:title & "Language Resource 1"
+\end{example3}
+\noindent
+Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows:
+\begin{example3}
+<lr1> & isocat:DC-2502 & "19th century"
+\end{example3}
+\noindent
+However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.
+This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties.
+\noindent
+The REST web service of \xne{ISOcat} provides a RDF representation of the data categories:
+\begin{example3}
 isocat:languageName & dcr:datcat & isocat:DC-2484; \\
  & rdfs:label & "language name"@en; \\
  & rdfs:comment & "A human understandable..."@en; \\
  & âŠ  \\
+\end{example}
+\end{example3}
+However this is only meant as template, as is stated in the explanatory comment of the exported data:
+\begin{quotation}
+By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
+\end{quotation}
+So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
+\begin{example3}
+\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
+\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
+\#myNoun & owl:sameAs & isocat:DC-1333. \\
+\end{example3}
+\subsection{RELcat - Ontological relations}
+As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
+\begin{example3}
+isocat:DC-2538 & rel:sameAs & dct:date
+\end{example3}
+\noindent
+By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
 \begin{note}
+Output from isocat is only meant as template!
+In the RDF representation, the data categories seem to be referenced by their mnemonicIdentifier (rdf:ID=âlanguageNameâ) how is this guaranteed URI and how is the data category meant to be referred to?
+Does this mean, that I would say:
+\begin{example3}
+rel:sameAs & owl:equivalentProperty & owl:sameAs
+\end{example3}
+to enable the inference of the equivalences?
+Is this correct:
 \end{note}
+Finally, the ConceptLink attribute used in the CMD profiles to reference the data category is modelled as:
+\begin{example}
+cmd:LanguageName & dcr:datcat & isocat:DC-248. \\
+\end{example}
+?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
+\begin{example2}
+ cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
+\end{example2}
+\noindent
+following facts need to be present in the ontology :
+\begin{example3}
+<lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
+cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
+isocat:DC-2538 & rel:sameAs & dc:created \\
+rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
+$\rightarrow$ \\
+<lr1> & dc:created & 2012\^{}\^{}xs:year \\
+\end{example3}
+\noindent
+What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples:
+\begin{example3}
+cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
+clavas:Organization & owl:subClassOf & dcterms:Agent \\
+<org1> & a & clavas:Organization \\
+\end{example3}
 \subsection{CMD instances}
+In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
 \subsubsection {Resource Identifier}
 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>}  from \code{<cmd:MdSelfLink>}) as the resource identifier.
 The relationship between the resource and the metadata record could be expressed as an annotation :
 \begin{example}
+It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
+If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
+\begin{example3}
 \_:anno1  & a & oa:Annotation; \\
  & oa:hasTarget  & <lr1>; \\
  & oa:hasBody  & <lr1.cmd>; \\
  & oa:motivatedBy  & oa:describing \\
+\end{example}
+\subsection{Provenance}
+Use the information from CMD-Header for information about the modelled data  :
+\begin{example}
+<lr1.cmd>
+ & dcterms:identifier  & <lr1.cmd>;  \\
+ & dcterms:creator ??  & "\{<cmd:MdCreator>\}";  \\
+\end{example}
+Other proposed fields:
+\begin{example}
+ & dcterms:publisher  & <http://clarin.eu>,  \\
+ & <provider-oai-accesspoint>; ?? \\
+ & dcterms:created/modified â\{<cmd:MdCreated>\}â ?? \\
+\end{example}
+\end{example3}
+\subsubsection{Provenance}
+The information from \code{cmd:Header} represents the provenance information about the modelled data:
+\begin{example3}
+<lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
+ & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
+ & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
+ & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
+\end{example3}
 \subsubsection{Hierarchy ( Resource Proxy â IsPartOf)}
+In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
+\furl{http://openannotation.org/spec/core/core.html\#Motivations}
+In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
+:
 \begin{example}
+\begin{example3}
 <lr0.cmd>  & a   & ore:ResourceMap \\
 <lr0.cmd> & ore:describes & <lr0.agg> \\
 <lr0.agg> & a   & ore:Aggregation \\
 ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
 \end{example}
+& ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
+\end{example3}
+\noindent
 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
 Additionally the flat header field <cmd:MdCollectionDisplayName> has been introduced to indicate by simple means the collection, of which given resource is part.
 This information can be used to generate a separate one-level grouping of the resources, in which the value from the <cmd:MdCollectionDisplayName> element would be used as the label of an otherwise undefined ore:ResourceMap.
+Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
+This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
 Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
 \todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}
 \begin{example}
+\begin{example3}
 \_:mdcoll  & a   & ore:ResourceMap; \\
  & rdfs:label & "Collection 1"; \\
 \_:mdcoll\#aggregation & a   & ore:Aggregation \\
+\_:mdcoll\#aggreg & a   & ore:Aggregation \\
  & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
+\end{example}
+\end{example3}
 \subsubsection{Components â nested structures}
+\begin{note}
+?? Model (instance) components as blank nodes via objectProperty:
+\end{note}
+\begin{example}
+There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
+\begin{enumerate}[a)]
+\item the components are encoded as object property
+\begin{example3}
 <lr1>  & cmd:Actor  & \_:Actor1 \\
 <lr1>  & cmd:Actor  & \_:Actor2 \\
 …
 \_:Actor1  & cmd:role & "Interviewer" \\
 \_:Actor2 & cmd:role & "Speaker" \\
+\end{example}
+?? or rather as Classes (and express the containement hierarchy with some extra predicate):
+\begin{example}
+\end{example3}
+\item a dedicated object property is used
+\begin{example3}
 \_:Actor1  & a & cmd:Actor \\
 <lr1> & cmd:contains & \_:Actor1 \\
+\end{example}
+\subsubsection{Elements, Fields, Values}
+There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields â using the literal values â try to find corresponding entities in appropriate controlled vocabularies and generate new triples.
+There seems to need to be a separate property (predicate) for fields that are mapped to entities, like:
+\begin{example}
+<lr1> & cmd:Organisation & "MPI" \\
+<lr1> & cmd:Organisation\_? & <org1> \\
+\end{example}
+%\subsubsection{Literal Values}
+\paragraph{Literal Values}
+Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile )
+\begin{example}
+<lr1> & dct:title & "Language Resource 1"
+\end{example}
+Analogously, we could model isocat data categories  as data properties . Metadata elements referencing ISOcat datacategories could be encoded as follows:
+\begin{example}
+<lr1> & isocat:DC-2502 & "19th century"
+\end{example}
+However, Windhouwer (2012) argues against direct mapping of complex data categories to data properties, but proposes to rather model data categories as annotation properties.
+\begin{example}
+cmd:timeCoverage  & a   & cmd\_spec:Element \\
+\end{example3}
+\end{enumerate}
+\subsection{Elements, Fields, Values}
+Finally, we want to integrate also the actual field values in the CMD records into the ontology.
+\subsubsection{Predicates}
+As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
+\begin{example3}
+cmd:timeCoverage  & a   & cmds:Element \\
 cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
 <lr1>  & cmd:timeCoverage  & "19th century" \\
+...
+\end{example}
+This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties.
+%\subsubsection{Mapping to entities â Vocabularies  â CLAVAS}
+\paragraph{Mapping to entities â Vocabularies  â CLAVAS}
+A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over  and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.)
+Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS  â a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies.
+Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \xne{skos:Concepts}:
+\begin{example}
+\end{example3}
+\subsubsection{Literal values -- data properties}
+To generate triples with literal values is straightforward:
+\begin{definition}{Literal triples}
+lr:Resource \ \quad cmds:Property \ \quad xsd:string
+\end{definition}
+\begin{example3}
+<lr1> & cmd:Organisation & "MPI" \\
+\end{example3}
+\subsubsection{Mapping to entities -- object properties}
+The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
+\begin{definition}{new RDF triples}
+lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI
+\end{definition}
+\begin{example3}
+<lr1> & cmd:Organisation\_? & <org1> \\
+\end{example3}
+\begin{note}
+Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
+i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
+\end{note}
+The mapping process is detailed in \ref{sec:values2entities}
+%%%%%%%%%%%%%%%%%55
+\section{Mapping field values to semantic entities}
+\label{sec:values2entities}
+This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
+\begin{enumerate}
+\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
+\item extract \emph{distinct data category, value pairs} from the metadata records
+\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
+\item assess the reliability of the match
+\item generate new RDF triples with entity identifiers as object properties
+\end{enumerate}
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
+\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation}
+\label{fig:smc_cmd2lod}
+\end{figure*}
+\subsubsection{Identify vocabularies  â CLAVAS}
+\todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
+LT-World \cite{Joerg2010}
+One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
+The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
+Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
+\begin{example3}
 <org1> & a   & skos:Concept \\
+\end{example}
+We may want to add some more typing and introduce classes for entities from individual vocabularies like clavas:Organization or similar.
+As far as CLAVAS will also maintain mappings/links to other datasets:
+\begin{example}
+<org1>   skos:exactMatch    <dbpedia/org1>, <lt-world/orgx>;
+\end{example}
+\end{example3}
+\noindent
+We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets
+\begin{example3}
+<org1> & skos:exactMatch  & <dbpedia/org1>, <lt-world/orgx>;
+\end{example3}
+\noindent
 we could use it to expand the data with alternative identifiers, fostering the interlinking of data:
+\begin{example}
+<org1>   dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>;
+\end{example}
+\paragraph{Mapping from strings to Entities}
+Find matching entities in selected Ontologies based on the textual values in the metadata records.
+Identify related ontologies:
+LT-World \cite{Joerg2010}
+task:
+\begin{enumerate}
+\item  express MDRecords in RDF
+\item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
+\item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
+%\fbox{ function lookup: Category x String -> ConceptualDomain}
+\begin{eqnarray*}
+lookup(Category, Literal) \rightarrow ConceptualDomain??
+\end{eqnarray*}
+Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
+\end{enumerate}
+\subsection{RELcat - Ontological relations}
+Information in RELcat is already stored in RDF \cite{SchuurmanWindhouwer2011}.  One relation from the example relation set for CMDI :
+\begin{example}
+isocat:DC-2538 rel:sameAs dct:date
+\end{example}
+Should we generate the redundant triples based on the relations defined between data categories?  I.e.  if there is a relation and a resource has value:
+\begin{example}
+<lr1> isocat:DC-2538 2012^^xs:year
+\end{example}
+should we generate
+\begin{example}
+<lr1> dct:date 2012^^xs:year
+\end{example}
+?
+What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples:
+\begin{example}
+cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
+clavas:Organization & owl:subClassOf & dcterms:Agent \\
+<org1> & a & clavas:Organization \\
+\end{example}
+\begin{example3}
+<org1>  & dcterms:identifier  & <org1>, <dbpedia/org1>, <lt-world/orgx>;
+\end{example3}
+\subsubsection{Lookup}
+In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
+\begin{definition}{signature of the lookup function}
+lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
+\end{definition}
+In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
+which will be the result of the previous step.
+\begin{definition}{Required configuration data indicating data category to available }
+DataCategory \quad \mapsto \quad Dataset+
+\end{definition}
+As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
+However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets.
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp}
+\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service}
+\label{fig:vocabulary_proxy}
+\end{figure*}
+\subsubsection{Candidate evaluation}
+The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
+One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
+In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
+%%%%%%%%%%%%%%%%%%%%%
 \section{SMC LOD - Semantic Web Application}
+\label{sec:lod}
 \todoin{read: Europeana RDF Store Report}
+Technical aspects (RDF-store?): Virtuoso
 \todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
 …
 \todocode{check install siren}\furl{http://siren.sindice.com/}
+\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
+\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
+ / interface (ontology browser?)
 semantic search component in the Linked Media Framework
 …
 \section {Full semantic search - concept-based + ontology-driven ?}
+\label{semantic-search}
 With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
 Namely to enhance it by employing ontological resources.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
+SPARQL
+rechercheisidore, dbpedia, ...
 \section{Summary}
+In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3553
+                      r3638
 \chapter{Concept-based mapping on schema level -- system design}
+\chapter{System design -- concept-based mapping on schema level}
 \label{ch:design}
 In this chapter, we define the part of the proposed system pertaining to the schema level: the concept-based crosswalk and search functionality -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}) -- and, additionally,  the aspect of visualization of schema-level (model) data.
 We start by drawing a global view on the system, introducing its individual components and the dependencies among them.
 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
+In this chapter, we define the main function of the proposed system -- the \textbf{concept-based crosswalk and search functionality} -- the tasks that the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}). Additionally we explore the related aspect of analytic visualization of the processed data.
+We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
+In the next section, the internal data model is presented and explained. In section \ref{def:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{def:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
 \section{System Architecture}
+The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR})
+and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
+The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain.
 \begin{figure*}[!ht]
 …
 \end{figure*}
+The SMC module can be broken down into following components:
 \begin{description}
 \item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx}
 \item[concept-based query expansion]
+\item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx}
+\item[concept-based query expansion] a module for query expansion based on the crosswalks
 \item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
 \item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph}
 …
 \end{description}
+The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
+\xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
+\section{Data model - Terms}
+\section{Data model}
+Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
+\begin{description}
+\item[smcIndex] a data type for denoting indexes in a human-readable way used internally and as input and output format of the service
+\item[Terms.xsd] the schema for internal representation of the processed data
+\end{description}
+\subsection{smcIndex}\label{def:smcIndex}
+In this section, we describe \code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
+An \code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
+\begin{defcap}
+\caption{Grammar of \code{smcIndex}}
+\begin{align*}
+smcIndex &::= dcrIndex \ | \ cmdIndex  \\
+dcrIndex &::= dcrID \ contextSep \ datcatLabel \\
+            & \quad \quad   | \  [\ dcrID \ contextSep \ ] \ datcatID \\
+cmdIndex &::= profile  \\
+                    &    \quad \quad  | \  cmdEntityId \\
+                      &   \quad \quad | \  [\ profile \ contextSep \ ] \ dotPath \\
+profile &::= profileName \ [ \ \texttt{\#} \ profileID \ ] \\
+dotPath  &::= [\ dotPath \ pathSep \ ] \ elemName \\
+cmdEntityId &::= componentId \ [ \ \texttt{\#} \ elemName \ ] \\
+contextSep &::= \texttt{`.`} \ | \  \texttt{`:`} \\
+pathSep &::= \texttt{`.`} \\
+dcrId &::= \texttt{`isocat`} \ | \ \texttt{`dc`}
+\end{align*}
+\end{defcap}
+The grammar distinguishes two main types of \code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
+These two types of \code{smcIndex} follow different construction patterns.
+\code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
+It is important to note, that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
+Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
+However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
+\code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
+\code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
+\begin{example1}
+\concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1272022528363} \\
+\concept{LexicalResourceProfile\#clarin.eu:cr1:p\_1290431694579}
+\end{example1}
+\noindent
+\code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down the ambiguity.
+\subsection{Terms}
 \label{datamodel-terms}
+\todocode{Terms.xsd}
+\begin{note}
+Describe the CMD-format?
+\end{note}
+In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{list:terms-schema}.
+\subsubsection{Type \code{Term}}
+\code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
+\begin{table}[ht]
+\caption{Attributes of \code{Term} when encoding data category}
+\label{table:terms-attributes-datcat}
+ \begin{tabular}{ l | l | l }
+  attribute & allowed values & sample value\\
+\hline
+  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
+  \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
+  \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
+ \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\
+ \end{tabular}
+\end{table}
+%\captionsetup{justification=raggedright, singlelinecheck=false}
+\lstset{language=XML}
+\begin{lstlisting}[label=list:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
+<Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
+        type="label" xml:lang="fr">nom de ressource</Term>
+\end{lstlisting}
+\begin{table}[ht]
+\caption{Attributes of \code{Term} when encoding CMD entity}
+\label{table:terms-attributes-cmd}
+ \begin{tabularx}{1\textwidth}{ l | X | X }
+  attribute & allowed values & sample value\\
+\hline
+  \var{id} &  \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\
+  \var{type} &  one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\
+  \var{name} & name of the component or element & \code{Url} \\
+  \var{path} &  \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\
+  \var{parent} & name of the parent component &  \code{Contact} \\
+ \end{tabularx}
+\end{table}
+\lstset{language=XML}
+\begin{lstlisting}[label=list:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
+<Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
+          id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
+          path="SpeechCorpus.Access.Contact.Url"/>
+\end{lstlisting}
+\begin{table}[ht]
+\caption{Attributes of \code{Term} when encoding a term in the inverted index?}
+\label{table:terms-attributes-index}
+ \begin{tabularx}{1\textwidth}{ l | X | X }
+  attribute & allowed values & sample value\\
+\hline
+  \var{id} &  \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\
+  \var{type} &  one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\
+  \var{schema}  & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\
+  \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
+  \var{node-value} &  \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\
+ \end{tabularx}
+\end{table}
+\lstset{language=XML}
+\begin{lstlisting}[label=list:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
+   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
+                id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
+                concept-id="http://www.isocat.org/datcat/DC-2545" >
+        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle
+   </Term>
+\end{lstlisting}
+\subsubsection{Type \code{Concept}}
+\code{Concept} represents a data category. Identifier is the PID issued by the DCR.
+It groups all terms belonging to given data category.
+The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements.
+Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages:
+\lstset{language=XML}
+\begin{lstlisting}[label=list:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
+<Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
+               id="http://www.isocat.org/datcat/DC-2545">
+         <Term set="isocat" type="mnemonic">resourceTitle</Term>
+         <Term set="isocat" type="id">DC-2545</Term>
+         <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+         <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
+        ...
+         <info xml:lang="en">The title is the complete title
+                        of the resource without any abbreviations.</info>
+        ...
+</Concept>
+\end{lstlisting}
+In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{list:concept-cmd-term}).
+\lstset{language=XML}
+\begin{lstlisting}[label=list:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
+ <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
+            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
+\end{lstlisting}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
+    <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
+        <Term set="isocat" type="mnemonic">resourceTitle</Term>
+        <Term set="isocat" type="id">DC-2545</Term>
+        <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+        <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
+        <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
+        ...
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
+                id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
+                        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
+                id="clarin.eu:cr1:c_1271859438123#Title">
+                        AnnotationTool.GeneralInfo.Title</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
+                id="clarin.eu:cr1:c_1274880881884#Title">
+                        imdi-corpus.Corpus.Title</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
+                id="clarin.eu:cr1:c_1271859438201#Title">
+                        Session.Title</Term>
+        ...
+    </Concept>
+\end{lstlisting}
+\subsubsection{Type \code{Termsets/Termset}}
+\code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute.
+For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile.
+Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
+\lstset{language=XML}
+\begin{lstlisting}[label=list:termset, caption=\code{Termset} element representing a CMD profile]
+<Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
+            type="CMD_Profile">
+      <info>
+         <id>clarin.eu:cr1:p_1357720977520</id>
+         <description>A CMDI profile for annotated text corpus resources.</description>
+         <name>AnnotatedCorpusProfile</name>
+         <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
+         <creatorName>nalida</creatorName>
+          ...
+     </info>
+     <Term type="CMD_Component" name="GeneralInfo" datcat=""
+            id="clarin.eu:cr1:c_1359626292113"
+            parent="AnnotatedCorpusProfile"
+            path="AnnotatedCorpusProfile.GeneralInfo">
+            <Term ...
+     </Term>
+     ...
+</Termset>
+\end{lstlisting}
+The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements.
+%%%%%%%%%%%%%%%%%%%%%%
 \section{cx -- crosswalk service}
 \label{def:cx}
+\label{sec:cx}
 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
+The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas, building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain. (cf. \ref{def:qx}).
+The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemata annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemata by some matching algorithm, but rather the data categories are used as bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), rather than in a collection of pair-wise equivalencies between the fields.
+\subsection{smcIndex}\label{indexes}
+In this section we describe \emph{smcIndex} -- the data type for input and output of the proposed application.
+An smcIndex is a human-readable string adhering to a specific syntax, denoting some search index.
+The generic syntax is:
+\begin{eqnarray*}
+smcIndex ::= context \ contextSep \ conceptLabel
+\end{eqnarray*}
+We distinguish two types of smcIndexes: (i) \emph{dcrIndex} referring to data categories and (ii) \emph{cmdIndex} denoting a specific
+``CMD-entity'', i.e. a metadata field, component or whole profile defined within CMD. The \textit{cmdIndex} can be interpreted as a XPath into the instances of CMD-profiles. In contrast to it, the \textit{dcrIndexes} are generally not directly applicable on existing data, but can be understood as abstract indexes referring to well-defined concepts -- the data categories -- and for actual search they need to be resolved to the metadata fields they are referred by. In return one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
+These two types of smcIndex also follow different construction patterns:
+\begin{eqnarray*}
+smcIndex & ::= & dcrIndex \ | \ cmdIndex  \\
+dcrIndex & ::= & dcrID \ contextSep \ datcatLabel \\
+cmdIndex & ::= & profile \  \\
+                      &  &  | \  [\ profile \ contextSep \ ] \ dotPath \\
+dotPath  & ::= & [\ dotPath \ pathSep \ ] \ elemName \\
+contextSep & ::= & \texttt{`.`} \ | \  \texttt{`:`} \\
+pathSep & ::= & \texttt{`.`} \\
+dcrId & ::= & \texttt{`isocat`} \ | \ \texttt{`dc`}
+\end{eqnarray*}
+The grammar is based on the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (\texttt{dc.title}) and on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} (\texttt{Session.Location.Country}).
+\textit{dcrID} is a shortcut referring to a data category registry
+%\footnote{Next to ISOcat other registries can function as a DCR, e.g., the Dublin Core set of metadata terms.}
+similar to the namespace-mechanism in XML-documents.  \textit{datcatLabel} is the verbose Identifier- (e.g. \texttt{telephoneNumber}) or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.
+Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
+The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{def:qx}).
+The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
+\subsection{Interface Specification}
+\label{def:cx-interface}
+In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format.
+\todoin{The two interfaces list and map
+Full definition in appendix and under link!}
+\subsubsection*{Method \code{list}}
+Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input  and provide autocompletion functionality.
+\begin{definition}{URI-pattern of the \code{list} method}
+/smc/cx/list/\$context
+\end{definition}
+\noindent
+Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data
+\begin{table}
+\caption{Allowed values for parameters of the \code{list}-method and corresponding return values}
+\label{table:cx-list-params}
+ \begin{tabular}{ l | p{0.7\textwidth} }
+  \var{\$context}  & returns a list of \\
+ \hline
+  \code{*,top} & available termsets \\
+  \var{\{termset\}} & terms (CMD components and elements) of given termset \\
+  \code{dcr} & available data category registries (isocat, dublincore) \\
+  \code{isocat}  & ISOcat data categories referenced in CMD data \\
+  \code{languages} & available languages (only for isocat data categories) \\
+  \code{cmd-profiles} & all available CMD profiles \\
+  \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\
+  \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\
+  \code{relsets} & available relation sets (defined in the Relation Registry)
+ \end{tabular}
+\end{table}
+ Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
+%NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.}
 % While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
+\textit{profile} is the name of the profile. % (despite the danger of ambiguity).
+\textit{dotPath} allows to address a leaf element (\texttt{Session.Actor.Role}), or any intermediary XML-element corresponding to a CMD-component (\texttt{Session.Actor})   within a metadata description. %This allows to easily express search in whole components, instead of having to list all individual fields.
+Generally, smcIndexes can be ambiguous, meaning they can refer to multiple concepts, or entities (CMD-elements). This is due to the fact that the names of the data categories, and CMD-entities are not guaranteed unique. The module will have to cope with this, by providing on demand the list of identifiers corresponding to a given smcIndex.
+%As an important sidenote -- cmdIndexes can be ambiguous, meaning they can refer to multiple entities (metadata fields), examples of valid indexes:
+%\begin{verbatim}
+%Name
+%Actor.Name, Project.Name
+%Session.Actor.Name, Drama.Actor.Name
+%\end{verbatim}
+%So we disambiguate (or narrow down the ambiguity) by prefixing context.
+\subsection{Interface Specification}
+In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
+% \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
+In the operation mode, the application accepts any index (\textit{smcIndex}, cf. \ref{indexes}) and returns a list of corresponding indexes (or only the input index, if no correspondences were found):
+\newline
+\textit{smcIndex $\mapsto$ smcIndex[ ]}
+\newline
+We can distinguish following levels for this mapping function:
+(1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD-elements (\textit{cmdIndexes)} bound to a given data category are returned:
+\newline
+\begin{example}
+isocat.size     & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
+\end{example}
+\newline
+\textit{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
+\newline
+\begin{example}
+imdi-corpus.Name & $\mapsto$ \\
+(isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name
+\end{example}
+\newline
+(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
+\newline
+\texttt{isocat.resourceTitle  $\mapsto$ }
+\verb|   (+ dc.title) |$\mapsto$  \newline
+\verb|   [imdi-corpus.Title, | \newline
+\verb|    TextCorpusProfile.GeneralInfo.Title,| \newline
+\verb|    teiHeader.titleStmt.title,| \newline
+\verb|    teiHeader.monogr.title]|
+\newline
+(3) \emph{container data categories} -- further expansions will be possible once the container data categories \cite{SchuurmanWindhouwer2011} will be used. Currently only fields (leaf nodes) in metadata descriptions are linked to data categories. However, at times, there is a need to conceptually bind also the components, meaning that besides the ``atomic'' data category for \texttt{actorName, there would be also a data category for the complex concept \texttt{Actor}.}
+Having concept links also on components will require a compositional approach to the task of semantic mapping, resulting in:
+\newline
+\texttt{Actor.Name $\mapsto$ }\newline
+\verb|    [Actor.Name, Actor.FullName, |\newline
+\verb|     Person.Name, Person.FullName]|
+\subsubsection*{Method \code{map} }
+Method \code{map} performs the actual translations:
+it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes.
+%it returns list of equivalent terms/smcIndexes for a given term/smcIndex.
+\begin{definition}{General function definition}
+smcIndex \mapsto smcIndex[ ]
+\end{definition}
+\begin{definition}{URI-pattern of the \code{map} method}
+/smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ]
+\end{definition}
+\noindent
+Parameter definition:\\*
+\begin{description}
+\item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would be one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
+\item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive
+\item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default
+\item[\var{\$relset}] optional; reference to a relset to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets are they are all applied in the expansion
+\end{description}
+\noindent
+Possible return formats:
+\begin{description}
+\item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
+\item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string
+\lstset{language=XML}
+\begin{lstlisting}
+<Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/>
+\end{lstlisting}
+\item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id}
+\lstset{language=XML}
+\begin{lstlisting}
+<Term concept-id="http://www.isocat.org/datcat/DC-2512"
+           set="isocat" type="datcat">creatorFullName</Term>
+\end{lstlisting}
+\item[\var{cmdid, id}] distinct cmd entities (\code{Term}) by \code{@id}
+\begin{lstlisting}
+<Term type="CMD_Element" name="Name" elem="Name" parent="Session"
+       datcat="http://www.isocat.org/datcat/DC-2544"
+       id="clarin.eu:cr1:c_1349361150645#Name"  path="DBD.Session.Name"/>
+\end{lstlisting}
+\end{description}
+\begin{table}[ht]
+\caption{Sample values for parameters of the \code{map}-method and corresponding return values}
+\label{table:cx-map-params}
+ \begin{tabular}{ l  l | l}
+  \var{\$context}  & \var{\$term} & returns \\
+ \hline
+  \code{*} & \code{name} & ? \\
+  \code{isocat} & \code{resourceTitle} & CMD terms \\
+  \code{cmd} & \code{name} & \\
+ \end{tabular}
+\end{table}
+\noindent
+Sample request\\*
+\begin{example1}
+/smc/cx/map/isocat/resourceTitle
+\end{example1}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ]
+<Terms >
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
+        id="clarin.eu:cr1:c_1271859438123#Title">
+                AnnotationTool.GeneralInfo.Title</Term>
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014"
+        id="clarin.eu:cr1:c_1288172614011#resourceTitle">
+                BamdesLexicalResource.BamdesCommonFields.resourceTitle
+     </Term>
+   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
+        id="clarin.eu:cr1:c_1274880881884#Title">
+                imdi-corpus.Corpus.Title</Term>
+   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
+        id="clarin.eu:cr1:c_1271859438201#Title">
+                Session.Title</Term>
+   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363"
+        id="clarin.eu:cr1:c_1271859438123#Title">
+                LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187"
+        id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term>
+\end{lstlisting}
+\noindent
+We can distinguish following levels for the mapping function:
+\noindent
+(1) \emph{data category identity} -- for the resolution only the basic data category map derived from Component Registry is employed. Accordingly, only indexes denoting CMD elements (\var{cmdIndex)} bound to a given data category are returned:
+\noindent
+\begin{example2}
+%\begin{tabularx}{\textwidth}{| p{0.4\textwidth}  p{0.6\textwidth} }
+isocat.size     $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
+\end{example2}
+%\end{tabularx}
+\noindent
+\var{cmdIndex} as input is also possible. It is translated to a corresponding data category, proceeding as above:
+\begin{example2}
+imdi-corpus.Name $\mapsto$ \\
+(isocat.resourceName) $\mapsto$ & TextCorpusProfile.GeneralInfo.Name
+\end{example2}
+\noindent
+(2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to a list of \var{cmdIndexes}:
+\begin{example2}
+isocat.resourceTitle $\mapsto$  \\
+(+ dc.title) $\mapsto$  & [GeneralInfo.Title, Text.TextTitle, collection.CollectionInfo.Title, resourceInfo. identificationInfo. resourceName, teiHeader.titleStmt.title, teiHeader.monogr.title]
+\end{example2}
+\noindent
+(3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
+Having concept links also on components will require a compositional approach for the mapping function, resulting in:
+\begin{example2}
+Actor.Name $\mapsto$ & [Actor.Name, Actor.FullName, \\
+& Person.Name, Person.FullName]
+\end{example2}
 \subsection{Implementation}
 At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries.
+At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
 \todoin{generate and reference XSLT-documentation}
+The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
 \subsubsection{Initialization}
+First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
+\newline
+\textit{datcatURI $\mapsto$ profile.component.element[]}
+\newline
+The collected data categories are enriched with information from corresponding registries (DCRs), adding the verbose identifier, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
+Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
+\todocode{example of inverted index}
+\label{smc_init}
+During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
+\begin{definition}{Principal structure of the inverted index}
+datcatURI \mapsto profile.component.element[]
+\end{definition}
+The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
+Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/smc_init.png}
+\caption{The various stages of the data flow during the initialization}
+\label{fig:smc_init}
+\end{figure*}
+Following datasets are available, after the initialization sequence has finished (cf. figure \ref{fig:smc_init}):
+\begin{description}
+\item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
+\item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
+\item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
+\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements
+\item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
+\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
+\end{description}
 \subsubsection{Operation}
+\subsubsection{Computing summaries}
+For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL-stylesheets for post-processing depending on requested format.
+The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq}-library within a \xne{eXist} XML-database.
 \subsection{Extensions}
+A useful supplementary function of the module would be to provide a list of existing indexes.
+That would allow the search user-interface to equip the query-input with autocompletion. Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
+Once there will be overlapping\footnote{i.e. different relations may  be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
+Also, use of \emph{other than equivalency relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the SMC, either returning the relation types themselves as well or equip the list of indexes with some similarity ratio.}
+Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
+Also, use of \emph{other than equivalency} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
 \section{qx -- concept-based search}
 …
 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
+The emphasis lies on the query language and the corresponding query input interface.
+Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
+offering it (the information) semi-transparently to the user (or application) on the consumption side.
+Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
+?
+Facets
+Controlled Vocabularies
+Synonym Expansion (via TermExtraction(ContentSet))
+The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
+Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
+Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}.
 \subsection{Query language}
+CQL?
+As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
 \subsection{Query Expansion}
+As long as the indexes to expand with are equivalent the query expansion is simply disjunction, returning a union of matching records. Thus \code{isocat.resourceTitle any "elephant"} would translate into
+\begin{example1}
+GeneralInfo.Title any "elephant" \\
+OR resourceInfo.resourceName any "elephant" \\
+OR CollectionInfo.Title any "elephant" \\
+OR teiHeader.titleStmt.title any "elephant" \\
+\end{example1}
+\noindent
+Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
 \subsection{SMC as module for Metadata Repository}
 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain.
+As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
 Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
 …
+\subsection{User Interface?}
+\subsubsection*{Query Input}
+\subsection{User Interface}
+A starting point for our considerations is the traditional structure found in many (advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator:
+\begin{definition}{Generic data format for structured queries}
+ [ < index, operation, term, boolean > ]
+\end{definition}
+\noindent
+This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}.
+% {Basic clause of the CQL syntax}
+\begin{definition}{The main clause of the CQL syntax, the \code{searchClause}}
+\label{def:searchClause}
+searchClause \ ::= \ index \ relation \ searchTerm
+\end{definition}
+\noindent
+An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model.
 \begin{figure*}[!ht]
 …
 \end{figure*}
+\noindent
 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
+\subsubsection*{Columns}
+\subsubsection*{Summaries}
+\subsubsection*{Differential Views}
+Visualize impact of given mapping in terms of covered dataset (number of matched records).
+\subsubsection*{Visualization}
+Landscape, Treemap, SOM
+\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
+\section{SMC-Browser}
+A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
+Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search.
+\section{SMC Browser}
 \label{smc-browser}
+Explore the Component Metadata Framework
+As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
+In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}.
+Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
+As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
+While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
+\subsection{Design}
+In the following, we elaborate on the basic idea of the proposed application, the source data, requirements and proposed application UI-layout.
+\subsubsection{Basic concept}
+If we consider the CMD data model (cf. \ref{def:CMD}) we recognize that every profile can be expressed as a tree with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by \var{inclusion} and \var{reference}.
+\begin{defcap}[!ht]
+\caption{\var{inclusion} and \var{reference} relationship}
+\begin{align*}
+cmds:Component  & \xrightarrow{includes} \quad  cmds:Component \\
+cmds:Component  & \xrightarrow{includes} \quad  cmds:Element \\
+cmds:Element  & \xrightarrow{refersTo} \quad DatCat
+\end{align*}
+\end{defcap}
+The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected). The main idea for the \xne{SMC Browser} is to \textbf{visualize this graph inherent in the CMD data}.
+\subsubsection{Requirements}
+Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
+In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
+This scenario implies a few requirements on the user interface:
+\begin{itemize}
+\item select nodes from a list of all available nodes (ideally grouped by type)
+\item filter the node list
+\item select an arbitrary number of nodes of any type (be it profiles, components, elements, data categories)
+\item traverse the graph starting from selected nodes into arbitrary depth
+\item traverse the graph backwards (meaning against the direction of the edges, i.e. e.g. from data categories towards the profiles)
+\item maintain the identity of the nodes, meaning one component or one data category used in two profiles has to be represented by one node (for displaying the reuse)
+\item show auxiliary information about the nodes on demand
+\end{itemize}
+\subsubsection{Application layout}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png}
+\end{center}
+\caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface}
+\label{fig:smc-browser_sketch}
+\end{figure*}
+\noindent
+Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}):
+\begin{description}
+\item[index panel] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
+\item[main graph pane] displays the selected subgraph, needs as much space as possible
+\item[graph navigation bar] for manipulation of the displayed graph by various means
+\item[detail view] displaying definition and statistical information for selected nodes
+\item[statistics] a separate view on the data listing the statistical information for whole dataset in tables
+\end{description}
+\subsection{Implementation}
+The application is implemented in \xne{javascript} based on a generic visualization \xne{js}-library \xne{d3}\furl{https://github.com/mbostock/d3/}. The library allows for data-driven visualization (hence the name \xne{d3 = data-driven documents}), attributes of data items being dynamically bound to attributes of the SVG objects representing them. This caters for high flexibility, fast development and consistent data views. The library also delivers the base graph layout algorithm: \emph{force-directed graph layout}\furl{https://github.com/mbostock/d3/wiki/Force-Layout##wiki-force}:
+\begin{quotation}
+A flexible force-directed graph layout implementation using position Verlet integration to allow simple constraints.  [\dots]
+In addition to the repulsive charge force, a pseudo-gravity force keeps nodes centered in the visible area and avoids expulsion of disconnected subgraphs, while links are fixed-distance geometric constraints. Additional custom forces and constraints may be applied on the "tick" event, simply by updating the x and y attributes of nodes.
+\end{quotation}
+Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
+\subsubsection{Data preprocessing}
+\label{smc-browser-data-preprocessing}
+The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
+\begin{description}
+\item[SMC graph basic]
+        the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}
+\item[SMC graph all]
+        additionally rendering the new profile-groups and relations between data categories (from Relation Registry)
+\item[only profiles + datcats]
+        just profiles and data categories are rendered (with direct links between those, skipping all components and elements)
+\item[profiles + datcats + datcats + groups + rr]
+        as above but again with profile-groups and relations
+\item[only profiles]
+       just profiles with links between them representing the degree of similarity based on the reuse of components and data categories
+\end{description}
+Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
+\begin{figure*}
+\includegraphics[width=1\textwidth]{images/smc_processing_-mdrepo}
+\caption{The data flow in process of precomputing data for the SMC browser}
+\label{fig:smc_processing}
+\end{figure*}
+\subsubsection{User interface}
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/navigation_bar_2013-09-28.png}
+\caption{Navigation bar of the SMC Browser with a number of options to manipulate the visible graph}
+\label{fig:navbar}
+\end{figure*}
+As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
+basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
+Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
+One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
+There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where a all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
+\subsection{Extensions}
+Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
+\subsubsection*{Graph operations -- differential views}
+An important feature would be to be able to apply set operations on selected (sub)graphs, especially \emph{intersection} and \emph{difference}. This would enable the user to easily extract components (nodes) that are shared (or not shared) among given schemas (subgraphs).
+\subsubsection*{Generalization}
+There is a high potential to broaden the scope of application for the discussed tool, provided some generalizations are taken into account.
+Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
+Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
+\subsubsection*{Viewer for external data}
+The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
+One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
+\subsubsection*{Integrate with instance data}
+The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
+Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
 \section{Summary}
+In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
+The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.

SMC4LRT/chapters/Infrastructure.tex

-                      r3553
+                      r3638
 \section{CLARIN / CMDI}
+\section{CLARIN}
 \label{def:CLARIN}
+CLARIN - Common Language Resource and Technology Infrastructure\cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
+\begin{quote}
+\dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located.\cite{CLARIN2013web}
+\end{quote}
+\begin{comment}
+To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work.
+\end{comment}
+The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
+In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and bodies ensuring the flow of information and coherent action on European level.
+Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
+\section{Component Metadata Infrastructure -  CMDI}
 \label{def:CMDI}
+CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
+\begin{quotation}
+\dots create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily usable.
+\end{quotation}
+The infrastructure foresees a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accommodate existing schemas.
+As stated before, the SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide:
+One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
+The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide:
 \begin{itemize}
 \item Data Category Registry
+\item Component Registry
 \item Relation Registry
+\item Component Registry
+\item Vocabulary Alignement Service (OpenSKOS)
+\end{itemize}
+\noindent
+All these components are running services, that this work shall directly build upon.
+Next to these core services, that SMC has direct dependencies to, some other services are being developed within the CMDI ecosystem that are also relevant in the context of SMC:
+\begin{itemize}
 \item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
 \item SchemaParser
+\item Vocabulary Alignement Service (OpenSKOS)
 \end{itemize}
 On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \label{cmdi_exploitation}.
+On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ref{cmdi_exploitation}.
 \begin{figure*}[!ht]
 …
+\subsection{CMDI registries: DCR, CR, RR}
+\label{def:CMD}
+\label{def:DCR}
+The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
+The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \emph{ISOcat}\footnote{\url{http://www.isocat.org/}}.
+Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
+The \emph{Component Metadata Framework} (CMD) is built on top of the DCR and complements it. While the DCR defines the atomic concepts, within CMD the metadata schemas can be constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles as long as each field ``refers via a PID to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. This allows to trivially infer equivalences between metadata fields in different CMD-based schemata. While the primary registry used in CMD is the ISOcat DCR, other authoritative sources for data categories (``trusted registries'') are accepted, especially Dublin Core Metadata Initiative \cite{DCMI:2005}.
+\emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles.
+\subsection{CMDI registries}
+The CMD framework as data model (cf. \ref{def:CMD} together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. In the following we explain briefly their role and interaction.
 \begin{figure*}[!ht]
 …
 \end{figure*}
+\subsubsection*{Data Category Registry}
+\label{def:DCR}
+The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
+The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \xne{ISOcat}\furl{http://www.isocat.org/}.
+Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
+\subsubsection*{Component Registry}
+\emph{Component Registry} (CR)\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} implements the CMD data model and fulfills two functions. For one it as a robust web application for creating and editing new CMD components and profiles. On the other hand it is the actual registry the persistently stores and exposes published CMD profiles, allowing to browse and search in them and view their structure.
+The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., add or a remove some metadata elements and/or components. Also new components can be created to model the unique aspects of the resources under consideration. All components are combined into one profile. Components, elements and values should be linked to a concept to make its semantics explicit.\cite{Durco2013_MTSR}
+This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
+from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+\subsubsection*{Ontological Relations -- Relation Registry}
 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
 However there needs to be an additional means to capture information about relations between data categories.
+This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
+ These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
+There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen. \cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
+This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design grounds on the expectation that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeller.
+These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
+There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
 This implementation stores the individual relations as RDF-triples
+\begin{example}
+<subjectDatcat, relationPredicate, objectDatcat>
+\end{example}
+allowing typed relations, like equivalency (\texttt{rel:sameAs}) and subsumption (\texttt{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
+!check DCR-RR/Odijk2010 -follow up
+!Cf. Erhard Hinrichs 2009
+\begin{example3}
+<subjectDatcat, & relationPredicate, & objectDatcat>
+\end{example3}
+allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
+\todoin{check DCR-RR/Odijk2010 -follow up ?; Cf. Erhard Hinrichs 2009 }
+\subsubsection*{Schema Registry}
 SCHEMAcat is a registry for schemata of all kinds (not just XML-based) semantically annotated with data categories.
 …
 (search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
-\noindent
-All these components are running services, that this work shall directly build upon.
-This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
-from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
-Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}.
 \subsection{Vocabulary Service / Reference Data Registry}
 …
 The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data for both the LRT and Digital Humanities community has been discussed on many occasions in various contexts. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives.
+In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centers (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-center) services to be dealt with.
+In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
+\begin{note}
 In parallel, within the sister ESFRI project DARIAH a taskforce with the same goal has been set up : \emph{Service for Reference Data and Controlled Vocabularies}. This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to \emph{establish a service providing controlled vocabularies and reference data} for the DARIAH (and CLARIN) community.
-Regarding the responsibilities of the DARIAH working groups:
-VCC3/Task 3 identifies and recommends vocabularies relevant for the community. VCC1/Task 5 provides basic/generic services relevant for whole community. Especially, the Schema Registry, that allows to express mappings between different schemas seems to be one starting point. In accordance with the VCC1 strategy, concentrate on pulling together (pooling) existing resources and only implement necessary ``glue'' to put the pieces together (data conversion, service-wrappers...)
 Thus there is a momentum and a high potential for a collaborative approach in at least these two big initiatives CLARIN and DARIAH, that serve a very wide-spread and diverse community.
+\end{note}
 \subsubsection{Abstract service description}
 As to the service itself it is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface is still necessary for administration etc.  By using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and \xne{LOD}.
 Besides providing vocabularies, the service should also hold and expose equivalencies (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalencies from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
+Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
 \begin{verbatim}
+GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065 | Wikipedia-Personensuche
+GND: 118540238 | LCCN: n79003362 |
+NDL: 00441109 | VIAF: 24602065
 \end{verbatim}
 \subsubsection{Vocabulary Service - CLAVAS}
 \label{def:CLAVAS}
 As described in previous section (\ref{def:DCR}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
+As described in previous section (\ref{def:DCR}), a solid pillar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
 …
 \label{interaction-dcr-skos}
 DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
+simple, complex: closed, open, constrained, (container)?
+\code{simple, complex: closed, open, constrained, (container)?}
 \begin{figure*}[!ht]
 …
 The semantic proximity of a /data category/ to a /concept/ may mislead to
 a na"ive approach to mapping DCR to SKOS, namely mapping every data category (from one profile) to a concept
 all of them belonging to the \xne{ISOcat-profile:ConceptScheme}.
+The fact that data categories are basically definitions of concepts may mislead to
+a na"ive approach to mapping DCR to SKOS, namely mapping every data category to a \code{skos:Concept}
+all of them belonging to the \xne{ISOcat:ConceptScheme}.
 However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
+A more sensible approach is to export only closed DCs as separate ConceptSchemes and their respective simple DCs as Concepts within that scheme.
+A more sensible approach is to export only closed DCs as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{Concepts} within that scheme.
+\begin{quotation}
 The rationale is, that if we see a vocabulary as a set of possible values for a
 field/element/attribute, complex DCs in ISOcat are the users of such
 vocabularies and simple DCs the DCR equivalence of values in such a
+vocabulary.\cite{Menzo2013mail}
+Another aspect is, that a simple DC can be in valuedomains of multiple closed DCs.
+Also a skos:Concept can belong to multiple ConceptSchemes\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
+So there could a 1:1 one mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
+vocabulary.
+\end{quotation}\cite{Menzo2013mail}
+Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
+Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
+So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
 That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
 …
 \todocite {MI Search Engine}
 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
+And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centres,
 and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}

SMC4LRT/chapters/Literature.tex

-                      r3551
+                      r3638
 \subsection{Metadata}
 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
+A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\furl{http://www.clarin.eu/cmdi} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
 Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}.
 …
 In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
 One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
+One more specific recent inspirational work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
 \todoin{check if relevant: http://schema.org/}
 …
 \subsection{Ontology Visualization}
+Landscape, Treemap, SOM
+\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
 …
 \section{Summary}
+This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and
+on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
+This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.

SMC4LRT/chapters/Results.tex

-                      r3551
+                      r3638
 \subsection{SMC Browser -- Advanced Interactive User Interface}
+\subsection{SMC Browser -- advanced interactive user interface}
 SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
+In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g. counting how many elements a profiles contains, or in how many profiles a DC is used.
 It is implemented on top of the js-library d3, the code is checked in clarin-svn.
 …
 The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record.
+In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
 …
 \item MD Search employing Semantic Mapping
 \item MD Search employing Fuzzy Search
+\item Visualize impact of given mapping in terms of covered dataset (number of matched records).
 \end{itemize}

SMC4LRT/chapters/abstract_en.tex

-                      r2672
+                      r3638
 \chapter*{Abstract}
+According to the guidelines of the faculty, an abstract in English has to be inserted here.
+This work is embedded in the context of a large research infrastructure initiative aimed at easing and harmonizing access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in at the core of the infrastructure.
+The ultimate objective of the effort -- in line with the overall mission of the infrastructure -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This was pursued by two separate, complementary approaches: a) Enriching the search capabilities with concept-based crosswalks on schema level.
+And -- acknowledging the integrative power of the \emph{Linked Open Data} paradigm  -- b) expressing the domain data as a \emph{Semantic Web} resource.
+In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
+As a by-product, the application \emph{SMC browser} was developed -- a visualization tool for interactive exploration of the dataset. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, they are considered the main contribution of this work by the author.

SMC4LRT/chapters/appendix.tex

-                      r3551
+                      r3638
+\chapter{Data model ?}
+\chapter{Data model reference}
+In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
 \begin{figure*}[!ht]
 \begin{center}
 …
 \label{fig:DCR_data_model}
 \end{figure*}
+\input{images/Terms.xsd}
+\input{images/general-component-schema.xsd}
 \begin{figure*}[!ht]
 …
 \end{figure*}
-\section {SMC Reports}
-\label{sec:reports}
+SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
+\chapter{SMC Browser}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
+\end{center}
+\caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
+\label{fig:cmd-dep-dotgraph}
+\end{figure*}
+\section{SMC Browser user documentation}
+\label{sec:smc-browser-userdocs}
+\input{chapters/userdocs_cleaned}
+\chapter{SMC Reports}
+\label{ch:reports}
+SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
 \input{chapters/examples_cleaned}

SMC4LRT/chapters/danksagung.tex

r2672	r3638
1	1	\chapter*{Danksagung}
2	2
3		Hier fÃŒgen Sie optional eine Danksagung ein.
	3	Ich mÃ¶chte mich herzlich bedanken, bei allen Kollegen die mir mit Rat zur Seite gestanden sind
	4	und meinen Liebsten fÃŒr ihre extra-portion Geduld, die ich ihnen abverlangt habe.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3638 for SMC4LRT

Legend:

Download in other formats: