Context Navigation

← Previous Change
Next Change →

Changeset 3665 for SMC4LRT

Timestamp:

10/02/13 19:52:31 (11 years ago)

Author:

vronk

Message:

rework of Results, Definitions, appendix, added Conclusion,
smaller changes to Design, Data

Location:

SMC4LRT/chapters

Files:

: 12 edited

Conclusion.tex (modified) (1 diff)
Data.tex (modified) (6 diffs)
Definitions.tex (modified) (1 diff)
Design_SMCinstance.tex (modified) (4 diffs)
Design_SMCschema.tex (modified) (16 diffs)
Infrastructure.tex (modified) (11 diffs)
Introduction.tex (modified) (11 diffs)
Results.tex (modified) (12 diffs)
abstract_de.tex (modified) (1 diff)
abstract_en.tex (modified) (1 diff)
appendix.tex (modified) (3 diffs)
danksagung.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Conclusion.tex

-                      r3551
+                      r3665
 \label{ch:conclusions}
 Further work is needed on more complex types of response (similarity ratio, relation types) and also on the interaction with Metadata Service to find the optimal way of providing the features of semantic mapping and query expansion as semantic search within the search user-interface.
+With this work, a technical description together with a prototypical implementation for the \emph{Semantic Mapping Component} was delivered -- one module within an infrastructure for providing metadata, the \emph{Component Metadata Infrastructure}.
+The statistics about current usage/population of the CMD demonstrate that the basic concept of a flexible metamodel with integrated semantic layer is being taken up by the community. Metadata modellers increasingly making use not only of the infrastructure, but are also reusing the modelling work done so far. The provisions designed to ensure semantic interoperability (DCR together with the RR) are pratically in place and prove to be useful.
+SMC features a concept-based crosswalk service providing correspondences between fields in metadata formats and a module for query expansion building on top of it, allowing concept-based semantic search. Further work is needed on the crosswalk service providing more complex types of response (similarity ratio, relation types) with implications for the query expansion module. The integration of the semantic mapping features in the search user interface is only rudimentary at present, calling for a more elaborate solution.
+% Dynamic integration of the information from the Relation Registry into the search interface and search processing.
 More work is needed on consolidation of the actual values in the CMD records. CLARIN has set up a separate task force for data curation, which will have to be an ongoing effort. Also, work is ongoing on enriching the SMC browser with instance data information, allowing to directly see and inspect, which profiles and DCs are effectively being used in the instance data (and how often).
+A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and transform all the data (profiles and instances) into RDF, a server solution to host the data and allow querying it and finally, on top of it offer a web interface for the users to explore the dataset.
+%Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
+And finally, a visualization tool for the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}.
+Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
+Irrespective of the additional levels - the user wants and has to get to the resource. (not always)
+to the "original"
+Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,
+one of them being metadata curation. The results of this work represent a directly applicable groundwork for this ongoing effort.
+One particularly pressing aspect of the curation is the consolidation of the actual values in the CMD records, a topic explicitly treated in this work.

SMC4LRT/chapters/Data.tex

-                      r3638
+                      r3665
 \label{def:CMD}
 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.)
+The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
+This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
 …
 \caption{The development of defined profiles and DCs over time}
 \label{table:dev_profiles}
+  \begin{tabular}{ l | r | r | r | r }
+%  \begin{tabular}{ l | r | r | r | r }
+  \begin{tabular}{ l  r  r  r  r }
     \hline
 date     & 2011-01 & 2012-06 & 2013-01 & 2013-06  \\
 …
 \subsection{Instance Data}
 \todoin{ add historical perspective on data - list overall}
+\subsubsection{Instance Data}
+%\todoin{ add historical perspective on data - list overall}
 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
 …
 \caption{Top 20 profiles, with the respective number of records}
 \begin{center}
+  \begin{tabular}{ r | l }
+  \begin{tabular}{ r l }
+    \hline
 \# records & profile \\
     \hline
 …
 \caption{Top 20 collections, with the respective number of records}
 \begin{center}
+  \begin{tabular}{ r | l }
+  \begin{tabular}{ r l }
+    \hline
 \# records & colleciton \\
     \hline
 …
 \subsection{TEI / teiHeader}
+\label{tei}
  TEI/teiHeader/ODD,
 \subsection{ISLE/IMDI}

SMC4LRT/chapters/Definitions.tex

-                      r3553
+                      r3665
 \label{ch:def}
+\section {Abbreviations}
+\label{abbr}
+\begin{table}[!h]
+\caption{Acronyms used throughout this document}
+\begin{tabular}{ l p{0.8\textwidth} }
+ACDH & \xne{Austrian Centre for Digital Humanities}, cf. \ref{acdh} \\
+CLARIN & \xne{Common Language Resources and Technology Infrastructure} -- a research infrastructure initiative, cf. \ref{def:CLARIN} \\
+CLAVAS & \xne{Vocabulary Alignement Service for CLARIN}, cf. \ref{def:CLAVAS} \\
+CMD & \xne{Component Metadata Framework} -- the data model underlying the CMD Infrastructure, cf. \ref{def:CMD} \\
+CMDI & \xne{Component Metadata Infrastructure}, cf. \ref{def:CMDI} \\
+ERIC & \xne{European Research Infrastructure  Consortium} -- a legal entity for long-term research infrastructure initiatives \\
+DARIAH & \xne{Digital Research Infrastructure for Arts and Humanities}\furl{http://www.dariah.eu} -- another research infrastructure initiative, sister project to CLARIN \\
+DC & data category, cf. \ref{def:DCR}  \\
+DCR & data category registry, cf. \ref{def:DCR} \cite{ISO12620:2009} \\
+DH & Digital Humanities, also eHumanities \\
+LINDAT & Czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz} \\
+MPI & Max Planck Institute, especially MPI for Psycholinguistics in Nijmegen, task leader of CMDI \\
+OLAC & \xne{Open Language Archive Community}\furl{http://www.language-archives.org/} \ref{def:OLAC} \\
+PID & persistent identifier \cite{CLARIN2009_PID} \\
+PURL & persistent uniform resource locator \cite{PURL1995} \\
+RDF & \xne{Resource Description Framework} \cite{RDF2004} \\
+RR & Relation Registry, cf. \ref{def:rr}   \\
+TEI & \xne{Text Encoding Initiative}, cf. \ref{tei} \\
+\end{tabular}
+\end{table}
 \section {Namespaces}
-Namespaces mentioned through this document listed:
+\begin{description}
+\item[dcif]
+\item[skos]
+\end{description}
+%\label{table:namespaces}
+\section {Abbreviations}
+%Namespaces referenced in this document, especially in \ref{sec:cmd2rdf} defining the RDF representation.
+\begin{description}
+\item[CLARIN] \textit{Common Language Resources and Technology Infrastructure} \ref{def:CLARIN}
+\item[CLAVAS] \textit{Vocabulary Alignement Service for CLARIN} \ref{def:CLAVAS}
+\item[CMD] \textit{Component Metadata} \ref{def:CMD}
+\item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
+\item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
+\item[DARIAH] \textit{Digital Research Infrastructure for Arts and Humanities}
+\item[DC] data category
+\item[DCR] data category registry \cite{ISO12620:2009}
+\item[DH] Digital Humanities, also eHumanities
+\item[LINDAT] czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz}
+\item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
+\item[PID] persistend identifier \cite{CLARIN2009_PID}
+\item[PURL] persistent uniform resource locator \cite{PURL1995}
+\item[RDF] \textit{Resource Description Framework} \cite{RDF2004}
+\item[RR] Relation Registry\ref{def:rr}
+\item[TEI] \textit{Text Encoding Initiative}
+\end{description}
+\begin{table}[!h]
+\caption{Namespaces referenced in this document}
+  \begin{tabular}{ l  l }
+\var{Prefix name} & \var{Prefix IRI} \\
+%    \hline
+rdf: & http://www.w3.org/1999/02/22-rdf-syntax-ns\# \\
+rdfs: & http://www.w3.org/2000/01/rdf-schema\# \\
+xsd: & http://www.w3.org/2001/XMLSchema\# \\
+owl: & http://www.w3.org/2002/07/owl\# \\
+skos:   & http://www.w3.org/2004/02/skos/core\# \\
+isocat: & http://www.isocat.org/datcat/ \\
+dcr:& http://isocat.org/ns/dcr.rdf\#  \\
+cmd: & http://clarin.eu/cmd/1.0\# \\
+cmds:    & ? \\
+dce: & http://purl.org/dc/elements/1.1/ \\
+dcterms: & http://purl.org/dc/terms \\
+oa: & http://www.w3.org/ns/oa\# \\
+ore: & http://www.openarchives.org/ore/terms/ \\
+cr: & http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/ \\
+\end{tabular}
+\end{table}
 \section {Terms}
+\section{Formatting conventions}
 In the following, the terms used in this work are explained.
+Inline formatting for highlighting: \\
+\begin{description}
+\item[Concept]  Basic "entity" in an ontology? that of what an ontology is build
+\item[Ontology]  \quote{formal, explicit specification of a shared conceptualisation} \cite{Gruber1993}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
+\item[Word]  a lexical unit, a word in a language, something that has a surface realization (writtenForm) and is a carrier of sense. so a relation holds: hasSense(Word, Concept)
+\item[Lexicon]  a collection of words, a (lexical) vocabulary
+\item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
+\item[(Data)Category] (almost) the same as Concept; Things like \concept{Topic}, \concept{Genre}, \concept{Organization}, \concept{ResourceType} are instantiations of Category
+\item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations,  CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (\concept{ResourceType:Image}), but it can be also individuals
+ (\concept{Organization University of Vienna})
+        \todoin{Is it synonymous to value domain, range}
+\item[Entity]
+\item[Resource] informational resource, in the context of CLARIN-Project  mainly Language Resources (Corpus, Lexicon, Multimedia)
+\item[Metadata Description] description of some properties of a resource.  MD-Record
+\item[Schema] - CMD-Profile
+\item[Annotation]
+\end{description}
+\begin{tabular}{ l l }
+\xne{Named Entity} & an application or project name (institution names are written in plain text) \\
+\code{code} & names of xml elements and attributes; also a concrete (sample) value  \\
+\code{concept} & lexical label denoting a concept  \\
+\var{variable} & definitions  and variables
+\end{tabular}
+Lexicon vs. Ontology
+Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
+And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
+So the main focus of a typical ontology are the concepts (``conceptualization''), primarily language-independent.
+\begin{definition}{A definition in a block with caption}
+some \ formal \ expression \ equation \ or \ grammar
+\end{definition}
+\noindent
+Example blocks, simple:
+\begin{example1}
+Short piece of sample data
+\end{example1}
+Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
+So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
+ontologicky vs. semaziologicky (Semanticke priznaky: kategoriÃ¡lne/archysÃ©my, difernciacne, specifikacne)
+\noindent
+or with tabs (especially for RDF triples):
+\begin{example3}
+my:work & my:example & my:block
+\end{example3}

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3638
+                      r3665
 \chapter{Mapping on instance level, CMD as LOD}
+\chapter{Mapping on instance level,\\ CMD as LOD}
 \label{ch:design-instance}
 …
 \end{quotation}
 As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
+As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
 …
 \end{figure*}
 \subsubsection{Identify vocabularies  â CLAVAS}
+\subsubsection{Identify vocabularies}
 \todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
 …
 \label{semantic-search}
 With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
+With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
 Namely to enhance it by employing ontological resources.

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3638
+                      r3665
 We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
 In the next section, the internal data model is presented and explained. In section \ref{def:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{def:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
+In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
 \section{System Architecture}
 …
 \begin{description}
 \item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx}
+\item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx-interface}
 \item[concept-based query expansion] a module for query expansion based on the crosswalks
 \item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
 …
 \label{datamodel-terms}
 In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{list:terms-schema}.
+In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
 \subsubsection{Type \code{Term}}
 …
 %\captionsetup{justification=raggedright, singlelinecheck=false}
 \lstset{language=XML}
 \begin{lstlisting}[label=list:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
+\begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
 <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
         type="label" xml:lang="fr">nom de ressource</Term>
 …
 \lstset{language=XML}
 \begin{lstlisting}[label=list:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
+\begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
 <Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
           id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
 …
 \lstset{language=XML}
 \begin{lstlisting}[label=list:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
+\begin{lstlisting}[label=lst:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
                 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
 …
 \lstset{language=XML}
 \begin{lstlisting}[label=list:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
+\begin{lstlisting}[label=lst:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
 <Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
                id="http://www.isocat.org/datcat/DC-2545">
 …
 \end{lstlisting}
 In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{list:concept-cmd-term}).
 \lstset{language=XML}
 \begin{lstlisting}[label=list:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
+In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:concept-cmd-term}).
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
  <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
             id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
 …
 \lstset{language=XML}
 \begin{lstlisting}[label=list:termset, caption=\code{Termset} element representing a CMD profile]
+\begin{lstlisting}[label=lst:termset, caption=\code{Termset} element representing a CMD profile]
 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
             type="CMD_Profile">
 …
 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{def:qx}).
+The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
 …
 \subsection{Implementation}
+The core functionality  of the SMC is implemented as a set of XSL-stylesheets
 At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
 …
 \section{qx -- concept-based search}
 \label{def:qx}
+\label{sec:qx}
 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
 …
 Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
+There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
 …
 \begin{description}
 \item[SMC graph basic]
         the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}
+        the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}; processing 155 profiles yields a graph with over 4.500 nodes and over 7.500 edges
 \item[SMC graph all]
         additionally rendering the new profile-groups and relations between data categories (from Relation Registry)
 …
 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
+To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
 \begin{figure*}
 …
 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where a all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
+There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
 \subsection{Extensions}
+\label{smc-browser-extensions}
 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.

SMC4LRT/chapters/Infrastructure.tex

-                      r3638
+                      r3665
 \label{ch:infra}
+In this chapter, we present the infrastructure, in which this work is embedded. We start with a short general introduction about the large research infrastructure initiative CLARIN, followed by a close examination of its technical infrastructure for creating and publishing metadata. In section \ref{sec:cv}, we discuss the services for managing controlled vocabularies and their role in the context of metadata creation.
 \section{CLARIN}
 …
 The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and bodies ensuring the flow of information and coherent action on European level.
+In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
 Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
+\section{Component Metadata Infrastructure -  CMDI}
+\section{Component Metadata Infrastructure -- CMDI}
 \label{def:CMDI}
 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
 The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide:
+The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}:
 \begin{itemize}
 …
 \noindent
+All these components are running services, that this work shall directly build upon.
+Next to these core services, that SMC has direct dependencies to, some other services are being developed within the CMDI ecosystem that are also relevant in the context of SMC:
+All these modules are running services, that this work shall directly build upon.
+In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}.
+\begin{figure*}[ht]
+\begin{center}
+\includegraphics[width=0.8\textwidth]{images/CMDI_components_old_clean.png}
+\caption{The diagram [from early CLARIN/CMDI presentations] shows individual modules of the CMDI and their interrelations as envisaged in the initial phase of the CLARIN project}
+\label{fig:cmdi-old}
+\end{center}
+\end{figure*}
+Next to the above-mentioned services SMC is in direct interaction with, some other services and applications are part of the CMDI ecosystem that are briefly introduced in \ref{cmdi-other} for completeness:
 \begin{itemize}
+\item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
+\item metadata editors
+\item Schema Registry
 \item SchemaParser
-\item Vocabulary Alignement Service (OpenSKOS)
 \end{itemize}
+On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ref{cmdi_exploitation}.
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/CMDI_components_old.png}
+\caption{The diagram (from early CLARIN/CMDI presentations) shows individual modules of the CMDI and their interrelations}
+\end{figure*}
+Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}.
 \subsection{CMDI registries}
 The CMD framework as data model (cf. \ref{def:CMD} together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. In the following we explain briefly their role and interaction.
 \begin{figure*}[!ht]
+\label{cmdi-registries}
+The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction.
+\begin{figure*}[t]
 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
+\label{fig:SMC-linkage}
 \end{figure*}
 \subsubsection*{Data Category Registry}
+\subsubsection*{Data Category Registry -- ISOcat}
 \label{def:DCR}
+The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
+The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \xne{ISOcat}\furl{http://www.isocat.org/}.
+Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
+The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories (DC). The resulting shared controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework (among others -- DCR is not specific to CMDI, it is meant to be used as common concept registry in many applications).
+The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}.
+\xne{ISOcat}\furl{http://www.isocat.org/} is an implementation of this standard framework developed by MPI for Psycholinguistics, Nijmegen in collaboration with the ISO technical committee \xne{ISO TC 37 Terminology and Other Language and Content Resources}.
+Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well.
+The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is, that the data categories are assigned a persistent identifier, making them globally and permanently referable.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.7\textwidth]{images/dc_types}
+\end{center}
+\caption{Data Category types\cite{Windhouwer2011ISOcat_intro}}
+\label{fig:dc_type}
+\end{figure*}
 \subsubsection*{Component Registry}
+\emph{Component Registry} (CR)\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} implements the CMD data model and fulfills two functions. For one it as a robust web application for creating and editing new CMD components and profiles. On the other hand it is the actual registry the persistently stores and exposes published CMD profiles, allowing to browse and search in them and view their structure.
+The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., add or a remove some metadata elements and/or components. Also new components can be created to model the unique aspects of the resources under consideration. All components are combined into one profile. Components, elements and values should be linked to a concept to make its semantics explicit.\cite{Durco2013_MTSR}
+This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
+from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+\label{def:CR}
+\emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompaniged by a REST webservice to allows client applications to retrieve the profile definitions. At the same time the web interface serves as an editor for creating and editing new CMD components and profiles.
+The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration.\cite{Durco2013_MTSR}
+Let us reiterate, that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted''\cite{Broeder+2010}, or \emph{to make its semantics explicit}.
+As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
+Once a profile is finished, the Component Registry provides automatically the corresponding XML schema in the \code{cmd} target namespace \code{http://www.clarin.eu/cmd}, that can be used as base for creating and validating metadata records.
 \subsubsection*{Ontological Relations -- Relation Registry}
 …
 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
 However there needs to be an additional means to capture information about relations between data categories.
 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design grounds on the expectation that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeller.
 These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
 There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
 This implementation stores the individual relations as RDF-triples
+This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations be under control of the metadata user whereas the data categories are under control of the metadata modeller.
+The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
+There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
+This implementation stores the individual relations as RDF triples
 \begin{example3}
+<subjectDatcat, & relationPredicate, & objectDatcat>
+subjectDatcat & relationPredicate & objectDatcat
 \end{example3}
+allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
+\todoin{check DCR-RR/Odijk2010 -follow up ?; Cf. Erhard Hinrichs 2009 }
+allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
+\subsection{Further parts of the infrastructure}
+\label{cmdi-other}
 \subsubsection*{Schema Registry}
 SCHEMAcat is a registry for schemata of all kinds (not just XML-based) semantically annotated with data categories.
+SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html} is a registry for schemas of all kinds (not just the CMD-based, in fact not even just XML-based) semantically annotated with data categories.
+\begin{quotation}
 RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow
 (search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
+\subsection{Vocabulary Service / Reference Data Registry}
+\subsubsection{Motivation \& related activities in the community}
+The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data for both the LRT and Digital Humanities community has been discussed on many occasions in various contexts. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives.
+In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
+\begin{note}
+In parallel, within the sister ESFRI project DARIAH a taskforce with the same goal has been set up : \emph{Service for Reference Data and Controlled Vocabularies}. This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to \emph{establish a service providing controlled vocabularies and reference data} for the DARIAH (and CLARIN) community.
+Thus there is a momentum and a high potential for a collaborative approach in at least these two big initiatives CLARIN and DARIAH, that serve a very wide-spread and diverse community.
+\end{note}
+\subsubsection{Abstract service description}
+As to the service itself it is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface is still necessary for administration etc.  By using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and \xne{LOD}.
+\end{quotation}
+\subsubsection*{Schema Parser}
+Schema Parser is a service developed at the Meertens Institute, Amsterdam, that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
+\subsubsection*{Metadata editors}
+\label{md-editors}
+Metadata creation, i.e. the authoring of actual metadata records is undisputably the fundamental task in the whole system.
+Though not directly interacting with SMC, metadata editors need to be mentioned, i. e. tools that the human metadata editors is using for authoring metadata.
+Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records.
+Two examples being the stand-alone application \xne{Arbil}\cite{withers2012arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\cite{dima2012mdeditor}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
+\subsection{CMDI - Exploitation side}
+\label{cmdi_exploitation}
+Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
+\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications}
+\label{fig:cmd-ingestion}
+\end{center}
+\end{figure*}
+The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/}\cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
+The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}).
+As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic  interoperability provided by the infrastructure.
+\begin{figure*}[ht]
+\begin{center}
+\includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png}
+\caption{screenshot of the faceted browser of the VLO}
+\label{fig:vlo}
+\end{center}
+\end{figure*}
+More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
+The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data.
+% \todoin { describe indexing and search}
+And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}). \cite{Durco2011}
+However the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
+%%%%%%%%%%%%%%%%%%%%
+\section{Vocabulary Service / Reference Data Registries}
+\label{sec:cv}
+\subsection{Motivation \& broader context}
+The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However the problem of incoherent labeling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
+This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc.
+Moreover, by using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards transformation of this data into \emph{Linked Open Data}.
+Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives.
+It has to be also kept in mind, that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
+\begin{comment}
 Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
 \begin{verbatim}
 …
 NDL: 00441109 | VIAF: 24602065
 \end{verbatim}
+\subsubsection{Vocabulary Service - CLAVAS}
+\end{comment}
+\subsection{Implementation -- OpenSKOS/CLAVAS}
 \label{def:CLAVAS}
+As described in previous section (\ref{def:DCR}), a solid pillar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
+This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
+The foundation is the vocabulary repository and editor OpenSKOS\furl{http://openskos.org}.
+This repository can serve as a project independent manager and provider of controlled vocabularies.
+One important feature of the OpenSKOS system is its distributed nature. It allows individual instances to synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, as multiple instances would provide identical synchronized data, while the primary responsibility for individual vocabularies could lie with different instances/organizations based on their specialization, field of expertise.
+Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), as well as Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/} are running an instance of OpenSKOS.
+As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \emph{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \emph{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. As part of the process of adaptation to the needs of CLARIN and LRT-community data categories from \xne{ISOcat} have been converted into SKOS-format and ingested into the system.
+\xne{Austrian Centre for Digital Humanities} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
+A plan has been developed/adopted to support further vocabularies relevant for the community.
+Following are those to be handled in short-term, in order of urgency/relevance/prirority:
+In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
+%As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
+The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format.
+One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up, that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
+Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running a instance of the OpenSKOS system.
+As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified, that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
 \begin{itemize}
 \item the list of language codes\cite{ISO639}
-\item country codes
 \item organization names for the domain of language resources
+\item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process)
 \end{itemize}
+See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
+and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from \xne{ISOcat} to \xne{SKOS}.
+\subsection{Interaction between DCR, VAS and client applications}
+\label{interaction-dcr-skos}
+DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
+\code{simple, complex: closed, open, constrained, (container)?}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.7\textwidth]{images/dc_types}
+\end{center}
+\caption{Data Category types}
+\label{fig:dc_type}
+\end{figure*}
+\todocite{DC types - ISOcat introduction at CLARIN-NL Workshop}
+See \ref{fig:DCR_data_model} for full DCR data model.
+\subsubsection{Export DCR to SKOS}
+\cite{Menzo2013mail}
+\subsection{Export DCR to SKOS}
+\label{sec:export-dcr}
+Based on the premise, that the data in DCR also represents a kind of a controlled vocabularies, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
+Note, that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is, that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
 The fact that data categories are basically definitions of concepts may mislead to
+a na"ive approach to mapping DCR to SKOS, namely mapping every data category to a \code{skos:Concept}
+all of them belonging to the \xne{ISOcat:ConceptScheme}.
+However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
+A more sensible approach is to export only closed DCs as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{Concepts} within that scheme.
+a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept}
+all of them belonging to the \code{ISOcat:ConceptScheme}. However the data in ISOcat as whole is too disparate in scope for such a vocabulary to be useful.
+A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme.
 \begin{quotation}
 …
 field/element/attribute, complex DCs in ISOcat are the users of such
 vocabularies and simple DCs the DCR equivalence of values in such a
+vocabulary.
+\end{quotation}\cite{Menzo2013mail}
+Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
+Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
+So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
+That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
+Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
+i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>).
+This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
+/representations/dcs2/clavas.xsl}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
+\end{center}
+\caption{The data flow and linking between schema, data categories and vocabularies}
+\label{fig:export_dcr2skos}
+\end{figure*}
+Open or constrained DCs are not exported as they don't provide anything to a vocabulary.
+There is no need to express the relationship between this constrained DC
+and the vocabulary in CLAVAS itself.
+Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS.
+However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
+However it needs to be yet assessed how useful this approach is. In the metadata profile
+there are many closed DCs with small value domains. How useful are those
+in CLAVAS?
+Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model.
+Where the value domains are big (ISO 639-3) or can only be
+partially enumerated (organization names) ISOcat can't/shouldn't contain
+the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
+provider.
+vocabulary.\cite{Menzo2013mail}
+\end{quotation}
+\begin{comment}
 Still there are some closed DCs which might be good vocabulary
 providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
 …
 then 20, 50 or 100 values are exported.
+\subsubsection{Vocabulary linking and use}
+Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
+is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482}
+the rule looks like:
+However it needs to be yet assessed how useful this approach is. In the metadata profile
+there are many closed DCs with small value domains. How useful are those
+in CLAVAS?
+\end{comment}
+\begin{figure*}
+\begin{center}
+\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
+\end{center}
+\caption{The wrong and correct variant of exporting ISOcat data categories in SKOS format to the Vocabulary Service}
+\label{fig:export_dcr2skos}
+\end{figure*}
+Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
+Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
+So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
+That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
+Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
+i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}).
+This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
+/representations/dcs2/clavas.xsl}
+\subsection{Linking to vocabularies in data categories and schemas -- interaction between ISOcat, CLAVAS and client applications}
+\label{interaction-dcr-skos}
+In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to
+convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013.\cite{Menzo2013mail}}
+Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository:
+\begin{quotation}
+Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
+partially enumerated (organization names) ISOcat can't/shouldn't contain
+the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
+provider.\cite{Menzo2013mail}
+\end{quotation}
+Currently, the only possibility to constrain the value domain of a data category
+is by the means a XML Schema provides, like enumeration or regular expression. So for the data category \concept{languageID\#DC-2482} the rule looks like:
 \lstset{language=XML}
 \begin{lstlisting}
 …
 \end{lstlisting}
 A current proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
+A proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
 \begin{lstlisting}
 …
 \end{lstlisting}
+\begin{quotation}
 \code{@href} points to the vocabulary. Actually a PID should be used in the context
 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
 …
 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
 valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
+This would yield a definition of the conceptualDomain for the data category as follows:
+\end{quotation}
+This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
 \lstset{language=XML}
 \begin{lstlisting}
+\begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
   <dcif:conceptualDomain type="constrained">
      <dcif:dataType>string</dcif:dataType>
 …
 \end{lstlisting}
+I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
+\begin{note}
+Integrate:
+ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat.
+\end{note}
+Note though, that anything stated in the DC specification is not binding,
+but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}.
+(Even if the DC is closed.) The authoritative/normative information is in the schema.
+A schema modeler, (concept)linking an element in the schema
+to a DC can decide to have another restriction for the values allowed
+in that element. The information from DCR serves as recommendation or default.
+\begin{figure*}[!ht]
+\begin{figure*}[ht]
 \begin{center}
 \includegraphics[width=0.7\textwidth]{images/concept_linking.png}
 \end{center}
 \caption{The data flow and linking between schema, data categories and vocabularies}
+\caption{The linking between schemas, data categories and vocabularies}
 \label{fig:concept_linking}
 \end{figure*}
+\paragraph {Modelling the vocabulary reference in the schema}
+It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation.
+One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done
+within the CMD-framework with the language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes)  adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization). And even this supposedly fixed list undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
+Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations).
+So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema.
+This is already used with DCR, adding the \code{@dcr:datcat} into schema elements.
+Also CMDI (ComponentRegistry when generating schemas) puts information in \code{<xs:appinfo/>}.
+Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon
+it, i.e., use OpenSKOSs autocomplete API.
+Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This
+isn't a problem if the vocabulary is open, e.g., organisation names, but
+it is when the value domain is closed, e.g., ISO 639-3. In the latter case
+the XSD generation might have two modes: a lax (smaller) version which
+It is important to emphasize, that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element then the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
+There are basically two options, how the vocabulary can be integrated into the schema.
+One approach is to explicitly enumerate all the values from the vocabulary.
+Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
+and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
+The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in  \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute.
+Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though, that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
+can use the reference to the data category to fetch explanations (semantic information)  (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}).
+The drawback of this variant is, that we gave up the validation. This
+isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but
+it is when the value domain is closed, e.g. \concept{languageId}. In the latter case,
+the XSD generation could support both modes: a lax (smaller) version which
 doesn't contain the closed vocabulary as an enumeration and leaves it to
 the tool, and a strict version which does contain the vocabulary as an
 enumeration. Probably the latter should stay the default, but Arbil could
+enumeration. Probably the latter should stay the default, but the client application could
 request the lax version leading to smaller and quicker XSD validation
 inside the tool.
+With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
+ In ISOcat, such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
+\begin{note}
+\noindent
+something similar for the link to an EBNF grammar in SCHEMAcat:
+%\begin{lstlisting}
+\begin{verbatim}
+      <scr:valueSchema
+               xmlns:scr="http://www.isocat.org/ns/scr"
+               pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
+               type="ISO 14977:1996 EBNF"/>
+\end{verbatim}
+%\end{lstlisting}
+\end{note}
+Finally, the client application (e.g. a metadata editor) is configured/guided by the schema.
+It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
+\subsection{CMDI - Exploitation side}
+\label{cmdi_exploitation}
+Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
+\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by exploitation side components}
+\end{figure*}
+The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
+More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
+\todocite {MI Search Engine}
+And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centres,
+and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
+\section{Content Repositories}
+Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
+RDF-stores in Content Repositories (Fedora, ..)
+The requirements for these repositories: PIDs, CMD, OAI-PMH
+\todocite{center-B paper}
+\section{Distrbuted system - federated search}
+Metadata -> harvesting via OAI-PMH, but Content search has to be really distributed.
+\begin{description}
+\item[Z39.50/SRU/SRW/CQL] LoC
+\item[OAI-PMH]
+\end{description}
+%However for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
+%%%%%%%%%%%%%%%%%
+\section{Other aspects of the infrastructure}
+While this work concentrates solely on the metadata, it needs to be recognized, that it is only aspect of the infrastructure and its actual purpose the availability of resources. Metadata is a necessary first step to announce and describe the resources. However it is of little value, if the resources themselves are not accessible.
+Consequently, another pillar of the CLARIN infrastructure are the centres\furl{http://www.clarin.eu/node/3812}:
+\begin{quotation}
+CLARIN's distributed network is made out of centres. These units, often a university or an academic institute, offer the scientific community access to services on a sustainable basis.
+\end{quotation}
+CLARIN imposes a number of criteria, that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767}\cite{CE-2013-0095}.
+CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
+One core service of such centres are the content repositories, systems meant for long-term preservation and publication of research data and resources.
+\begin{figure*}
+\begin{center}
+\includegraphics[width=0.7\textwidth]{images/FCS_components.png}
+\end{center}
+\caption{components of the Federated Content Search}
+\label{fig:fcs}
+\end{figure*}
+Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs}\cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via the aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50. The maintenance of SRU/CQL has been
+transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
 \section{Summary}
+In this chapter we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry, that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally a few other aspects of the infrastructure, that are equally important, however not pertaining to the metadata level, were briefly tackled.

SMC4LRT/chapters/Introduction.tex

-                      r3553
+                      r3665
 While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
+This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
 …
 \section{Main Goal}
 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}.
+The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the underlying processing, referred to as \xne{semantic mapping}.
 The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
 …
 \end{description}
 The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals.
+The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the dependencies between individual subgoals.
 \begin{figure*}[!ht]
 …
 \subsubsection*{Concept-based query expansion}
 Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching.
+Once the crosswalks are available, they can be used to rewrite user queries, so that they match equivalent or similar fields across heterogeneous metadata schemas resulting in higher recall when searching.
 \paragraph{Example}
 …
 \end{quote}
 while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered:
+The expansion cannot be solved by simple string matching, as there are other fields labeled with the same (sub)strings but with different semantics, that shouldn't be considered:
 \begin{quote}
 \concept{Project/Title, Organisation/Name, Country/Name}
+\concept{Project/Title, Organisation/Name, Country/Name, LanguageName}
 \end{quote}
 …
 \subsubsection*{Ontology-driven data exploration}
 Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources.
+Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset}.
 \paragraph{Example}
 …
 \subsubsection*{Visualization}
 Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means for exploration of and interaction with it. Hence this subgoal aiming at exploring ways of visualizing the data at hand.
+Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means to explore and interact with it. Hence this subgoal aimed to propose ways of visualizing the data at hand.
 \section{Method}
 …
 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
 Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories.
+Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories. This theoretical part will be accompanied by a prototypical implementation as proof of concept.
+This theoretical part will be accompanied by a prototypical implementation as proof of concept.
+%In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
+In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
+In this work, the focus lies on the actual method to generate and apply the crosswalks -- expressed in the specification and operationalized in the (prototypical) implementation of the service -- rather than trying to establish final, accomplished crosswalks between the schemas. In fact, given the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on \emph{dynamic mapping}, i.e. to enable the users to directly manipulate the level of use of the crosswalks or even apply custom crosswalks depending on their current task or research question being able to actively influence the recall/precision ratio of the search results, and essentially to modulate the semantic search space.
+Note that in this work, the focus lies on the actual method to generate and apply the crosswalks -- expressed in the specification and operationalized in the (prototypical) implementation of the service -- rather than trying to establish final, accomplished crosswalks between the schemas. In fact, given the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on \emph{dynamic mapping}, i.e. to enable the users to directly manipulate the level of use of the crosswalks or even apply custom crosswalks depending on their current task or research question being able to actively influence the recall/precision ratio of the search results, and essentially to modulate the semantic search space.
 Serving the second subgoal, semantic interpretation on the instance level, we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
+Serving the second subgoal -- semantic interpretation on the instance level -- we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
 semantic resources (controlled vocabularies, ontologies).
 Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
 A separate usability evaluation of the semantic search is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
+A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
 \section{Expected Results}
 …
 The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \xne{crosswalk service}.
 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
 and the results and findings of the \emph{evaluation}.
+and the sample results. % and findings of the \emph{evaluation}.
 Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/}.
 …
 \begin{description}
 \item [Crosswalk service] specification and a basic implementation of the service
 \item [Concept-based search] design of the query expansion and prototypical integration with search engines
+\item [Concept-based search] design of the query expansion and prototypical integration with a search engine
 \item [Visualization tool] design of an application for interactive exploration of the concerned dataset
 \item [Evaluation] evaluation results of querying the dataset comparing simple search and semantic search
+%\item [Evaluation] evaluation results of querying the dataset comparing simple search and semantic search
 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
 \end{description}
 \section{Structure of the work}
 The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the terms and abbreviations used in the work.
+The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work.
 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
 …
 The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively.
+The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
+%evaluation and the
+The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
 \section{Keywords}

SMC4LRT/chapters/Results.tex

-                      r3638
+                      r3665
 \label{ch:results}
+In this chapter, the results of the work are presented, divided into two main areas:
+software and data.
+In two sections, we explore the CMD data domain - the usage of the data categories on the one hand and the integration of existing formats on the other hand. While these two aspects were not directly part of this work, they were a) made possible by output of this work (SMC-Browser, statistical analysis), b) yield a valuable test case for the usefulness of the work and c) are an indispensable prerequisit for the necessary curation work being carried out by the CMDI community.
+In this chapter, the results of the work are presented. After a short update about the current state of affairs in the infrastructure as whole, the individual parts of the work are listed with pointers to their specifications in previous chapters and links to the running prototypes.
+In the subsequent two sections, we explore a few specific aspects of the CMD data domain -- regarding the usage of the data categories (\ref{sec:explore-datcats}) and the integration of existing formats (\ref{sec:explore-formats}). While these topics are not directly results of this work, the presented analyses are. They were made possible by the technical solution of this work, yield a valuable test case for the usefulness of the work and are an indispensable prerequisite for the necessary coordination and curation work being carried out by the CMDI community.
 \section{Current status of the infrastructure}
 …
 The main services of the infrastructure have been in stable production for the last two years.
 Relation Registry is operational as early prototype.
 Three instances of OpenSKOS are running, one of them being hosted by ACDH.
+Three instances of \xne{OpenSKOS} are running, one of them being hosted by \xne{ACDH}.
 \subsection{CMDI - data}
 More than 130 profiles are defined. (See \ref{table:dev_profiles} for more details about profiles.)
+More than 130 profiles are defined. (See table \ref{table:dev_profiles} for more details about profiles.)
 The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
 The collection amounts to over 550.000 records in 64 profiles.
+The collection amounts to over 550.000 records in more than 60 distinct profiles.
 \subsection{ACDH - the home of SMC}
+Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities, that provides depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure.
+Figure \ref{fig:acdh_context} sketches the broader context of \xne{acdh} and its different roles.
+\section {Software}
+The specification of the system can be found in the chapters \ref{ch:design} and \ref{ch:design-instance}.
+There is prototypical implementation for three parts of the system
+\begin{itemize}
+\item the crosswalk service as a REST web service
+\item a module to integrate with a search engine
+\item web application that allows advanced interaction with the data set
+\end{itemize}
+The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
+Furthermore, the CMD data has been expressed RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
+Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities with the mission to foster digital research paradigm in humanities. It is designed to provide depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. SMC is one of these services provided by this centre.
+Figure \ref{fig:acdh_context} sketches the broader context of \xne{ACDH} and its different roles.
+%%%%%%%%%%%%%%%%
+\section {Technical solution}
+With this work we delivered a module embedded in a larger metadata infrastructure, aimed at supporting the semantic interoperability across the heterogeneous data in this infrastructure. The module consists of multiple interrelated components. The technical specification of the module can be found in chapter \ref{ch:design}. A prototypical implementation has been developed for the three main parts of the system. The code of this implementation is maintained in the central CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
+The module itself is hosted at the \xne{CLARIN-AT} server, offering a main entry point page linking to the various parts of the module at:
+\\
+\url{http://clarin.aac.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
 \subsection{SMC - crosswalks service}
+The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java.
+the crosswalk service as a REST web service
+exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}
+This interface is available as part of the smc application:
+\url{http://clarin.aac.ac.at/smc/cx}
 \subsection{SMC - as a module within Metadata Repository}
+There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
+The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
+\url{http://clarin.aac.ac.at/mdrepo/smc}
 \subsection{SMC Browser -- advanced interactive user interface}
+SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
+In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g. counting how many elements a profiles contains, or in how many profiles a DC is used.
+It is implemented on top of the js-library d3, the code is checked in clarin-svn.
+The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
+E.g. starting from 124 profiles, this amounts to a graph with ??? nodes and ??? edges.
+\begin{figure*}[!ht]
+SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
+\url{http://clarin.aac.ac.at/smc/browser}
+\begin{figure*}
 \includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
 \caption{Screenshot of the SMC browser}
 \end{figure*}
-SMC Browser also features detailed numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation.
-In the following section, we make extensive use of the output of this tool, to visualize individual aspects of the discussed data set.
 \subsection{SMC LOD}
+\section{Exploring the usage of data categories}
+At the core of the whole SMC (and CMDI) are the data categories as basic conceptual building blocks or anchors.
+We want to take a closer look on the usage of the data categories in the CMD infrastructure, examplifying on a few very common concepts -- \concept{language}, \concept{name}, \concept{resource type}, \concept{???}.
+In the ISOcat DCR 791 DCs are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed}
+\subsection{Language}
+While there are 69 components and 97 elements containing a substring `language' defined in the CR
+still only 19 distinct DCs with a `language' substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
+\textit{languageID} (\texttt{DC-2482}) and \textit{languageName} (\texttt{DC-2484}) are referenced by more than 80 profiles.
+Additionally, these two DCs are linked to the Dublin Core term \textit{Language} in the RR.
+Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \textit{dc:Language} or \textit{dcterms:Language}, \textit{isocat:languageId} or \textit{isocat:languageName},
+most profiles use both \textit{isocat:languageId} and \textit{isocat:languageName} and there are even profiles that refer to both \textit{isocat} and \textit{dublincore} data categories (\textit{data}, \textit{HZSKCorpus}, \textit{ToolService}).
+In a separate track, a model has been proposed (cf. \ref{ch:design-instance}) to express CMD data in RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
+%%%%%%%%%%%%%%%555
+\section{Exploring the CMD data -- SMC reports}
+SMC reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain that were created making extensive use of the visual and numerical output from the \xne{SMC Browser}. In this section, we deliver a few examples of these analyses. A complete up to date listing is maintained on the SMC website:
+\url{http://clarin.aac.ac.at/smc/reports}
+\subsection{Usage of data categories}
+\label{sec:explore-datcats}
+At the core of the whole SMC (and CMDI) are the data categories as basic semantic building blocks or anchors.
+In the ISOcat DCR, currently 791 DCs are defined in the Metadata thematic profile, starting from 222 that were initially created by the so-called \textit{Athens Core} group in 2010. %\todoin{need to check, how many of these athens-core data categories are being employed}
+As can be seen in table \ref{table:dev_profiles}, around 500 distinct data categories are being used in CMD profiles.
+We want to take a closer look on the usage of the data categories in the CMD data domain, examplifying on the very common concepts -- \concept{language}, \concept{name}. %, \concept{resource type}, \concept{???}.
+\subsubsection{Language}
+While there are 69 components and 97 elements containing a substring \code{`language'} defined in the CR
+still only 19 distinct DCs with a \code{`language'} substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
+\concept{languageID\#DC-2482}) and \concept{languageName\#DC-2484}) are referenced by more than 80 profiles.
+Additionally, these two DCs are linked to the Dublin Core term \concept{Language} in the RR.
+Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \concept{dc:Language} or \concept{dcterms:Language}, \concept{isocat:languageId} or \concept{isocat:languageName},
+most profiles use both \concept{isocat:languageId} and \concept{isocat:languageName} and there are even profiles that refer to both \concept{isocat} and \concept{dublincore} data categories (\concept{data}, \concept{HZSKCorpus}, \concept{ToolService}).
 …
 \includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
 \end{center}
 \caption{The four main \textit{Language} data categories and in which profiles they are being used}
+\caption{The four main \concept{Language} data categories and in which profiles they are being used}
 \label{fig:language_datcats}
 \end{figure*}
 It requires further inspection and in the end a case by case decision, if the other less often used `language' DCs can be treated as equivalent to the above mentioned ones.
 \textit{languageScript}, \textit{implementationLanguage}, as well as \textit{noLanguages} or  \textit{sizePerLanguage} clearly do not belong to the language cluster.
 But \textit{sourceLanguage}, \textit{languageMother} or \textit{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
 \subsection{Name / Title}
+\concept{languageScript}, \concept{implementationLanguage}, as well as \concept{noLanguages} or  \concept{sizePerLanguage} clearly do not belong to the language cluster.
+But \concept{sourceLanguage}, \concept{languageMother} or \concept{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
+\subsubsection{Name / Title}
 There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
+Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
+Some of the DCs referenced by \texttt{Name} elements are \textit{author} (\texttt{DC-4115}), \textit{contact full name} (\texttt{DC-2454}), \textit{dcterms:Contributor}, \textit{project name} (\texttt{DC-2536}), \textit{web service name} (\texttt{DC-4160}) and \textit{language name} (\texttt{DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
+\subsection{Resource type}
+\subsection{Subject, Genre, Topic}
+\section{Exploring the integration of existing formats}
+Again the main DC \concept{resourceName\#DC-2544}) being used in 74 profiles together with the semantically near \concept{resourceTitle\#DC-2545}) used in 69 profiles offer a good coverage over available data.
+Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
+%\subsection{Resource type}
+% \subsection{Subject, Genre, Topic}
+\subsection{Integration of existing formats}
+\label{sec:explore-formats}
 CLARIN set out with the aspiration /yearning to overcome the babylon of metadata formats
 …
 In this section, we want to elaborate on/analyze the state of integration efforts for 4 major formats: \xne{dublincore/OLAC}, \xne{teiHeader} and \xne{META-SHARE resourceInfo}.
 \subsection{dublincore / OLAC}
+\subsubsection{dublincore / OLAC}
 Very widely used (because) simple format
 …
 \caption{Profiles modelling dublincore terms}
 \label{table:dcterms-profiles}
+  \begin{tabular}{ l | l | l | r | r }
+    \hline
+profile name & created & creator & count & instances \\
+    \hline
+component-dc-terms-modular & 2010-04-21 & CMDI-team & 15 / 15 / 15 \\
+component-dc-terms & 2010-04-21 & CMDI-team & 0 / 15 / 15 \\
+DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
+OLAC-DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
+OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02-12 & Menzo Windhouwer & 1 / 71 / 62 & \\
+DC-UBU & 2013-05-29& Utrecht University Library & 0 / 15 / 15 & \\
+OLAC-DcmiTerms-ref & 2013-06-24 & fankhauser@ids-mannheim.de & 0 / 55 / 55 & \\
+    \hline
+  \end{tabular}
+%  \begin{tabular}{ |l | l | l | r | r | }
+  \begin{tabu}{ l  l  l  r  r }
+    \hline
+\rowfont{\itshape\small} profile name & created & creator & count & instances \\
+   \hline
+component-dc-terms-modular & 2010-04 & CMDI-team & 15 / 15 / 15 & \\
+component-dc-terms & 2010-04 & CMDI-team & 0 / 15 / 15 & \\
+DcmiTerms & 2010-10 & D.Van Uytvanck & 0 / 55 / 55 & 46.156 \\
+OLAC-DcmiTerms & 2010-10 & D. Van Uytvanck & 0 / 55 / 55 & 85.149 \\
+OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02 & M. Windhouwer & 1 / 71 / 62 & \\
+DC-UBU & 2013-05 & Utrecht Uni Lib & 0 / 15 / 15 & \\
+OLAC-DcmiTerms-ref & 2013-06 & Fankhauser, IDS & 0 / 55 / 55 & 697 \\
+OLAC-DcmiTerms-ref-DWR & private & ? & 1 / 61 / 55 &  775 \\
+    \hline
+  \end{tabu}
 \end{table}
 Additionally, there is a number of profiles with concept links to dublincore terms,
 Some use all of the dublincore elements or terms as one component within a larger profile,
 one example being the \xne{data} profile created by the Czech initiative LINDAT modells  the minimal obligatory set of META-SHARE \xne{resourceInfo}) combined with a simple dublincore record (see also subsection about META-SHARE below).
 Other profiles refer only to some data categories. Most often used: \concept{Title} (used in 33 profiles) and \concept{Creator} (in 29 profiles).
+one example being the \xne{data} profile created by the Czech initiative LINDAT models  the minimal obligatory set of META-SHARE \xne{resourceInfo} schema, cf. subsection about META-SHARE below) combined with a simple dublincore record.
+Other profiles refer only to some data categories. Most often used: \concept{dc:Title} (used in 33 profiles) and \concept{dc:Creator} (in 29 profiles).
 Profiles that make more frequent use of the dublincore terms:
 \begin{itemize}
+\item EastRepublican (8)
+\item HZSKCorpus (17)
+\item teiHeader (8)
+\item ToolService (15)
+\item OralHistoryInterviewDANS (15)
 \end{itemize}
 \begin{figure*}[!ht]
 \begin{center}
 \includegraphics[width=0.8\textwidth]{images/profiles_using_dcmiterms.png}
+\begin{tabular}{l r}
+EastRepublican & 8 \\
+HZSKCorpus &17 \\
+teiHeader &8 \\
+ToolService &15 \\
+OralHistoryInterviewDANS & 15 \\
+\end{tabular}
+\begin{figure*}
+\begin{center}
+\includegraphics[width=1\textwidth]{images/profiles_using_dcmiterms.png}
 \end{center}
 \caption{Profiles referring to at least some of the dublincore data categories/terms}
 …
 \subsection{teiHeader}
+\subsubsection{teiHeader}
 TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
 …
 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.75\textwidth]{images/teiHeader_DBNL.png}
+%[!ht]
+\begin{figure*}
+\begin{center}
+\includegraphics[width=0.65\textwidth]{images/teiHeader_DBNL.png}
 \end{center}
 \caption{The reuse of components between the original teiHeader-profile (2010) and the profiels used in Nederlab project}
 …
 \end{figure*}
+% p{0.2\textwidth}
 \begin{table}
 \caption{Overview of TEI-related CMD profiles}
 \label{table:tei-profiles}
   \begin{tabular}{ l | r | l | r | r | r}
     \hline
 profile name & created & creator & count & instances \\
     \hline
 teiHeader & 2010 & ICLTT, Durco & 16/35/13 & 467 \\
+  \begin{tabu}{ p{0.2\textwidth}  r  l  r  r  }
+    \hline
+\rowfont{\itshape\small} profile name & created & creator & count & instances \\
+    \hline
+teiHeader & 2010 & Durco, ICLTT & 16/35/13 & 467 \\
 teiHeader & 2012 & Deutsches Text Archiv & 56/82/10 & 857 \\
 TEIDocumentDescription & 2012 & Leipzig Corpora, Eckart & 16/35/13 & ? \\
 DBNL\_Tekst & 2013 & Nederlab, Zhang & 20/38,15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
 DBNL\_Tekst\_Onzelfstandig  & & & 20/47/21 &  \\
     \hline
   \end{tabular}
+TEIDocument Description & 2012 & Eckart, Leipzig Corpora & 16/35/13 & ? \\
+DBNL\_Tekst & 2013 & Zhang, Nederlab & 20/38/15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
+DBNL\_Tekst\_ Onzelfstandig  & & & 20/47/21 &  \\
+    \hline
+  \end{tabu}
 \end{table}
 …
 clarin.eu:cr1:p 1366279029218 (private)
+\subsection{META-SHARE}
+META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
+%
+\subsubsection{META-SHARE}
+%
+META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
 %In cooperation between metadata teams from CLARIN and META-SHARE
+\begin{figure*}[!ht]
+The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
+In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
+This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
+The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific informations.
+\begin{figure*}
 \begin{center}
 \includegraphics[width=0.5\textwidth]{images/SMC-resourceInfo.png}
 \end{center}
+\caption{The five \concept{resourceInfo} profiles with the first level of components}
+\label{fig:resource_info_5}
+\end{figure*}
+\begin{figure*}
+\begin{center}
+\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
+\end{center}
 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
 \label{fig:resource_info_5}
+\label{fig:META-SHARE-LINDAT}
 \end{figure*}
 …
 \caption{Profiles modelling resourceInfo}
 \label{table:resourceinfo-profiles}
   \begin{tabular}{ l | l | l | r | r }
     \hline
 profile name & created & creator & count & instances \\
     \hline
 resourceInfo (minimal) & 2013-02-13 & LINDAT.CZ & 34 / 41 / 21 \\
 resourceInfo (lexical) & 2013-06-02 & P. Labropoulou & 86 / 226 / 57 \\
 resourceInfo (tools) & 2013-06-02 & P. Labropoulou & 61 / 176 / 52 \\
 resourceInfo (language) & 2013-06-02 & P. Labropoulou & 89 / 228 / 54 \\
 resourceInfo (corpus) & 2013-06-02 & P. Labropoulou & 117 / 337 / 72 \\
     \hline
   \end{tabular}
+  \begin{tabu}{ l l l r r }
+    \hline
+\rowfont{\itshape\small} profile name & created & creator & count & instances \\
+    \hline
+resourceInfo (minimal) & 2013-02 & LINDAT.CZ & 34 / 41 / 21 & 67 \\
+resourceInfo (lexical) & 2013-06 & P. Labropoulou & 86 / 226 / 57 \\
+resourceInfo (tools) & 2013-06 & P. Labropoulou & 61 / 176 / 52 \\
+resourceInfo (language) & 2013-06 & P. Labropoulou & 89 / 228 / 54 \\
+resourceInfo (corpus) & 2013-06 & P. Labropoulou & 117 / 337 / 72 \\
+    \hline
+  \end{tabu}
 \end{table}
+The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
+In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
+This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
+\end{center}
+\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
+\label{fig:META-SHARE-LINDAT}
+\end{figure*}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[height=1\textheight]{images/resourceInfoBIG.png}
+\begin{figure*}
+\begin{center}
+\includegraphics[height=0.95\textheight]{images/resourceInfoBIG.png}
 \end{center}
 \caption{the META-SHARE based profile for describing corpora}
 …
+%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{SMC cloud}
+As a latest, still experimental, addition, SMC browser provides a special type of graph, that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
+covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/just_profiles_6.png}
+\end{center}
+\caption{SMC cloud -- graph visualizing the semantic proximity of profiles}
+\label{fig:SMC_cloud}
+\end{figure*}
+\begin{comment}
 \section{Evaluation}
 \label{evaluation}
 …
 AF + DCR + RR
+\end{comment}
 \section{Summary}
+The direct comparison of the CMD approach of metamodel allowing to generate custom profiles with shared semantics and a more traditional way of trying to generate one schema to fit all in as in META-SHARE shows nicely the trade-off: many custom schemas or one very large.
+In this final chapter, we presented the results, on the one hand the technical solution of the module \xne{Semantic Mapping Component}, on the other hand we spent a good part of the chapter on commented analyses of the processed dataset, that were made possible by \xne{SMC Browser}, a interactive visualization tool developed as part of this work for exploration of the schema level data of the discussed collection. As such, the analyses can be seen as an evaluation, a proof of concept and usefulness of the presented work.

SMC4LRT/chapters/abstract_de.tex

-                      r2672
+                      r3665
 \chapter*{Kurzfassung}
+Hier fÃŒgen Sie die Kurzfassung auf Deutsch gemÃ€Ã den Vorgaben der FakultÃ€t ein.
+Diese Arbeit ist eingebettet in eine groÃe internationale Forschungsinfrastruktur-Iinitiave, die zur Aufgabe hat,
+einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermÃ¶glichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes koherentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermÃ¶glicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÃ€t zu ÃŒberwinden, das sich aus der HeterogenitÃ€t der Metatadaten-Formate ergibt.
+Das eigentliche Ziel, der Nutzen dieser Arbeit -- im Einklang mit der generellen Idee des ganzen Unterfangens -- war die \emph{Verbesserung der SuchmÃ¶glichkeiten} in der groÃen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÃ€nzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhÃ¶hen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend, Modellierung der DomÃ€ndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermÃ¶glichen.
+Entsprechend den zwei Herangehensweisen lieferte die Arbeit auch zwei Hauptergebnisse: a) die Spezifikation eines Moduls fÃŒr \emph{konzept-basierte Suche} zusammen mit dem zugrundeliegenden Dienst \emph{crosswalk service}, begleitet von einer Testimplementierung; b) Spezifikation der Modellierung der Ausgangsdaten im RDF Format, womit die Grundlage geschaffen ist, die Daten als \emph{Linked Open Data} bereitzustellen.
+Teilweise als Nebenprodukt wurde auch die Anwendung \emph{SMC Browser} entwickelt -- ein interaktives Visualisierungswerkzeug zur ErschlieÃung der Schema-Ebene der Datensammlung. Mit Hilfe dieses Werkzeugs konnte eine Reihe von tiefergehenden Analysen der Daten erstellt werden, die direkt von der Forschergemeinschaft zur ErschlieÃung und Redaktion der komplexen Daten genutzt werden. Somit kÃ¶nnen die Anwendung und die Analyseberichte als ein wertvoller Beitrag fÃŒr die Forschergemeinschaft angesehen werden.

SMC4LRT/chapters/abstract_en.tex

-                      r3638
+                      r3665
 This work is embedded in the context of a large research infrastructure initiative aimed at easing and harmonizing access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in at the core of the infrastructure.
+This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.
+The ultimate objective of the effort -- in line with the overall mission of the infrastructure -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This was pursued by two separate, complementary approaches: a) Enriching the search capabilities with concept-based crosswalks on schema level.
+And -- acknowledging the integrative power of the \emph{Linked Open Data} paradigm  -- b) expressing the domain data as a \emph{Semantic Web} resource.
+The ultimate objective of this work -- in line with the overall mission of the whole initiative -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
+In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
+As a by-product, the application \emph{SMC browser} was developed -- a visualization tool for interactive exploration of the dataset. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, they are considered the main contribution of this work by the author.
+In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF format, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
+Partly as by-product, the application \emph{SMC browser} was developed -- an interactive visualization tool to explore the dataset on the schema level. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, the tool and the reports can be considered a valuable contribution to the community.

SMC4LRT/chapters/appendix.tex

-                      r3638
+                      r3665
 \begin{figure*}[!ht]
 \begin{center}
 \includegraphics[width=1\textwidth]{images/acdh-diagram_300dpi_rotated.png}
+\includegraphics[width=1\textheight, angle=90]{images/acdh-diagram_300dpi.png}
 \end{center}
 \caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
 …
 \end{figure*}
+\chapter{CMD -- sample data}
+\chapter{SMC Browser}
+\section{Definition of a CMD profile}
+\section{CMD record}
+\chapter{SMC Browser -- related material }
 …
 \input{chapters/userdocs_cleaned}
+\section {Sample SMC graphs}
+\label{sec:smc-graphs}
+\begin{comment}
 \chapter{SMC Reports}
 \label{ch:reports}
+\label{ch:smc-reports}
 SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
 \input{chapters/examples_cleaned}
+\end{comment}

SMC4LRT/chapters/danksagung.tex

r3638	r3665
2	2
3	3	Ich mÃ¶chte mich herzlich bedanken, bei allen Kollegen die mir mit Rat zur Seite gestanden sind
4		und meinen Liebsten fÃŒr ~~ihre extra-p~~ortion Geduld, die ich ihnen abverlangt habe.
	4	und meinen Liebsten fÃŒr die Extra-Portion Geduld, die ich ihnen abverlangt habe.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: