Changeset 4117 for SMC4LRT

SMC4LRT/chapters/Conclusion.tex

r3776	r4117
11	11
12	12	%Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
13		And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features~~, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see~~ which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
	13	And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see, which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
14	14
15	15	Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,

SMC4LRT/chapters/Data.tex

-                      r3776
+                      r4117
 \chapter{Analysis of the data landscape}
+\chapter{Analysis of the Data Landscape}
 \label{ch:data}
 This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
 …
 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
+The actual core provision for semantic interoperability is the requirement that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
 Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records.
+Once the profiles are defined they are transformed into a XML Schema that prescribes the structure of the instance records.
 The generated schema also conveys as annotation the information about the referenced data categories.
 …
 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
 collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
+of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records} that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there are a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
+On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles.
 …
 .583 & DoBeS archive \\
 .185 & Language and Cognition \\
+.859 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
 .593 & talkbank \\
 .363 & Acquisition \\
-.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
 .893 & MPI CGN \\
 .628 & Bavarian Archive for Speech Signals (BAS) \\
 …
 .640 & Oxford Text Archive \\
 .492 & Leipzig Corpora Collection \\
-.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
 .280 & A Digital Archive of Research Papers in Computational Linguistics \\
 .147 & CLARIN NL \\
 .081 & MPI fÃŒr Bildungsforschung \\
+.678 & WALS Online \\
 \hline
   \end{tabu}
 …
 \end{table}
 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
+We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand, there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
 …
 Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
 As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
 \subsection{Dublin Core metadata terms}
+As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} puts the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
+\subsection{Dublin Core Metadata Terms}
 The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
 …
 \end{description}
 The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
+The DCMI terms format is very widely spread nowadays. Thanks to its simplicity, it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
+There are multiple possible serializations, in particular a mapping to RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
 Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
 The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
+The simplicity of the format is also its main drawback when considered as metadata format in the research communities. It is too general to capture all specific details, individual research groups need to describe different kinds of resources with.
 \subsection{OLAC}
 \label{def:OLAC}
 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is an application profile \cite{heery2000application}, of the \xne{Dublin Core metadata terms} adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
+The OLAC schema\furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
 \begin{quotation}
 …
 OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
 Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
+Note that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
 …
 \begin{quotation}
  The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
+ The Text Encoding Initiative (TEI) is a consortium, which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines, which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
 \end{quotation}
 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
+TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility with respect to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
 Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
 …
 \subsection{ISLE/IMDI -- The Language Archive}
 \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
 To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
+\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project \cite{wittenburg2000eagles} 2000 to 2003.
+To serve the main goal of the project, easing access to language resources fostering the reuse, resource descriptions in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/} that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
 The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
 …
 \label{def:META-SHARE}
 META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
+META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries that covered the technical aspects.
 …
 \end{quotation}
 Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
+Within the project META-SHARE, a new metadata format was developed \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
 %In cooperation between metadata teams from CLARIN and META-SHARE
 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
 The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
 Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
+The technical infrastructure of META-SHARE is a distributed network consisting of a number of member repositories that offer their own subset of resources\furl{http://www.meta-share.eu/}.
+Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network'' \cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
 One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
+One point of criticism from the community was the fact that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
 %? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
 …
 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well).
 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
+The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}.
 Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
 …
 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
 ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
+ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources, which may be needed by the HLT -- Human Language Technology -- community.
 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
 …
 \subsection{LDC}
 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives.
+Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is licensed for a fee, more than 650 resources have been made available since 1993. The catalogue is freely accessible. The metadata is additionally aggregated by OLAC archives.
 \section{Formats and Collections in the World of Libraries}
 \label{sec:lib-formats}
 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
+There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
 %\item[LoC] Library of Congress \url{http://www.loc.gov}
 …
 There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
 The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
 MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
+The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- it is the standard format used for communication among libraries around the world.
+MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), which are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
 \xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
 …
 A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
+Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
+A number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}.
 \xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
 more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
 There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
+There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as a comprehensive standard for resource description and discovery that, however, was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
 And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now.
 \subsection{ESE, Europeana Data Model - EDM}
 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
 For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
+Within the big European initiative \xne{Europeana} (cf. \ref{lit:digi-lib}), information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
+For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation}, a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format.
 %https://github.com/europeana
 …
 \label{refdata}
 One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
 one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
+One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web,
+a preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
 \url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
+Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
+In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}
+How this resources will be employed is discussed in \ref{sec:values2entities}.
+Additionally, some verbose commentary follows.
+Conceptually, we want to partition these resources in two types. On the one hand, abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand, named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight that, while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (\code{sameAs}), for concepts we need to accept a plurality of existing conceptualizations, and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
+In the following, we inventorize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}. How this resources will be employed is discussed in \ref{sec:values2entities}. Additionally, some verbose commentary follows.
 %\subsubsection{Named entities}
 The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
+The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called \xne{Virtual International Authority File}, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
+Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}. There is only a limited free access and fee is charged for full access, but recently the provider announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
 Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.
+Also to mention \xne{Yago}\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/}, a large knowledge base created by MPI Informatik integrating dbpedia, geonames and wordnet datasets. \cite{Suchanek2007yago}
 So we witness a strong general trend towards Semantic Web and Linked Open Data.
 …
 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
+We also gave an overview of main formats and collections in the domain of Library and Information Services and an inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
 …
  %   \hline
 AAT & international Architecture and Arts Thesaurus, Getty \\
+AAT & International Architecture and Arts Thesaurus, Getty \\
 CONA & Cultural Objects Name Authority \\
 DAI & Deutsches ArchÃ€ologisches Institut \\
 …
 FAST & Faceted Application of Subject Terminology \\
 Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\
 GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\
+GND & \emph{Gemeinsame Normdatei} - Integrated Authority Files of the German National Library \\
 GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\
 % {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\
 …
 LCC & Library of Congress Classification \\
 LCSH & Library of Congress Subject Headings \\
 LoC & Library of Congress\furl{http://loc.gov} \\
 OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\
 PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\
+LoC & \href{http://loc.gov}{Library of Congress} \\
+OCLC & \href{http://www.oclc.org}{Online Computer Library Center} -- world's biggest library federation \\
+PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{Prometheus} KÃŒnstlerNamensansetzungsDatei\\
 RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\
 TGN & Getty Thesaurus of Geographic Names \\

SMC4LRT/chapters/Definitions.tex

-                      r3776
+                      r4117
 RDF & \xne{Resource Description Framework} \cite{RDF2004} \\
 RR & Relation Registry, cf. \ref{def:rr}   \\
 TEI & \xne{Text Encoding Initiative}, cf. \ref{tei} \\
+TEI & \xne{Text Encoding Initiative}, cf. \ref{def:tei} \\
 \end{tabular}
 \end{table}
 …
 \end{table}
 \section{Formatting conventions}
+\section{Formatting Conventions}
 Inline formatting for highlighting: \\

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3776
+                      r4117
 \chapter{Mapping on instance level,\\ CMD as LOD}
+\chapter{Mapping on Instance Level,\\ CMD as LOD}
 \label{ch:design-instance}
 …
 And if you can express these all in RDF, which we can for almost all of them (maybe
 except the actual language resource ... unless it has a schema adorned
+except for the actual language resource ... unless it has a schema adorned
 with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
 metadata we have that in the CMDI profiles ...) you could load all the
 …
 As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
 In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006}
+As described in previous chapters (\ref{ch:infra}, \ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values.
+One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and suchlike. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
+In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data} \cite{TimBL2006}
 as well as for real semantic (ontology-driven) search and exploration of the data.
 The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
 In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
+In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod}.
 \section{CMD to RDF}
 …
 \end{itemize}
 \subsection{CMD specification}
+\subsection{CMD Specification}
 The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element.
 …
 \noindent
 This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
+These entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
 \label{table:rdf-cmd}
 …
 \end{example3}
+\noindent
 That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
 …
 \end{example3}
+\noindent
 Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
 used usually directly as data properties:
 …
 \noindent
 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL}
+However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. \cite{Windhouwer2012_LDL}
 In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
 …
 \subsection{RELcat - Ontological relations}
 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
+\subsection{RELcat - Ontological Relations}
+As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
 \begin{example3}
 …
 \noindent
 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
+By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be understood as an upper layer of a taxonomy of relation types, implying a subtyping:
 \begin{example3}
 …
 \subsection{CMD instances}
+\subsection{CMD Instances}
 In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
 \subsubsection {Resource Identifier}
 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
+It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
 (Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
+(Note also that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
 \begin{example3}
 …
 %%%%%%%%%%%%%%%%%
 \section{Mapping field values to semantic entities}
+\section{Mapping Field Values to Semantic Entities}
 \label{sec:values2entities}
 …
 \end{example3}
 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}):
+However, for the needs of the mapping task, we propose to reduce and rewrite to retrieve distinct concept, value pairs (cf. figure \ref{fig:smc_cmd2lod}):
 \begin{example3}
 …
 \end{example3}
 \var{lookup} function is a customized version of the \var(map) function, that operates on this information pairs (concept, label).
+\var{lookup} function is a customized version of the \var(map) function that operates on these information pairs (concept, label).
 The two steps \var{lookup} and \var{assess} correspond exactly to the two steps in \cite{jimenez2012large} in their system \xne{LogMap2}: 1) computation of mapping candidates (maximise recall) and b) assessment of the candidates (maximize precision)
 …
 \subsubsection{Identify vocabularies}
 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
+One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labelled \code{@clavas:vocabulary}). For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
+The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However, definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
 …
 In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
 \begin{definition}{signature of the lookup function}
+\begin{definition}{Signature of the lookup function}
 lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
 \end{definition}
 In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,
+In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
 which will be the result of the previous step.
 …
 The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
 In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
+One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource to determine, which specific Academy of Sciences is meant in given resource description.
+In some situation these ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note that the CLARIN search engine VLO provides a feedback link that allows even the normal user to report on problems or inconsistencies in CMD records.
 …
 The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}.
 Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
 Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
+Given that our main concern is the data themselves, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
+Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger than ``just'' the original dataset.
 \section{Summary}

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3776
+                      r4117
 \chapter{System design -- concept-based mapping on schema level}
+\chapter{System Design -- Concept-based Mapping on Schema Level}
 \label{ch:design}
 …
 We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
+In the next section, the internal data model is presented and explained. In section \ref{sec:cx}, the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx}, we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}, an advanced interactive user interface for exploring the CMD data domain is proposed.
 \section{System Architecture}
 …
 \begin{figure*}
 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
+\caption{The component view on the SMC - modules and their interdependencies}
 \label{fig:smc_modules}
 \end{figure*}
 …
 The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
+\xne{SMC Browser} consists of two parts, the \xne{smc-stats} and \xne{smc-graph}, and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
 \section{Data model}
+\section{Data Model}
 Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
 …
 In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
+An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms that may not contain whitespaces.
 \begin{defcap}
 …
 It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
+However, there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
+\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However, despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
 \var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
 …
 %\noindent
 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
+\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However, longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
 \subsection{Terms}
 …
 \subsubsection{Type \code{Term}}
 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
+\code{Term} is a polymorph data type that can have different sets of attributes depending on the type of data it represents.
 \begin{table}[h]
 \caption{Attributes of \code{Term} when encoding data category}
+\caption{Attributes of \code{Term} when encoding data category (enclosed in \code{Concept})}
 \label{table:terms-attributes-datcat}
  \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
 …
 \rowfont{\itshape\small}   attribute & allowed values & sample value\\
 \hline
   \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
+%  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
   \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
   \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
 …
 \subsubsection{Type \code{Relation}}
 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
+As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated that contain more than two equivalent concepts.
 % role="about"
 …
 %%%%%%%%%%%%%%%%%%%%%%
 \section{cx -- crosswalk service}
+\section{cx -- Crosswalk Service}
 \label{sec:cx}
 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
+The crosswalk service offers the functionality that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
+The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
 \subsection{Interface Specification}
 …
 The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
+The service is implemented as a RESTful service, however, only supporting the GET operation, as it operates on a data set that the users cannot change directly. (The changes have to be performed in the upstream registries.)
 …
 \item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
+\item[\xne{cmd-terms-nested}] as above, however, the \code{Term} elements are nested reflecting the component structure in the profile
 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
+\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute).
 \end{description}
 \subsubsection{Operation}
 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
+For the actual service operation a minimal application has been implemented that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
 …
 Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
 \section{qx -- concept-based search}
+\section{qx -- Concept-based Search}
 \label{sec:qx}
 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
+In this section, we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
 Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
+Note that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is tackled in \ref{sec:values2entities} (and also there only rather superficially).
 Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
 …
 \label{cql}
 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
+CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks.
+It was introduced in 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
+transferred from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012 \cite{OASIS2012sru}.)
 Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
 …
 The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
 The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
+In particular, CQL introduces the so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
 The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
 …
 %\begin{note}
 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
+Alternatively to the -- potentially costly -- on-the-fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories, in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
 %\end{note}
 \subsection{SMC as module for Metadata Repository}
+\subsection{SMC as Module for Metadata Repository}
 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
+Metadata Repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module that provides a user interface widget for formulating the query.
 \begin{figure*}
 \begin{center}
 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
+\caption{The component diagram of the integration of SMC as module within the Metadata Repository}
 \label{fig:modules-mdrepo}
 \end{center}
 …
 \subsection{User Interface}
 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
+A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
 \begin{definition}{Generic data format for structured queries}
  < index, operation, term, boolean >+
 …
 \noindent
+Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
+Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
+A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
+Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
+Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labelling the fields of the results, or when providing facets to drill down the search.
+A fundamentally different approach is the "content first" paradigm that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is that the suggestions are typed, so that the user is informed, from which index given term comes (\concept{person}, \concept{place}, etc.)
+Combining the two approaches, we could arrive at a ``smart'' widget consisting of one input field with on-the-fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
 …
 As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
+While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However, this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
 \subsection{Design}
 …
 \subsubsection{Requirements}
 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
+Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
 In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
 …
 \end{quotation}
 Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
+Especially remarkable feature is the possibility to add custom constraints that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
 \subsubsection{Data preprocessing}
 \label{smc-browser-data-preprocessing}
 The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
+The application operates on a set of static XHTML and JSON data files that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
 \begin{description}
 …
 \end{description}
 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
 To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
+Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However, soon it became obvious that the graph is getting too huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
+The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
 …
 As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
 basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
+basic \code{force} layout there are also directed layouts that are often better suited for displaying the directed graph.
 Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
+One special option is \code{graph} that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
 …
 \label{smc-browser-extensions}
 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
+Next to the basic setup described above, there is a number of possible additional features that could enhance the functionality and usefulness of the discussed tool.
 \subsubsection*{Graph operations -- differential views}
 …
 Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
+Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
 \subsubsection*{Viewer for external data}
 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
+The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) that would allow to visualize their data in the SMC browser.
 One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
 \subsubsection*{Integrate with instance data}
 The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
+The usefulness and information gain of the application could be greatly increased by integrating the instance data, i.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
 Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
 …
 %%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Application of \emph{schema matching} techniques in SMC}
+\section{Application of \emph{Schema Matching} Techniques in SMC}
 \label{sec:schema-matching-app}
 …
 Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
+However, this only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework, the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
 Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
 It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
+It is very improbable that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
 Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
+However, thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
 components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
+Given that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
 Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
 …
 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
+Although we exemplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles that, though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
+Note that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency prevails.
 The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
+However, if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
 Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
 This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
 In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
+In contrast to the task described here that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
 another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
 \section{Summary}
 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
+In this core chapter, we laid out a design for a system dealing with concept-based crosswalks on schema level.
 The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
 In addition, we elaborated on the application of schema matching methods to infer mappings between schemas.

SMC4LRT/chapters/Infrastructure.tex

-                      r3776
+                      r4117
 \chapter{Underlying infrastructure}
+\chapter{Underlying Infrastructure}
 \label{ch:infra}
 …
 \label{def:CLARIN}
 CLARIN - Common Language Resource and Technology Infrastructure\cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
+CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
 \begin{quote}
 \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located.\cite{CLARIN2013web}
+\dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web}
 \end{quote}
 …
 The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
+In the preparation phase of the project 2008 - 2011, over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
 Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
 …
 \label{def:CMDI}
 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
+One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
 The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}:
 …
 \noindent
 All these modules are running services, that this work shall directly build upon.
+All these modules are running services that this work shall directly build upon.
 In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}.
 …
 Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}.
 \subsection{CMDI registries}
+\subsection{CMDI Registries}
 \label{cmdi-registries}
 The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction.
 …
 \begin{figure*}[t]
 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
+\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping.}
 \label{fig:SMC-linkage}
 \end{figure*}
 …
 Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well.
 The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is, that the data categories are assigned a persistent identifier, making them globally and permanently referable.
+The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is that the data categories are assigned a persistent identifier, making them globally and permanently referable.
 \begin{figure*}[!ht]
 …
 \includegraphics[width=0.7\textwidth]{images/dc_types}
 \end{center}
 \caption{Data Category types\cite{Windhouwer2011ISOcat_intro}}
+\caption{Data Category types \cite{Windhouwer2011}}
 \label{fig:dc_type}
 \end{figure*}
 …
 \label{def:CR}
 \emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompaniged by a REST webservice to allows client applications to retrieve the profile definitions. At the same time the web interface serves as an editor for creating and editing new CMD components and profiles.
 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration.\cite{Durco2013_MTSR}
 Let us reiterate, that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted''\cite{Broeder+2010}, or \emph{to make its semantics explicit}.
+\emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompanied by a REST webservice to allow client applications to retrieve the profile definitions. At the same time, the web interface serves as an editor for creating and editing new CMD components and profiles.
+The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013MTSR}
+Let us reiterate that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}, or \emph{to make its semantics explicit}.
 As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
 …
 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
 However there needs to be an additional means to capture information about relations between data categories.
 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations be under control of the metadata user whereas the data categories are under control of the metadata modeller.
+However, there needs to be an additional mean to capture information about relations between data categories.
+This information was deliberately not included in the DCR, because relations often depend on the context, in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations need to be under control of the metadata user whereas the data categories are under control of the metadata modeller.
 The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
+There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011} that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
 This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
 …
 \end{definition}
 \subsection{Further parts of the infrastructure}
+\subsection{Further Parts of the Infrastructure}
 \label{cmdi-other}
 …
 \begin{quotation}
 RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow
 (search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
+(search) algorithms to traverse the semantic graph thus made explicit \cite{SchuurmanWindhouwer2011}.
 \end{quotation}
 \subsubsection*{Schema Parser}
 Schema Parser is a service developed at the Meertens Institute, Amsterdam, that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
+Schema Parser is a service developed at the Meertens Institute, Amsterdam that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
 \subsubsection*{Metadata editors}
 …
 Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records.
 Two examples being the stand-alone application \xne{Arbil}\cite{withers2012arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\cite{dima2012mdeditor}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
 \subsection{CMDI exploitation side}
+Two examples being the stand-alone application \xne{Arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} \cite{withers2012arbil} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} \cite{dima2012mdeditor} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
+\subsection{CMDI Exploitation Side}
 \label{cmdi_exploitation}
 Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
+Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
 \begin{figure*}[!ht]
 \begin{center}
 \includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
 \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications}
+\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications.}
 \label{fig:cmd-ingestion}
 \end{center}
 \end{figure*}
 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/}\cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
+The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
 The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}).
 As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic  interoperability provided by the infrastructure.
 …
 \begin{center}
 \includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png}
 \caption{screenshot of the faceted browser of the VLO}
+\caption{Screenshot of the faceted browser of the VLO}
 \label{fig:vlo}
 \end{center}
 \end{figure*}
 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
+More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It is also based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{Zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
 The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data.
 % \todoin { describe indexing and search}
 And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}). \cite{Durco2011}
 However the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
+And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) \cite{Durco2011}.
+However, the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
 %%%%%%%%%%%%%%%%%%%%
 …
 \label{sec:cv}
 \subsection{Motivation \& broader context}
 The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However the problem of incoherent labeling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
+\subsection{Motivation \& Broader Context}
+The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However, the problem of incoherent labelling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values.
 This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc.
 …
 Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives.
 It has to be also kept in mind, that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
+It has to be also kept in mind that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
 \begin{comment}
 …
 \label{def:CLAVAS}
 In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
+In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the Dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
 %As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
 The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format.
 One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up, that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running a instance of the OpenSKOS system.
 As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified, that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
+One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
+Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running an instance of the OpenSKOS system.
+As the work on this vocabulary repository started in the context of a cultural heritage programme, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
 \begin{itemize}
 \item the list of language codes\cite{ISO639}
+\item the list of language codes \cite{ISO639}
 \item organization names for the domain of language resources
 \item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process)
 …
 \label{sec:export-dcr}
 Based on the premise, that the data in DCR also represents a kind of a controlled vocabularies, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
 Note, that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is, that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
+Based on the premise that the data in DCR also represents a kind of a controlled vocabulary, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
+Note that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
 The fact that data categories are basically definitions of concepts may mislead to
 a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept}
 all of them belonging to the \code{ISOcat:ConceptScheme}. However the data in ISOcat as whole is too disparate in scope for such a vocabulary to be useful.
+all of them belonging to the \code{ISOcat:ConceptScheme}. However, the data in ISOcat as a whole is too disparate in scope for such a vocabulary to be useful.
 A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme.
 \begin{quotation}
 The rationale is, that if we see a vocabulary as a set of possible values for a
+The rationale is that if we see a vocabulary as a set of possible values for a
 field/element/attribute, complex DCs in ISOcat are the users of such
 vocabularies and simple DCs the DCR equivalence of values in such a
 vocabulary.\cite{Menzo2013mail}
+vocabulary. \cite{Menzo2013mail}
 \end{quotation}
 \begin{comment}
 Still there are some closed DCs which might be good vocabulary
+Still there are some closed DCs, which might be good vocabulary
 providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
 stay in ISOcat. I think at some point we should create a smaller set of
 …
 then 20, 50 or 100 values are exported.
 However it needs to be yet assessed how useful this approach is. In the metadata profile
+However, it needs to be yet assessed how useful this approach is. In the metadata profile
 there are many closed DCs with small value domains. How useful are those
 in CLAVAS?
 …
 \end{figure*}
 Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
+Another aspect is that a simple DC can be in value domains of multiple closed DCs.
 Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
 So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
 …
 Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
 i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}).
 This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
+This is how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
 /representations/dcs2/clavas.xsl}
 \subsection{Linking to vocabularies in data categories and schemas -- interaction between ISOcat, CLAVAS and client applications}
+\subsection{Linking to Vocabularies in Data Categories and Schemas -- Interaction between ISOcat, CLAVAS and Client Applications}
 \label{interaction-dcr-skos}
 In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to
 convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013.\cite{Menzo2013mail}}
+convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}}
 Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository:
 \begin{quotation}
 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
+Originally, the vocabulary repository has been conceived to manage rather large and complex value domains that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
 partially enumerated (organization names) ISOcat can't/shouldn't contain
 the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
 provider.\cite{Menzo2013mail}
+provider. \cite{Menzo2013mail}
 \end{quotation}
 …
 \end{lstlisting}
 A proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
+A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
 \begin{lstlisting}
 …
 \begin{quotation}
 \code{@href} points to the vocabulary. Actually a PID should be used in the context
 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
+of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency than the core.
 \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
 …
 \end{quotation}
 This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
+This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup, but are capable of XSD/RNG validation, can still use the regular expression based definition.
 \lstset{language=XML}
 \begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
+\begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=Definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
   <dcif:conceptualDomain type="constrained">
      <dcif:dataType>string</dcif:dataType>
 …
 \end{figure*}
 It is important to emphasize, that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element then the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
 There are basically two options, how the vocabulary can be integrated into the schema.
+It is important to emphasize that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element than the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
+There are basically two options how the vocabulary can be integrated into the schema.
 One approach is to explicitly enumerate all the values from the vocabulary.
 Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
+Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however, there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
 and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
 The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in  \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute.
 Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though, that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
+Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
 can use the reference to the data category to fetch explanations (semantic information)  (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}).
 The drawback of this variant is, that we gave up the validation. This
+The drawback of this variant is that we gave up the validation. This
 isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but
 it is when the value domain is closed, e.g. \concept{languageId}. In the latter case,
+it is when the value domain is closed, e.g. \concept{languageID}. In the latter case,
 the XSD generation could support both modes: a lax (smaller) version which
 doesn't contain the closed vocabulary as an enumeration and leaves it to
 the tool, and a strict version which does contain the vocabulary as an
+the tool, and a strict version, which does contain the vocabulary as an
 enumeration. Probably the latter should stay the default, but the client application could
 request the lax version leading to smaller and quicker XSD validation
 inside the tool.
 %However for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
+%However, for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
 %%%%%%%%%%%%%%%%%
 \section{Other aspects of the infrastructure}
 While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
+\section{Other Aspects of the Infrastructure}
+While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However, it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
 \subsubsection{CLARIN Centres}
 …
 \end{quotation}
 CLARIN imposes a number of criteria, that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767}\cite{CE-2013-0095}.
+CLARIN imposes a number of criteria that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767} \cite{CE-2013-0095}.
 CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
 One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
+One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties' researchers (not just the home users) to store research data.
 \begin{comment}
 …
 \subsubsection{Federated Content Search}
 Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
+Another aspect of the availability of resources is that while metadata can be harvested and indexed locally in one repository this is not possible with the content itself, both due to the size of the data and mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
 Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}.
 …
 \section{Summary}
 In this chapter we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry, that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally a few other aspects of the infrastructure, that are equally important, however not pertaining to the metadata level, were briefly tackled.
+In this chapter, we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally, a few other aspects of the infrastructure that are equally important, however, not pertaining to the metadata level, were briefly tackled.

SMC4LRT/chapters/Introduction.tex

-                      r3776
+                      r4117
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Motivation / problem statement}
+\section{Motivation / Problem Statement}
 While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
+This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (CMDI, cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
+This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} (SMC) -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
 \section{Main Goal}
 …
 Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
 Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones.
+Thus, the goal is not primarily to define new crosswalks but rather to develop a service serving existing ones.
 \subsubsection*{Concept-based query expansion}
 …
 \paragraph{Example}
 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
 all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like:
+all the semantically near fields (\emph{concept cluster}) that are however, labelled (or even structured) differently in other schemas like:
 \begin{quote}
 …
 \end{quote}
 The expansion cannot be solved by simple string matching, as there are other fields labeled with the same (sub)strings but with different semantics, that shouldn't be considered:
+The expansion cannot be solved by simple string matching, as there are other fields labelled with the same (sub)strings but with different semantics that shouldn't be considered:
 \begin{quote}
 …
 \subsubsection*{Semantic interpretation}
 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
+The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the evidence in the metadata records collected within CMDI shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
 \subsubsection*{Ontology-driven data exploration}
 …
 \section{Method}
 We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.
+We start with examining the existing data and with the description of the existing infrastructure, in which this work is embedded.
 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
 …
 Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
+A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however, this issue can only be tackled marginally and will have to be outsourced into future work.
 \section{Expected Results}
 …
 \end{description}
 \section{Structure of the work}
+\section{Structure of the Work}
 The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
 …
 The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
+The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref}) and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).

SMC4LRT/chapters/Literature.tex

-                      r3776
+                      r4117
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 In this chapter we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching
+In this chapter, we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching
 and review Semantic Web principles and technologies.
 …
 \xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey.
 \xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
+\xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata \cite{Broeder2011} --
 are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}.
 Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI.
 …
 More of a sister-project is the initiative \xne{DARIAH} - Digital Research Infrastructure for the Arts and Humanities\furl{http://dariah.eu}. It has a broader scope, but has many personal ties as well as similar problems  and similiar solutions as CLARIN. Therefore there are efforts to intensify the cooperation between these two research infrastructures for digital humanities.
 \xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
+\xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource \cite{Piperidis2012meta}, however, focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
 \begin{quotation}
+\noindent
 META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee.
 \end{quotation}
 See \ref{def:META-SHARE} for more details about META-SHARE's catalog and metadata format.
+See \ref{def:META-SHARE} for more details about META-SHARE's catalogue and metadata format.
 …
 In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS).
 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
+Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogues, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
 %, starting collaborative efforts in mid 70s
 …
  The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide.
 In Europe, multiple recent initiatives have pursuit similar goals of pooling together the immense wealth of information sheltered in the many libraries:
+In Europe, multiple recent initiatives have pursued similar goals of pooling together the immense wealth of information sheltered in the many libraries:
 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
 …
 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''.
 The related catalogs and formats are described in the section \ref{sec:lib-formats}.
 \section{Existing crosswalks (services)}
+The related catalogues and formats are described in the section \ref{sec:lib-formats}.
+\section{Existing Crosswalks (Services)}
 Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core  and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}}
 However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort, that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
+However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118},
 \begin{quotation}
 …
 \end{quotation}
 Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration as for the specification of the planned service.
+Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration for the specification of the planned service.
 \begin{comment}
 …
 \label{lit:schema-matching}
 As Shvaiko\cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
 As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities} we elaborate on the possible ways to apply these methods to the described problem.)
+As Shvaiko \cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
+As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities}, we elaborate on the possible ways to apply these methods to the described problem.)
 There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work \cite{Kalfoglou2003, Shvaiko2008, Noy2005_ontologyalignment, Noy2004_semanticintegration, Shvaiko2005_classification} and most recently \cite{shvaiko2012ontology, amrouch2012survey}.
-%Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{}.
 Shvaiko and Euzenat also run the web page \url{http://www.ontologymatching.org/} dedicated to this topic and the related OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}}, an ongoing effort to evaulate alignment tools based on various alignment tasks from different domains.
 Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?''.
+Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research ``the field of ontology matching [is] still making progress?''.
 \subsubsection{Method}
 …
 \cite{EhrigSure2004} and \cite{amrouch2012survey} instead introduce \var{ontology mapping} when applying the task on individual entities, in the meaning as a function that ``for each concept (node) in ontology A [tries to] find a corresponding concept
 (node), which has the same or similar semantics, in ontology B and vice verse''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.
 \cite{EhrigSure2004} further specify the mapping function as based on a similarity function, that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.
+(node), which has the same or similar semantics, in ontology B and vice versa''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.
+\cite{EhrigSure2004} further specify the mapping function as based on a similarity function that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.
 \begin{defcap}[!ht]
 …
 \cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods.
 \cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function\cite{ehrig2004qom}, that combines all similiarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)
+\cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function \cite{ehrig2004qom} that combines all similarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)
 Thus, \var{ontology similarity} is a much weaker assertion, than \var{ontology alignment}, in fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
 …
 \end{enumerate}
 In  contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision), that however correspond  to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.
+In  contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision) that however, correspond  to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.
 …
 A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications:
 \xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{ProtÃ©gÃ©} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
 All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
+\xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik2002similarity}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{ProtÃ©gÃ©} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
+All of the tools use multiple methods as described in the previous section, exploiting both element features as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
 Next to OWL as input format supported by all the systems some also accept XML Schemas (\xne{COMA++, SF, Cupid, SMatch}),
 …
 \section{Semantic Web -- Linked Open Data}
 Linked Data paradigm\cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
+Linked Data paradigm \cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
 \subsubsection{Semantic Web - Technical solutions / Server applications}
 \label{semweb-tech}
 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL\cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
+The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL \cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
 Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
 A qualitative and quantitative study\cite{Haslhofer2011europeana}   in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana}
 Virtuoso is used to host many important Linked Data sets, e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
 Virtuoso is offered both as commercial and open-source version license models exist.
+A qualitative and quantitative study \cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
+\xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents. \cite{Erling2009Virtuoso, Haslhofer2011europeana}
+Virtuoso is used to host many important Linked Data sets, e.g. DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
+Virtuoso is offered both as commercial and open-source version license models.
 Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
 …
 There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}).
 There is also plenty of general graph visualization tools, that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
+There is also plenty of general graph visualization tools that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{Bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
 %Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development
 The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}  -- a tool which visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
+The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}, a tool that visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
 \begin{figure*}
 …
 \subsubsection{Linguistic ontologies}
 One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD\cite{Farrar2003}\furl{http://linguistics-ontology.org},
 that ``gives a formalized account of the most basic categories and relations (the "atoms") used in the scientific description of human language, attempting to codify the general knowledge of the field. The motivation is to`` facilite automated reasoning over linguistic data and help establish the basic concepts through which intelligent search can be carried out''.
+One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD \cite{Farrar2003}\furl{http://linguistics-ontology.org},
+that ``gives a formalized account of the most basic categories and relations (the `atoms') used in the scientific description of human language, attempting to codify the general knowledge of the field''. The motivation is to ``facilitate automated reasoning over linguistic data and help establish the basic concepts, through which intelligent search can be carried out''.
 In line with the aspiration ``to be compatible with the general goals of the Semantic Web'', the dataset is provided via a web application as well as a dump in OWL format\furl{http://linguistics-ontology.org/gold-2010.owl} \cite{GOLD2010}.
 Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalog of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives a large collection of all kinds resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.
+Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalogue of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives, a large collection of all kinds of resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.
  World Atlas of Language Structures (WALS) \furl{http://WALS.info} \cite{wals2011}
 is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) ''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution.
 Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema
+is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars)''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution.
+Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema.
 Extending on this work, Simons et al. \cite{Simons2004semantics} propose a method for mapping linguistic descriptions in plain XML into semantically rich RDF/OWL, employing the GOLD ontology as the target semantic schema.
 These ontologies can be used by (``ontologized'') Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
 Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry a ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'', that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).
 Although not exactly an ontology in the common sense of
+Although (by design) this registry does not contain any relations between concepts,
 the central entities are concepts and not lexical items, thus it can be seen as a proto-ontology.
+These ontologies can be used by (``ontologized'') lexicons to refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
+Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry, an ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'' that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).
+Although not exactly an ontology in the common sense --
+given that this registry (by design) does not contain any relations between concepts --
+the central entities are concepts and not lexical items, thus it can be seen as a semantic resource.
 Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010.
 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
 \subsubsection{Lexicalised ontologies,``ontologized'' lexicons}
+\subsubsection{Lexicalised ontologies, ``ontologized'' lexicons}
 The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies.
 …
 In a number of works Buitelaar, McCrae et. al \cite{Buitelaar2009, buitelaar2010ontology, McCrae2010c, buitelaar2011ontology, Mccrae2012interchanging} argues for ``associating linguistic information with ontologies'' or ``ontology lexicalisation'' and draws attention to lexical and linguistic issues in knowledge representation in general. This basic idea lies behind the series of proposed models \xne{LingInfo}, \xne{LexOnto}, \xne{LexInfo} and, most recently, \xne{lemon} aimed at allowing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology.
 The most recent in this line, \xne{lemon} or \xne{lexicon model for ontologies} defines ``a formal model for the proper representation of the continuum between: i) ontology semantics; ii) terminology that is used to convey this in natural
+language; and iii) linguistic information on these terms and their constituent lexical units'', in essence enabling the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference", no complex semantic in-
+formation needs to be stated in the lexicon.
+a clear separation of the lexical layer and the ontological layer.
+Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models.
+and in particular on global standards: W3C standard: SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and
+and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009})
+Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications, provides a RDF serialization (?!?!).
+language; and iii) linguistic information on these terms and their constituent lexical units''.
+In essence, \xne{lemon} enables the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference". No complex semantic information needs to be stated in the lexicon, ensuring (or at least fostering) a clear separation of the lexical layer and the ontological layer.
+Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models, and in particular on global standards: W3C standard, SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009}).
+Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications. LMF specifies also a RDF serialization.
 An overview of current developments in application of the linked data paradigm for linguistic data collections was given at the  workshop Linked Data in Linguistics\furl{http://ldl2012.lod2.eu/} 2012 \cite{ldl2012}.
 The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given, that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.
+The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.
 \section{Summary}
 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
+This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and, on the other hand, it gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.

SMC4LRT/chapters/Results.tex

-                      r3776
+                      r4117
 In the subsequent two sections, we explore a few specific aspects of the CMD data domain -- regarding the usage of the data categories (\ref{sec:explore-datcats}) and the integration of existing formats (\ref{sec:explore-formats}). While these topics are not directly results of this work, the presented analyses are. They were made possible by the technical solution of this work, yield a valuable test case for the usefulness of the work and are an indispensable prerequisite for the necessary coordination and curation work being carried out by the CMDI community.
 \section{Current status of the infrastructure}
+\section{Current Status of the Infrastructure}
 Before we get to the results of this work,  we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results.
 \subsection{CMDI - services}
+\subsection{CMDI -- Services}
 The main services of the infrastructure have been in stable production for the last two years.
 Relation Registry is operational as early prototype.
 Three instances of \xne{OpenSKOS} are running, one of them being hosted by \xne{ACDH}.
 \subsection{CMDI - data}
+\subsection{CMDI -- Data}
 More than 130 profiles are defined. (See table \ref{table:dev_profiles} for more details about profiles.)
 The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
 The collection amounts to over 550.000 records in more than 60 distinct profiles.
 \subsection{ACDH - the home of SMC}
+\subsection{ACDH -- The Home of SMC}
+\label{acdh}
 Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities with the mission to foster digital research paradigm in humanities. It is designed to provide depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. SMC is one of these services provided by this centre.
 Figure \ref{fig:acdh_context} sketches the broader context of \xne{ACDH} and its different roles.
 %%%%%%%%%%%%%%%%
 \section {Technical solution}
+\section {Technical Solution}
 With this work we delivered a module embedded in a larger metadata infrastructure, aimed at supporting the semantic interoperability across the heterogeneous data in this infrastructure. The module consists of multiple interrelated components. The technical specification of the module can be found in chapter \ref{ch:design}. A prototypical implementation has been developed for the three main parts of the system. The code of this implementation is maintained in the central CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
 …
 \\
+\url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
+\subsection{SMC - crosswalks service}
+the crosswalk service as a REST web service
+exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}
+This interface is available as part of the smc application:
+\url{http://clarin.arz.oeaw.ac.at/smc/cx}
+\subsection{SMC - as a module within Metadata Repository}
+The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
+\url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet )
+\subsection{SMC Browser -- advanced interactive user interface}
+\url{http://clarin.oeaw.ac.at/smc/}
+\subsection{SMC -- Crosswalks Service}
+The crosswalk service as a REST web service exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}. This interface is available via the wrapping smc application:
+\url{http://clarin.oeaw.ac.at/smc/cx}
+\subsection{SMC -- as a Module within Metadata Repository}
+The SMC will also be integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
+\url{http://clarin.oeaw.ac.at/mdrepo/}
+\subsection{SMC Browser -- Advanced Interactive User Interface}
 SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
 \url{http://clarin.arz.oeaw.ac.at/smc-browser}
+\url{http://clarin.oeaw.ac.at/smc-browser}
 \begin{figure*}
 …
 %%%%%%%%%%%%%%%555
 \section{Exploring the CMD data -- SMC reports}
+%%%%%%%%%%%%%%%
+\section{Exploring the CMD Data -- SMC Reports}
 SMC reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain that were created making extensive use of the visual and numerical output from the \xne{SMC Browser}. In this section, we deliver a few examples of these analyses. A complete up to date listing is maintained on the SMC website:
 \url{http://clarin.aac.ac.at/smc/reports}
 \subsection{Usage of data categories}
+\url{http://clarin.oeaw.ac.at/smc-browser/docs/reports.html}
+\subsection{Usage of Data Categories}
 \label{sec:explore-datcats}
 At the core of the whole SMC (and CMDI) are the data categories as basic semantic building blocks or anchors.
 …
 \includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
 \end{center}
 \caption{The four main \concept{Language} data categories and in which profiles they are being used}
+\caption{The four main \concept{Language} data categories and profiles they are being used in}
 \label{fig:language_datcats}
 \end{figure*}
 …
 Again the main DC \concept{resourceName\#DC-2544}) being used in 74 profiles together with the semantically near \concept{resourceTitle\#DC-2545}) used in 69 profiles offer a good coverage over available data.
 Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
+Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
 %\subsection{Resource type}
 …
 % \subsection{Subject, Genre, Topic}
 \subsection{Integration of existing formats}
+\subsection{Integration of Existing Formats}
 \label{sec:explore-formats}
 …
 \subsubsection{dublincore / OLAC}
 \label{reports:OLAC}
 Very widely used (because) simple format
+Very widely used (because) simple metadata format
 \ref{def:OLAC}
 %\ref{info:olac-records}
 Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms.
 As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
+As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however, the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
 \begin{figure*}[!ht]
 …
 \begin{table}
 \caption{Profiles modelling dublincore terms}
+\caption{Profiles Modelling Dublincore Terms}
 \label{table:dcterms-profiles}
 %  \begin{tabular}{ |l | l | l | r | r | }
 …
 \end{table}
 Additionally, there is a number of profiles with concept links to dublincore terms,
+Additionally, there is a number of profiles with concept links to dublincore terms.
 Some use all of the dublincore elements or terms as one component within a larger profile,
 one example being the \xne{data} profile created by the Czech initiative LINDAT models  the minimal obligatory set of META-SHARE \xne{resourceInfo} schema, cf. subsection about META-SHARE below) combined with a simple dublincore record.
 …
 \label{results:tei}
 TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
 TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.
+TEI does not provide just one fixed schema, but allows for a certain flexibility regarding the elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.
 Thus there is also not just one fixed \xne{teiHeader}.
 The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
 The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/}\cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
 Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold\cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
 \xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however reusing existing components.
+The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/} \cite{Geyken2011deutsches}, digitizing a hoist of historical German texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
+Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold \cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
+\xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in The Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however, reusing existing components.
 As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
 Another approach was applied within the context of other CLARIN-NL projects\cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
+Another approach was applied within the context of other CLARIN-NL projects \cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
 …
 %In cooperation between metadata teams from CLARIN and META-SHARE
 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
 The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific informations.
+The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
+In a parallel effort, LINDAT, the Czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however, combined with a simple dublincore record.
+This way, the information gets partly duplicated, but with the advantage that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
+The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific information.
 \begin{figure*}
 …
 \includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
 \end{center}
 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
+\caption{Profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
 \label{fig:META-SHARE-LINDAT}
 \end{figure*}
 …
 \includegraphics[height=0.95\textheight]{images/resourceInfoBIG.png}
 \end{center}
 \caption{the META-SHARE based profile for describing corpora}
+\caption{The META-SHARE based profile for describing corpora}
 \label{fig:META-SHARE-BIG}
 \end{figure*}
 …
 %%%%%%%%%%%%%%%%%%%%%%%
 \subsection{SMC cloud}
+\subsection{SMC Cloud}
 \label{sec:smc-cloud}
 As a latest, still experimental, addition, SMC browser provides a special type of graph, that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
+As the latest, still experimental, addition, SMC browser provides a special type of graph that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
 covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.
 …
 \end{figure*}
-\begin{comment}
-\section{Evaluation}
-\label{evaluation}
-Sample Queries:
-candidate Categories:
-ResourceType, Format
-Genre, Topic
-Project, Institution, Person, Publisher
-\subsection{Use Cases}
-\begin{itemize}
-\item MD Search employing Semantic Mapping
-\item MD Search employing Fuzzy Search
-\item Visualize impact of given mapping in terms of covered dataset (number of matched records).
-\end{itemize}
-\section{Discussion}
-\subsection{Semantic Mapping in Metadata vs. Content/Annotation}
-AF + DCR + RR
-\end{comment}
 \section{Summary}
 In this final chapter, we presented the results, on the one hand the technical solution of the module \xne{Semantic Mapping Component}, on the other hand we spent a good part of the chapter on commented analyses of the processed dataset, that were made possible by \xne{SMC Browser}, a interactive visualization tool developed as part of this work for exploration of the schema level data of the discussed collection. As such, the analyses can be seen as an evaluation, a proof of concept and usefulness of the presented work.

SMC4LRT/chapters/abstract_de.tex

-                      r3665
+                      r4117
 \chapter*{Kurzfassung}
+Diese Arbeit ist eingebettet in eine groÃe internationale Forschungsinfrastruktur-Iinitiave, die zur Aufgabe hat,
+einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermÃ¶glichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes koherentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermÃ¶glicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÃ€t zu ÃŒberwinden, das sich aus der HeterogenitÃ€t der Metatadaten-Formate ergibt.
+Das eigentliche Ziel, der Nutzen dieser Arbeit -- im Einklang mit der generellen Idee des ganzen Unterfangens -- war die \emph{Verbesserung der SuchmÃ¶glichkeiten} in der groÃen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÃ€nzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhÃ¶hen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend, Modellierung der DomÃ€ndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermÃ¶glichen.
+Das eigentliche Ziel, der Nutzen dieser Arbeit war die \emph{Verbesserung der SuchmÃ¶glichkeiten} in einer groÃen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÃ€nzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhÃ¶hen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend Modellierung der DomÃ€ndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermÃ¶glichen.
 Entsprechend den zwei Herangehensweisen lieferte die Arbeit auch zwei Hauptergebnisse: a) die Spezifikation eines Moduls fÃŒr \emph{konzept-basierte Suche} zusammen mit dem zugrundeliegenden Dienst \emph{crosswalk service}, begleitet von einer Testimplementierung; b) Spezifikation der Modellierung der Ausgangsdaten im RDF Format, womit die Grundlage geschaffen ist, die Daten als \emph{Linked Open Data} bereitzustellen.
 Teilweise als Nebenprodukt wurde auch die Anwendung \emph{SMC Browser} entwickelt -- ein interaktives Visualisierungswerkzeug zur ErschlieÃung der Schema-Ebene der Datensammlung. Mit Hilfe dieses Werkzeugs konnte eine Reihe von tiefergehenden Analysen der Daten erstellt werden, die direkt von der Forschergemeinschaft zur ErschlieÃung und Redaktion der komplexen Daten genutzt werden. Somit kÃ¶nnen die Anwendung und die Analyseberichte als ein wertvoller Beitrag fÃŒr die Forschergemeinschaft angesehen werden.
+Diese Arbeit ist eingebettet in eine groÃe internationale Forschungsinfrastrukturinitiave, die zur Aufgabe hat,
+einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermÃ¶glichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes kohÃ€rentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermÃ¶glicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÃ€t zu ÃŒberwinden, das sich aus der HeterogenitÃ€t der Metatadaten-Formate ergibt.

SMC4LRT/chapters/abstract_en.tex

-                      r3665
+                      r4117
 \chapter*{Abstract}
+This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.
+The ultimate objective of this work -- in line with the overall mission of the whole initiative -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
+The ultimate objective of this work was to \emph{enhance search functionality} over a large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
 In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF format, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
 …
 Partly as by-product, the application \emph{SMC browser} was developed -- an interactive visualization tool to explore the dataset on the schema level. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, the tool and the reports can be considered a valuable contribution to the community.
+This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.

SMC4LRT/chapters/acknowledgements.tex

-                      r3776
+                      r4117
 \chapter*{Acknowledgements}
 I would like to thank all the colleagues from my institute and from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
 And to all my dear one, for the extra portion of patience I demanded from them
+I would like to thank all the colleagues from the institute and from the CLARIN community for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
+And all my dear ones, for the extra portion of patience I demanded from them.
 \\
+ \\
+With love to em.
+\hfill with love to em

SMC4LRT/chapters/appendix.tex

-                      r3776
+                      r4117
 \chapter{Data model reference}
 \label{ch:data-model-ref}
 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
+In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
+\begin{figure*}[!ht]
+\input{images/Terms.xsd}
+\input{images/general-component-schema.xsd}
+\begin{figure*}
+\begin{center}
+\includegraphics[width=1\textwidth]{images/EDC_components_v4.png}
+\end{center}
+\caption{Reference Architecture}
+\label{fig:ref_arch}
+\end{figure*}
+\begin{figure*}[p]
 \begin{center}
 \includegraphics[width=1\textwidth]{images/DCR_data_model.jpg}
 …
 \end{figure*}
-\input{images/Terms.xsd}
-\input{images/general-component-schema.xsd}
 \begin{figure*}[!ht]
 \begin{center}
+\includegraphics[width=1\textwidth]{images/EDC_components_v4.png}
+\end{center}
+\caption{Reference Architecture}
+\label{fig:ref_arch}
+\end{figure*}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textheight, angle=90]{images/acdh-diagram_300dpi.png}
+\includegraphics[width=0.95\textheight, angle=90]{images/acdh-diagram_300dpi.png}
 \end{center}
 \caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
 …
 \end{figure*}
 \chapter{CMD -- sample data}
+\chapter{CMD -- Sample Data}
 \label{ch:cmd-sample}
 …
 \input{chapters/collection_spec.xml.tex}
 \section{CMD record}
+\section{CMD Record}
 Following listing represents a sample CMD record  - an instance of the \concept{collection} profile listed above.
 …
 \chapter{SMC -- documentation}
+\chapter{SMC -- Documentation}
 \label{ch:smc-docs}
 \begin{figure*}
 \begin{center}
 \includegraphics[height=1\textwidth, angle=90]{images/build_init.png}
+\includegraphics[width=1.1\textheight, angle=90]{images/build_init.png}
 \end{center}
 \caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.}
 …
 \end{figure*}
 \section{Documentation of smc-xsl}
+\section{Developer Documentation}
 \label{sec:smc-xsl-docs}
-\todoin{generate and reference XSLT-documentation}
+\section{SMC Browser user documentation}
+A developer documentation of the code and the system is included in the source repository
+\noindent
+\url{https://svn.clarin.eu/SMC/trunk/SMC/docs}
+\noindent
+A short introduction can be found online as part of the application:
+\noindent
+\url{http://clarin.oeaw.ac.at/smc/docs/devdocs.html}
+\section{SMC Browser User Documentation}
 \label{sec:smc-browser-userdocs}
 \input{chapters/userdocs_cleaned}
+\section {Sample SMC graphs}
+\clearpage
+\section {Sample SMC Graphs}
 \label{sec:smc-graphs}
 …
 \label{fig:cmd-dep-dotgraph}
 \end{figure*}
+\begin{figure*}[h]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/SMC-export_sample.png}
+\end{center}
+\caption{A sample output from SMC browser showing a number of frequently used data categories and the clusters of profiles using them.}
+\label{fig:smc-sample}
+\end{figure*}

SMC4LRT/chapters/userdocs_cleaned.tex

-                      r3776
+                      r4117
 Explore the \DUroletitlereference{Component Metadata Framework}
 In \emph{CMD}, metadata schemas are defined by profiles, that are constructed out of reusable components  - collections
+In \emph{CMD}, metadata schemas are defined by profiles that are constructed out of reusable components  - collections
 of metadata fields. The components can contain other components, and they can be reused in multiple profiles.
 Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should
 …
 SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the \href{examples.html}{examples} for inspiration.
 It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). More technical documentation follows soon.
+It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). There is also some preliminary \href{devdocs.html}{technical documentation}
 …
+}
 The User interface is divided into 4 main parts:
+The user interface is divided into 4 main parts:
+%
 \begin{description}
 \item[{Index}] \leavevmode
 Lists all available Profiles, Components, Elements and used Data Categories
 The lists can be filtered (enter search pattern in the input box at the top of the index-pane)
 By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane
+The lists can be filtered (enter search pattern in the input box at the top of the index-pane).
+By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane.
 \item[{Main (Graph)}] \leavevmode
 …
+}
 Following data sets are distinguished wrt user interaction:
+Following data sets are distinguished with respect to the user interaction:
+%
 \begin{description}
 \item[{all data}] \leavevmode
 the full graph with all profiles, components, elements and data categories and links between them.
+Currently this amounts to roughly 2.000 nodes and 3.000 links
+Currently this amounts to roughly 4.600 nodes and 7.500 links.
 \item[{selected nodes}] \leavevmode
 nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes})
+nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes}).
 \item[{data to show}] \leavevmode
 …
+}
 The navigation pane provides following option to control the rendering of the graph:
+The navigation pane provides the following options to control the rendering of the graph:
+%
 \begin{description}

Context Navigation

Legend:

SMC4LRT/chapters/Conclusion.tex

SMC4LRT/chapters/Data.tex

SMC4LRT/chapters/Definitions.tex

SMC4LRT/chapters/Design_SMCinstance.tex

SMC4LRT/chapters/Design_SMCschema.tex

SMC4LRT/chapters/Infrastructure.tex

SMC4LRT/chapters/Introduction.tex

SMC4LRT/chapters/Literature.tex

SMC4LRT/chapters/Results.tex

SMC4LRT/chapters/abstract_de.tex

SMC4LRT/chapters/abstract_en.tex

SMC4LRT/chapters/acknowledgements.tex

SMC4LRT/chapters/appendix.tex

SMC4LRT/chapters/userdocs_cleaned.tex

Download in other formats: