Changeset 3776 for SMC4LRT

SMC4LRT/Outline.tex

-                      r3681
+                      r3776
 \listoffigures
 \listoftodos
 \begin{comment}
+%\listoftodos
+%\begin{comment}
 \input{chapters/Introduction}
 …
+\input{chapters/Definitions}
 \end{comment}
+%\end{comment}
 \input{chapters/Data}
 \begin{comment}
+%\begin{comment}
 \input{chapters/Infrastructure}
 …
 \input{chapters/Conclusion}
 \end{comment}
+%\end{comment}
 …
 \appendix
+%\input{chapters/appendix}
+\input{chapters/Definitions}
+\input{chapters/appendix}

SMC4LRT/chapters/Conclusion.tex

-                      r3665
+                      r3776
 % Dynamic integration of the information from the Relation Registry into the search interface and search processing.
 A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and transform all the data (profiles and instances) into RDF, a server solution to host the data and allow querying it and finally, on top of it offer a web interface for the users to explore the dataset.
+A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and to transform all the data (profiles and instances) into RDF, a server solution to host the data and to allow querying it and, eventually, a web interface for the users to explore the dataset.
 %Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
+And finally, a visualization tool for the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}.
+Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
+And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
 Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,
 one of them being metadata curation. The results of this work represent a directly applicable groundwork for this ongoing effort.
+one of them being metadata curation. The results of this work represent a directly applicable input for this ongoing effort.
 One particularly pressing aspect of the curation is the consolidation of the actual values in the CMD records, a topic explicitly treated in this work.

SMC4LRT/chapters/Data.tex

-                      r3681
+                      r3776
 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
 CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
+The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
 indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
 …
 While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
 Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
+Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records.
 The generated schema also conveys as annotation the information about the referenced data categories.
 …
 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
+Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
+(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
+Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\concept{dublincore}, \concept{collection}, the set of \concept{Bamdes}-profiles) there are complex profiles with up to 10 levels (\concept{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 distinct components and 337 elements (or 419 components and 1587 elements when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \concept{Contact}) included by three other components (\concept{Project}, \concept{Institution}, \concept{Access}) will appear three times in the instantiated record.}).
 …
 Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
+Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
+As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
 …
 \end{description}
 Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
+The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
 …
 \label{def:OLAC}
 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
+The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
 \begin{quotation}
 …
 One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
 ? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+%? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
 \subsection{ELRA}
 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
+European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well).
 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
 Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
 …
 \subsection{LDC}
+Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
+Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives.
 \section{Formats and Collections in the World of Libraries}
+There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
+\label{sec:lib-formats}
+There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
 %\item[LoC] Library of Congress \url{http://www.loc.gov}
 …
 Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
 Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
+\xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
 more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
+In 1998 a new  Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
+and since ?? RDA - Resource Description and Access
+There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
+And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now.
 \subsection{ESE, Europeana Data Model - EDM}
 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
 originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
+Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
+For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
+EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format.
 %https://github.com/europeana
 …
 Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
 In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
+In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}
 How this resources will be employed is discussed in \ref{sec:values2entities}.
+Additionally, some verbose commentary follows.
 %\subsubsection{Named entities}
 …
 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
+Yago is a large knowledge integrating dbpedia, geonames and ..??
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
+Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.
 So we witness a strong general trend towards Semantic Web and Linked Open Data.
 …
 %\subsection{Concepts -- Classifications, Taxonomies, \dots}
+\begin{comment}
+VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
+the entire WorldCat cataloging collection made publicly
+available using Schema.org mark-up with library extensions for use by developers and
+search partners such as Bing, Google, Yahoo! and Yandex
+OCLC begins adding linked data to WorldCat by appending
+Schema.org descriptive mark-up to WorldCat.org pages, thereby
+making OCLC member library data available for use by intelligent
+Web crawlers such as Google and Bing
+\end{comment}
+\section{Summary}
+In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
+We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
 …
 & & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
 Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
+\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
+\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons& ontology-based portal for LRT & \href{http://www.lt-world.org/kb/}{portal} \\
+& & 4.600 organizations & & \\
 Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
 PKND     & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
 …
 GND/s & DNB & 202.000 & subjects (SchlagwÃ¶rter), universal, lang:de & \\
 GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
 DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
+DDC & OCLC & & universal classification by field of study, multi langs & \href{http://dewey.info/}{dewey.info} \\
 UDC & & & & \\
 Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
  DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
 ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
 Object Names Thesaurus & British Museum & &  classification of objects in the collection & \\
 Material Thesaurus & British Museum & & classification of material & \\
 Thesaurus of Monument Types & British Museum & & types of monuments & \\
+ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts & \href{http://www.isocat.org}{web-app}, service \\
+Object Names Thes. & British Museum & &  classification of objects in the collection & \\
+Material Thes. & British Museum & & classification of material & \\
+Thes. Monument Types & British Museum & & types of monuments & \\
 Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
 Oberbegriffsdatei  & DMB & & a set of vocabularies for museums, lang:de  & \url{museumsvokabular.de}, PDF, XML dumps\\
 …
 \end{landscape}
+\begin{description}
+\item[AAT] international Architecture and Arts Thesaurus, Getty
+\item[CONA] Cultural Objects Name Authority
+\item[DAI] Deutsches ArchÃ€ologisches Institut
+\item[DDC] Dewey Decimal Classification
+\item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
+\item[DMB] Deutscher Museumsbund
+\item[DNB] Deutsche National Bibliothek
+\item[FAST] Faceted Application of Subject Terminology
+\item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
+\item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
+\item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
+\begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
+\item[ISO] International Standardization Organization
+\item[LCCN] Library of Congress Control Number
+\item[LCC] Library of Congress Classification
+\item[LCSH] Library of Congress Subject Headings
+\item[LoC] Library of Congress\furl{http://loc.gov}
+\item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
+\item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
+\item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
+\item[TGN] Getty Thesaurus of Geographic Names
+\item[UDC] Universal Decimal Classification
+\item[ULAN] Union List of Artist Names
+\item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
+\end{description}
+\begin{comment}
+VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
+the entire WorldCat cataloging collection made publicly
+available using Schema.org mark-up with library extensions for use by developers and
+search partners such as Bing, Google, Yahoo! and Yandex
+OCLC begins adding linked data to WorldCat by appending
+Schema.org descriptive mark-up to WorldCat.org pages, thereby
+making OCLC member library data available for use by intelligent
+Web crawlers such as Google and Bing
+\end{comment}
+\section{Summary}
+In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
+We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).
+\begin{table}
+\caption{Glossary of acronyms used in the overview of controlled vocabularies (tables \ref{table:data-ne}, \ref{table:data-concepts}) }
+\label{table:vocab-glossary}
+%  \begin{tabu}{  >{\sffamily}l p{0.8\textwidth}
+\begin{tabular}{ >{\sffamily}l p{0.8\textwidth}}
+%    \hline
+%\rowfont{\itshape\small} name & provider & size (items / facts)  & description & access \\
+ %   \hline
+AAT & international Architecture and Arts Thesaurus, Getty \\
+CONA & Cultural Objects Name Authority \\
+DAI & Deutsches ArchÃ€ologisches Institut \\
+DDC & Dewey Decimal Classification       \\
+DFKI & Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz \\
+DMB & Deutscher Museumsbund \\
+DNB & Deutsche National Bibliothek \\
+FAST & Faceted Application of Subject Terminology \\
+Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\
+GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\
+GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\
+% {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\
+ISO & International Standardization Organization \\
+LCCN & Library of Congress Control Number \\
+LCC & Library of Congress Classification \\
+LCSH & Library of Congress Subject Headings \\
+LoC & Library of Congress\furl{http://loc.gov} \\
+OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\
+PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\
+RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\
+TGN & Getty Thesaurus of Geographic Names \\
+UDC & Universal Decimal Classification                            \\
+ULAN & Union List of Artist Names \\
+VIAF & Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries  \\
+\end{tabular}
+\end{table}

SMC4LRT/chapters/Definitions.tex

-                      r3680
+                      r3776
 \end{definition}
 \noindent
+\begin{example1}
 Example blocks, simple:
-\begin{example1}
-Short piece of sample data
 \end{example1}
-\noindent
-or with tabs (especially for RDF triples):
 \begin{example3}
+my:work & my:example & my:block
+or with & tabs (especially for & RDF triples)
 \end{example3}

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3680
+                      r3776
 relevant parts in a triple store and do your SPARQL/reasoning on it. Well
 that's where I'm ultimately heading with all these registries related to
+semantic interoperability ... I hope ;-)\cite{Menzo2013mail}
+semantic interoperability ... I hope ;-)
+\hfill \textit{Menzo Windhouwer} \cite{Menzo2013mail}
 \end{quotation}
 As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
 …
 \subsection{CMD specification}
 The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
+The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element.
 \label{table:rdf-spec}
 \begin{example3}
+cmds:Component & subClassOf  & owl:Class. \\
+cmds:Profile & subClassOf  & cmds:Component. \\
+cmds:Element & subClassOf  & rdf:Property. \\
+\end{example3}
+cmds:Component & a  & rdfs:Class. \\
+cmds:Profile & rdfs:subClassOf  & cmds:Component. \\
+cmds:Element & a  & rdfs:Class. \\
+cmds:ElementValue & a & rdf:Property \\
+cmds:Attribute & a & rdf:Property \\
+\end{example3}
 \noindent
 …
  & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
 cmd:Actor       & a & cmds:Component. \\
+cmd:LanguageName  & a & cmds:Element. \\
+\end{example3}
+\begin{note}
+Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?)
+\end{note}
+cmd:Actor.LanguageName  & a & cmds:Element. \\
+\end{example3}
+%\begin{note}
+%Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?)
+%\end{note}
 \subsection{Data Categories}
 …
 dcr:datcat & a  & owl:AnnotationProperty ; \\
  & rdfs:label  & "data category"@en ; \\
  & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
+ & rdfs:comment  & "This resource is equivalent to this data category."@en ; \\
  & skos:note  & "The data category should be identified by its PID."@en ; \\
 \end{example3}
 …
 \noindent
+Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows:
+\begin{example3}
+<lr1> & isocat:DC-2502 & "19th century"
+\end{example3}
+\noindent
+However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.
+This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties.
+\noindent
+The REST web service of \xne{ISOcat} provides a RDF representation of the data categories:
+\begin{example3}
+isocat:languageName & dcr:datcat & isocat:DC-2484; \\
+ & rdfs:label & "language name"@en; \\
+ & rdfs:comment & "A human understandable..."@en; \\
+ & âŠ  \\
+\end{example3}
+However this is only meant as template, as is stated in the explanatory comment of the exported data:
+\begin{quotation}
+By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
+\end{quotation}
+So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
+However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL}
+In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
 \begin{example3}
 …
 \noindent
+By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
+\begin{note}
+Does this mean, that I would say:
+\begin{example3}
+rel:sameAs & owl:equivalentProperty & owl:sameAs
+\end{example3}
+to enable the inference of the equivalences?
+Is this correct:
+\end{note}
+?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
+\begin{example2}
+ cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
+\end{example2}
+\noindent
+following facts need to be present in the ontology :
+\begin{example3}
+<lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
+cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
+isocat:DC-2538 & rel:sameAs & dc:created \\
+rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
+$\rightarrow$ \\
+<lr1> & dc:created & 2012\^{}\^{}xs:year \\
+\end{example3}
+\noindent
+What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples:
+\begin{example3}
+cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
+clavas:Organization & owl:subClassOf & dcterms:Agent \\
+<org1> & a & clavas:Organization \\
+\end{example3}
+By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
+\begin{example3}
+rel:sameAs & rdfs:subPropertyOf & owl:sameAs
+\end{example3}
 \subsection{CMD instances}
 …
 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
+If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
+If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
+(Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
 \begin{example3}
 \_:anno1  & a & oa:Annotation; \\
  & oa:hasTarget  & <lr1>; \\
+ & oa:hasTarget  & <lr1a>, <lr1b>; \\
  & oa:hasBody  & <lr1.cmd>; \\
  & oa:motivatedBy  & oa:describing \\
 …
 \begin{example3}
 <lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
  & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
  & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
  & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
+ & dcterms:creator & "\var{\{cmd:MdCreator\}}";  \\
+ & dcterms:publisher  & <http://clarin.eu>\\
+ & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" \\
 \end{example3}
 …
 & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
 \end{example3}
-\noindent
-?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
-Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
-This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
-Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
-\todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}
-\begin{example3}
-\_:mdcoll  & a   & ore:ResourceMap; \\
- & rdfs:label & "Collection 1"; \\
-\_:mdcoll\#aggreg & a   & ore:Aggregation \\
- & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
-\end{example3}
 \subsubsection{Components â nested structures}
+There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
+\begin{enumerate}[a)]
+\item the components are encoded as object property
+\begin{example3}
+<lr1>  & cmd:Actor  & \_:Actor1 \\
+<lr1>  & cmd:Actor  & \_:Actor2 \\
+\_:Actor1  & cmd:motherTongue  & iso-639:aac \\
+\_:Actor2  & cmd:motherTongue  & iso-639:deu \\
+\_:Actor1  & cmd:role & "Interviewer" \\
+\_:Actor2 & cmd:role & "Speaker" \\
+\end{example3}
+\item a dedicated object property is used
+For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
 \begin{example3}
 …
 \end{example3}
-\end{enumerate}
 \subsection{Elements, Fields, Values}
 Finally, we want to integrate also the actual field values in the CMD records into the ontology.
+\subsubsection{Predicates}
+As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
+% \subsubsection{Predicates}
+As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
+Following example show the whole chains of statements from metamodel to literal value:
 \begin{example3}
 cmd:timeCoverage  & a   & cmds:Element \\
+cmd:timeCoverageValue & a & cmds:ElementValue \\
 cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
+<lr1>  & cmd:timeCoverage  & "19th century" \\
+\end{example3}
+\subsubsection{Literal values -- data properties}
+To generate triples with literal values is straightforward:
+\begin{definition}{Literal triples}
+lr:Resource \ \quad cmds:Property \ \quad xsd:string
+\end{definition}
+\begin{example3}
+<lr1> & cmd:Organisation & "MPI" \\
+\end{example3}
+\subsubsection{Mapping to entities -- object properties}
+The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
+\begin{definition}{new RDF triples}
+lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI
+\end{definition}
+\begin{example3}
+<lr1> & cmd:Organisation\_? & <org1> \\
+\end{example3}
+\begin{note}
+<lr1> & cmd:contains & \_:timeCoverage1 \\
+\_:timeCoverage1 & a & cmd:timeCoverage \\
+\_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
+\end{example3}
+While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities:
+\begin{example3}
+\var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
+\_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
+\end{example3}
+\begin{comment}
 Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
 i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
+\end{note}
+The mapping process is detailed in \ref{sec:values2entities}
+%%%%%%%%%%%%%%%%%55
+\end{comment}
+The mapping process is detailed in \ref{sec:values2entities}.
+%%%%%%%%%%%%%%%%%
 \section{Mapping field values to semantic entities}
 \label{sec:values2entities}
 …
 We don't try to achieve complete ontology alignment, we just want to find
 for our ``anonymous'' concepts semantically equivalent concepts from other ontologies.
 This is very near just other phrasing for the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
+This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}:
 ``for each concept (node) in ontology A [tries to] find a corresponding concept
 (node), which has the same or similar semantics, in ontology B and vice verse''.
 The first two points in the above enumeration represent the steps necessary to be able to apply the ontology mapping.
+The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated ontology to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
+The transformation of the data has been partly described in previous section:
+It can be trivially automatically converted into RDF triples as :
+\begin{example3}
+<lr1> & cmd:Organisation & "MPI" \\
+\end{example3}
+However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs:
+\begin{example3}
+\_:1 & a & cmd:Organisation;\\
+The identification of appropriate vocabularies is discussed in the next subsection. In the operationalization, the identified vocabularies could be treated as one aggregated semantic resource to map all entities against. For the sake of higher precision, it may be sensible to perform the task separately for individual concepts, i.e. organisations, persons etc. and in every run consider only relevant vocabularies.
+The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
+\begin{example3}
+\_:organisation1 & cmd:OrganisationValue & "MPI" \\
+\end{example3}
+However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}):
+\begin{example3}
+\_:1 & a & clavas:Organisation;\\
    & skos:altLabel & "MPI";
 \end{example3}
 …
 \subsubsection{Identify vocabularies}
+\todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
+LT-World \cite{Joerg2010}
+One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
+One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
 …
 \end{definition}
 In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
+In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,
 which will be the result of the previous step.
 …
 \label{sec:lod}
+With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
+Namely to enhance it by employing ontological resources.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
+SPARQL
+rechercheisidore, dbpedia, ...
+\cite{Europeana RDF Store Report}
+Technical aspects (RDF-store?): Virtuoso
+semantic search component in the Linked Media Framework
+\todoin{check SARQ}\furl{http://github.com/castagna/SARQ}
+%\section {Full semantic search - concept-based + ontology-driven ?}
+%\label{semantic-search}
+With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
+The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
+The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}.
+Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
+Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
 \section{Summary}
+%The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL.
+The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
+In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
+In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the method to translate the string values in metadata fields to corresponding semantic entities.
+This task can be also seen as building a bridge between the world XML resources and semantic resources expressed in RDF.
+Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
+%The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3680
+                      r3776
 The SMC module is part of the CMD Infrastructure. It is a consumer of data from the production-side registries and serves search services on the exploitation side of the infrastructure, as well as third party applications accessing the joint CLARIN metadata domain.
 \begin{figure*}[!ht]
+\begin{figure*}
 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
 …
 \subsection{smcIndex}\label{def:smcIndex}
 In this section, we describe \code{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
 An \code{smcIndex} is a human-readable string adhering to a specific syntax, denoting some search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
+In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
+An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
 \begin{defcap}
 \caption{Grammar of \code{smcIndex}}
+\caption{Grammar of \var{smcIndex}}
 \begin{align*}
 smcIndex &::= dcrIndex \ | \ cmdIndex  \\
 …
 \end{defcap}
 The grammar distinguishes two main types of \code{smcIndex}: a) \code{dcrIndex} referring to data categories and b) \code{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
 These two types of \code{smcIndex} follow different construction patterns.
 \code{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \code{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
 It is important to note, that in general -- by design -- \code{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
+The grammar distinguishes two main types of \var{smcIndex}: a) \var{dcrIndex} referring to data categories and b) \var{cmdIndex} denoting a specific ``CMD entity'', i.e. an element (metadata field), component or whole profile defined within CMD (cf. \ref{def:CMD} for description of the CMD data model).
+These two types of \var{smcIndex} follow different construction patterns.
+\var{cmdIndex} has a recursive path-like structure and can be interpreted as a XPath-expression into the instances of CMD profiles. In contrast to it, \var{dcrIndex} consists of just one-level term and is generally not directly applicable on existing data. It can be understood as abstract index referring to well-defined concepts -- the data categories -- and for actual search it needs to be resolved to the set of CMD elements it is referred by. In return, one can expect to match more metadata fields from multiple profiles, all referring to the same data category.
+It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
 \code{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \code{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \code{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
 \code{profile} is reference to a CMD profile. Again, dealing with the ambiguity, it can be either the name of the profile \code{profileName} or its identifier \code{profileId} as issued by the Component Registry (e.g. \code{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
+\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
+\var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
 \begin{example1}
 …
 \end{example1}
 \noindent
 \code{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to narrow down the ambiguity.
+%\noindent
+\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
 \subsection{Terms}
 \label{datamodel-terms}
+In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
+Here we describe the XML schema for internal representation of the processed data.
+In abstract terms, the internal format is basically a table with information about indexes collected from the upstream registries or created during preprocessing. \code{Term} is main entity that represents either a label of a data category, or a CMD entity (a CMD  component or element). \code{Termset} represents a logical collection of \code{Terms} (one profile or data categories of one type). \code{Concept} represents a data category and groups all corresponding terms. \code{Relation} is used to express relation between two \code{Concepts}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
 \subsubsection{Type \code{Term}}
 …
 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
 \begin{table}[ht]
+\begin{table}[h]
 \caption{Attributes of \code{Term} when encoding data category}
 \label{table:terms-attributes-datcat}
+ \begin{tabular}{ l | l | l }
+  attribute & allowed values & sample value\\
+ \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
+\hline
+\rowfont{\itshape\small}   attribute & allowed values & sample value\\
 \hline
   \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
 …
   \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
  \var{xml:lang} & two-letter language code (only for ISOcat) & \code{en}, \code{si} \\
+ \end{tabular}
+\hline
+ \end{tabu}
 \end{table}
+%\captionsetup{justification=raggedright, singlelinecheck=false}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
+<Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
+        type="label" xml:lang="fr">nom de ressource</Term>
+\end{lstlisting}
+\begin{table}[ht]
+\begin{table}[h]
 \caption{Attributes of \code{Term} when encoding CMD entity}
 \label{table:terms-attributes-cmd}
 \begin{tabularx}{1\textwidth}{ l | X | X }
+ %\begin{tabu}{1\textwidth}{ l | l | l }
   attribute & allowed values & sample value\\
+ \begin{tabu}{ p{0.1\textwidth}  p{0.4\textwidth} >{\footnotesize}X }
+\hline
+\rowfont{\itshape\small}   attribute & allowed values & sample value\\
 \hline
   \var{id} &  \var{cmdEntityId} as defined in \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1290431694487\#Url} \\
+  \var{type} &  one of ['CMD\_Element', 'CMD\_Component'] & \code{CMD\_Element}\\
+  \var{type} & {\footnotesize \code{CMD\_Element} | \code{CMD\_Component} } & \code{CMD\_Element}\\
+  \var{datcat} &  reference to the data category, URL or \var{dcrIndex} & \code{isocat:DC-2546}\\
   \var{name} & name of the component or element & \code{Url} \\
   \var{path} &  \var{dotPath} (cf. \ref{def:smcIndex}) & \code{SpeechCorpus.Access.Contact.Url} \\
   \var{parent} & name of the parent component &  \code{Contact} \\
+ \end{tabularx}
+\hline
+ \end{tabu}
 \end{table}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
+<Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
+          id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
+          path="SpeechCorpus.Access.Contact.Url"/>
+\end{lstlisting}
+\begin{table}[ht]
+\caption{Attributes of \code{Term} when encoding a term in the inverted index?}
+\begin{table}
+\caption{Attributes of \code{Term} when encoding a CMD entity in the inverted index}
 \label{table:terms-attributes-index}
+ \begin{tabularx}{1\textwidth}{ l | X | X }
+  attribute & allowed values & sample value\\
+ \begin{tabu}{ p{0.1\textwidth}  p{0.4\textwidth} >{\footnotesize}X }
+\hline
+\rowfont{\itshape\small}   attribute & allowed values & sample value\\
 \hline
   \var{id} &  \var{cmdEntityId} cf. \ref{def:smcIndex} & \code{clarin.eu:cr1:c\_1359626292113 \#ResourceTitle} \\
+  \var{type} &  one of \code{['id', 'mnemonic', 'label', 'full-path']} & \code{full-path}\\
+ \var{set} & denotion of the containing termset & \code{cmd} \\
+  \var{type} &  one of \code{full-path} or \code{min-path} & \code{full-path}\\
   \var{schema}  & \var{profileID} & \code{clarin.eu:cr1:p\_1357720977520} \\
   \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
+%  \var{concept-id} & id of the corresponding (data category) &  \var{isocat:}\code{DC-2545} \\
   \var{node-value} &  \var{dotPath} & \code{SpeechCorpus.Access.Contact.Url} \\
+ \end{tabularx}
+\hline
+ \end{tabu}
 \end{table}
+%\captionsetup{justification=raggedright, singlelinecheck=false}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
+  <Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
+             type="label" xml:lang="fr">nom de ressource</Term>
+\end{lstlisting}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
+  <Term type="CMD_Element" name="Url" id="clarin.eu:cr1:c_1290431694487#Url"
+             parent="Contact" datcat="http://www.isocat.org/datcat/DC-2546"
+             path="SpeechCorpus.Access.Contact.Url"/>
+\end{lstlisting}
 \lstset{language=XML}
 \begin{lstlisting}[label=lst:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
                 id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
                 concept-id="http://www.isocat.org/datcat/DC-2545" >
+  <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
+             id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
+             concept-id="http://www.isocat.org/datcat/DC-2545" >
         AnnotatedCorpusProfile.GeneralInfo.ResourceTitle
    </Term>
+  </Term>
 \end{lstlisting}
 \subsubsection{Type \code{Concept}}
 \code{Concept} represents a data category. Identifier is the PID issued by the DCR.
+\code{Concept} represents a data category. Identifier is the PID issued by the DCR encoded in the \var{id} attribute.
 It groups all terms belonging to given data category.
 The content model is a sequence of \code{Terms} followed by a sequence of \code{info} elements.
+Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition potentially in different languages:
+Initially, after loading from DCR, a \code{Concept} contains only \code{Term}s of type: \code{id, mnemonic, label} (in multiple languages) encoding the corresponding attributes of the data category, followed by \code{info} elements holding the definition (also potentially in different languages). In the inverted index, the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:dcr-cmd-map}).
 \lstset{language=XML}
 \begin{lstlisting}[label=lst:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
+<Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
+               id="http://www.isocat.org/datcat/DC-2545">
+         <Term set="isocat" type="mnemonic">resourceTitle</Term>
+         <Term set="isocat" type="id">DC-2545</Term>
+         <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+         <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
+  <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
+    <Term set="isocat" type="mnemonic">resourceTitle</Term>
+    <Term set="isocat" type="id">DC-2545</Term>
+    <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+    <Term set="isocat" type="label" xml:lang="fi">resurssin otsikko</Term>
+    ...
+    <info xml:lang="en">The title is the complete title
+                of the resource without any abbreviations.</info>
+     ...
+  </Concept>
+\end{lstlisting}
+%\lstset{language=XML}
+%\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to %\code{Concept}]
+% <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
+%            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
+%\end{lstlisting}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
+  <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
+    <Term set="isocat" type="mnemonic">resourceTitle</Term>
+    <Term set="isocat" type="id">DC-2545</Term>
+    <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+    <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
+    <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
+      ...
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
+            id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
+        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
+            id="clarin.eu:cr1:c_1271859438123#Title">
+        AnnotationTool.GeneralInfo.Title</Term>
+    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
+            id="clarin.eu:cr1:c_1271859438201#Title">
+        Session.Title</Term>
         ...
+         <info xml:lang="en">The title is the complete title
+                        of the resource without any abbreviations.</info>
+        ...
+</Concept>
+\end{lstlisting}
+In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:concept-cmd-term}).
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
+ <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
+            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
+\end{lstlisting}
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Sample of the inverted index \code{Concept} $\mapsto$ \code{Term}]
+    <Concept id="http://www.isocat.org/datcat/DC-2545" type="datcat">
+        <Term set="isocat" type="mnemonic">resourceTitle</Term>
+        <Term set="isocat" type="id">DC-2545</Term>
+        <Term set="isocat" type="label" xml:lang="en">resource title</Term>
+        <Term set="isocat" type="label" xml:lang="hr">naslov resursa</Term>
+        <Term set="isocat" type="label" xml:lang="lv">resursa nosaukums</Term>
+        ...
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
+                id="clarin.eu:cr1:c_1359626292113#ResourceTitle">
+                        AnnotatedCorpusProfile.GeneralInfo.ResourceTitle</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
+                id="clarin.eu:cr1:c_1271859438123#Title">
+                        AnnotationTool.GeneralInfo.Title</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
+                id="clarin.eu:cr1:c_1274880881884#Title">
+                        imdi-corpus.Corpus.Title</Term>
+        <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
+                id="clarin.eu:cr1:c_1271859438201#Title">
+                        Session.Title</Term>
+        ...
+    </Concept>
+\end{lstlisting}
+  </Concept>
+\end{lstlisting}
+%    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
+  %          id="clarin.eu:cr1:c_1274880881884#Title">
+     %   imdi-corpus.Corpus.Title</Term>
+\subsubsection{Type \code{Relation}}
+As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
+% role="about"
+\begin{lstlisting}[label=lst:dcr-cmd-map, caption=Internal representation of the relation between concepts]
+  <Relation type="sameAs">
+    <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2484"/>
+    <Concept type="datcat" id="http://purl.org/dc/elements/1.1/language"/>
+  </Relation>
+\end{lstlisting}
 \subsubsection{Type \code{Termsets/Termset}}
+\code{Termset} groups a set of terms as outlined in \ref{table:cx-list-params}. It is identified by the \code{@set} attribute.
+For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile.
+Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
+\code{Termset} groups a set of terms. (Possible termsets are listed in table \ref{table:cx-list-params}.) It is identified by the \code{@set} attribute.
+For example all french labels of isocat data categories under the identifier \code{isocat-fr} build a termset, as well as all the full-paths of one profile. The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements. Finally, \code{Termsets} is a root element grouping \code{Termset} elements.
 \lstset{language=XML}
 \begin{lstlisting}[label=lst:termset, caption=\code{Termset} element representing a CMD profile]
 <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
+  <Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
             type="CMD_Profile">
+      <info>
+         <id>clarin.eu:cr1:p_1357720977520</id>
+         <description>A CMDI profile for annotated text corpus resources.</description>
+         <name>AnnotatedCorpusProfile</name>
+         <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
+         <creatorName>nalida</creatorName>
+          ...
+     </info>
+     <Term type="CMD_Component" name="GeneralInfo" datcat=""
+    <info>
+      <id>clarin.eu:cr1:p_1357720977520</id>
+      <description>A CMDI profile for annotated text corpus resources.
+      </description>
+      <name>AnnotatedCorpusProfile</name>
+      <registrationDate>2013-01-31T11:57:12+00:00</registrationDate>
+      <creatorName>nalida</creatorName>
+      ...
+   </info>
+   <Term type="CMD_Component" name="GeneralInfo" datcat=""
             id="clarin.eu:cr1:c_1359626292113"
             parent="AnnotatedCorpusProfile"
             path="AnnotatedCorpusProfile.GeneralInfo">
             <Term ...
+       <Term ...
      </Term>
      ...
+</Termset>
+\end{lstlisting}
+The content of the \code{Termset} can optionally begin with an \code{info} element (conveying information as provided by the source registry, like definition, creation date or author) followed by a flat or nested list of \code{Term} elements.
+  </Termset>
+\end{lstlisting}
 %%%%%%%%%%%%%%%%%%%%%%
 …
 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
+The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
+The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
 \subsection{Interface Specification}
 …
 In this section, we define the abstract interface of the proposed service, in terms of the input parameters and output data format.
+\todoin{The two interfaces list and map
+Full definition in appendix and under link!}
+\subsubsection*{Method \code{list}}
+Method \code{list} lists available items for given context or type. This allows the client applications to configure the query input  and provide autocompletion functionality.
+\begin{definition}{URI-pattern of the \code{list} method}
+%\todoin{The two interfaces list and map Full definition in appendix and under link!}
+\subsubsection*{Method \var{list}}
+Method \var{list} lists available items for given context or type. This allows the client applications to configure the query input and provide autocompletion functionality. Table \ref{table:cx-list-params} lists the accepted values for the \var{\$context} parameter and the corresponding types of returned data.
+\begin{definition}{URI-pattern of the \var{list} method}\label{def:list-method}
 /smc/cx/list/\$context
 \end{definition}
-\noindent
-Table \ref{table:cx-list-params} lists the allowed values for the \var{\$context} parameter and the corresponding types of returned data
 \begin{table}
 \caption{Allowed values for parameters of the \code{list}-method and corresponding return values}
 \label{table:cx-list-params}
+ \begin{tabular}{ l | p{0.7\textwidth} }
+  \var{\$context}  & returns a list of \\
+ \hline
+% \begin{tabular}{ l | p{0.7\textwidth} }
+%  \var{\$context}  & returns a list of \\
+ \begin{tabu}{ l p{0.7\textwidth} }
+\hline
+\rowfont{\itshape\small} \$context & returns a list of \\
+\hline
   \code{*,top} & available termsets \\
   \var{\{termset\}} & terms (CMD components and elements) of given termset \\
 …
   \code{cmd-full-paths} & all complete (starting from Profile) \emph{dotPaths} to CMD components and elements\\
   \code{cmd-minimal-paths} & reduced but still unique paths to CMD components and elements \\
+  \code{relsets} & available relation sets (defined in the Relation Registry)
+ \end{tabular}
+  \code{relsets} & available relation sets (defined in the Relation Registry) \\
+\hline
+\end{tabu}
 \end{table}
+ Also the application should deliver additional information about the indexes like description and a link to the definition of the underlying entity in the source registry.
+\subsubsection*{Method \var{explain} }
+The service also has to deliver additional information about the indexes like description and a link to the definition of the entity in the source registry.
+\begin{definition}{URI-pattern of the \code{explain} method}\label{def:explain-method}
+/smc/cx/explain/\{\$context\} \ [ \ /\{\$term\} \ ] \ [ \ ?format=\$format \ ] \ [ \ ?lang=\$lang \ ]
+\end{definition}
+\begin{example1}
+/smc/cx/explain/cmd/clarin.eu:cr1:p\_1357720977520 \\
+/smc/cx/explain/isocat/DC-2506?lang=et,pt
+\end{example1}
+\lstset{extendedchars=false,
+escapeinside='', language=XML}
+\begin{lstlisting}[label=lst:sample-explain, caption=Sample output of the \var{explain} function for a data category]
+  <Concept type="datcat" id="http://www.isocat.org/datcat/DC-2506">
+    <Term set="isocat" type="mnemonic">annotationMode</Term>
+    <Term set="isocat" type="id">DC-2506</Term>
+    <Term set="isocat" type="label" xml:lang="et">m'Ã€'rgendusviis</Term>
+    <Term set="isocat" type="label" xml:lang="pt">modo de anota'Ã§Ã£'o</Term>
+    <info xml:lang="et">N'Ã€'itab, kas ressurss m'Ã€'rgendati
+                                  k'Ã€'sitsi v'\~{o}'i automaatselt.</info>
+    <info xml:lang="pt">Indica se o recurso foi criado manualmente
+                                  ou por processo autom'Ã¡'tico.</info>
+</Concept>
+\end{lstlisting}
 %NO (this will be handled by the servic as multililngual labels e) : or the Name-attribute (in any available translation, e.g. \texttt{numero di telefono@it}) of the data category.}
 % While it is desirable to also allow the Name-attribute of the data category (\texttt{telephone number}), especially also the Names defined in other working languages (\texttt{numero di telefono@it, numer telefonu@pl}), special care has to be taken here as these attributes mostly contain white spaces, which could cause problems in downstream components, when parsing a complex query containing such indices.
 \subsubsection*{Method \code{map} }
 Method \code{map} performs the actual translations:
+\subsubsection*{Method \var{map} }
+Method \var{map} performs the actual translations:
 it accepts any index (adhering to the \var{smcIndex} datatype, cf. \ref{def:smcIndex}) and returns a list of corresponding indexes.
 %it returns list of equivalent terms/smcIndexes for a given term/smcIndex.
 \begin{definition}{General function definition}
 smcIndex \mapsto smcIndex[ ]
+\begin{definition}{General function definition}\label{def:map-method-general}
+smcIndex \mapsto smcIndex*
 \end{definition}
 \begin{definition}{URI-pattern of the \code{map} method}
+\begin{definition}{URI-pattern of the \var{map} method}
 /smc/cx/map/\{\$context\}/\{\$term\} \ [ \ ?format=\{\$format\} \ ] \ [ \ \&relset=\{\$relset\} \ ]
 \end{definition}
 \noindent
 Parameter definition:\\*
+Parameter definition:
 \begin{description}
 \item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this would be one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
+\item[\var{\$context}] identifies the context to search in for the \var{\$term}, primarily this is one of \code{[*, isocat, dc, cmd]}, in extended mode any of terms listed in table \ref{table:cx-list-params} is accepted
 \item[\var{\$term}] \var{smcIndex} term (without the context prefix); the term is used to lookup a concept, to deliver the list of equivalent indexes; case-insensitive
 \item[\var{\$format}] the desired result format can be indicated explicitely, alternatively to default content negotiation; one of \code{[json, rdf, xml]}; \code{xml} is default
 \item[\var{\$relset}] optional; reference to a relset to be applied on the identified concept to expand the cluster of equivalent ; allows multiple values from \code{list/relsets}; if multiple sets are they are all applied in the expansion
+\item[\var{\$relset}] optional; reference to a relation set to be combined with the identified concept to expand the cluster of matching concepts; allows multiple values from \code{list/relsets}; if multiple sets are listed they are all applied in the expansion
 \end{description}
 …
 Possible return formats:
 \begin{description}
+\item[\var{'', default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
+\item[\var{default}] internal XML format with all attributes (\xne{Terms.xsd}, cf. listing \ref{lst:map-output})
 \item[\var{schema}] distinct schemas (\code{Termset}) referencing given data category or string
 \lstset{language=XML}
 …
 <Termset schema="clarin.eu:cr1:p_1295178776924" name="serviceDescription"/>
 \end{lstlisting}
 \item[\var{datcat}] distinct data categories (\code{Term@id@da}) by \code{@concept-id}
+\item[\var{datcat}] distinct data categories, by grouping the \code{Term@datcat} attribute of the matching terms
 \lstset{language=XML}
 \begin{lstlisting}
 …
            set="isocat" type="datcat">creatorFullName</Term>
 \end{lstlisting}
 \item[\var{cmdid, id}] distinct cmd entities (\code{Term}) by \code{@id}
+\item[\var{cmdid, id}] distinct cmd entities grouped by \code{@id}
 \begin{lstlisting}
 <Term type="CMD_Element" name="Name" elem="Name" parent="Session"
 …
 \end{description}
-\begin{table}[ht]
-\caption{Sample values for parameters of the \code{map}-method and corresponding return values}
-\label{table:cx-map-params}
- \begin{tabular}{ l  l | l}
-  \var{\$context}  & \var{\$term} & returns \\
- \hline
-  \code{*} & \code{name} & ? \\
-  \code{isocat} & \code{resourceTitle} & CMD terms \\
-  \code{cmd} & \code{name} & \\
- \end{tabular}
-\end{table}
 \noindent
 Sample request\\*
 …
 \lstset{language=XML}
 \begin{lstlisting}[label=lst:map-output, caption=Corresponding sample output ]
 <Terms >
+<Termset>
     <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1297242111880"
         id="clarin.eu:cr1:c_1271859438123#Title">
                 AnnotationTool.GeneralInfo.Title</Term>
+                id="clarin.eu:cr1:c_1271859438123#Title">
+            AnnotationTool.GeneralInfo.Title</Term>
     <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1288172614014"
         id="clarin.eu:cr1:c_1288172614011#resourceTitle">
                 BamdesLexicalResource.BamdesCommonFields.resourceTitle
+                id="clarin.eu:cr1:c_1288172614011#resourceTitle">
+            BamdesLexicalResource.BamdesCommonFields.resourceTitle
      </Term>
    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1274880881885"
         id="clarin.eu:cr1:c_1274880881884#Title">
                 imdi-corpus.Corpus.Title</Term>
+                id="clarin.eu:cr1:c_1274880881884#Title">
+            imdi-corpus.Corpus.Title</Term>
    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1271859438204"
         id="clarin.eu:cr1:c_1271859438201#Title">
                 Session.Title</Term>
+                id="clarin.eu:cr1:c_1271859438201#Title">
+            Session.Title</Term>
    <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1272022528363"
         id="clarin.eu:cr1:c_1271859438123#Title">
                 LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
+                id="clarin.eu:cr1:c_1271859438123#Title">
+            LexicalResourceProfile.LexicalResource.GeneralInfo.Title</Term>
     <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1284723009187"
+        id="clarin.eu:cr1:c_1271859438123#Title">collection.GeneralInfo.Title</Term>
+                id="clarin.eu:cr1:c_1271859438123#Title">
+            collection.GeneralInfo.Title</Term>
 \end{lstlisting}
 …
 \noindent
 (3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and element, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
+(3) \emph{container data categories} -- further expansions will be possible once the \emph{container data categories} \cite{SchuurmanWindhouwer2011} will be used.\footnote{Although metadata modellers are encouraged to indicate data categories for both components and elements, this is taking up only slowly and currently only around 14 per cent of the components have a data category specified.} The idea is to set a concept link also for the components, meaning that besides the ``atomic'' data category for \concept{actorName}, there would be also a data category for the complex concept \concept{Actor}.
 Having concept links also on components will require a compositional approach for the mapping function, resulting in:
 \begin{example2}
 …
 \subsection{Implementation}
-The core functionality  of the SMC is implemented as a set of XSL-stylesheets
 At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
+\todoin{generate and reference XSLT-documentation}
+The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
 …
 \subsubsection{Initialization}
 \label{smc_init}
 During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
 \begin{definition}{Principal structure of the inverted index}
 datcatURI \mapsto profile.component.element[]
+During initialization the application fetches the information from the source modules (cf. \ref{def:CMDI}) and transforms it into the internal \xne{Terms} format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories.\ref{def:inverted-index}
+\begin{definition}{Principal structure of the inverted index}\label{def:inverted-index}
+datcatPID \mapsto profile.component.element*
 \end{definition}
 The collected data categories are enriched with information from corresponding registries (DCRs), adding the label, the description and available translations into other working languages. %, usable as base for multi-lingual search user-interface.
 Finally, relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
 \begin{figure*}[!ht]
+\begin{figure*}
 \includegraphics[width=1\textwidth]{images/smc_init.png}
 \caption{The various stages of the data flow during the initialization}
 …
 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements
+\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
 …
 \subsubsection{Operation}
 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL-stylesheets for post-processing depending on requested format.
 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq}-library within a \xne{eXist} XML-database.
+For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
+The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
 \subsection{Extensions}
 …
 Once there will be overlapping\footnote{i.e. different relations may be defined for one data category in different relation sets} user-defined relation sets in the Relation Registry an additional input parameter will be required to \emph{explicitly restrict the selection of relation sets} to apply in the mapping function.
 Also, use of \emph{other than equivalency} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
+Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
 \section{qx -- concept-based search}
 \label{sec:qx}
 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
 In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
+In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
 Note, that \emph{query expansion} yet needs to distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
 Note, also that this chapter deals only with the schema-level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The corresponding instance level is tackled in \ref{semantic-search}.
+Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
+Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
 \subsection{Query language}
+\label{cql}
 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
+CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
+It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
+transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
+Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
+However, the protocol is defined in a very generic way, with a strong focus on extensibility.
+It is equally suitable for content search.
+\begin{comment}
+The protocol part (SRU) defines three major operations:
+) \emph{explain}: in which the target repository announces its particular configuration (e.g. available indices),
+) \emph{scan}:  informing about terms available in/for given index, and
+) \emph{searchRetrieve}: returning a search result based on a CQL query.
+\end{comment}
+The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
+The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
+In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
+The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
 \subsection{Query Expansion}
 …
 \end{example1}
+\noindent
+%\begin{note}
 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
+%\end{note}
 \subsection{SMC as module for Metadata Repository}
 …
 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
+Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
+There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
+\begin{figure*}[!ht]
+Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
+\begin{figure*}
+\begin{center}
 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
 \label{fig:modules-mdrepo}
+\end{center}
 \end{figure*}
 …
 \subsection{User Interface}
 A starting point for our considerations is the traditional structure found in many (advanced) search interface, which is basically a an array of index - term pairs, or in more advanced alternatives: tuples of index, comparison operator, term and boolean operator:
+A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
 \begin{definition}{Generic data format for structured queries}
  [ < index, operation, term, boolean > ]
+ < index, operation, term, boolean >+
 \end{definition}
-\noindent
-This maps trivially to the main clause of the CQL syntax, the \var{searchClause} \ref{def:searchClause}.
 % {Basic clause of the CQL syntax}
 \begin{definition}{The main clause of the CQL syntax, the \code{searchClause}}
+\begin{definition}{The basic \code{searchClause} of the CQL syntax}
 \label{def:searchClause}
 searchClause \ ::= \ index \ relation \ searchTerm
 \end{definition}
+\noindent
+An alternative would be a smart parsing input field with contextual autocomplete. Though such a widget would still share the underlying data model.
+\begin{figure*}[!ht]
+\begin{figure*}
+\begin{center}
 \includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png}
 \caption{A proposed query input interface offering concepts as search indexes}
 \label{fig:query_input}
+\end{center}
 \end{figure*}
 \noindent
 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
+A fundementally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
+Although we concentrate on query input, the use of indexes has to be consistent across, be it in labeling the fields of the results, or when providing facets to drill down the search.
+Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
+A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
+Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
+%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{SMC Browser}
 \label{smc-browser}
 …
 \includegraphics[width=1\textwidth]{images/smc-browser_UIsketch.png}
 \end{center}
 \caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface}
+\caption{A sketch of a possible layout for the SMC Browser -- individual parts of the user interface and the update dependencies}
 \label{fig:smc-browser_sketch}
 \end{figure*}
 …
 Prospective parts of the application layout (cf. figure \ref{fig:smc-browser_sketch}):
 \begin{description}
 \item[index panel] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
+\item[index pane] list of all available nodes (profiles, components, elements, data categories); allows to select nodes to be displayed in the graph pane
 \item[main graph pane] displays the selected subgraph, needs as much space as possible
 \item[graph navigation bar] for manipulation of the displayed graph by various means
 \item[detail view] displaying definition and statistical information for selected nodes
 \item[statistics] a separate view on the data listing the statistical information for whole dataset in tables
+\item[notifications] a widget to provide feedback about the system status to the user
 \end{description}
 …
 \item[profiles + datcats + datcats + groups + rr]
         as above but again with profile-groups and relations
 \item[only profiles]
+\item[profiles similarity]
        just profiles with links between them representing the degree of similarity based on the reuse of components and data categories
 \end{description}
 …
 %%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Application of Schema Matching techniques in SMC}
+\section{Application of \emph{schema matching} techniques in SMC}
 \label{sec:schema-matching-app}
 Even though the described module is about ``semantic mapping'',  until now  we did not directly make use of the traditional ontology/schema mapping/alignment methods and tools as summarized in \ref{lit:schema-matching}. This is due
 to the fact that the in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,
+to the fact that in this work we can harness the mechanisms of the semantic interoperability layer built into the core of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation,
 to a high degree obsoleting the need for a posteriori complex schema matching/mapping techniques.
 Or put in terms of the schema matching methodology, the system relies on explicitely set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
+Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
 Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{We talk of schema even though the creation (and also remodelling) takes place in the component registry by creating CMD profiles and components, because every profile has an unambiguous expression in XML Schema.} \var{$S_{1..n}$}.
 It is very unprobable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
 Given the heterogenity of the schemas present in the field of research, full alignments are not achievable at all.
+The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
+It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
+Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
 components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, she is helped even with candidates that are not equivalent, thus we can further relax the task and allow even candidates that are just similar to a certain degree, that can be operationalized as threshold $t$ on the output of the \var{similarity} function
+Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
 Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
 …
 Next to the usual features and measures that can be applied like label equality or string-similarity and structural equality,
+the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}.
+It would be worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature.
+longest matching subpath.
+the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalency is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well, thus by default the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
+Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
 The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
+Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
+This new simliarity ratios could be applied as alternative weights in the just-profiles graph \ref{sec:smc-cloud}.
+In contrast to the task described here, that -- restricted matching XML schemas -- can be seen as staying in the ``XML World'',
+another aspect within this work is clearly situated in the Semantic Web world and requires application of ontology matching methods, the mapping of field values to semantic entities described in \ref{sec:values2entities}.
+Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
+This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
+In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
+another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.

SMC4LRT/chapters/Infrastructure.tex

-                      r3671
+                      r3776
 As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
 Once a profile is finished, the Component Registry provides automatically the corresponding XML schema in the \code{cmd} target namespace \code{http://www.clarin.eu/cmd}, that can be used as base for creating and validating metadata records.
+Once a profile is created, the Component Registry provides automatically the corresponding XML schema that can be used as base for creating and validating metadata records in the \code{cmd} namespace \code{http://www.clarin.eu/cmd}.
 \subsubsection*{Ontological Relations -- Relation Registry}
 …
 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
+This implementation stores the individual relations as RDF triples
+\begin{example3}
+subjectDatcat & relationPredicate & objectDatcat
+\end{example3}
+allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
+This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
+\begin{definition}{The relation triples as stored by the Relation Registry}
+\textless \ subjectDatcat \ relationPredicate \  objectDatcat \textgreater
+\end{definition}
 \subsection{Further parts of the infrastructure}
 …
 \subsection{CMDI - Exploitation side}
+\subsection{CMDI exploitation side}
 \label{cmdi_exploitation}
 Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
 …
 \lstset{language=XML}
 \begin{lstlisting}
         <dcif:conceptualDomain type="constrained">
                 <dcif:dataType>string</dcif:dataType>
                 <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
                 <dcif:rule>[a-z]{3}</dcif:rule>
         </dcif:conceptualDomain>
+  <dcif:conceptualDomain type="constrained">
+    <dcif:dataType>string</dcif:dataType>
+    <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
+    <dcif:rule>[a-z]{3}</dcif:rule>
+  </dcif:conceptualDomain>
 \end{lstlisting}
 …
 \begin{lstlisting}
         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
+  <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
 \end{lstlisting}
 …
      <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
       <dcif:rule>
+         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
+         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639"
+                                     type="closed"/>
       </dcif:rule>
   </dcif:conceptualDomain>
 …
 %%%%%%%%%%%%%%%%%
 \section{Other aspects of the infrastructure}
+While this work concentrates solely on the metadata, it needs to be recognized, that it is only aspect of the infrastructure and its actual purpose the availability of resources. Metadata is a necessary first step to announce and describe the resources. However it is of little value, if the resources themselves are not accessible. Consequently, another pillar of the CLARIN infrastructure are the centres\furl{http://www.clarin.eu/node/3812}:
+While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
+\subsubsection{CLARIN Centres}
+One view on the CLARIN infrastructure is that of a network of centres\furl{http://www.clarin.eu/node/3812}:
 \begin{quotation}
 …
 CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
+One core service of such centres are the content repositories, systems meant for long-term preservation and publication of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
+One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
+\begin{comment}
 In the following a few further well established repositories are mentioned.
 …
 \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}}
 \end{description}
+\end{comment}
 \begin{figure*}
 …
 \end{figure*}
+Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs}\cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via the aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50. The maintenance of SRU/CQL has been
+transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
+\subsubsection{Federated Content Search}
+Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
+Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}.
 \section{Summary}

SMC4LRT/chapters/Introduction.tex

-                      r3665
+                      r3776
 \section{Structure of the work}
+The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work.
+In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
+The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
 The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively.
 …
 The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
+The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
 \section{Keywords}

SMC4LRT/chapters/Literature.tex

-                      r3681
+                      r3776
 In recent years, multiple large-scale initiatives have set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular.
 \xne{EAGLES/ISLE Meta Data Initiative} (IMDI) \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.
+\xne{EAGLES/ISLE Meta Data Initiative} (IMDI)\furl{http://www.mpi.nl/imdi/} \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.
 \xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey.
 \xne{CLARIN} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
+\xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
 are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}.
 Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI.
 …
 \label{lit:digi-lib}
 In a broader view we should also regard the activities in the world of libraries.
 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, they certainly have a long tradition, wealth of experience and stable solutions.
+Mainly driven by national libraries still bigger aggregations of the bibliographic data are being set up.
+ The biggest one being the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012})
 powered by OCLC, a cooperative of over 72.000 libraries worldwide.
 In Europe, more recent initiatives have pursuit similar goals:
+In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS).
+Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
+%, starting collaborative efforts in mid 70s
+Driven mainly by national libraries still bigger aggregations of the bibliographic data are being set up.
+ The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide.
+In Europe, multiple recent initiatives have pursuit similar goals of pooling together the immense wealth of information sheltered in the many libraries:
 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana).
 A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.
 The related catalogs and formats are described in the section \ref{sec:other-md-catalogs}
+\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections. (In fact, The European Library is the \emph{library aggregator} for Europeana.)
+A large number of projects contribute(d) to \xne{Europeana}. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, one of them being the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
+Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''.
+The related catalogs and formats are described in the section \ref{sec:lib-formats}.
 \section{Existing crosswalks (services)}
+Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems and also in libraries,  e.g. \emph{MARC to Dublin Core Crosswalk}\furl{http://loc.gov/marc/marc2dc.html}
+\cite{Day2002crosswalks} lists a number of mappings between metadata formats.
+Mostly Dublin Core and MARC family of formats
+http://www.loc.gov/marc/dccross.html
+static
+metadata crosswalk repository
+OCLC launched \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
+in particular \xne{Crosswalk Web Service}\furl{http://www.oclc.org/developer/services/metadata-crosswalk-service}
+http://www.oclc.org/research/activities/xwalk.html
+Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core  and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}}
+However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort, that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
 \begin{quotation}
 …
 \end{quotation}
+the Crosswalk Web Service is now a production system that has been incorporated into the following OCLC products and services.
+However the demo service is not available\furl{http://errol.oclc.org/schemaTrans.oclc.org.search}
+Offered formats?
+These however concentrate on the formats for the LIS community available and are ??
+For this service, a metadata format is defined as a triple of:
+    Standardâthe metadata standard of the record (e.g. MARC, DC, MODS, etc ...)
+    Structureâthe structure of how the metadata is expressed in the record (e.g. XML, RDF, ISO 2709, etc ...)
+    Encodingâthe character encoding of the metadata (e.g. MARC8, UTF-8, Windows 1251, etc ...)
+Offered interface!?
+he Crosswalk Web Service has 4 methods:
+    translate(...) - This method translates the records. See the documentation for more information.
+    getSupportedSourceRecordFormats() - This method returns a list of formats that are supported as input formats.
+    getSupportedTargetRecordFormats() - This method returns a list of formats that the input formats can be translated to.
+    getSupportedJavaEncodings() - Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports.
+Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration as for the specification of the planned service.
+\begin{comment}
+The Crosswalk Web Service has 4 methods:
+\begin{description}
+\item[translate()]  This method translates the records.
+\item[getSupportedSourceRecordFormats()]  This method returns a list of formats that are supported as input formats.
+\item[getSupportedTargetRecordFormats()] This method returns a list of formats that the input formats can be translated to.
+\item[getSupportedJavaEncodings()] Some formats will support all of the character encodings that Java supports. This function returns the list of encodings that Java supports.
+\end{description}
+\end{comment}
 …
 This elegant abstraction introduced with the \var{similarity} function provides a general model that can accomodate a broad range of comparison relationships and corresponding similarity measures. And here, again, we encounter a broad range of possible approaches.
+\cite{ehrig2004qom} lists a number of basic features and corresponsing similarity measures:
+Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed.
+For concepts, next to the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again either as string similarity, but also broaded by employing external taxonomies and other semantic resources like WordNet - \emph{extensional} methods), equal (shared) class instances, shared superclasses, subclasses, properties.
+Element-level (terminological)  vs structure-level (structural)  \cite{Shvaiko2005_classification}
+based on background knowledge...
+subclassâsuperclass relationships, domains and ranges of properties, analysis of the graph structure of the ontology.
+For properties the degree of the super an subproperties equality, overlapping domain and/or range.
+Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy).
+\cite{ehrig2004qom} lists a number of basic features and corresponding similarity measures, \cite{Shvaiko2005_classification} classifies the features into element-level (terminological), structure-level (structural)  and based on background knowledge (extensional):
+Starting from primitive data types, next to value equality, string similarity, edit distance or in general relative distance can be computed. For concepts, besides the directly applicable unambiguous \code{sameAs} statements, label similarity can be determined (again, either as string similarity, but also by employing external taxonomies and other semantic resources like WordNet -- \emph{extensional} methods), equal (shared) class instances, subclassâsuperclass relationships, shared properties. For properties the degree of the super an subproperties equality, overlapping domain and/or range.
+Additionally to these measures applicable on individual ontology items, there are approaches (like the \var{Similarity Flooding algorithm} \cite{melnik2002similarity}) to propagate computed similarities across the graph defined by relations between entities (primarily subsumption hierarchy), or even to analyse and compare the overall graph structure of the ontology.
 \cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods.
 …
 A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications:
 IF-Map \cite{kalfoglou2003if}, QOM \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, Similarity Flooding (SF) \cite{melnik}, S-Match \cite{Giunchiglia2007_semanticmatching}, the Prompt tools \cite{Noy2003_theprompt} integrating with ProtÃ©gÃ© or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
+\xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{ProtÃ©gÃ©} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
 All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
 …
 \subsubsection{Semantic Web - Technical solutions / Server applications}
+The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently
+and idealiter expose them via a web interface to the users.
+Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as a number of commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
+\label{semweb-tech}
+The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL\cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
+Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
 A qualitative and quantitative study\cite{Haslhofer2011europeana}   in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana}
 Virtuoso is used to host many important Linked Data sets (e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}).
+Virtuoso is used to host many important Linked Data sets, e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
 Virtuoso is offered both as commercial and open-source version license models exist.
 Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
+One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
+One more specific work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. Another solution in a related, more specialized domain and already in productive use is \xne{rechercheisidore}\furl{http://rechercheisidore.fr} \cite{pouyllau2011isidore}, a french portal for digital humanities resources.
 \begin{comment}
 …
 Haystack\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
+\todoin{check SARQ}\furl{http.//github.com/castagna/SARQ}
 \end{comment}
 \subsubsection{Ontology Visualization}
+Landscape, Treemap, SOM
+\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
+The complex structured datasets like ontologies require dedicated means for their high-level exploration, like aggregations and interactive visualization techniques. A large variety of solutions has been implemented in the last two decades (cf. overview of the field in \cite{lanzenberger2010ontology}, also for citations of tools listed below). Given the inherent graph structure of the RDF data model, the obvious and most common approach is a tree- or graph-based visualization with concepts being represented as nodes and relations as edges. Numerous solutions are realized as plug-ins for the wide-spread open-source ontology editor \xne{Prot\'{e}g\'{e}} \cite{grosso1999protege} developed at Stanford University, like  \xne{OntoViz, Jambalaya, TouchGraph, OWLViz, OntoSphere, PromptViz} etc.
+There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}).
+There is also plenty of general graph visualization tools, that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
+%Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development
+The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}  -- a tool which visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
+\begin{figure*}
+\begin{center}
+\includegraphics[width=0.8\textwidth]{images/AlViz_screenshot.png}
+\caption{Screenshot of AlViz -- tool for visual exploration of ontology alignment \cite{lanzenberger2006alviz}}
+\label{fig:alviz}
+\end{center}
+\end{figure*}
 …
 \section{Language and Ontologies}
 There are two different relation links betwee language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.
+There are two different relation links between language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.
 \subsubsection{Linguistic ontologies}
 …
 Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010.
 Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the terminology of the discipline linguistic is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
+Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
 \subsubsection{Lexicalised ontologies,``ontologized'' lexicons}
 The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies.

SMC4LRT/chapters/Results.tex

-                      r3681
+                      r3776
 \\
 \url{http://clarin.aac.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
+\url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
 …
 This interface is available as part of the smc application:
 \url{http://clarin.aac.ac.at/smc/cx}
+\url{http://clarin.arz.oeaw.ac.at/smc/cx}
 \subsection{SMC - as a module within Metadata Repository}
 The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
 \url{http://clarin.aac.ac.at/mdrepo/smc}
+\url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet )
 \subsection{SMC Browser -- advanced interactive user interface}
 …
 SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
 \url{http://clarin.aac.ac.at/smc/browser}
+\url{http://clarin.arz.oeaw.ac.at/smc-browser}
 \begin{figure*}
 …
 \begin{figure*}[!ht]
 \begin{center}
 \includegraphics[width=1\textwidth]{images/just_profiles_6.png}
+\includegraphics[width=1\textwidth]{images/just_profiles_9.png}
 \end{center}
 \caption{SMC cloud -- graph visualizing the semantic proximity of profiles}

SMC4LRT/chapters/acknowledgements.tex

-                      r2697
+                      r3776
 \chapter*{Acknowledgements}
+I would like to thank all the colleagues from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Daan Broeder, Menzo Windhouwer, Marc Kemps-Snijders, Hennie Brugman.
+I would like to thank all the colleagues from my institute and from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
+And to all my dear one, for the extra portion of patience I demanded from them
+\\
+ \\
 With love to em.

SMC4LRT/chapters/appendix.tex

-                      r3665
+                      r3776
 \chapter{Data model reference}
+\label{ch:data-model-ref}
 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
 …
 \chapter{CMD -- sample data}
+\label{ch:cmd-sample}
 \section{Definition of a CMD profile}
+Following listing presents a sample CMD specification for the \concept{collection\#clarin.eu:cr1:p\_1345561703620} profile.
+\input{chapters/collection_spec.xml.tex}
 \section{CMD record}
+Following listing represents a sample CMD record  - an instance of the \concept{collection} profile listed above.
+\input{chapters/collection_instance.xml.tex}
+\chapter{SMC Browser -- related material }
+\chapter{SMC -- documentation}
+\label{ch:smc-docs}
+\begin{figure*}
+\begin{center}
+\includegraphics[height=1\textwidth, angle=90]{images/build_init.png}
+\end{center}
+\caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.}
+\label{fig:smc-build_init}
+\end{figure*}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
+\end{center}
+\caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
+\label{fig:cmd-dep-dotgraph}
+\end{figure*}
+\section{Documentation of smc-xsl}
+\label{sec:smc-xsl-docs}
+\todoin{generate and reference XSLT-documentation}
 \section{SMC Browser user documentation}
 …
 \label{sec:smc-graphs}
+\begin{figure*}[h]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/cmd-deps-graph_part2.png}
+\end{center}
+\caption{An early version of a visual representation of (a part of) the \xne{smc-graph} generated with the \code{dot} tool.}
+\label{fig:cmd-dep-dotgraph}
+\end{figure*}
 \begin{comment}
 \chapter{SMC Reports}
 \label{ch:smc-reports}
+%\%label{ch:smc-reports}
 SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.

SMC4LRT/chapters/userdocs_cleaned.tex

r3666	r3776
99	99	The nodes are colour-coded by type:
100	100
101		\includegraphics[height=100px]{~~C:/Users/m/3/clarin/_repo/SMC/docs/graph_legend.sv~~g}
	101	\includegraphics[height=100px]{images/graph_legend.png}
102	102
103	103	\phantomsection\label{select-nodes}

SMC4LRT/images/Terms.xsd.tex

-                      r3640
+                      r3776
 \begin{lstlisting}[label=lst:terms-schema, caption=Terms.xsd -- schema of the internal data model \ref{datamodel-terms}]
 <?xml version="1.0" encoding="UTF-8"?>
+<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink">
+<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
+elementFormDefault="qualified" xmlns:ns2="http://www.w3.org/1999/xlink">
   <xs:import namespace="http://www.w3.org/1999/xlink" schemaLocation="ns2.xsd"/>
   <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>

SMC4LRT/thesis.tex

-                      r3666
+                      r3776
 \thesisverfassung{Matej \v{D}ur\v{c}o} % Verfasser
 \thesisauthor{Matej \v{D}ur\v{c}o} % your name
 \thesisauthoraddress{Viktorgasse 8/6, 1040 Wien} % your address
+\thesisauthoraddress{JosefstÃ€dterstrasse 70/32, 1080 Wien} % your address
 \thesismatrikelno{0005416} % your registration number
 \thesisbetreins{ao.Univ.-Prof.?? Dr. Andreas Rauber}
 \thesisbetrzwei{Univ.-Prof. Mag. Dr. Gerhard Budin}
+\thesisbetreins{ao.Univ.-Prof. Dr. Andreas Rauber, Univ.-Prof. Mag. Dr. Gerhard Budin}
+\thesisbetrzwei{}
 %\thesisbetrdrei{Dr. Vorname Familienname} % optional
 …
+%\begin{comment}
+\begin{comment}
+\end{comment}
 \input{chapters/Introduction}
 \input{chapters/Literature}
-\input{chapters/Definitions}
 \input{chapters/Data}
 \input{chapters/Infrastructure}
 \input{chapters/Design_SMCschema}
 \input{chapters/Design_SMCinstance}
 \input{chapters/Results}
+%\end{comment}
 \input{chapters/Conclusion}
 …
 %\bibliography{references}
 %\bibliographystyle{ieeetr}
 \bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own}
+\bibliography{../../2bib/lingua,../../2bib/ontolingua,../../2bib/smc4lrt,../../2bib/semweb,../../2bib/distributed_systems,../../2bib/own,../../2bib/diglib,../../2bib/it-misc,../../2bib/infovis}
 \appendix
+\input{chapters/Definitions}
 \input{chapters/appendix}

Context Navigation

Legend:

SMC4LRT/Outline.tex

SMC4LRT/chapters/Conclusion.tex

SMC4LRT/chapters/Data.tex

SMC4LRT/chapters/Definitions.tex

SMC4LRT/chapters/Design_SMCinstance.tex

SMC4LRT/chapters/Design_SMCschema.tex

SMC4LRT/chapters/Infrastructure.tex

SMC4LRT/chapters/Introduction.tex

SMC4LRT/chapters/Literature.tex

SMC4LRT/chapters/Results.tex

SMC4LRT/chapters/acknowledgements.tex

SMC4LRT/chapters/appendix.tex

SMC4LRT/chapters/userdocs_cleaned.tex

SMC4LRT/images/Terms.xsd.tex

SMC4LRT/thesis.tex

Download in other formats: