Context Navigation

← Previous Change
Next Change →

Changeset 3551 for SMC4LRT

Timestamp:

09/11/13 18:04:14 (11 years ago)

Author:

vronk

Message:

intermediate version - ongoing work on introduction

Location:

SMC4LRT/chapters

Files:

: 10 edited

Conclusion.tex (modified) (1 diff)
Data.tex (modified) (5 diffs)
Definitions.tex (modified) (2 diffs)
Design_SMCinstance.tex (modified) (5 diffs)
Design_SMCschema.tex (modified) (9 diffs)
Infrastructure.tex (modified) (13 diffs)
Introduction.tex (modified) (8 diffs)
Literature.tex (modified) (3 diffs)
Results.tex (modified) (8 diffs)
appendix.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Conclusion.tex

r3204	r3551
8	8
9	9	More work is needed on consolidation of the actual values in the CMD records. CLARIN has set up a separate task force for data curation, which will have to be an ongoing effort. Also, work is ongoing on enriching the SMC browser with instance data information, allowing to directly see and inspect, which profiles and DCs are effectively being used in the instance data (and how often).
	10
	11
	12	Irrespective of the additional levels - the user wants and has to get to the resource. (not always)
	13	to the "original"

SMC4LRT/chapters/Data.tex

-                      r3140
+                      r3551
 \subsubsection{CMD Profiles }
 In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
+In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
 …
 \begin{table}
 \caption{The development of defined profiles and DCs over time}
 \label{table:dev}
+\label{table:dev_profiles}
   \begin{tabular}{ l | r | r | r | r }
     \hline
 …
 VIAF - Virtual International Authority File
 Other related relevant activities and initiatives
 …
 \section{LRT Metadata Catalogs/Collections}
+\label{sec:lrt-md-catalogs}
 \todoin{Overview of catalogs, name, since, \#providers, \#resources}
 …
 \section{Other Metadata Catalogs/Collections}
+\label{sec:other-md-catalogs}
 \subsection{(Digital) Libraries}

SMC4LRT/chapters/Definitions.tex

-                      r3140
+                      r3551
 \chapter{Definitions}
+Meanings of ``mapping'':
+\begin{itemize}
+\item transform
+\item match (schemas)
+\item  overview (browser)
+\end{itemize}
+\label{ch:def}
 \section {Namespaces}
 …
 \item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
 \item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
+\item[DARIAH] \textit{Digital Research Infrastructure for Arts and Humanities}
 \item[DC] data category
 \item[DCR] data category registry \cite{ISO12620:2009}
+\item[DH] Digital Humanities, also eHumanities
+\item[LINDAT] czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz}
 \item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
 \item[PID] persistend identifier \todocite{PID}

SMC4LRT/chapters/Design_SMCinstance.tex

-                      r3240
+                      r3551
+\chapter{Design - Mapping on instance level}
+Linked Data - Express dataset in RDF
+\chapter{System design - mapping on instance level}
+\label{ch:design-instance}
 \begin{quotation}
 I do think that ISOcat, CLAVAS, RELcat, an actual language
 …
 semantic interoperability ... I hope ;-)
 \end{quotation}
+\todocite{Menzo}
+\cite{Menzo2013mail}
+Linked Data - Express dataset in RDF
 …
 \begin{example}
 <lr1> dct:title "Language Resource 1"
+<lr1> & dct:title & "Language Resource 1"
 \end{example}
 …
 \begin{example}
 <lr1> isocat:DC-2502 "19th century"
+<lr1> & isocat:DC-2502 & "19th century"
 \end{example}
 …
 \todocode{Load data: relcat, clavas, olac-and-dc-providers cmd, lt-world?}
+\section {Full semantic search - concept-based + ontology-driven ?}
+With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
+Namely to enhance it by employing ontological resources.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
 \section{Summary}

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3240
+                      r3551
 \chapter{System Design - Mapping on schema level}
+\chapter{Concept-based mapping on schema level -- system design}
 \label{ch:design}
+In this chapter, we lay out the functioning of the semantic mapping on schema level, the task the Semantic Mapping Component was originally conceived for within the larger CMD Infrastructure (cf. \ref{def:CMDI}).
+Semantic interoperability was one of the main concerns addressed by the CMDI and is weaved in tightly in all modules of the infrastructure. The task of the SMC module is to collect information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata formats. This information serves as basis for the concept-based search.
+We start by drawing a global view on the system, introducing its individual components and the dependencies among them.
+In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for resolving crosswalks is described, divided into the interface specification and actual implementation. In section \ref{def:concept_search} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} a advanced interactive user interface for exploring the CMD data domain is proposed.
 \section{System Architecture}
+The Semantic Mapping module is based on the DCR and CMD framework and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
+The Semantic Mapping module is based on the DCR and CMD framework (cf. section \ref{def:DCR})
+and is being developed as a separate service on the side of CLARIN  Metadata Service, its primary consuming service, but shall be equally usable by other applications.
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
+\caption{The component view on the SMC - modules and their inter-dependencies}
+\label{fig:smc_modules}
+\end{figure*}
+\begin{description}
+\item[crosswalk service] the main service translating between indexes, detailed in \ref{sec:cx}
+\item[concept-based query expansion]
+\item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
+\item[SMC Browser] a web application to explore the CMD data domain consisting of the two modules: \xne{smc-stats} and \xne{smc-graph}
+\item[smc-stats] a module of the \xne{SMC Browser} providing human-readable statistical summaries of the CMD data domain
+\item[smc-graph] a module of the \xne{SMC Browser} providing advanced interactive graph-based user interface for exploring the CMD data domain
+\end{description}
+For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
+\section{Data model - Terms}
+\label{datamodel-terms}
+\todocode{Terms.xsd}
 \begin{note}
-Do we need separate \\section{Data Model}?
 Describe the CMD-format?
 \end{note}
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
+\caption{The process of transforming the CMD metadata records to and RDF representation}
+\label{fig:smc_modules}
+\end{figure*}
+For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
+\subsection{Use Cases}
+\begin{itemize}
+\item MD Search employing Semantic Mapping
+\item MD Search employing Fuzzy Search
+\end{itemize}
+\section{Crosswalks -- Mapping on schema level}
+merging the pieces of information provided by those,
+offering them semi-transaprently to the user (or application) on the consumption side.
+a module of the Component Metadata Infrastructure performing semantic mapping on search indexes. This  builds the base for query expansion to facilitate semantic search and enhance recall when querying the Metadata Repository.
+\section{Crosswalk service}
+\label{sec:cx}
+Crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. It allows to translate between search indexes. In particular it expresses data category based indexes as equivalent paths to fields in the CMD profiles. This way it builds the base for query expansion enhancing the recall, when searching in the heterogeneous data collection of the joint CLARIN metadata domain.
 …
 \subsection{Interface Specification}
+In this section, we describe the actual task of the proposed application -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
+\footnote{Though tightly related, mapping of terms and query expansion are to be seen as two separate functions.}
+In this section, we describe the actual task of the proposed service -- \textbf{mapping indexes to indexes} -- in abstract terms. The returned mappings can be used by other applications to expand or translate the original user query, to match elements in other schemas.
 % \footnote{This primary usage of SMC for work with user-created query strings explains the need for human-readability of the indices.}
 …
 \newline
 \texttt{isocat.size $\mapsto$ } \newline
+\verb|   [teiHeader.extent, |\newline
 \verb|    TextCorpusProfile.Number]|
+\begin{example}
+isocat.size     & $\mapsto$ & [teiHeader.extent, TextCorpusProfile.Number]
+\end{example}
 \newline
 …
 \newline
+\texttt{imdi-corpus.Name   $\mapsto$ } \newline
+\verb|   (isocat.resourceName) |$\mapsto$  \newline
+\verb|   TextCorpusProfile.GeneralInfo.Name|
+\newline
+\begin{example}
+imdi-corpus.Name & $\mapsto$ \\
+(isocat.resourceName) & $\mapsto$ TextCorpusProfile.GeneralInfo.Name
+\end{example}
+\newline
 (2) \emph{relations between data categories} -- employing also information from the Relation Registry, related (equivalent) data categories are retrieved and subsequently both the input and the related data categories resolved to cmdIndexes:
 …
 \verb|     Person.Name, Person.FullName]|
+\subsection{Initialization}
+First there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{components}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
+\subsection{Implementation}
+At the core of the described module is a set of XSL-stylesheets, governed by a ant-build file and a configuration file holding the information about individual source registries.
+\todoin{generate and reference XSLT-documentation}
+\subsubsection{Initialization}
+First, there is an initialization phase, in which the application fetches the information from the source modules (cf. \ref{def:CMD}) and transforms it into the internal Terms format (cf. \ref{datamodel-terms}). All profiles and components from the Component Registry are read and all the URIs to data categories are extracted to construct an inverted map of data categories:
 \newline
 …
 Finally relation sets defined in the Relation Registry are fetched and matched with the data categories in the map to create sets of semantically equivalent (or otherwise related) data categories.
+\todocode{example of inverted index}
+\subsubsection{Operation}
+\subsubsection{Computing summaries}
 \subsection{Extensions}
 …
 \section{Concept-based search}
+Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
+Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
+with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
+In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
+\label{def:concept_search}
+To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
+In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
+The emphasis lies on the query language and the corresponding query input interface.
 Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
+offering it (the information) semi-transparently to the user (or application) on the consumption side.
 Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall ``explain'' - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
+?
 …
 \subsection{SMC as module for Metadata Repository}
+(MD)search frameworks:
+\begin{description}
+\item[Zebra/Z39.50] JZKit
+\item[Lucene/Solr]
+\item[eXist] - xml DB
+\end{description}
+As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain.
+Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
+\caption{The component view on the SMC - modules and their inter-dependencies}
+\label{fig:modules-mdrepo}
+\end{figure*}
 \subsection{User Interface?}
 \subsubsection*{Query Input}
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/query_input_autocomplete_term.png}
+\caption{A proposed query input interface offering concepts as search indexes}
+\label{fig:query_input}
+\end{figure*}
+Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
 \subsubsection*{Columns}
 …
 \todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
+\section{SMC-Browser}
+\label{smc-browser}
+Explore the Component Metadata Framework
+As the data set keeps growing both in numbers and in complexity, the call from the CMD community to provide advanced/enhanced ways for its exploration gets stronger. \textit{SMC browser} is one answer to this need. It is a web application, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
+In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted \cite{Broeder+2010}.
+Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (\code{componentA -includes-> componentB}) or referencing (\code{elementA -refersTo-> datcat1}).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
 \section{Summary}

SMC4LRT/chapters/Infrastructure.tex

-                      r3234
+                      r3551
 \section{CLARIN / CMDI}
 \label{def:CLARIN}
+\label{def:CMDI}
 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is to
 …
 As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{method}, we introduce in short these modules and the data they provide:
+As stated before, the SMC is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{ch:design}, we introduce in short these modules and the data they provide:
 \begin{itemize}
 …
 ?MDService
 \begin{figure*}[!ht]
 \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
+\end{figure*}
+\begin{figure*}[!ht]
+\includegraphics[width=0.8\textwidth]{images/CMDI_components_old.png}
+\caption{The diagram (from early CLARIN/CMDI presentations) shows individual modules of the CMDI and their interrelations}
+\end{figure*}
 \subsection{CMDI - DCR/CR/RR}
 \label{def:cmdi}
 \label{def:dcr}
+\label{def:CMD}
+\label{def:DCR}
 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
 …
 % \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles.
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
+\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
+\end{figure*}
 The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
 However there needs to be an additional means to capture information about relations between data categories.
 …
 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
 Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
+Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{ch:design}.
 \subsection{Vocabulary Service / Reference Data Registry}
 …
 \subsubsection{Vocabulary Service - CLAVAS}
+As described in previous section (\ref{dcr}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
+\label{def:CLAVAS}
+As described in previous section (\ref{def:DCR}), a solid pilar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is â by design â not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain âsemi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
 …
 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), as well as Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/} are running an instance of OpenSKOS.
 As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \emph{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \emph{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. As part of the process of adaptation to the needs of CLARIN and LRT-community data categories from \xne{ISOcat} have been converted into SKOS-format and ingested into the system.
 \xne{CLARIN Centre Vienna} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
+\xne{Austrian Centre for Digital Humanities} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
 A plan has been developed/adopted to support further vocabularies relevant for the community.
 …
 See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from ISOcat to \xne{SKOS}.
+and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from \xne{ISOcat} to \xne{SKOS}.
 \subsection{Interaction between DCR, VAS and client applications}
 …
 With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
  In ISOcat such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
+ In ISOcat, such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
 \begin{note}
 …
 It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
-\todoask{ Could the application use the the vocabulary indication in DC-spec as default or fallback?}
 \subsection{CMDI - Exploitation side}
 Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
 …
 and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
 \section{Content Repositories}
 Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
 …
 \section{Distrbuted system - federated search}
+Metadata -> harvesting via OAI-PMH
+but Content search has to be really distributed.
+?
+Metadata -> harvesting via OAI-PMH, but Content search has to be really distributed.
 \begin{description}
 \item[Z39.50/SRU/SRW/CQL] LoC
 …
 \end{description}
 \section{Summary}

SMC4LRT/chapters/Introduction.tex

-                      r3234
+                      r3551
 \section{Motivation / problem statement}
 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (chapter \ref{ch:data} analyses the disparity in the data domain)
+While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
+This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at fostering the development of common large-scale international infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common harmonized architecture for accessing and working with LRT. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:cmdi})
+-- a distributed system consisting of multiple interconnected applications aimed at creating and providing metadata for lLRT in a coherent harmonized way.
+This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
 This work discusses a module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogenity of the resource descriptions, without the reductionist approach of trying to impose one common description schema for all resources.
+This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
 \section{Main Goal}
 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of Language Resources and Technology (LRT), henceforth referred to as \emph{semantic search} , distincting it from the necessary underlying processing, referred to as \emph{semantic mapping}.
+The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}.
 The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
 …
 \end{description}
 The work can further be divided along the schema / instance duality/dimension. Figure \ref{fig:master_outline} sketches the goals / conceptual space of this thesis.
+The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals.
+%\includegraphics[width=\unitlength]{images/master_outline.eps}
+\begin{figure*}[!ht]
+\begin{center}
+%\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
+\includegraphics{images/master_outline.png}
+\end{center}
+\caption{The conceptual space of this work}
 \label{fig:master_outline}
+\input{images/master_outline.eps_tex}
+\end{figure*}
+%\input{images/master_outline.eps_tex}
 \subsubsection*{Crosswalks}
 Goal is not primarily to produce the crosswalks but rather to develop the service serving them.
+\subsubsection*{Crosswalk service}
+Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
+???
+While this may seem a rather trivial task, it is not if we consider the heterogeneity and complexity of the dataset,
+further complicated by the fact, that this shall be community-driven process, without a central authority defining the relations
+and that there may be even need for different relation sets for different tasks. In fact, a number of modules of the discussed infrastructure are dedicated to overcoming the semantic interoperability problem.
+Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones.
 \subsubsection*{Concept-based query expansion}
 Once the crosswalks are available, they can be used to expand/translate user queries, to match related fields across heterogeneous metadata formats, resulting in higher recall.
+Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching.
 \paragraph{Example}
 Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be expanded to
 all the semantically near fields (concept cluster), that are however labelled (or even structured) differently in other formats like
+Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
+all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like:
 \begin{quote}
 …
 \end{quote}
+but probably not to other fields, using same (sub)strings for the field labels
+but with different semantics, like:
+while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered:
 \begin{quote}
 …
 \subsubsection*{Semantic interpretation}
 The problem of different labels for semantically similar or even identical things is even more so virulent on the level of individual values in the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly/exhaustively enumerated. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies.
+The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
 \subsubsection*{Ontology-driven search / data exploration}
+\subsubsection*{Ontology-driven data exploration}
 By applying semantic web technologies, the user will be given new means of \emph{exploring the dataset} through semantic resources (ontology-driven search/browsing/exploration).
+Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources.
 \paragraph{Example}
 Ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
+Ontology-driven search -- Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external interlinked semantic resources.
 \subsubsection*{Visualization}
 …
 \section{Method}
 The primary concern of this work is the integrative effort, i.e. bringing together existing pieces (resources, components and methods). We start with examining the existing data and the description of the evolving infrastructure in which this work is embedded.
+We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.
 Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
 …
 \section{Expected Results}
 The main result of this work will be the \emph{specification} of the two modules \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
+The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \texttt{crosswalk service}.
 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
 and the results and findings of the \emph{evaluation}.
 …
 \begin{description}
 \item [Specification Semantic Mapping] design of the mapping mechanism
 \item [Specification Semantic Search] design of the query expansion and integration with search engines
 \item [Prototype] proof of concept implementation
+\item [Crosswalk service] specification and proof of basic implementation of the module
+\item [Concept-based search] design of the query expansion and integration with search engines
+\item [Visualization] design of an application for interactive exploration of the concerned dataset
 \item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
+\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
 \end{description}
 …
 In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work.
 The main part of the work is found in chapters \ref{ch:design}, \ref{ch:implementation} and \ref{ch:cmd2rdf} laying out the design of the software module, the proposal  how to modell the data in RDF and the possibilities of visualization respectively.
+The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module, the proposal how to modell the data in RDF respectively.
 The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.

SMC4LRT/chapters/Literature.tex

-                      r3140
+                      r3551
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 This work is guided by \todoin{two (or three? + Infrastructure} main dimensions: the data - in broad, Language Resource and Technology  and the method - Semantic Web technologies. This division is reflected in the following chapter:
+This work is guided by two main dimensions: the \textbf{data} -- in broad, Language Resource and Technology  -- and the \textbf{method} -- Schema matching and Semantic Web technologies. This division is reflected in the following chapter:
 \section{(Infrastructure for) Language Resources and Technology}
 …
 Chapter \ref{ch:data} examines the field of LRT in more detail.
 \subsection{Metadata}
 A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders2009,Broeder2010}.
+A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder2010}.
 Individual components of this infrastructure will be described in more detail in the section \ref{ch:components}.
+Individual components of this infrastructure will be described in more detail in the section \ref{ch:infra}.
+A number of solution evolved in the recent years.
+The first to undertake standardization efforts for the exchange of catalog information were digital libraries.
+Z39.50 as base protocol, Worldcat, mapping/configuration files.
+These catalogs are further described in the section \ref{sec:other-md-catalogs}
+In the recent years the evolving research infrastructures all identified a common/harmonized search as a crucial component of the system and came up with a number of solutions, however often reduced to collecting metadata, reducing to dublincore
+and offering a lucene/solr based facetted search.
+These catalogs are further described in the section \ref{sec:lrt-md-catalogs}.
+Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose.
 \subsection{Content Repositories}
 …
 \todoin{check if relevant: http://schema.org/}
+\subsection{Existing Crosswalk services}
+\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}
+http://semanticweb.org/wiki/VoID
+http://www.dnb.de/rdf
 \subsection{Ontology Visualization}

SMC4LRT/chapters/Results.tex

-                      r3240
+                      r3551
+\chapter{Evaluation}
+\label{ch:Evaluation}
+\section{Sample Queries}
+candidate Categories:
+ResourceType, Format
+Genre, Topic
+Project, Institution, Person, Publisher
+\section{Exploring Data Categories}
+In the ISOcat DCR 791 DCss are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed} In the following we describe two show cases -- \textit{Language} and \textit{name} -- in more detail.
+\chapter{Results and Findings}
+\label{ch:results}
+In this chapter, the results of the work are presented, divided into two main areas:
+software and data.
+In two sections, we explore the CMD data domain - the usage of the data categories on the one hand and the integration of existing formats on the other hand. While these two aspects were not directly part of this work, they were a) made possible by output of this work (SMC-Browser, statistical analysis), b) yield a valuable test case for the usefulness of the work and c) are an indispensable prerequisit for the necessary curation work being carried out by the CMDI community.
+\section{Current status of the infrastructure}
+Before we get to the results of this work,  we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results.
+\subsection{CMDI - services}
+The main services of the infrastructure have been in stable production for the last two years.
+Relation Registry is operational as early prototype.
+Three instances of OpenSKOS are running, one of them being hosted by ACDH.
+\subsection{CMDI - data}
+More than 130 profiles are defined. (See \ref{table:dev_profiles} for more details about profiles.)
+The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
+The collection amounts to over 550.000 records in 64 profiles.
+\subsection{ACDH - the home of SMC}
+Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities, that provides depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure.
+Figure \ref{fig:acdh_context} sketches the broader context of \xne{acdh} and its different roles.
+\section {Software}
+The specification of the system can be found in the chapters \ref{ch:design} and \ref{ch:design-instance}.
+There is prototypical implementation for three parts of the system
+\begin{itemize}
+\item the crosswalk service as a REST web service
+\item a module to integrate with a search engine
+\item web application that allows advanced interaction with the data set
+\end{itemize}
+The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
+Furthermore, the CMD data has been expressed RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
+\subsection{SMC - crosswalks service}
+The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java.
+\subsection{SMC - as a module within Metadata Repository}
+There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
+\subsection{SMC Browser -- Advanced Interactive User Interface}
+SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
+It is implemented on top of the js-library d3, the code is checked in clarin-svn.
+The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
+E.g. starting from 124 profiles, this amounts to a graph with ??? nodes and ??? edges.
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
+\caption{Screenshot of the SMC browser}
+\end{figure*}
+SMC Browser also features detailed numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation.
+In the following section, we make extensive use of the output of this tool, to visualize individual aspects of the discussed data set.
+\subsection{SMC LOD}
+\section{Exploring the usage of data categories}
+At the core of the whole SMC (and CMDI) are the data categories as basic conceptual building blocks or anchors.
+We want to take a closer look on the usage of the data categories in the CMD infrastructure, examplifying on a few very common concepts -- \concept{language}, \concept{name}, \concept{resource type}, \concept{???}.
+In the ISOcat DCR 791 DCs are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed}
 \subsection{Language}
 …
 \subsection{Name}
+\subsection{Name / Title}
 There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
 Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
 …
 \subsection{Subject, Genre, Topic}
+\section{Mapping existing Formats}
+\section{Exploring the integration of existing formats}
+CLARIN set out with the aspiration /yearning to overcome the babylon of metadata formats
+and its flexible CMD metamodel is specifically designed to integrate existing formats.
+In this section, we want to elaborate on/analyze the state of integration efforts for 4 major formats: \xne{dublincore/OLAC}, \xne{teiHeader} and \xne{META-SHARE resourceInfo}.
 \subsection{dublincore / OLAC}
 Very widely used format
+Very widely used (because) simple format
 \ref{info:olac-records}
+There are 4-5 CMD profiles modelling OLAC/dcmi-terms
+Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms.
+As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.5\textwidth]{images/dcmiterms-profiles.png}
+\end{center}
+\caption{The meanwhile four DCMI profiles with identical conceptual linking}
+\label{fig:dcmi-profiles}
+\end{figure*}
+\begin{table}
+\caption{Profiles modelling dublincore terms}
+\label{table:dcterms-profiles}
+  \begin{tabular}{ l | l | l | r | r }
+    \hline
+profile name & created & creator & count & instances \\
+    \hline
+component-dc-terms-modular & 2010-04-21 & CMDI-team & 15 / 15 / 15 \\
+component-dc-terms & 2010-04-21 & CMDI-team & 0 / 15 / 15 \\
+DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
+OLAC-DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
+OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02-12 & Menzo Windhouwer & 1 / 71 / 62 & \\
+DC-UBU & 2013-05-29& Utrecht University Library & 0 / 15 / 15 & \\
+OLAC-DcmiTerms-ref & 2013-06-24 & fankhauser@ids-mannheim.de & 0 / 55 / 55 & \\
+    \hline
+  \end{tabular}
+\end{table}
+Additionally, there is a number of profiles with concept links to dublincore terms,
+Some use all of the dublincore elements or terms as one component within a larger profile,
+one example being the \xne{data} profile created by the Czech initiative LINDAT modells  the minimal obligatory set of META-SHARE \xne{resourceInfo}) combined with a simple dublincore record (see also subsection about META-SHARE below).
+Other profiles refer only to some data categories. Most often used: \concept{Title} (used in 33 profiles) and \concept{Creator} (in 29 profiles).
+Profiles that make more frequent use of the dublincore terms:
+\begin{itemize}
+\item EastRepublican (8)
+\item HZSKCorpus (17)
+\item teiHeader (8)
+\item ToolService (15)
+\item OralHistoryInterviewDANS (15)
+\end{itemize}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.8\textwidth]{images/profiles_using_dcmiterms.png}
+\end{center}
+\caption{Profiles referring to at least some of the dublincore data categories/terms}
+\label{fig:profiles-using-dcmiterms}
+\end{figure*}
 …
 The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
 The large research project \xne{Deutsches Text Archiv}\furl{http://deutschestextarchiv.de/}\todocite{DTA}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information.
+\todoin{Why a separate cmd-profile}
 \xne{Nederlab} is another large-scale project concerned with \todoin{dutch? historic texts}, starting 2013 in Netherlands\todocite{Nederlab}. Within this project another set of CMD profiles was created, however reusing existing components.
 As seen in figure \ref{fig:teiHeadeer_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
 Another approach was applied within the context of other CLARIN-NL projects, \todocite{Windhouwer, 2012} generated, based on an ODD-file, a data category for every element of the teiHeader (135 datcats) creating a dedicated data category selection: \xne{TEI Header (2.1.0)}. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:components}. The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
+The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/}\cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
+Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold\cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
+\xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however reusing existing components.
+As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
+Another approach was applied within the context of other CLARIN-NL projects\cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
 This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
 …
   \begin{tabular}{ l | r | l | r | r | r}
     \hline
 project, author & created & profile name & comp elem datcats & instances \\
     \hline
 Deutsches Text Archiv & 2012 & teiHeader & 56/82/10 & 857 \\
 ICLTT, Durco & 2010 & teiHeader & 16/35/13 & 467 \\
 Leipzig Corpora, Eckart & 2012 & TEIDocumentDescription & 16/35/13 & ? \\
 Nederlab, Zhang & 2013 & DBNL\_Tekst & 20/38,15 & ? \\
   & & DBNL\_Tekst\_Onzelfstandig & 20/47/21 & ? \\
+profile name & created & creator & count & instances \\
+    \hline
+teiHeader & 2010 & ICLTT, Durco & 16/35/13 & 467 \\
+teiHeader & 2012 & Deutsches Text Archiv & 56/82/10 & 857 \\
+TEIDocumentDescription & 2012 & Leipzig Corpora, Eckart & 16/35/13 & ? \\
+DBNL\_Tekst & 2013 & Nederlab, Zhang & 20/38,15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
+DBNL\_Tekst\_Onzelfstandig  & & & 20/47/21 &  \\
     \hline
   \end{tabular}
 \end{table}
-\todoin{DBNL\_Tekst\_Onzelfstandig - how many instances?}
 DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
 clarin.eu:cr1:p 1366279029218 (private)
 …
 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
 %In cooperation between metadata teams from CLARIN and META-SHARE
+The model has been expressed as 4 CMD profiles for distinct resource types sharing most of the components. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 419 components and 1587 elements (when expanded). Although most of the elements are optional
+resourceInfo    419     1587    72      790     797     50.22 %
+\todoin{how many distinct components/elements}
+This? shows nicely the trade-off between the two different approaches between CMD and META-SHARE: many custom schemas or one very large.
+In a parallel effort, LINDAT, the czech national infrastructure initiative with ties to both CLARIN and META-SHARE, created a CMD profile modelling the minimal obligatory set of META-SHARE. combined with dublincore.
+So the information is partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema
+resourceInfo    65      92      21      82      10      10.87 %
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=0.5\textwidth]{images/SMC-resourceInfo.png}
+\end{center}
+\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
+\label{fig:resource_info_5}
+\end{figure*}
+\begin{table}
+\caption{Profiles modelling resourceInfo}
+\label{table:resourceinfo-profiles}
+  \begin{tabular}{ l | l | l | r | r }
+    \hline
+profile name & created & creator & count & instances \\
+    \hline
+resourceInfo (minimal) & 2013-02-13 & LINDAT.CZ & 34 / 41 / 21 \\
+resourceInfo (lexical) & 2013-06-02 & P. Labropoulou & 86 / 226 / 57 \\
+resourceInfo (tools) & 2013-06-02 & P. Labropoulou & 61 / 176 / 52 \\
+resourceInfo (language) & 2013-06-02 & P. Labropoulou & 89 / 228 / 54 \\
+resourceInfo (corpus) & 2013-06-02 & P. Labropoulou & 117 / 337 / 72 \\
+    \hline
+  \end{tabular}
+\end{table}
+The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
+In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\xne{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \xne{resourceInfo}), however combined with a simple dublincore record.
+This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
 \begin{figure*}[!ht]
 …
+\section{Summary}
+\chapter{Results}
+\label{ch:results}
+\section { Software module}
+The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java. There is also a plan to provide an XQuery implementation. The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
+\subsection{SMC Browser -- Advanced Interactive User Interface}
+Explore the Component Metadata Framework
+In CMD, metadata schemas are defined by profiles, that are constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles. Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should be interpreted (Broeder et al., 2010).
+Thus, every profile can be expressed as a tree, with the profile component as the root node, the used components as intermediate nodes and elements or data categories as leaf nodes, parent-child relationship being defined by the inclusion (componentA -includes-> componentB) or referencing (elementA -refersTo-> datcat1).The reuse of components in multiple profiles and especially also the referencing of the same data categories in multiple CMD elements leads to a blending of the individual profile trees into a graph (acyclic directed, but not necessarily connected).
+SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the examples for inspiration.
+It is implemented on top of wonderful js-library d3, the code checked in clarin-svn (and needs refactoring). More technical documentation follows soon.
+The graph is constructed from all profiles defined in the Component Registry. To resolve name and description of data categories referenced in the CMD elements definitions of all (public) data categories from DublinCore and ISOcat (from the Metadata Profile [RDF] - retrieving takes some time!) are fetched. However only data categories used in CMD will get part of the graph. Here is a quantitative summary of the dataset.
+\section{Evaluation}
+\label{evaluation}
+Sample Queries:
+candidate Categories:
+ResourceType, Format
+Genre, Topic
+Project, Institution, Person, Publisher
+\subsection{Use Cases}
+\begin{itemize}
+\item MD Search employing Semantic Mapping
+\item MD Search employing Fuzzy Search
+\end{itemize}
 …
 \section{Summary}
+\begin{figure*}[!ht]
+\includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
+\caption{Screenshot of the SMC browser}
+\end{figure*}
+The direct comparison of the CMD approach of metamodel allowing to generate custom profiles with shared semantics and a more traditional way of trying to generate one schema to fit all in as in META-SHARE shows nicely the trade-off: many custom schemas or one very large.

SMC4LRT/chapters/appendix.tex

-                      r3240
+                      r3551
 \includegraphics[width=1\textwidth]{images/DCR_data_model.jpg}
 \end{center}
 \caption{DCR data model}
+\caption{DCIF -- the data model for the Data Category Registry as defined by the ISO Standard ISO12620:2009 \cite{ISO12620:2009}}
 \label{fig:DCR_data_model}
 \end{figure*}
-\todocite{DCR data model}
 \begin{figure*}[!ht]
 …
 \label{fig:ref_arch}
 \end{figure*}
+\begin{figure*}[!ht]
+\begin{center}
+\includegraphics[width=1\textwidth]{images/acdh-diagram_300dpi_rotated.png}
+\end{center}
+\caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
+\label{fig:acdh_context}
+\end{figure*}
+\section {SMC Reports}
+\label{sec:reports}
+SCM Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
+\input{chapters/examples_cleaned}

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: