Changeset 3665 for SMC4LRT


Ignore:
Timestamp:
10/02/13 19:52:31 (11 years ago)
Author:
vronk
Message:

rework of Results, Definitions, appendix, added Conclusion,
smaller changes to Design, Data

Location:
SMC4LRT/chapters
Files:
12 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Conclusion.tex

    r3551 r3665  
    33\label{ch:conclusions}
    44
    5 Further work is needed on more complex types of response (similarity ratio, relation types) and also on the interaction with Metadata Service to find the optimal way of providing the features of semantic mapping and query expansion as semantic search within the search user-interface.
     5With this work, a technical description together with a prototypical implementation for the \emph{Semantic Mapping Component} was delivered -- one module within an infrastructure for providing metadata, the \emph{Component Metadata Infrastructure}.
    66
    7 The statistics about current usage/population of the CMD demonstrate that the basic concept of a flexible metamodel with integrated semantic layer is being taken up by the community. Metadata modellers increasingly making use not only of the infrastructure, but are also reusing the modelling work done so far. The provisions designed to ensure semantic interoperability (DCR together with the RR) are pratically in place and prove to be useful.
     7SMC features a concept-based crosswalk service providing correspondences between fields in metadata formats and a module for query expansion building on top of it, allowing concept-based semantic search. Further work is needed on the crosswalk service providing more complex types of response (similarity ratio, relation types) with implications for the query expansion module. The integration of the semantic mapping features in the search user interface is only rudimentary at present, calling for a more elaborate solution.
     8% Dynamic integration of the information from the Relation Registry into the search interface and search processing.
    89
    9 More work is needed on consolidation of the actual values in the CMD records. CLARIN has set up a separate task force for data curation, which will have to be an ongoing effort. Also, work is ongoing on enriching the SMC browser with instance data information, allowing to directly see and inspect, which profiles and DCs are effectively being used in the instance data (and how often).
     10A whole separate track is the effort to deliver the CMD data as \emph{Linked Open Data}, for which only the groundwork has been done by specifying the modelling of the data in RDF. Further steps are: setup of a processing workflow to apply the specified model and transform all the data (profiles and instances) into RDF, a server solution to host the data and allow querying it and finally, on top of it offer a web interface for the users to explore the dataset.
    1011
     12%Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
     13And finally, a visualization tool for the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}.
     14Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
    1115
    12 Irrespective of the additional levels - the user wants and has to get to the resource. (not always)
    13 to the "original"
     16Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,
     17one of them being metadata curation. The results of this work represent a directly applicable groundwork for this ongoing effort.
     18One particularly pressing aspect of the curation is the consolidation of the actual values in the CMD records, a topic explicitly treated in this work.
  • SMC4LRT/chapters/Data.tex

    r3638 r3665  
    1111\label{def:CMD}
    1212
    13 The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.)
     13The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
    1414CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
    1515The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
    1616indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
    1717
     18This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     19
    1820While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
    1921
     
    3234\caption{The development of defined profiles and DCs over time}
    3335\label{table:dev_profiles}
    34   \begin{tabular}{ l | r | r | r | r }
     36%  \begin{tabular}{ l | r | r | r | r }
     37  \begin{tabular}{ l  r  r  r  r }
     38
    3539    \hline
    3640date     & 2011-01 & 2012-06 & 2013-01 & 2013-06  \\
     
    5155
    5256
    53 \subsection{Instance Data}
    54 
    55 
    56 \todoin{ add historical perspective on data - list overall}
     57\subsubsection{Instance Data}
     58
     59
     60%\todoin{ add historical perspective on data - list overall}
    5761
    5862The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
     
    6569\caption{Top 20 profiles, with the respective number of records}
    6670\begin{center}
    67   \begin{tabular}{ r | l }
     71  \begin{tabular}{ r l }
     72    \hline
    6873\# records & profile \\
    6974    \hline
     
    96101\caption{Top 20 collections, with the respective number of records}
    97102\begin{center}
    98   \begin{tabular}{ r | l }
     103  \begin{tabular}{ r l }
     104    \hline
    99105\# records & colleciton \\
    100106    \hline
     
    154160
    155161\subsection{TEI / teiHeader}
     162\label{tei}
     163
    156164 TEI/teiHeader/ODD,
     165
    157166
    158167\subsection{ISLE/IMDI}
  • SMC4LRT/chapters/Definitions.tex

    r3553 r3665  
    22\label{ch:def}
    33
     4\section {Abbreviations}
     5\label{abbr}
     6
     7\begin{table}[!h]
     8\caption{Acronyms used throughout this document}
     9\begin{tabular}{ l p{0.8\textwidth} }
     10ACDH & \xne{Austrian Centre for Digital Humanities}, cf. \ref{acdh} \\
     11CLARIN & \xne{Common Language Resources and Technology Infrastructure} -- a research infrastructure initiative, cf. \ref{def:CLARIN} \\
     12CLAVAS & \xne{Vocabulary Alignement Service for CLARIN}, cf. \ref{def:CLAVAS} \\
     13CMD & \xne{Component Metadata Framework} -- the data model underlying the CMD Infrastructure, cf. \ref{def:CMD} \\
     14CMDI & \xne{Component Metadata Infrastructure}, cf. \ref{def:CMDI} \\
     15ERIC & \xne{European Research Infrastructure  Consortium} -- a legal entity for long-term research infrastructure initiatives \\
     16DARIAH & \xne{Digital Research Infrastructure for Arts and Humanities}\furl{http://www.dariah.eu} -- another research infrastructure initiative, sister project to CLARIN \\
     17DC & data category, cf. \ref{def:DCR}  \\
     18DCR & data category registry, cf. \ref{def:DCR} \cite{ISO12620:2009} \\
     19DH & Digital Humanities, also eHumanities \\
     20LINDAT & Czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz} \\
     21MPI & Max Planck Institute, especially MPI for Psycholinguistics in Nijmegen, task leader of CMDI \\
     22OLAC & \xne{Open Language Archive Community}\furl{http://www.language-archives.org/} \ref{def:OLAC} \\
     23PID & persistent identifier \cite{CLARIN2009_PID} \\
     24PURL & persistent uniform resource locator \cite{PURL1995} \\
     25RDF & \xne{Resource Description Framework} \cite{RDF2004} \\
     26RR & Relation Registry, cf. \ref{def:rr}   \\
     27TEI & \xne{Text Encoding Initiative}, cf. \ref{tei} \\
     28\end{tabular}
     29\end{table}
     30
    431\section {Namespaces}
    5 Namespaces mentioned through this document listed:
    632
    7 \begin{description}
    8 \item[dcif] 
    9 \item[skos]
    10 \end{description} 
     33%\label{table:namespaces}
    1134
    12 \section {Abbreviations}
     35%Namespaces referenced in this document, especially in \ref{sec:cmd2rdf} defining the RDF representation.
    1336
    14 \begin{description}
    15 \item[CLARIN] \textit{Common Language Resources and Technology Infrastructure} \ref{def:CLARIN}
    16 \item[CLAVAS] \textit{Vocabulary Alignement Service for CLARIN} \ref{def:CLAVAS}
    17 \item[CMD] \textit{Component Metadata} \ref{def:CMD}
    18 \item[CMDI] \textit{Component Metadata Infrastructure} \ref{def:CMDI}
    19 \item[ERIC] \textit{European Research Infrastructure  Consortium} - a legal entity for long-term research infrastructure initiatives
    20 \item[DARIAH] \textit{Digital Research Infrastructure for Arts and Humanities}
    21 \item[DC] data category
    22 \item[DCR] data category registry \cite{ISO12620:2009}
    23 \item[DH] Digital Humanities, also eHumanities
    24 \item[LINDAT] czech national infrastructure for LRT\furl{http://lindat.ufal.cuni.cz}
    25 \item[OLAC] \textit{Open Language Archive Community}\furl{http://www.language-archives.org/}\ref{def:OLAC}
    26 \item[PID] persistend identifier \cite{CLARIN2009_PID}
    27 \item[PURL] persistent uniform resource locator \cite{PURL1995}
    28 \item[RDF] \textit{Resource Description Framework} \cite{RDF2004}
    29 \item[RR] Relation Registry\ref{def:rr} 
    30 \item[TEI] \textit{Text Encoding Initiative}
    31 \end{description}
     37\begin{table}[!h]
     38\caption{Namespaces referenced in this document}
     39  \begin{tabular}{ l  l }
     40\var{Prefix name} & \var{Prefix IRI} \\
     41%    \hline
     42rdf: & http://www.w3.org/1999/02/22-rdf-syntax-ns\# \\
     43rdfs: & http://www.w3.org/2000/01/rdf-schema\# \\
     44xsd: & http://www.w3.org/2001/XMLSchema\# \\
     45owl: & http://www.w3.org/2002/07/owl\# \\
     46skos:   & http://www.w3.org/2004/02/skos/core\# \\
     47isocat: & http://www.isocat.org/datcat/ \\
     48dcr:& http://isocat.org/ns/dcr.rdf\#  \\
     49cmd: & http://clarin.eu/cmd/1.0\# \\
     50cmds:    & ? \\
     51dce: & http://purl.org/dc/elements/1.1/ \\
     52dcterms: & http://purl.org/dc/terms \\
     53oa: & http://www.w3.org/ns/oa\# \\
     54ore: & http://www.openarchives.org/ore/terms/ \\
     55cr: & http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/ \\
     56\end{tabular}
     57\end{table}
    3258
    33 \section {Terms}
     59\section{Formatting conventions}
    3460
    35 In the following, the terms used in this work are explained.
     61Inline formatting for highlighting: \\
    3662
    37 \begin{description}
    38 \item[Concept]  Basic "entity" in an ontology? that of what an ontology is build
    39 \item[Ontology]  \quote{formal, explicit specification of a shared conceptualisation} \cite{Gruber1993}, but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
    40 \item[Word]  a lexical unit, a word in a language, something that has a surface realization (writtenForm) and is a carrier of sense. so a relation holds: hasSense(Word, Concept)
    41 \item[Lexicon]  a collection of words, a (lexical) vocabulary
    42 \item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
    43 \item[(Data)Category] (almost) the same as Concept; Things like \concept{Topic}, \concept{Genre}, \concept{Organization}, \concept{ResourceType} are instantiations of Category
    44 \item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations,  CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (\concept{ResourceType:Image}), but it can be also individuals
    45  (\concept{Organization University of Vienna})
    46         \todoin{Is it synonymous to value domain, range}
    47 \item[Entity]
    48 \item[Resource] informational resource, in the context of CLARIN-Project  mainly Language Resources (Corpus, Lexicon, Multimedia)
    49 \item[Metadata Description] description of some properties of a resource.  MD-Record
    50 \item[Schema] - CMD-Profile
    51 \item[Annotation]
    52 \end{description}
     63\begin{tabular}{ l l }
     64\xne{Named Entity} & an application or project name (institution names are written in plain text) \\
     65\code{code} & names of xml elements and attributes; also a concrete (sample) value  \\
     66\code{concept} & lexical label denoting a concept  \\
     67\var{variable} & definitions  and variables
     68\end{tabular}
    5369
    5470
    55 Lexicon vs. Ontology
    56 Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
    57 And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
    58 So the main focus of a typical ontology are the concepts (``conceptualization''), primarily language-independent.
     71\begin{definition}{A definition in a block with caption}
     72some \ formal \ expression \ equation \ or \ grammar
     73\end{definition}
    5974
     75\noindent
     76Example blocks, simple:
     77\begin{example1}
     78Short piece of sample data
     79\end{example1}
    6080
    61 Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
    62 So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
    63 
    64 ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)
     81\noindent
     82or with tabs (especially for RDF triples):
     83\begin{example3}
     84my:work & my:example & my:block
     85\end{example3}
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3638 r3665  
    1 \chapter{Mapping on instance level, CMD as LOD}
     1\chapter{Mapping on instance level,\\ CMD as LOD}
    22\label{ch:design-instance}
    33
     
    1515\end{quotation}
    1616
    17 As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     17As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
    1818
    1919One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
     
    312312\end{figure*}
    313313
    314 \subsubsection{Identify vocabularies  – CLAVAS}
     314\subsubsection{Identify vocabularies}
    315315
    316316\todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
     
    403403\label{semantic-search}
    404404
    405 With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
     405With the new enhanced dataset, as detailed in section \ref{sec:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
    406406
    407407Namely to enhance it by employing ontological resources.
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3638 r3665  
    66
    77We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
    8 In the next section, the internal data model is presented and explained. In section \ref{def:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{def:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
     8In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
    99
    1010\section{System Architecture}
     
    2121
    2222\begin{description}
    23 \item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx}
     23\item[crosswalk service] the basic service translating between fields (or indexes), detailed in \ref{def:cx-interface}
    2424\item[concept-based query expansion] a module for query expansion based on the crosswalks
    2525\item[smc-xsl] set of xslt-stylesheets (governed by a build-file) for pre- and post-processing the data
     
    9090\label{datamodel-terms}
    9191
    92 In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{list:terms-schema}.
     92In abstract terms, the internal format is basically a table of indexes with information collected from the upstream registries or created during preprocessing. Main entity is \code{Term} that represents either a label of a data category, or a CMD entity (a CMD  component or element). Further entities \code{Termset} and \code{Concept} are mainly used for logical grouping of the \code{Terms}. In the following, we explain the data model of these entities and their use in more detail. For a full \xne{Terms.xsd} XML schema see listing \ref{lst:terms-schema}.
    9393
    9494\subsubsection{Type \code{Term}}
     
    111111%\captionsetup{justification=raggedright, singlelinecheck=false}
    112112\lstset{language=XML}
    113 \begin{lstlisting}[label=list:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
     113\begin{lstlisting}[label=lst:terms-attributes-datcat, caption=sample \code{Term} element encoding an ISOcat data category]
    114114<Term concept-id="http://www.isocat.org/datcat/DC-2544" set="isocat"
    115115        type="label" xml:lang="fr">nom de ressource</Term>
     
    131131
    132132\lstset{language=XML}
    133 \begin{lstlisting}[label=list:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
     133\begin{lstlisting}[label=lst:terms-attributes-element, caption=sample \code{Term} element encoding a CMD element]
    134134<Term type="CMD_Element" name="Url" datcat="http://www.isocat.org/datcat/DC-2546"
    135135          id="clarin.eu:cr1:c_1290431694487#Url" parent="Contact"
     
    152152
    153153\lstset{language=XML}
    154 \begin{lstlisting}[label=list:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
     154\begin{lstlisting}[label=lst:terms-attributes-index, caption=sample \code{Term} element encoding a term in the inverted index]
    155155   <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1357720977520"
    156156                id="clarin.eu:cr1:c_1359626292113#ResourceTitle"
     
    168168
    169169\lstset{language=XML}
    170 \begin{lstlisting}[label=list:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
     170\begin{lstlisting}[label=lst:concept, caption=sample \code{Concept} element representing the data category \concept{resourceTitle}]
    171171<Concept xmlns:dcif="http://www.isocat.org/ns/dcif" type="datcat"
    172172               id="http://www.isocat.org/datcat/DC-2545">
     
    182182\end{lstlisting}
    183183
    184 In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{list:concept-cmd-term}).
    185 
    186 \lstset{language=XML}
    187 \begin{lstlisting}[label=list:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
     184In the inverted index the \code{Concept} is enriched with the \code{Terms} representing corresponding CMD entities (cf. Listing \ref{lst:concept-cmd-term}).
     185
     186\lstset{language=XML}
     187\begin{lstlisting}[label=lst:concept-cmd-term, caption=\code{Term} for CMD element added to \code{Concept}]
    188188 <Term set="cmd" type="full-path" schema="clarin.eu:cr1:p_1345561703620"
    189189            id="clarin.eu:cr1:c_1345561703619#Name">collection.CollectionInfo.Name</Term>
     
    223223
    224224\lstset{language=XML}
    225 \begin{lstlisting}[label=list:termset, caption=\code{Termset} element representing a CMD profile]
     225\begin{lstlisting}[label=lst:termset, caption=\code{Termset} element representing a CMD profile]
    226226<Termset name="AnnotatedCorpusProfile" id="clarin.eu:cr1:p_1357720977520"
    227227            type="CMD_Profile">
     
    254254Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
    255255
    256 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{def:qx}).
     256The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications building the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
    257257
    258258The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm, but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points), instead of a collection of pair-wise links between fields.
     
    428428\subsection{Implementation}
    429429
     430The core functionality  of the SMC is implemented as a set of XSL-stylesheets
     431
    430432At the core of the described module is a set of XSL-stylesheets, governed by an ant-build file and a configuration file holding the information about individual source registries.
    431433
     
    474476
    475477\section{qx -- concept-based search}
    476 \label{def:qx}
     478\label{sec:qx}
    477479To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
    478480In this section we want to explore, how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
     
    506508
    507509Metadata repository is implemented in xquery running within the eXist XML-database as a web application.
     510
     511There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
    508512
    509513
     
    622626\begin{description}
    623627\item[SMC graph basic]
    624         the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}
     628        the basic graph contains \var{profiles $\mapsto$ components $\mapsto$ elements $\mapsto$ datcats}; processing 155 profiles yields a graph with over 4.500 nodes and over 7.500 edges
    625629\item[SMC graph all]
    626630        additionally rendering the new profile-groups and relations between data categories (from Relation Registry)
     
    635639Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
    636640
     641To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
     642
    637643
    638644\begin{figure*}
     
    657663One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
    658664
    659 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where a all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
     665There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
    660666
    661667\subsection{Extensions}
     668\label{smc-browser-extensions}
     669
    662670Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
    663671
  • SMC4LRT/chapters/Infrastructure.tex

    r3638 r3665  
    22\label{ch:infra}
    33
     4In this chapter, we present the infrastructure, in which this work is embedded. We start with a short general introduction about the large research infrastructure initiative CLARIN, followed by a close examination of its technical infrastructure for creating and publishing metadata. In section \ref{sec:cv}, we discuss the services for managing controlled vocabularies and their role in the context of metadata creation.
    45
    56\section{CLARIN}
     
    1819The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
    1920
    20 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and bodies ensuring the flow of information and coherent action on European level.
     21In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
    2122
    2223Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
    2324
    24 \section{Component Metadata Infrastructure -  CMDI}
     25
     26\section{Component Metadata Infrastructure -- CMDI}
    2527\label{def:CMDI}
    2628
    2729One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
    2830
    29 The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide:
     31The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}:
    3032
    3133\begin{itemize}
     
    3638
    3739\noindent
    38 All these components are running services, that this work shall directly build upon.
    39 
    40 Next to these core services, that SMC has direct dependencies to, some other services are being developed within the CMDI ecosystem that are also relevant in the context of SMC:
     40All these modules are running services, that this work shall directly build upon.
     41
     42In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}.
     43
     44\begin{figure*}[ht]
     45\begin{center}
     46\includegraphics[width=0.8\textwidth]{images/CMDI_components_old_clean.png}
     47\caption{The diagram [from early CLARIN/CMDI presentations] shows individual modules of the CMDI and their interrelations as envisaged in the initial phase of the CLARIN project}
     48\label{fig:cmdi-old}
     49\end{center}
     50\end{figure*}
     51
     52Next to the above-mentioned services SMC is in direct interaction with, some other services and applications are part of the CMDI ecosystem that are briefly introduced in \ref{cmdi-other} for completeness:
    4153
    4254\begin{itemize}
    43 \item Schema Registry (SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html})
     55\item metadata editors
     56\item Schema Registry
    4457\item SchemaParser
    45 \item Vocabulary Alignement Service (OpenSKOS)
    4658\end{itemize}
    4759
    48 On the other hand, SMC shall serve the modules on the exploitation side of the infrastructure, i.e. search services used by end users. These are briefly introduced in \ref{cmdi_exploitation}.
    49 
    50 \begin{figure*}[!ht]
    51 \includegraphics[width=0.8\textwidth]{images/CMDI_components_old.png}
    52 \caption{The diagram (from early CLARIN/CMDI presentations) shows individual modules of the CMDI and their interrelations}
    53 \end{figure*}
    54 
     60Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}.
    5561
    5662\subsection{CMDI registries}
    57 
    58 The CMD framework as data model (cf. \ref{def:CMD} together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. In the following we explain briefly their role and interaction.
    59 
    60 \begin{figure*}[!ht]
     63\label{cmdi-registries}
     64The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction.
     65
     66\begin{figure*}[t]
    6167\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
    6268\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
     69\label{fig:SMC-linkage}
    6370\end{figure*}
    6471       
    65 \subsubsection*{Data Category Registry}
     72\subsubsection*{Data Category Registry -- ISOcat}
    6673\label{def:DCR}
    6774
    68 The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
    69 The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \xne{ISOcat}\furl{http://www.isocat.org/}.
    70 Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
     75The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories (DC). The resulting shared controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework (among others -- DCR is not specific to CMDI, it is meant to be used as common concept registry in many applications).
     76
     77The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}.
     78\xne{ISOcat}\furl{http://www.isocat.org/} is an implementation of this standard framework developed by MPI for Psycholinguistics, Nijmegen in collaboration with the ISO technical committee \xne{ISO TC 37 Terminology and Other Language and Content Resources}.
     79Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well.
     80
     81The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is, that the data categories are assigned a persistent identifier, making them globally and permanently referable.
     82
     83\begin{figure*}[!ht]
     84\begin{center}
     85\includegraphics[width=0.7\textwidth]{images/dc_types}
     86\end{center}
     87\caption{Data Category types\cite{Windhouwer2011ISOcat_intro}}
     88\label{fig:dc_type}
     89\end{figure*}
    7190
    7291\subsubsection*{Component Registry}
    73 
    74 \emph{Component Registry} (CR)\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} implements the CMD data model and fulfills two functions. For one it as a robust web application for creating and editing new CMD components and profiles. On the other hand it is the actual registry the persistently stores and exposes published CMD profiles, allowing to browse and search in them and view their structure.
    75 
    76 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., add or a remove some metadata elements and/or components. Also new components can be created to model the unique aspects of the resources under consideration. All components are combined into one profile. Components, elements and values should be linked to a concept to make its semantics explicit.\cite{Durco2013_MTSR}
    77 
    78 This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
    79 from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     92\label{def:CR}
     93
     94\emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompaniged by a REST webservice to allows client applications to retrieve the profile definitions. At the same time the web interface serves as an editor for creating and editing new CMD components and profiles.
     95
     96The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration.\cite{Durco2013_MTSR}
     97
     98Let us reiterate, that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted''\cite{Broeder+2010}, or \emph{to make its semantics explicit}.
     99
     100As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
     101Once a profile is finished, the Component Registry provides automatically the corresponding XML schema in the \code{cmd} target namespace \code{http://www.clarin.eu/cmd}, that can be used as base for creating and validating metadata records.
    80102
    81103\subsubsection*{Ontological Relations -- Relation Registry}
     
    83105The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
    84106However there needs to be an additional means to capture information about relations between data categories.
    85 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design grounds on the expectation that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeller.
    86 
    87 These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
    88 
    89 There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
    90 This implementation stores the individual relations as RDF-triples
     107This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations be under control of the metadata user whereas the data categories are under control of the metadata modeller.
     108
     109The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
     110
     111There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
     112This implementation stores the individual relations as RDF triples
    91113
    92114\begin{example3}
    93 <subjectDatcat, & relationPredicate, & objectDatcat>
     115subjectDatcat & relationPredicate & objectDatcat
    94116\end{example3}
    95117
    96 allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
    97 
    98 \todoin{check DCR-RR/Odijk2010 -follow up ?; Cf. Erhard Hinrichs 2009 }
     118allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
     119
     120\subsection{Further parts of the infrastructure}
     121\label{cmdi-other}
    99122
    100123\subsubsection*{Schema Registry}
    101124
    102 SCHEMAcat is a registry for schemata of all kinds (not just XML-based) semantically annotated with data categories.
    103 
     125SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html} is a registry for schemas of all kinds (not just the CMD-based, in fact not even just XML-based) semantically annotated with data categories.
     126\begin{quotation}
    104127RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow
    105128(search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
    106 
    107 
    108 \subsection{Vocabulary Service / Reference Data Registry}
    109 
    110 \subsubsection{Motivation \& related activities in the community}
    111 The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data for both the LRT and Digital Humanities community has been discussed on many occasions in various contexts. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives.
    112 
    113 In the context of the CLARIN initiative, one activity to tackle this issue -- mainly driven by CLARIN-NL -- is the project/taskforce \emph{CLAVAS - Vocabulary Alignment Service for CLARIN} where the plan is to reuse and enhance for CLARIN needs a SKOS-based  vocabulary repository and editor OpenSKOS\furl{http://openskos.org}, developed and run within the dutch program CATCHplus\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. See below for a more detailed description of this system. As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
    114 
    115 \begin{note}
    116 In parallel, within the sister ESFRI project DARIAH a taskforce with the same goal has been set up : \emph{Service for Reference Data and Controlled Vocabularies}. This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to \emph{establish a service providing controlled vocabularies and reference data} for the DARIAH (and CLARIN) community.
    117 
    118 Thus there is a momentum and a high potential for a collaborative approach in at least these two big initiatives CLARIN and DARIAH, that serve a very wide-spread and diverse community.
    119 \end{note}
    120 
    121 \subsubsection{Abstract service description}
    122 As to the service itself it is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface is still necessary for administration etc.  By using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and \xne{LOD}.
     129\end{quotation}
     130
     131\subsubsection*{Schema Parser}
     132Schema Parser is a service developed at the Meertens Institute, Amsterdam, that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
     133
     134\subsubsection*{Metadata editors}
     135\label{md-editors}
     136
     137Metadata creation, i.e. the authoring of actual metadata records is undisputably the fundamental task in the whole system.
     138Though not directly interacting with SMC, metadata editors need to be mentioned, i. e. tools that the human metadata editors is using for authoring metadata.
     139
     140Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records.
     141Two examples being the stand-alone application \xne{Arbil}\cite{withers2012arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\cite{dima2012mdeditor}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
     142
     143
     144\subsection{CMDI - Exploitation side}
     145\label{cmdi_exploitation}
     146Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
     147
     148\begin{figure*}[!ht]
     149\begin{center}
     150\includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
     151\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications}
     152\label{fig:cmd-ingestion}
     153\end{center}
     154\end{figure*}
     155
     156The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/}\cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
     157The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}).
     158As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic  interoperability provided by the infrastructure.
     159
     160\begin{figure*}[ht]
     161\begin{center}
     162\includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png}
     163\caption{screenshot of the faceted browser of the VLO}
     164\label{fig:vlo}
     165\end{center}
     166\end{figure*}
     167
     168More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
     169The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data.
     170% \todoin { describe indexing and search}
     171
     172And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}). \cite{Durco2011}
     173However the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
     174
     175%%%%%%%%%%%%%%%%%%%%
     176\section{Vocabulary Service / Reference Data Registries}
     177\label{sec:cv}
     178
     179\subsection{Motivation \& broader context}
     180The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However the problem of incoherent labeling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     181
     182This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc.
     183Moreover, by using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards transformation of this data into \emph{Linked Open Data}.
     184
     185Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives.
     186
     187It has to be also kept in mind, that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
     188
     189\begin{comment}
    123190Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
    124191\begin{verbatim}
     
    126193NDL: 00441109 | VIAF: 24602065
    127194\end{verbatim}
    128 
    129 \subsubsection{Vocabulary Service - CLAVAS}
     195\end{comment}
     196
     197\subsection{Implementation -- OpenSKOS/CLAVAS}
    130198\label{def:CLAVAS}
    131 As described in previous section (\ref{def:DCR}), a solid pillar for defining and maintaining data categories is the ISOcat data category registry. However, while ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed'' concept domains, controlled vocabularies, like lists of entities (e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).
    132 
    133 This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). Within the CLARIN project mainly the abovementioned taskforce \emph{CLAVAS} is concerned with this challenge.
    134 The foundation is the vocabulary repository and editor OpenSKOS\furl{http://openskos.org}.
    135 
    136 This repository can serve as a project independent manager and provider of controlled vocabularies.
    137 One important feature of the OpenSKOS system is its distributed nature. It allows individual instances to synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, as multiple instances would provide identical synchronized data, while the primary responsibility for individual vocabularies could lie with different instances/organizations based on their specialization, field of expertise.
    138 
    139 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), as well as Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/} are running an instance of OpenSKOS.
    140 As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \emph{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \emph{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. As part of the process of adaptation to the needs of CLARIN and LRT-community data categories from \xne{ISOcat} have been converted into SKOS-format and ingested into the system.
    141 \xne{Austrian Centre for Digital Humanities} is also running a prototypical instance of the OpenSKOS system with ISOcat data.
    142 
    143 A plan has been developed/adopted to support further vocabularies relevant for the community.
    144 Following are those to be handled in short-term, in order of urgency/relevance/prirority:
     199
     200In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
     201
     202%As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
     203
     204The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format.
     205One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up, that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
     206
     207Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running a instance of the OpenSKOS system.
     208
     209As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified, that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
    145210\begin{itemize}
    146211\item the list of language codes\cite{ISO639}
    147 \item country codes
    148212\item organization names for the domain of language resources
     213\item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process)
    149214\end{itemize}
    150215
    151 See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies
    152 and \ref{interaction-dcr-skos} for discussion on mapping the information about data categories from \xne{ISOcat} to \xne{SKOS}.
    153 
    154 \subsection{Interaction between DCR, VAS and client applications}
    155 \label{interaction-dcr-skos}
    156 
    157 DCR recognizes following types of data categories (Figure \ref{fig:dc_type}):
    158 \code{simple, complex: closed, open, constrained, (container)?}
    159 
    160 \begin{figure*}[!ht]
    161 \begin{center}
    162 \includegraphics[width=0.7\textwidth]{images/dc_types}
    163 \end{center}
    164 \caption{Data Category types}
    165 \label{fig:dc_type}
    166 \end{figure*}
    167 \todocite{DC types - ISOcat introduction at CLARIN-NL Workshop}
    168 
    169 See \ref{fig:DCR_data_model} for full DCR data model.
    170 
    171 \subsubsection{Export DCR to SKOS}
    172 \cite{Menzo2013mail}
    173 
     216\subsection{Export DCR to SKOS}
     217\label{sec:export-dcr}
     218
     219Based on the premise, that the data in DCR also represents a kind of a controlled vocabularies, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
     220
     221Note, that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is, that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
    174222
    175223The fact that data categories are basically definitions of concepts may mislead to
    176 a na"ive approach to mapping DCR to SKOS, namely mapping every data category to a \code{skos:Concept}
    177 all of them belonging to the \xne{ISOcat:ConceptScheme}.
    178 However this is not practical/useful, ISOcat as whole is too disparate, and so would be the resulting vocabulary.
    179 
    180 A more sensible approach is to export only closed DCs as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{Concepts} within that scheme.
     224a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept}
     225all of them belonging to the \code{ISOcat:ConceptScheme}. However the data in ISOcat as whole is too disparate in scope for such a vocabulary to be useful.
     226
     227A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme.
    181228
    182229\begin{quotation}
     
    184231field/element/attribute, complex DCs in ISOcat are the users of such
    185232vocabularies and simple DCs the DCR equivalence of values in such a
    186 vocabulary.
    187 \end{quotation}\cite{Menzo2013mail}
    188 
    189 Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
    190 Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
    191 So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
    192 That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
    193 
    194 Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
    195 i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using <dcr:datcat/> (and <dcterms:source/>).
    196 This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
    197 /representations/dcs2/clavas.xsl}
    198 
    199 
    200 
    201 \begin{figure*}[!ht]
    202 \begin{center}
    203 \includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
    204 \end{center}
    205 \caption{The data flow and linking between schema, data categories and vocabularies}
    206 \label{fig:export_dcr2skos}
    207 \end{figure*}
    208  
    209 Open or constrained DCs are not exported as they don't provide anything to a vocabulary.
    210 There is no need to express the relationship between this constrained DC
    211 and the vocabulary in CLAVAS itself.
    212 Indeed it is not possible to express the conceptualDomain/range of a data category within SKOS.
    213 
    214 However, they can refer to a CLAVAS vocabulary. Indeed, providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository.
    215 
    216 However it needs to be yet assessed how useful this approach is. In the metadata profile
    217 there are many closed DCs with small value domains. How useful are those
    218 in CLAVAS?
    219 
    220 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data-model.
    221 Where the value domains are big (ISO 639-3) or can only be
    222 partially enumerated (organization names) ISOcat can't/shouldn't contain
    223 the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
    224 provider.
     233vocabulary.\cite{Menzo2013mail}
     234\end{quotation}
     235
     236\begin{comment}
    225237Still there are some closed DCs which might be good vocabulary
    226238providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
     
    230242then 20, 50 or 100 values are exported.
    231243
    232 
    233 \subsubsection{Vocabulary linking and use}
    234 Currently (before integration of VAS and DCR), the only possibility to constrain the value domain of a data category
    235 is by the means a XML Schema provides \todoin{check xml schema possibilities to restrict values}, like a regular expression. So for the data category \concept{languageID DC-2482}
    236 the rule looks like:
     244However it needs to be yet assessed how useful this approach is. In the metadata profile
     245there are many closed DCs with small value domains. How useful are those
     246in CLAVAS?
     247\end{comment}
     248
     249\begin{figure*}
     250\begin{center}
     251\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
     252\end{center}
     253\caption{The wrong and correct variant of exporting ISOcat data categories in SKOS format to the Vocabulary Service}
     254\label{fig:export_dcr2skos}
     255\end{figure*}
     256
     257Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
     258Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
     259So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
     260That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
     261
     262Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
     263i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}).
     264This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
     265/representations/dcs2/clavas.xsl}
     266
     267
     268\subsection{Linking to vocabularies in data categories and schemas -- interaction between ISOcat, CLAVAS and client applications}
     269\label{interaction-dcr-skos}
     270
     271In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to
     272convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013.\cite{Menzo2013mail}}
     273
     274Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository:
     275
     276\begin{quotation}
     277Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
     278partially enumerated (organization names) ISOcat can't/shouldn't contain
     279the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
     280provider.\cite{Menzo2013mail}
     281\end{quotation}
     282
     283Currently, the only possibility to constrain the value domain of a data category
     284is by the means a XML Schema provides, like enumeration or regular expression. So for the data category \concept{languageID\#DC-2482} the rule looks like:
    237285\lstset{language=XML}
    238286\begin{lstlisting}
     
    244292\end{lstlisting}
    245293
    246 A current proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
     294A proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
    247295
    248296\begin{lstlisting}
     
    250298\end{lstlisting}
    251299
     300\begin{quotation}
    252301\code{@href} points to the vocabulary. Actually a PID should be used in the context
    253302of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
     
    255304\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
    256305valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
    257 
    258 This would yield a definition of the conceptualDomain for the data category as follows:
     306\end{quotation}
     307
     308This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
    259309 
    260310\lstset{language=XML}
    261 \begin{lstlisting}
     311\begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
    262312  <dcif:conceptualDomain type="constrained">
    263313     <dcif:dataType>string</dcif:dataType>
     
    274324\end{lstlisting}
    275325
    276 I.e. the new rule pointing to the vocabulary would be \emph{added}, so that tools that don't support CLAVAS lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
    277 
    278 \begin{note}
    279 Integrate:
    280 
    281 ISOcat refers to CLAVAS as a hint, the metadata schema is the final one that has the real CLAVAS vocabulary reference, i.e., no reference to CLAVAS via ISOcat.
    282 \end{note}
    283 
    284 Note though, that anything stated in the DC specification is not binding,
    285 but rather a generic hint or recommendation, \todoin{check: it is not ``normative''}.
    286 (Even if the DC is closed.) The authoritative/normative information is in the schema.
    287 A schema modeler, (concept)linking an element in the schema
    288 to a DC can decide to have another restriction for the values allowed
    289 in that element. The information from DCR serves as recommendation or default.
    290 
    291 
    292 \begin{figure*}[!ht]
     326\begin{figure*}[ht]
    293327\begin{center}
    294328\includegraphics[width=0.7\textwidth]{images/concept_linking.png}
    295329\end{center}
    296 \caption{The data flow and linking between schema, data categories and vocabularies}
     330\caption{The linking between schemas, data categories and vocabularies}
    297331\label{fig:concept_linking}
    298332\end{figure*}
    299  
    300 
    301 \paragraph {Modelling the vocabulary reference in the schema}
    302 It needs to be yet defined how the information about the vocabulary can be translated into a valid schema representation.
    303 One brute-force approach would be to explicitely enumerate all the values from the vocabulary. This is being currently done
    304 within the CMD-framework with the language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. However there is clearly a limit to this approach both in terms of size of the vocabulary (ISO-639 contains 7.679 items (language codes)  adding some 2MB to each schema referencing it) and its stability/change rate --- ISO-639 is a standard with a fixed list, however most other vocabularies are more volatile (think organization). And even this supposedly fixed list undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
    305 
    306 Most of these vocabularies also cannot be seen as closed-constrained, i.e. the list that is provided, provides a recommended orthography variant for a given entity, still allowing other values for given field rather than resricting the values to only the items from the vocabulary (think organizations).
    307 
    308 So this has to be solved in ``soft'' way. Most schema languages allow to annotate the schema.
    309 This is already used with DCR, adding the \code{@dcr:datcat} into schema elements.
    310 Also CMDI (ComponentRegistry when generating schemas) puts information in \code{<xs:appinfo/>}.
    311 
    312 Tools like Arbil can get access to these annotations, e.g., a reference to a CLAVAS vocabulary, and act upon
    313 it, i.e., use OpenSKOSs autocomplete API.
    314 Normal XSD validation then wouldn't validate if a value actually is part of the vocabulary. This
    315 isn't a problem if the vocabulary is open, e.g., organisation names, but
    316 it is when the value domain is closed, e.g., ISO 639-3. In the latter case
    317 the XSD generation might have two modes: a lax (smaller) version which
     333
     334It is important to emphasize, that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element then the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
     335
     336There are basically two options, how the vocabulary can be integrated into the schema.
     337One approach is to explicitly enumerate all the values from the vocabulary.
     338Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
     339and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
     340
     341The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in  \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute.
     342
     343Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though, that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
     344can use the reference to the data category to fetch explanations (semantic information)  (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}).
     345
     346The drawback of this variant is, that we gave up the validation. This
     347isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but
     348it is when the value domain is closed, e.g. \concept{languageId}. In the latter case,
     349the XSD generation could support both modes: a lax (smaller) version which
    318350doesn't contain the closed vocabulary as an enumeration and leaves it to
    319351the tool, and a strict version which does contain the vocabulary as an
    320 enumeration. Probably the latter should stay the default, but Arbil could
     352enumeration. Probably the latter should stay the default, but the client application could
    321353request the lax version leading to smaller and quicker XSD validation
    322354inside the tool.
    323355
    324 With this proposal, ISOcat constrained DCs can refer to a CLAVAS vocabulary as a way to constrain (we stretch this a bit if a vocabulary is 'open', e.g., like organization names where it provides the preferred spelling of known organizations but still has to be possible to add new organization names, not in the vocabulary).
    325 
    326  In ISOcat, such constraints have the same status as, for example, the data type, which is that ISOcat just provides hints it has no way to enforce this. Look at CMDI where the CMDI elements refer to a ISOcat DC via a concept link but they may have a completely different data type. In an ideal world the Component Editor would take over the data type and the CLAVAS vocabulary from the linked DC specification. This way the reference to the CLAVAS vocabulary ends up in the CMD component/profile specification and the derived XSD, and can be used by tools that support CLAVAS, e.g., Arbil (well its in the planning).
    327 
    328 \begin{note}
    329 \noindent
    330 something similar for the link to an EBNF grammar in SCHEMAcat:
    331 
    332 %\begin{lstlisting}
    333 \begin{verbatim}
    334       <scr:valueSchema
    335                xmlns:scr="http://www.isocat.org/ns/scr"
    336                pid="http://hdl.handle.net/1839/00-SCHM-0000-0000-004A-A"
    337                type="ISO 14977:1996 EBNF"/>
    338 \end{verbatim}
    339 %\end{lstlisting}
    340 \end{note}
    341 
    342 
    343 Finally, the client application (e.g. a metadata editor) is configured/guided by the schema.
    344 It can use the reference to the DC to fetch explanations (semantic information)  (and translations) from ISOcat, but it is bound to the value range as restricted by the schema.
    345 
    346 \subsection{CMDI - Exploitation side}
    347 \label{cmdi_exploitation}
    348 Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todoin{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
    349 
    350 \begin{figure*}[!ht]
    351 \includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
    352 \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by exploitation side components}
    353 \end{figure*}
    354 
    355 
    356 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    357 
    358 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todoin { describe indexing and search}
    359 \todocite {MI Search Engine}
    360 
    361 And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centres,
    362 and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
    363 
    364 \section{Content Repositories}
    365 Metadata is only one aspect of the availability of resources. It is the first step to announce and describe the resources. However it is of little value, if the resources themselves are not equally well accessible. Thus another pillar of the CLARIN infrastructure are Content Repositories - centres to ensure availability of resources.
    366 
    367 RDF-stores in Content Repositories (Fedora, ..)
    368 
    369 The requirements for these repositories: PIDs, CMD, OAI-PMH
    370 \todocite{center-B paper}
    371 
    372 \section{Distrbuted system - federated search}
    373 
    374 Metadata -> harvesting via OAI-PMH, but Content search has to be really distributed.
    375 
    376 \begin{description}
    377 \item[Z39.50/SRU/SRW/CQL] LoC
    378 \item[OAI-PMH]
    379 \end{description}
     356%However for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
     357
     358
     359%%%%%%%%%%%%%%%%%
     360\section{Other aspects of the infrastructure}
     361While this work concentrates solely on the metadata, it needs to be recognized, that it is only aspect of the infrastructure and its actual purpose the availability of resources. Metadata is a necessary first step to announce and describe the resources. However it is of little value, if the resources themselves are not accessible.
     362
     363Consequently, another pillar of the CLARIN infrastructure are the centres\furl{http://www.clarin.eu/node/3812}:
     364\begin{quotation}
     365CLARIN's distributed network is made out of centres. These units, often a university or an academic institute, offer the scientific community access to services on a sustainable basis.
     366\end{quotation}
     367
     368CLARIN imposes a number of criteria, that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767}\cite{CE-2013-0095}.
     369CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
     370
     371One core service of such centres are the content repositories, systems meant for long-term preservation and publication of research data and resources.
     372
     373
     374\begin{figure*}
     375\begin{center}
     376\includegraphics[width=0.7\textwidth]{images/FCS_components.png}
     377\end{center}
     378\caption{components of the Federated Content Search}
     379\label{fig:fcs}
     380\end{figure*}
     381
     382Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs}\cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via the aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50. The maintenance of SRU/CQL has been
     383transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
    380384
    381385
    382386\section{Summary}
     387
     388In this chapter we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry, that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally a few other aspects of the infrastructure, that are equally important, however not pertaining to the metadata level, were briefly tackled.
     389
  • SMC4LRT/chapters/Introduction.tex

    r3553 r3665  
    88While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
    99
    10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
     10This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
    1111
    1212This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
     
    1414\section{Main Goal}
    1515
    16 The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}.
     16The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the underlying processing, referred to as \xne{semantic mapping}.
    1717
    1818The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
     
    2525\end{description}
    2626
    27 The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals.
     27The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the dependencies between individual subgoals.
    2828
    2929\begin{figure*}[!ht]
     
    4444\subsubsection*{Concept-based query expansion}
    4545
    46 Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching.
     46Once the crosswalks are available, they can be used to rewrite user queries, so that they match equivalent or similar fields across heterogeneous metadata schemas resulting in higher recall when searching.
    4747
    4848\paragraph{Example}
     
    5454\end{quote}
    5555
    56 while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered:
     56The expansion cannot be solved by simple string matching, as there are other fields labeled with the same (sub)strings but with different semantics, that shouldn't be considered:
    5757
    5858\begin{quote}
    59 \concept{Project/Title, Organisation/Name, Country/Name}
     59\concept{Project/Title, Organisation/Name, Country/Name, LanguageName}
    6060\end{quote}
    6161
     
    6666\subsubsection*{Ontology-driven data exploration}
    6767
    68 Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources.
     68Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset}.
    6969
    7070\paragraph{Example}
     
    7272
    7373\subsubsection*{Visualization}
    74 Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means for exploration of and interaction with it. Hence this subgoal aiming at exploring ways of visualizing the data at hand.
     74Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means to explore and interact with it. Hence this subgoal aimed to propose ways of visualizing the data at hand.
    7575
    7676\section{Method}
     
    7979Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
    8080
    81 Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories.
     81Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories. This theoretical part will be accompanied by a prototypical implementation as proof of concept.
    8282
    83 This theoretical part will be accompanied by a prototypical implementation as proof of concept.
     83%In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
    8484
    85 In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
    86 
    87 In this work, the focus lies on the actual method to generate and apply the crosswalks -- expressed in the specification and operationalized in the (prototypical) implementation of the service -- rather than trying to establish final, accomplished crosswalks between the schemas. In fact, given the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on \emph{dynamic mapping}, i.e. to enable the users to directly manipulate the level of use of the crosswalks or even apply custom crosswalks depending on their current task or research question being able to actively influence the recall/precision ratio of the search results, and essentially to modulate the semantic search space.
     85Note that in this work, the focus lies on the actual method to generate and apply the crosswalks -- expressed in the specification and operationalized in the (prototypical) implementation of the service -- rather than trying to establish final, accomplished crosswalks between the schemas. In fact, given the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on \emph{dynamic mapping}, i.e. to enable the users to directly manipulate the level of use of the crosswalks or even apply custom crosswalks depending on their current task or research question being able to actively influence the recall/precision ratio of the search results, and essentially to modulate the semantic search space.
    8886
    8987
    90 Serving the second subgoal, semantic interpretation on the instance level, we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
     88Serving the second subgoal -- semantic interpretation on the instance level -- we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
    9189semantic resources (controlled vocabularies, ontologies).
    9290Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
    9391
    94 A separate usability evaluation of the semantic search is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
     92A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
    9593
    9694\section{Expected Results}
     
    9896The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \xne{crosswalk service}.
    9997This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
    100 and the results and findings of the \emph{evaluation}.
     98and the sample results. % and findings of the \emph{evaluation}.
    10199
    102100Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/}.
     
    104102\begin{description}
    105103\item [Crosswalk service] specification and a basic implementation of the service
    106 \item [Concept-based search] design of the query expansion and prototypical integration with search engines
     104\item [Concept-based search] design of the query expansion and prototypical integration with a search engine
    107105\item [Visualization tool] design of an application for interactive exploration of the concerned dataset
    108 \item [Evaluation] evaluation results of querying the dataset comparing simple search and semantic search
     106%\item [Evaluation] evaluation results of querying the dataset comparing simple search and semantic search
    109107\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
    110108\end{description}
    111109
    112110\section{Structure of the work}
    113 The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the terms and abbreviations used in the work.
     111The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work.
    114112
    115113In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
     
    117115The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively.
    118116
    119 The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
     117%evaluation and the
     118The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
    120119
    121120\section{Keywords}
  • SMC4LRT/chapters/Results.tex

    r3638 r3665  
    22\label{ch:results}
    33
    4 In this chapter, the results of the work are presented, divided into two main areas:
    5 
    6 software and data.
    7 
    8 In two sections, we explore the CMD data domain - the usage of the data categories on the one hand and the integration of existing formats on the other hand. While these two aspects were not directly part of this work, they were a) made possible by output of this work (SMC-Browser, statistical analysis), b) yield a valuable test case for the usefulness of the work and c) are an indispensable prerequisit for the necessary curation work being carried out by the CMDI community.
     4In this chapter, the results of the work are presented. After a short update about the current state of affairs in the infrastructure as whole, the individual parts of the work are listed with pointers to their specifications in previous chapters and links to the running prototypes.
     5
     6In the subsequent two sections, we explore a few specific aspects of the CMD data domain -- regarding the usage of the data categories (\ref{sec:explore-datcats}) and the integration of existing formats (\ref{sec:explore-formats}). While these topics are not directly results of this work, the presented analyses are. They were made possible by the technical solution of this work, yield a valuable test case for the usefulness of the work and are an indispensable prerequisite for the necessary coordination and curation work being carried out by the CMDI community.
    97
    108\section{Current status of the infrastructure}
     
    1412The main services of the infrastructure have been in stable production for the last two years.
    1513Relation Registry is operational as early prototype.
    16 Three instances of OpenSKOS are running, one of them being hosted by ACDH.
     14Three instances of \xne{OpenSKOS} are running, one of them being hosted by \xne{ACDH}.
    1715
    1816\subsection{CMDI - data}
    19 More than 130 profiles are defined. (See \ref{table:dev_profiles} for more details about profiles.)
     17More than 130 profiles are defined. (See table \ref{table:dev_profiles} for more details about profiles.)
    2018The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
    21 The collection amounts to over 550.000 records in 64 profiles.
     19The collection amounts to over 550.000 records in more than 60 distinct profiles.
    2220
    2321\subsection{ACDH - the home of SMC}
    24 Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities, that provides depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure.
    25 Figure \ref{fig:acdh_context} sketches the broader context of \xne{acdh} and its different roles.
    26 
    27 
    28 \section {Software}
    29 The specification of the system can be found in the chapters \ref{ch:design} and \ref{ch:design-instance}.
    30 
    31 There is prototypical implementation for three parts of the system
    32 
    33 \begin{itemize}
    34 \item the crosswalk service as a REST web service
    35 \item a module to integrate with a search engine
    36 \item web application that allows advanced interaction with the data set
    37 \end{itemize}
    38 
    39 The SMC module is being maintained in the CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
    40 
    41 Furthermore, the CMD data has been expressed RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
     22       
     23Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities with the mission to foster digital research paradigm in humanities. It is designed to provide depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. SMC is one of these services provided by this centre.
     24Figure \ref{fig:acdh_context} sketches the broader context of \xne{ACDH} and its different roles.
     25
     26%%%%%%%%%%%%%%%%
     27\section {Technical solution}
     28With this work we delivered a module embedded in a larger metadata infrastructure, aimed at supporting the semantic interoperability across the heterogeneous data in this infrastructure. The module consists of multiple interrelated components. The technical specification of the module can be found in chapter \ref{ch:design}. A prototypical implementation has been developed for the three main parts of the system. The code of this implementation is maintained in the central CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
     29
     30The module itself is hosted at the \xne{CLARIN-AT} server, offering a main entry point page linking to the various parts of the module at:
     31\\
     32
     33\url{http://clarin.aac.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
     34
    4235
    4336\subsection{SMC - crosswalks service}
    44 
    45 The core function of the SMC is implemented as a set of XSL-stylesheets, with auxiliary functionality (like caching or a wrapping web service) being provided by a wrapping application implemented in Java.
     37the crosswalk service as a REST web service
     38
     39exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}
     40
     41This interface is available as part of the smc application:
     42
     43\url{http://clarin.aac.ac.at/smc/cx}
    4644
    4745\subsection{SMC - as a module within Metadata Repository}
    48 There is also a XQuery implementation, that is integrated as a module of the SADE/cr-xq - eXist-based web application framework for publishing resources, on which the Metadata Repository is running.
    49 
     46The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
     47
     48\url{http://clarin.aac.ac.at/mdrepo/smc}
    5049
    5150\subsection{SMC Browser -- advanced interactive user interface}
    5251
    53 SMC Browser\furl{http://clarin.aac.ac.at/smc-browser} is a web application to explore the complex dataset of the Component Metadata Framework, by visualizing its structure as an interactive graph.
    54 In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g. counting how many elements a profiles contains, or in how many profiles a DC is used.
    55 
    56 It is implemented on top of the js-library d3, the code is checked in clarin-svn.
    57 
    58 The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
    59 
    60 E.g. starting from 124 profiles, this amounts to a graph with ??? nodes and ??? edges.
    61 
    62 \begin{figure*}[!ht]
     52SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
     53
     54\url{http://clarin.aac.ac.at/smc/browser}
     55
     56\begin{figure*}
    6357\includegraphics[width=1\textwidth]{images/screen_SMC-Browser_2013-01-23}
    6458\caption{Screenshot of the SMC browser}
    6559\end{figure*}
    6660
    67 SMC Browser also features detailed numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation.
    68 
    69 In the following section, we make extensive use of the output of this tool, to visualize individual aspects of the discussed data set.
    70 
    7161\subsection{SMC LOD}
    72 
    73 
    74 \section{Exploring the usage of data categories}
    75 At the core of the whole SMC (and CMDI) are the data categories as basic conceptual building blocks or anchors.
    76 We want to take a closer look on the usage of the data categories in the CMD infrastructure, examplifying on a few very common concepts -- \concept{language}, \concept{name}, \concept{resource type}, \concept{???}.
    77 
    78 In the ISOcat DCR 791 DCs are defined in the Metadata thematic profile, out of which 222 were created by the \textit{Athens Core} group. \todoin{need to check, how many of these athens-core data categories are being employed}
    79 
    80 \subsection{Language}
    81 While there are 69 components and 97 elements containing a substring `language' defined in the CR
    82 still only 19 distinct DCs with a `language' substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
    83 \textit{languageID} (\texttt{DC-2482}) and \textit{languageName} (\texttt{DC-2484}) are referenced by more than 80 profiles.
    84 Additionally, these two DCs are linked to the Dublin Core term \textit{Language} in the RR.
    85 Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \textit{dc:Language} or \textit{dcterms:Language}, \textit{isocat:languageId} or \textit{isocat:languageName},
    86 most profiles use both \textit{isocat:languageId} and \textit{isocat:languageName} and there are even profiles that refer to both \textit{isocat} and \textit{dublincore} data categories (\textit{data}, \textit{HZSKCorpus}, \textit{ToolService}).
     62In a separate track, a model has been proposed (cf. \ref{ch:design-instance}) to express CMD data in RDF, as first important step towards incorporating the dataset in the \emph{Web of Data}.
     63
     64
     65%%%%%%%%%%%%%%%555
     66\section{Exploring the CMD data -- SMC reports}
     67SMC reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain that were created making extensive use of the visual and numerical output from the \xne{SMC Browser}. In this section, we deliver a few examples of these analyses. A complete up to date listing is maintained on the SMC website:
     68
     69\url{http://clarin.aac.ac.at/smc/reports}
     70
     71\subsection{Usage of data categories}
     72\label{sec:explore-datcats}
     73At the core of the whole SMC (and CMDI) are the data categories as basic semantic building blocks or anchors.
     74
     75In the ISOcat DCR, currently 791 DCs are defined in the Metadata thematic profile, starting from 222 that were initially created by the so-called \textit{Athens Core} group in 2010. %\todoin{need to check, how many of these athens-core data categories are being employed}
     76As can be seen in table \ref{table:dev_profiles}, around 500 distinct data categories are being used in CMD profiles.
     77We want to take a closer look on the usage of the data categories in the CMD data domain, examplifying on the very common concepts -- \concept{language}, \concept{name}. %, \concept{resource type}, \concept{???}.
     78
     79\subsubsection{Language}
     80While there are 69 components and 97 elements containing a substring \code{`language'} defined in the CR
     81still only 19 distinct DCs with a \code{`language'} substring are being used\footnote{Here the term `used' means referenced in CMD components and elements.}. The most commonly used ones:
     82\concept{languageID\#DC-2482}) and \concept{languageName\#DC-2484}) are referenced by more than 80 profiles.
     83Additionally, these two DCs are linked to the Dublin Core term \concept{Language} in the RR.
     84Thus a search engine capable of interpreting RR information could offer the user a simple Dublin Core-based search interface, while -- by expanding the query -- still searching over all available data, and, moreover, on demand offer the user a more finegrained semantic interpretation for the matches based on the originally assigned DCs. Figure \ref{fig:language_datcats} depicts the relations between the language data categories and their usage in the profiles. We encounter all types of situations: profiles using only \concept{dc:Language} or \concept{dcterms:Language}, \concept{isocat:languageId} or \concept{isocat:languageName},
     85most profiles use both \concept{isocat:languageId} and \concept{isocat:languageName} and there are even profiles that refer to both \concept{isocat} and \concept{dublincore} data categories (\concept{data}, \concept{HZSKCorpus}, \concept{ToolService}).
    8786
    8887
     
    9190\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
    9291\end{center}
    93 \caption{The four main \textit{Language} data categories and in which profiles they are being used}
     92\caption{The four main \concept{Language} data categories and in which profiles they are being used}
    9493\label{fig:language_datcats}
    9594\end{figure*}
    9695
    9796It requires further inspection and in the end a case by case decision, if the other less often used `language' DCs can be treated as equivalent to the above mentioned ones.
    98 \textit{languageScript}, \textit{implementationLanguage}, as well as \textit{noLanguages} or  \textit{sizePerLanguage} clearly do not belong to the language cluster.
    99 But \textit{sourceLanguage}, \textit{languageMother} or \textit{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
    100 
    101 
    102 \subsection{Name / Title}
     97\concept{languageScript}, \concept{implementationLanguage}, as well as \concept{noLanguages} or  \concept{sizePerLanguage} clearly do not belong to the language cluster.
     98But \concept{sourceLanguage}, \concept{languageMother} or \concept{participantDominantLanguage} can at least be expected to share the same value domain (natural languages) and even if they do not describe the language of the resource, they could be considered when one aims at maximizing the recall (i.e., trying to find anything related to a given language). This is actually exactly the scenario the RR was conceived for -- allow to define custom relation sets based on specific needs of a project or of a research question.
     99
     100
     101\subsubsection{Name / Title}
    103102There are as many as 72 CMD elements with the label \texttt{Name}, referring to 12 different DCs.
    104 Again the main DC \textit{resourceName} (\texttt{DC-2544}) being used in 74 profiles together with the semantically near \textit{resourceTitle} (\texttt{DC-2545}) used in 69 profiles offer a good coverage over available data.
    105 
    106 Some of the DCs referenced by \texttt{Name} elements are \textit{author} (\texttt{DC-4115}), \textit{contact full name} (\texttt{DC-2454}), \textit{dcterms:Contributor}, \textit{project name} (\texttt{DC-2536}), \textit{web service name} (\texttt{DC-4160}) and \textit{language name} (\texttt{DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
    107 
    108 \subsection{Resource type}
    109 
    110 \subsection{Subject, Genre, Topic}
    111 
    112 \section{Exploring the integration of existing formats}
     103Again the main DC \concept{resourceName\#DC-2544}) being used in 74 profiles together with the semantically near \concept{resourceTitle\#DC-2545}) used in 69 profiles offer a good coverage over available data.
     104
     105Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
     106
     107%\subsection{Resource type}
     108
     109% \subsection{Subject, Genre, Topic}
     110
     111\subsection{Integration of existing formats}
     112\label{sec:explore-formats}
    113113
    114114CLARIN set out with the aspiration /yearning to overcome the babylon of metadata formats
     
    116116In this section, we want to elaborate on/analyze the state of integration efforts for 4 major formats: \xne{dublincore/OLAC}, \xne{teiHeader} and \xne{META-SHARE resourceInfo}.
    117117
    118 \subsection{dublincore / OLAC}
     118\subsubsection{dublincore / OLAC}
    119119
    120120Very widely used (because) simple format
     
    136136\caption{Profiles modelling dublincore terms}
    137137\label{table:dcterms-profiles}
    138   \begin{tabular}{ l | l | l | r | r }
    139     \hline
    140 profile name & created & creator & count & instances \\
    141     \hline
    142 component-dc-terms-modular & 2010-04-21 & CMDI-team & 15 / 15 / 15 \\
    143 component-dc-terms & 2010-04-21 & CMDI-team & 0 / 15 / 15 \\
    144 DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
    145 OLAC-DcmiTerms & 2010-10-28 & Dieter Van Uytvanck & 0 / 55 / 55 & \\
    146 OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02-12 & Menzo Windhouwer & 1 / 71 / 62 & \\
    147 DC-UBU & 2013-05-29& Utrecht University Library & 0 / 15 / 15 & \\
    148 OLAC-DcmiTerms-ref & 2013-06-24 & fankhauser@ids-mannheim.de & 0 / 55 / 55 & \\
    149     \hline
    150   \end{tabular}
     138%  \begin{tabular}{ |l | l | l | r | r | }
     139  \begin{tabu}{ l  l  l  r  r }
     140    \hline
     141\rowfont{\itshape\small} profile name & created & creator & count & instances \\
     142   \hline
     143component-dc-terms-modular & 2010-04 & CMDI-team & 15 / 15 / 15 & \\
     144component-dc-terms & 2010-04 & CMDI-team & 0 / 15 / 15 & \\
     145DcmiTerms & 2010-10 & D.Van Uytvanck & 0 / 55 / 55 & 46.156 \\
     146OLAC-DcmiTerms & 2010-10 & D. Van Uytvanck & 0 / 55 / 55 & 85.149 \\
     147OLAC-DcmiTerms\footnote{optional DANS-DC-metadata component} & 2013-02 & M. Windhouwer & 1 / 71 / 62 & \\
     148DC-UBU & 2013-05 & Utrecht Uni Lib & 0 / 15 / 15 & \\
     149OLAC-DcmiTerms-ref & 2013-06 & Fankhauser, IDS & 0 / 55 / 55 & 697 \\
     150OLAC-DcmiTerms-ref-DWR & private & ? & 1 / 61 / 55 &  775 \\
     151    \hline
     152  \end{tabu}
    151153\end{table}
    152154
    153155Additionally, there is a number of profiles with concept links to dublincore terms,
    154156Some use all of the dublincore elements or terms as one component within a larger profile,
    155 one example being the \xne{data} profile created by the Czech initiative LINDAT modells  the minimal obligatory set of META-SHARE \xne{resourceInfo}) combined with a simple dublincore record (see also subsection about META-SHARE below).
    156 Other profiles refer only to some data categories. Most often used: \concept{Title} (used in 33 profiles) and \concept{Creator} (in 29 profiles).
     157one example being the \xne{data} profile created by the Czech initiative LINDAT models  the minimal obligatory set of META-SHARE \xne{resourceInfo} schema, cf. subsection about META-SHARE below) combined with a simple dublincore record.
     158Other profiles refer only to some data categories. Most often used: \concept{dc:Title} (used in 33 profiles) and \concept{dc:Creator} (in 29 profiles).
    157159Profiles that make more frequent use of the dublincore terms:
    158160
    159 \begin{itemize}
    160 \item EastRepublican (8)
    161 \item HZSKCorpus (17)
    162 \item teiHeader (8)
    163 \item ToolService (15)
    164 \item OralHistoryInterviewDANS (15)
    165 \end{itemize}
    166 
    167 \begin{figure*}[!ht]
    168 \begin{center}
    169 \includegraphics[width=0.8\textwidth]{images/profiles_using_dcmiterms.png}
     161\begin{tabular}{l r}
     162EastRepublican & 8 \\
     163HZSKCorpus &17 \\
     164teiHeader &8 \\
     165ToolService &15 \\
     166OralHistoryInterviewDANS & 15 \\
     167\end{tabular}
     168
     169\begin{figure*}
     170\begin{center}
     171\includegraphics[width=1\textwidth]{images/profiles_using_dcmiterms.png}
    170172\end{center}
    171173\caption{Profiles referring to at least some of the dublincore data categories/terms}
     
    174176
    175177
    176 \subsection{teiHeader}
     178\subsubsection{teiHeader}
    177179
    178180TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
     
    191193This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
    192194
    193 \begin{figure*}[!ht]
    194 \begin{center}
    195 \includegraphics[width=0.75\textwidth]{images/teiHeader_DBNL.png}
     195%[!ht]
     196\begin{figure*}
     197\begin{center}
     198\includegraphics[width=0.65\textwidth]{images/teiHeader_DBNL.png}
    196199\end{center}
    197200\caption{The reuse of components between the original teiHeader-profile (2010) and the profiels used in Nederlab project}
     
    199202\end{figure*}
    200203
     204% p{0.2\textwidth}
    201205\begin{table}
    202206\caption{Overview of TEI-related CMD profiles}
    203207\label{table:tei-profiles}
    204   \begin{tabular}{ l | r | l | r | r | r}
    205     \hline
    206 profile name & created & creator & count & instances \\
    207     \hline
    208 teiHeader & 2010 & ICLTT, Durco & 16/35/13 & 467 \\
     208  \begin{tabu}{ p{0.2\textwidth}  r  l  r  r  }
     209    \hline
     210\rowfont{\itshape\small} profile name & created & creator & count & instances \\
     211    \hline
     212teiHeader & 2010 & Durco, ICLTT & 16/35/13 & 467 \\
    209213teiHeader & 2012 & Deutsches Text Archiv & 56/82/10 & 857 \\
    210 TEIDocumentDescription & 2012 & Leipzig Corpora, Eckart & 16/35/13 & ? \\
    211 DBNL\_Tekst & 2013 & Nederlab, Zhang & 20/38,15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
    212 DBNL\_Tekst\_Onzelfstandig  & & & 20/47/21 &  \\
    213     \hline
    214   \end{tabular}
     214TEIDocument Description & 2012 & Eckart, Leipzig Corpora & 16/35/13 & ? \\
     215DBNL\_Tekst & 2013 & Zhang, Nederlab & 20/38/15 & \textgreater 37 Mio.\footnote{There shall be a metadata record for every article.} \\
     216DBNL\_Tekst\_ Onzelfstandig  & & & 20/47/21 &  \\
     217    \hline
     218  \end{tabu}
    215219\end{table}
    216220
     
    218222clarin.eu:cr1:p 1366279029218 (private)
    219223
    220 \subsection{META-SHARE}
    221 
    222 
    223 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
     224%
     225\subsubsection{META-SHARE}
     226%
     227
     228META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
    224229%In cooperation between metadata teams from CLARIN and META-SHARE
    225230
    226 \begin{figure*}[!ht]
     231The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
     232
     233In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
     234This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
     235
     236The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific informations.
     237
     238\begin{figure*}
    227239\begin{center}
    228240\includegraphics[width=0.5\textwidth]{images/SMC-resourceInfo.png}
    229241\end{center}
     242\caption{The five \concept{resourceInfo} profiles with the first level of components}
     243\label{fig:resource_info_5}
     244\end{figure*}
     245
     246\begin{figure*}
     247\begin{center}
     248\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
     249\end{center}
    230250\caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
    231 \label{fig:resource_info_5}
     251\label{fig:META-SHARE-LINDAT}
    232252\end{figure*}
    233253
     
    235255\caption{Profiles modelling resourceInfo}
    236256\label{table:resourceinfo-profiles}
    237   \begin{tabular}{ l | l | l | r | r }
    238     \hline
    239 profile name & created & creator & count & instances \\
    240     \hline
    241 resourceInfo (minimal) & 2013-02-13 & LINDAT.CZ & 34 / 41 / 21 \\
    242 resourceInfo (lexical) & 2013-06-02 & P. Labropoulou & 86 / 226 / 57 \\
    243 resourceInfo (tools) & 2013-06-02 & P. Labropoulou & 61 / 176 / 52 \\
    244 resourceInfo (language) & 2013-06-02 & P. Labropoulou & 89 / 228 / 54 \\
    245 resourceInfo (corpus) & 2013-06-02 & P. Labropoulou & 117 / 337 / 72 \\
    246     \hline
    247   \end{tabular}
     257  \begin{tabu}{ l l l r r }
     258    \hline
     259\rowfont{\itshape\small} profile name & created & creator & count & instances \\
     260    \hline
     261resourceInfo (minimal) & 2013-02 & LINDAT.CZ & 34 / 41 / 21 & 67 \\
     262resourceInfo (lexical) & 2013-06 & P. Labropoulou & 86 / 226 / 57 \\
     263resourceInfo (tools) & 2013-06 & P. Labropoulou & 61 / 176 / 52 \\
     264resourceInfo (language) & 2013-06 & P. Labropoulou & 89 / 228 / 54 \\
     265resourceInfo (corpus) & 2013-06 & P. Labropoulou & 117 / 337 / 72 \\
     266    \hline
     267  \end{tabu}
    248268\end{table}
    249269
    250 The model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
    251 
    252 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
    253 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
    254 
    255 \begin{figure*}[!ht]
    256 \begin{center}
    257 \includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
    258 \end{center}
    259 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
    260 \label{fig:META-SHARE-LINDAT}
    261 \end{figure*}
    262 
    263 \begin{figure*}[!ht]
    264 \begin{center}
    265 \includegraphics[height=1\textheight]{images/resourceInfoBIG.png}
     270
     271\begin{figure*}
     272\begin{center}
     273\includegraphics[height=0.95\textheight]{images/resourceInfoBIG.png}
    266274\end{center}
    267275\caption{the META-SHARE based profile for describing corpora}
     
    270278
    271279
    272 
     280%%%%%%%%%%%%%%%%%%%%%%%
     281\subsection{SMC cloud}
     282As a latest, still experimental, addition, SMC browser provides a special type of graph, that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
     283covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.
     284
     285\begin{figure*}[!ht]
     286\begin{center}
     287\includegraphics[width=1\textwidth]{images/just_profiles_6.png}
     288\end{center}
     289\caption{SMC cloud -- graph visualizing the semantic proximity of profiles}
     290\label{fig:SMC_cloud}
     291\end{figure*}
     292
     293\begin{comment}
    273294\section{Evaluation}
    274295\label{evaluation}
     
    298319AF + DCR + RR
    299320
     321\end{comment}
    300322
    301323\section{Summary}
    302 
    303 The direct comparison of the CMD approach of metamodel allowing to generate custom profiles with shared semantics and a more traditional way of trying to generate one schema to fit all in as in META-SHARE shows nicely the trade-off: many custom schemas or one very large.
    304 
     324In this final chapter, we presented the results, on the one hand the technical solution of the module \xne{Semantic Mapping Component}, on the other hand we spent a good part of the chapter on commented analyses of the processed dataset, that were made possible by \xne{SMC Browser}, a interactive visualization tool developed as part of this work for exploration of the schema level data of the discussed collection. As such, the analyses can be seen as an evaluation, a proof of concept and usefulness of the presented work.
     325
  • SMC4LRT/chapters/abstract_de.tex

    r2672 r3665  
    11\chapter*{Kurzfassung}
    22
    3 Hier fÃŒgen Sie die Kurzfassung auf Deutsch gemÀß den Vorgaben der FakultÀt ein.
     3Diese Arbeit ist eingebettet in eine große internationale Forschungsinfrastruktur-Iinitiave, die zur Aufgabe hat,
     4einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermöglichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes koherentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermöglicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÀt zu ÃŒberwinden, das sich aus der HeterogenitÀt der Metatadaten-Formate ergibt.
     5
     6Das eigentliche Ziel, der Nutzen dieser Arbeit -- im Einklang mit der generellen Idee des ganzen Unterfangens -- war die \emph{Verbesserung der Suchmöglichkeiten} in der großen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÀnzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhöhen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend, Modellierung der DomÀndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermöglichen.
     7
     8Entsprechend den zwei Herangehensweisen lieferte die Arbeit auch zwei Hauptergebnisse: a) die Spezifikation eines Moduls fÃŒr \emph{konzept-basierte Suche} zusammen mit dem zugrundeliegenden Dienst \emph{crosswalk service}, begleitet von einer Testimplementierung; b) Spezifikation der Modellierung der Ausgangsdaten im RDF Format, womit die Grundlage geschaffen ist, die Daten als \emph{Linked Open Data} bereitzustellen.
     9
     10Teilweise als Nebenprodukt wurde auch die Anwendung \emph{SMC Browser} entwickelt -- ein interaktives Visualisierungswerkzeug zur Erschließung der Schema-Ebene der Datensammlung. Mit Hilfe dieses Werkzeugs konnte eine Reihe von tiefergehenden Analysen der Daten erstellt werden, die direkt von der Forschergemeinschaft zur Erschließung und Redaktion der komplexen Daten genutzt werden. Somit können die Anwendung und die Analyseberichte als ein wertvoller Beitrag fÃŒr die Forschergemeinschaft angesehen werden.
  • SMC4LRT/chapters/abstract_en.tex

    r3638 r3665  
    22
    33
    4 This work is embedded in the context of a large research infrastructure initiative aimed at easing and harmonizing access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in at the core of the infrastructure.
     4This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.
    55
    6 The ultimate objective of the effort -- in line with the overall mission of the infrastructure -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This was pursued by two separate, complementary approaches: a) Enriching the search capabilities with concept-based crosswalks on schema level.
    7 And -- acknowledging the integrative power of the \emph{Linked Open Data} paradigm  -- b) expressing the domain data as a \emph{Semantic Web} resource.
     6The ultimate objective of this work -- in line with the overall mission of the whole initiative -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
    87
    9 In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
    10 As a by-product, the application \emph{SMC browser} was developed -- a visualization tool for interactive exploration of the dataset. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, they are considered the main contribution of this work by the author.
     8In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF format, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
    119
     10Partly as by-product, the application \emph{SMC browser} was developed -- an interactive visualization tool to explore the dataset on the schema level. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, the tool and the reports can be considered a valuable contribution to the community.
     11
  • SMC4LRT/chapters/appendix.tex

    r3638 r3665  
    3030\begin{figure*}[!ht]
    3131\begin{center}
    32 \includegraphics[width=1\textwidth]{images/acdh-diagram_300dpi_rotated.png}
     32\includegraphics[width=1\textheight, angle=90]{images/acdh-diagram_300dpi.png}
    3333\end{center}
    3434\caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
     
    3636\end{figure*}
    3737
     38\chapter{CMD -- sample data}
    3839
    39 \chapter{SMC Browser}
     40\section{Definition of a CMD profile}
     41
     42\section{CMD record}
     43
     44
     45\chapter{SMC Browser -- related material }
    4046
    4147
     
    5359\input{chapters/userdocs_cleaned}
    5460
     61\section {Sample SMC graphs}
     62\label{sec:smc-graphs}
    5563
    56 
    57 
    58 
     64\begin{comment}
     65       
    5966\chapter{SMC Reports}
    60 \label{ch:reports}
     67\label{ch:smc-reports}
    6168
    6269SMC Reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain based on the visual and numerical output from the SMC Browser \ref{smc-browser}.
    6370
    6471\input{chapters/examples_cleaned}
     72\end{comment}
  • SMC4LRT/chapters/danksagung.tex

    r3638 r3665  
    22
    33Ich möchte mich herzlich bedanken, bei allen Kollegen die mir mit Rat zur Seite gestanden sind
    4 und meinen Liebsten fÃŒr ihre extra-portion Geduld, die ich ihnen abverlangt habe.
     4und meinen Liebsten fÃŒr die Extra-Portion Geduld, die ich ihnen abverlangt habe.
Note: See TracChangeset for help on using the changeset viewer.