Changeset 4117 for SMC4LRT


Ignore:
Timestamp:
12/01/13 19:04:51 (10 years ago)
Author:
vronk
Message:

minor orthographic corrections

Location:
SMC4LRT/chapters
Files:
14 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Conclusion.tex

    r3776 r4117  
    1111
    1212%Irrespective of the additional levels - the user wants and has to get to the resource. (not always) to the "original"
    13 And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features, that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
     13And finally, a visualization tool for exploring the schema level data of the discussed data collection was developed -- the \emph{SMC Browser}. Considering the feedback received until now from the colleagues in the community, it is already now a useful tool with high further potential. As detailed in \ref{smc-browser-extensions}, there is a number of features that could enhance the functionality and usefulness of the tool: integrate with instance data to be able to directly see, which profiles are effectively being used; allow set operations on subgraphs (like intersection and difference) to enable differential views; generalize the matching algorithm; enhance the tool to act as an independent visualization service, by accepting external graph data (from any domain).
    1414
    1515Within the CLARIN community a number of (permanent) tasks has been identified and corresponding task forces have been established,
  • SMC4LRT/chapters/Data.tex

    r3776 r4117  
    11
    2 \chapter{Analysis of the data landscape}
     2\chapter{Analysis of the Data Landscape}
    33\label{ch:data}
    44This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
     
    1010The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
    1111CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
    12 The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
     12The actual core provision for semantic interoperability is the requirement that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
    1313indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
    1414
    15 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     15%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    1616
    1717While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
    1818
    19 Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records.
     19Once the profiles are defined they are transformed into a XML Schema that prescribes the structure of the instance records.
    2020The generated schema also conveys as annotation the information about the referenced data categories.
    2121
     
    5757The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
    5858collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
    59 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
    60 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
     5916 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records} that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there are a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
     60On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    6161
    6262
     
    10710724.583 & DoBeS archive \\
    10810823.185 & Language and Cognition \\
     10917.859 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
    10911014.593 & talkbank \\
    11011114.363 & Acquisition \\
    111 14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
    11211212.893 & MPI CGN \\
    11311310.628 & Bavarian Archive for Speech Signals (BAS) \\
     
    1171174.640 & Oxford Text Archive \\
    1181184.492 & Leipzig Corpora Collection \\
    119 3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
    1201193.280 & A Digital Archive of Research Papers in Computational Linguistics \\
    1211203.147 & CLARIN NL \\
    1221213.081 & MPI fÃŒr Bildungsforschung \\   
     1222.678 & WALS Online \\
    123123\hline
    124124  \end{tabu}
     
    126126\end{table}
    127127
    128 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
     128We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand, there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
    129129
    130130
     
    135135Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
    136136
    137 As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
    138 
    139 
    140 \subsection{Dublin Core metadata terms}
     137As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} puts the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
     138
     139
     140\subsection{Dublin Core Metadata Terms}
    141141The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
    142142
     
    149149\end{description}
    150150
    151 The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
    152 
    153 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
     151The DCMI terms format is very widely spread nowadays. Thanks to its simplicity, it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
     152
     153There are multiple possible serializations, in particular a mapping to RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
    154154Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
    155155
    156 The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
     156The simplicity of the format is also its main drawback when considered as metadata format in the research communities. It is too general to capture all specific details, individual research groups need to describe different kinds of resources with.
    157157
    158158\subsection{OLAC}
    159159\label{def:OLAC}
    160160
    161 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
    162 
    163 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
     161\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is an application profile \cite{heery2000application}, of the \xne{Dublin Core metadata terms} adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
     162
     163The OLAC schema\furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
    164164
    165165\begin{quotation}
     
    179179OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
    180180
    181 Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
     181Note that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
    182182
    183183
     
    187187
    188188\begin{quotation}
    189  The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
     189 The Text Encoding Initiative (TEI) is a consortium, which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines, which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
    190190\end{quotation}
    191191
    192 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
     192TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility with respect to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
    193193
    194194Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
     
    199199\subsection{ISLE/IMDI -- The Language Archive}
    200200
    201 \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
    202 
    203 To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
     201\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project \cite{wittenburg2000eagles} 2000 to 2003.
     202
     203To serve the main goal of the project, easing access to language resources fostering the reuse, resource descriptions in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/} that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
    204204
    205205The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
     
    213213\label{def:META-SHARE}
    214214
    215 META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
     215META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries that covered the technical aspects.
    216216
    217217
     
    221221\end{quotation}
    222222
    223 Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
     223Within the project META-SHARE, a new metadata format was developed \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
    224224%In cooperation between metadata teams from CLARIN and META-SHARE
    225225
    226 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
    227 
    228 The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
    229 
    230 Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
     226The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
     227
     228The technical infrastructure of META-SHARE is a distributed network consisting of a number of member repositories that offer their own subset of resources\furl{http://www.meta-share.eu/}.
     229
     230Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network'' \cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
    231231The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
    232232
    233 One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
     233One point of criticism from the community was the fact that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
    234234
    235235%? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
     
    239239
    240240European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well).
    241 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
     241The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}.
    242242Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
    243243
     
    245245ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
    246246
    247 ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
     247ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources, which may be needed by the HLT -- Human Language Technology -- community.
    248248
    249249ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
     
    253253\subsection{LDC}
    254254
    255 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives.
     255Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is licensed for a fee, more than 650 resources have been made available since 1993. The catalogue is freely accessible. The metadata is additionally aggregated by OLAC archives.
    256256
    257257\section{Formats and Collections in the World of Libraries}
    258258\label{sec:lib-formats}
    259259
    260 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
     260There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
    261261
    262262%\item[LoC] Library of Congress \url{http://www.loc.gov}
     
    269269There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
    270270
    271 The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
    272 
    273 MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
     271The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- it is the standard format used for communication among libraries around the world.
     272
     273MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), which are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
    274274
    275275\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
     
    277277A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
    278278
    279 Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
     279A number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}.
    280280
    281281\xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
    282282more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
    283283
    284 There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
     284There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as a comprehensive standard for resource description and discovery that, however, was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
    285285And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now.
    286286
    287287\subsection{ESE, Europeana Data Model - EDM}
    288288
    289 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
    290 
    291 For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
     289Within the big European initiative \xne{Europeana} (cf. \ref{lit:digi-lib}), information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
     290
     291For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation}, a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
    292292EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format.
    293293%https://github.com/europeana
     
    297297\label{refdata}
    298298
    299 One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
    300 one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
     299One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web,
     300a preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
    301301\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
    302302
    303 Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
    304 
    305 In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}
    306 How this resources will be employed is discussed in \ref{sec:values2entities}.
    307 Additionally, some verbose commentary follows.
     303Conceptually, we want to partition these resources in two types. On the one hand, abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand, named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight that, while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (\code{sameAs}), for concepts we need to accept a plurality of existing conceptualizations, and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
     304
     305In the following, we inventorize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}. How this resources will be employed is discussed in \ref{sec:values2entities}. Additionally, some verbose commentary follows.
    308306
    309307%\subsubsection{Named entities}
    310308
    311 The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
    312 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
     309The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called \xne{Virtual International Authority File}, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
     310Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}. There is only a limited free access and fee is charged for full access, but recently the provider announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
    313311
    314312Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
    315313
    316 Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.
     314Also to mention \xne{Yago}\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/}, a large knowledge base created by MPI Informatik integrating dbpedia, geonames and wordnet datasets. \cite{Suchanek2007yago}
    317315
    318316So we witness a strong general trend towards Semantic Web and Linked Open Data.
     
    351349
    352350In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
    353 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
     351We also gave an overview of main formats and collections in the domain of Library and Information Services and an inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
    354352
    355353
     
    451449 %   \hline
    452450
    453 AAT & international Architecture and Arts Thesaurus, Getty \\
     451AAT & International Architecture and Arts Thesaurus, Getty \\
    454452CONA & Cultural Objects Name Authority \\
    455453DAI & Deutsches ArchÀologisches Institut \\
     
    460458FAST & Faceted Application of Subject Terminology \\
    461459Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\
    462 GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\
     460GND & \emph{Gemeinsame Normdatei} - Integrated Authority Files of the German National Library \\
    463461GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\
    464462% {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\
     
    467465LCC & Library of Congress Classification \\
    468466LCSH & Library of Congress Subject Headings \\
    469 LoC & Library of Congress\furl{http://loc.gov} \\
    470 OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\
    471 PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\
     467LoC & \href{http://loc.gov}{Library of Congress} \\
     468OCLC & \href{http://www.oclc.org}{Online Computer Library Center} -- world's biggest library federation \\
     469PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{Prometheus} KÃŒnstlerNamensansetzungsDatei\\
    472470RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\
    473471TGN & Getty Thesaurus of Geographic Names \\
  • SMC4LRT/chapters/Definitions.tex

    r3776 r4117  
    2525RDF & \xne{Resource Description Framework} \cite{RDF2004} \\
    2626RR & Relation Registry, cf. \ref{def:rr}   \\
    27 TEI & \xne{Text Encoding Initiative}, cf. \ref{tei} \\
     27TEI & \xne{Text Encoding Initiative}, cf. \ref{def:tei} \\
    2828\end{tabular}
    2929\end{table}
     
    5858\end{table}
    5959
    60 \section{Formatting conventions}
     60\section{Formatting Conventions}
    6161
    6262Inline formatting for highlighting: \\
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3776 r4117  
    1 \chapter{Mapping on instance level,\\ CMD as LOD}
     1\chapter{Mapping on Instance Level,\\ CMD as LOD}
    22\label{ch:design-instance}
    33
     
    77
    88And if you can express these all in RDF, which we can for almost all of them (maybe
    9 except the actual language resource ... unless it has a schema adorned
     9except for the actual language resource ... unless it has a schema adorned
    1010with ISOcat DC references ... \textless insert a SCHEMAcat plug ;-) \textgreater, but for
    1111metadata we have that in the CMDI profiles ...) you could load all the
     
    1818
    1919
    20 As described in previous chapters (\ref{ch:infra},\ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
    21 
    22 One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
    23 
    24 In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006}
     20As described in previous chapters (\ref{ch:infra}, \ref{ch:design}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, the established machinery pertains mostly to the schema level, the actual values in the fields of CMD instances remain ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values.
     21
     22One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and suchlike. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
     23
     24In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data} \cite{TimBL2006}
    2525as well as for real semantic (ontology-driven) search and exploration of the data.
    2626
    2727The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
    28 In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
     28In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod}.
    2929
    3030\section{CMD to RDF}
     
    3939\end{itemize}
    4040
    41 \subsection{CMD specification}
     41\subsection{CMD Specification}
    4242
    4343The main entity of the meta model is the CMD component and is typed as specialization of the \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It would be natural to translate a CMD element to a RDF property, but it needs to be a class as a CMD element -- next to its value -- can also have attributes. This further implies a property ElementValue to express the actual value of given CMD element.
     
    5454
    5555\noindent
    56 This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
     56These entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
    5757
    5858\label{table:rdf-cmd}
     
    8080\end{example3}
    8181
     82\noindent
    8283That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
    8384
     
    8687\end{example3}
    8788
     89\noindent
    8890Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
    8991used usually directly as data properties:
     
    9496
    9597\noindent
    96 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL}
     98However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications. \cite{Windhouwer2012_LDL}
    9799In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
    98100
     
    104106
    105107
    106 \subsection{RELcat - Ontological relations}
    107 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     108\subsection{RELcat - Ontological Relations}
     109As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
    108110
    109111\begin{example3}
     
    112114
    113115\noindent
    114 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
     116By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be understood as an upper layer of a taxonomy of relation types, implying a subtyping:
    115117
    116118\begin{example3}
     
    120122
    121123
    122 \subsection{CMD instances}
     124\subsection{CMD Instances}
    123125In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
    124126
    125127\subsubsection {Resource Identifier}
    126128
    127 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
     129It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
    128130If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
    129 (Note also, that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
     131(Note also that one MD record can describe multiple resources, this can be also easily accomodated in OpenAnnotation):
    130132
    131133\begin{example3}
     
    202204
    203205%%%%%%%%%%%%%%%%%
    204 \section{Mapping field values to semantic entities}
     206\section{Mapping Field Values to Semantic Entities}
    205207\label{sec:values2entities}
    206208
     
    232234\end{example3}
    233235
    234 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept , value pairs (cf. figure \ref{fig:smc_cmd2lod}):
     236However, for the needs of the mapping task, we propose to reduce and rewrite to retrieve distinct concept, value pairs (cf. figure \ref{fig:smc_cmd2lod}):
    235237
    236238\begin{example3}
     
    239241\end{example3}
    240242
    241 \var{lookup} function is a customized version of the \var(map) function, that operates on this information pairs (concept, label).
     243\var{lookup} function is a customized version of the \var(map) function that operates on these information pairs (concept, label).
    242244
    243245The two steps \var{lookup} and \var{assess} correspond exactly to the two steps in \cite{jimenez2012large} in their system \xne{LogMap2}: 1) computation of mapping candidates (maximise recall) and b) assessment of the candidates (maximize precision)
     
    252254\subsubsection{Identify vocabularies}
    253255
    254 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}) . For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
    255 
    256 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
     256One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labelled \code{@clavas:vocabulary}). For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
     257
     258The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However, definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
    257259
    258260Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
     
    280282In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
    281283
    282 \begin{definition}{signature of the lookup function}
     284\begin{definition}{Signature of the lookup function}
    283285lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
    284286\end{definition}
    285287
    286 In the implementation there needs to be additional initial configuration input, identifying datasets for given data categories,
     288In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
    287289which will be the result of the previous step.
    288290
     
    303305The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
    304306
    305 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
    306 
    307 In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
     307One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource to determine, which specific Academy of Sciences is meant in given resource description.
     308
     309In some situation these ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note that the CLARIN search engine VLO provides a feedback link that allows even the normal user to report on problems or inconsistencies in CMD records.
    308310
    309311
     
    317319
    318320The technical base for a semantic web application is usually a RDF triple-store as discussed in \ref{semweb-tech}.
    319 Given that our main concern is the data itself, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
    320 
    321 
    322 Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
     321Given that our main concern is the data themselves, their processing and display, we want to rely on stable, robust feature rich solution minimizing the effort to provide the data online. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store'').
     322
     323
     324Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger than ``just'' the original dataset.
    323325
    324326\section{Summary}
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3776 r4117  
    11
    2 \chapter{System design -- concept-based mapping on schema level}
     2\chapter{System Design -- Concept-based Mapping on Schema Level}
    33\label{ch:design}
    44
     
    66
    77We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
    8 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
     8In the next section, the internal data model is presented and explained. In section \ref{sec:cx}, the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx}, we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}, an advanced interactive user interface for exploring the CMD data domain is proposed.
    99
    1010\section{System Architecture}
     
    1414\begin{figure*}
    1515\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
    16 \caption{The component view on the SMC - modules and their inter-dependencies}
     16\caption{The component view on the SMC - modules and their interdependencies}
    1717\label{fig:smc_modules}
    1818\end{figure*}
     
    3131The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
    3232
    33 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
     33\xne{SMC Browser} consists of two parts, the \xne{smc-stats} and \xne{smc-graph}, and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
    3434
    3535For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
    3636
    37 \section{Data model}
     37\section{Data Model}
    3838
    3939Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
     
    4747In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
    4848
    49 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
     49An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms that may not contain whitespaces.
    5050
    5151\begin{defcap}
     
    7373It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
    7474Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
    75 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
    76 
    77 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
     75However, there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
     76
     77\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However, despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
    7878
    7979\var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
     
    8585
    8686%\noindent
    87 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
     87\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However, longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
    8888
    8989\subsection{Terms}
     
    9595\subsubsection{Type \code{Term}}
    9696
    97 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
     97\code{Term} is a polymorph data type that can have different sets of attributes depending on the type of data it represents.
    9898
    9999\begin{table}[h]
    100 \caption{Attributes of \code{Term} when encoding data category}
     100\caption{Attributes of \code{Term} when encoding data category (enclosed in \code{Concept})}
    101101\label{table:terms-attributes-datcat}
    102102 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
     
    104104\rowfont{\itshape\small}   attribute & allowed values & sample value\\
    105105\hline
    106   \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
     106%  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
    107107  \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
    108108  \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
     
    223223
    224224\subsubsection{Type \code{Relation}}
    225 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
     225As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated that contain more than two equivalent concepts.
    226226
    227227% role="about"
     
    261261
    262262%%%%%%%%%%%%%%%%%%%%%%
    263 \section{cx -- crosswalk service}
     263\section{cx -- Crosswalk Service}
    264264\label{sec:cx}
    265265
    266 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
     266The crosswalk service offers the functionality that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
    267267Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
    268268
    269269The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
    270270
    271 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
     271The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
    272272
    273273\subsection{Interface Specification}
     
    455455The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
    456456
    457 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
     457The service is implemented as a RESTful service, however, only supporting the GET operation, as it operates on a data set that the users cannot change directly. (The changes have to be performed in the upstream registries.)
    458458
    459459
     
    479479\item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
    480480\item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
    481 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
     481\item[\xne{cmd-terms-nested}] as above, however, the \code{Term} elements are nested reflecting the component structure in the profile
    482482\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
    483483\item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
    484 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
     484\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute).
    485485\end{description}
    486486
    487487\subsubsection{Operation}
    488 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
     488For the actual service operation a minimal application has been implemented that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
    489489The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
    490490
     
    495495Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
    496496
    497 \section{qx -- concept-based search}
     497\section{qx -- Concept-based Search}
    498498\label{sec:qx}
    499499To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
    500 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
     500In this section, we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
    501501
    502502The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
    503503
    504 Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
     504Note that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is tackled in \ref{sec:values2entities} (and also there only rather superficially).
    505505
    506506Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
     
    509509\label{cql}
    510510As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
    511 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
    512 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
    513 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
     511CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks.
     512It was introduced in 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
     513transferred from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012 \cite{OASIS2012sru}.)
    514514
    515515Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
     
    525525The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
    526526The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
    527 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
     527In particular, CQL introduces the so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
    528528
    529529The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
     
    541541
    542542%\begin{note}
    543 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
     543Alternatively to the -- potentially costly -- on-the-fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories, in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
    544544%\end{note}
    545545
    546 \subsection{SMC as module for Metadata Repository}
     546\subsection{SMC as Module for Metadata Repository}
    547547
    548548As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
    549549
    550 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
     550Metadata Repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module that provides a user interface widget for formulating the query.
    551551
    552552\begin{figure*}
    553553\begin{center}
    554554\includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
    555 \caption{The component view on the SMC - modules and their inter-dependencies}
     555\caption{The component diagram of the integration of SMC as module within the Metadata Repository}
    556556\label{fig:modules-mdrepo}
    557557\end{center}
     
    561561\subsection{User Interface}
    562562
    563 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
     563A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
    564564\begin{definition}{Generic data format for structured queries}
    565565 < index, operation, term, boolean >+
     
    581581
    582582\noindent
    583 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
    584 Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
    585 
    586 A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
    587 
    588 Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
     583Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labelling the fields of the results, or when providing facets to drill down the search.
     584
     585A fundamentally different approach is the "content first" paradigm that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is that the suggestions are typed, so that the user is informed, from which index given term comes (\concept{person}, \concept{place}, etc.)
     586
     587Combining the two approaches, we could arrive at a ``smart'' widget consisting of one input field with on-the-fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
    589588
    590589
     
    595594As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
    596595
    597 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
     596While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However, this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
    598597
    599598\subsection{Design}
     
    615614
    616615\subsubsection{Requirements}
    617 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
     616Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
    618617
    619618In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
     
    658657\end{quotation}
    659658
    660 Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
     659Especially remarkable feature is the possibility to add custom constraints that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
    661660
    662661\subsubsection{Data preprocessing}
    663662\label{smc-browser-data-preprocessing}
    664 The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
     663The application operates on a set of static XHTML and JSON data files that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
    665664
    666665\begin{description}
     
    677676\end{description}
    678677
    679 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
    680 
    681 To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
     678Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However, soon it became obvious that the graph is getting too huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
     679
     680The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
    682681
    683682
     
    698697
    699698As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
    700 basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
     699basic \code{force} layout there are also directed layouts that are often better suited for displaying the directed graph.
    701700Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
    702701
    703 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
     702One special option is \code{graph} that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
    704703
    705704There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
     
    708707\label{smc-browser-extensions}
    709708
    710 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
     709Next to the basic setup described above, there is a number of possible additional features that could enhance the functionality and usefulness of the discussed tool.
    711710
    712711\subsubsection*{Graph operations -- differential views}
     
    717716Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
    718717
    719 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
     718Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
    720719
    721720\subsubsection*{Viewer for external data}
    722 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
     721The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) that would allow to visualize their data in the SMC browser.
    723722
    724723One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
    725724
    726725\subsubsection*{Integrate with instance data}
    727 The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
     726The usefulness and information gain of the application could be greatly increased by integrating the instance data, i.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
    728727
    729728Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
     
    731730
    732731%%%%%%%%%%%%%%%%%%%%%%%%%
    733 \section{Application of \emph{schema matching} techniques in SMC}
     732\section{Application of \emph{Schema Matching} Techniques in SMC}
    734733\label{sec:schema-matching-app}
    735734
     
    739738Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
    740739
    741 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
     740However, this only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework, the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
    742741
    743742Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
    744743The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
    745 It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
     744It is very improbable that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
    746745Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
    747 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
     746However, thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
    748747components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
    749 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
     748Given that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
    750749Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
    751750
     
    764763the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
    765764
    766 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
    767 
    768 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
     765Although we exemplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles that, though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
     766
     767Note that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency prevails.
    769768
    770769The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
    771 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
     770However, if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
    772771 
    773772Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
    774773This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
    775774
    776 In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
     775In contrast to the task described here that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
    777776another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
    778777
    779 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     778%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    780779
    781780
    782781
    783782\section{Summary}
    784 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
     783In this core chapter, we laid out a design for a system dealing with concept-based crosswalks on schema level.
    785784The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
    786785In addition, we elaborated on the application of schema matching methods to infer mappings between schemas.
  • SMC4LRT/chapters/Infrastructure.tex

    r3776 r4117  
    1 \chapter{Underlying infrastructure}
     1\chapter{Underlying Infrastructure}
    22\label{ch:infra}
    33
     
    77\label{def:CLARIN}
    88
    9 CLARIN - Common Language Resource and Technology Infrastructure\cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
     9CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
    1010
    1111\begin{quote}
    12 \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located.\cite{CLARIN2013web}
     12\dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web}
    1313\end{quote}
    1414
     
    1919The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
    2020
    21 In the preparation phase of the project 2008 - 2011 over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
     21In the preparation phase of the project 2008 - 2011, over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
    2222
    2323Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
     
    2727\label{def:CMDI}
    2828
    29 One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework}\cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
     29One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
    3030
    3131The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}:
     
    3838
    3939\noindent
    40 All these modules are running services, that this work shall directly build upon.
     40All these modules are running services that this work shall directly build upon.
    4141
    4242In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}.
     
    6060Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}.
    6161
    62 \subsection{CMDI registries}
     62\subsection{CMDI Registries}
    6363\label{cmdi-registries}
    6464The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction.
     
    6666\begin{figure*}[t]
    6767\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
    68 \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
     68\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping.}
    6969\label{fig:SMC-linkage}
    7070\end{figure*}
     
    7979Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well.
    8080
    81 The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is, that the data categories are assigned a persistent identifier, making them globally and permanently referable.
     81The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is that the data categories are assigned a persistent identifier, making them globally and permanently referable.
    8282
    8383\begin{figure*}[!ht]
     
    8585\includegraphics[width=0.7\textwidth]{images/dc_types}
    8686\end{center}
    87 \caption{Data Category types\cite{Windhouwer2011ISOcat_intro}}
     87\caption{Data Category types \cite{Windhouwer2011}}
    8888\label{fig:dc_type}
    8989\end{figure*}
     
    9292\label{def:CR}
    9393
    94 \emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompaniged by a REST webservice to allows client applications to retrieve the profile definitions. At the same time the web interface serves as an editor for creating and editing new CMD components and profiles.
    95 
    96 The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration.\cite{Durco2013_MTSR}
    97 
    98 Let us reiterate, that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted''\cite{Broeder+2010}, or \emph{to make its semantics explicit}.
     94\emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompanied by a REST webservice to allow client applications to retrieve the profile definitions. At the same time, the web interface serves as an editor for creating and editing new CMD components and profiles.
     95
     96The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013MTSR}
     97
     98Let us reiterate that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}, or \emph{to make its semantics explicit}.
    9999
    100100As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
     
    104104
    105105The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
    106 However there needs to be an additional means to capture information about relations between data categories.
    107 This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations be under control of the metadata user whereas the data categories are under control of the metadata modeller.
     106However, there needs to be an additional mean to capture information about relations between data categories.
     107This information was deliberately not included in the DCR, because relations often depend on the context, in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations need to be under control of the metadata user whereas the data categories are under control of the metadata modeller.
    108108
    109109The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
    110110
    111 There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen\cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
     111There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011} that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
    112112This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
    113113
     
    116116\end{definition}
    117117
    118 \subsection{Further parts of the infrastructure}
     118\subsection{Further Parts of the Infrastructure}
    119119\label{cmdi-other}
    120120
     
    124124\begin{quotation}
    125125RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow
    126 (search) algorithms to traverse the semantic graph thus made explicit\cite{Schuurman2011_SCHEMAcat}.
     126(search) algorithms to traverse the semantic graph thus made explicit \cite{SchuurmanWindhouwer2011}.
    127127\end{quotation}
    128128
    129129\subsubsection*{Schema Parser}
    130 Schema Parser is a service developed at the Meertens Institute, Amsterdam, that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
     130Schema Parser is a service developed at the Meertens Institute, Amsterdam that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
    131131
    132132\subsubsection*{Metadata editors}
     
    137137
    138138Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records.
    139 Two examples being the stand-alone application \xne{Arbil}\cite{withers2012arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\cite{dima2012mdeditor}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
    140 
    141 
    142 \subsection{CMDI exploitation side}
     139Two examples being the stand-alone application \xne{Arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} \cite{withers2012arbil} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} \cite{dima2012mdeditor} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
     140
     141
     142\subsection{CMDI Exploitation Side}
    143143\label{cmdi_exploitation}
    144 Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications, that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
     144Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
    145145
    146146\begin{figure*}[!ht]
    147147\begin{center}
    148148\includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
    149 \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications}
     149\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications.}
    150150\label{fig:cmd-ingestion}
    151151\end{center}
    152152\end{figure*}
    153153
    154 The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/}\cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
     154The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
    155155The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}).
    156156As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic  interoperability provided by the infrastructure.
     
    159159\begin{center}
    160160\includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png}
    161 \caption{screenshot of the faceted browser of the VLO}
     161\caption{Screenshot of the faceted browser of the VLO}
    162162\label{fig:vlo}
    163163\end{center}
    164164\end{figure*}
    165165
    166 More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
     166More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It is also based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{Zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
    167167The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data.
    168168% \todoin { describe indexing and search}
    169169
    170 And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}). \cite{Durco2011}
    171 However the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
     170And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) \cite{Durco2011}.
     171However, the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
    172172
    173173%%%%%%%%%%%%%%%%%%%%
     
    175175\label{sec:cv}
    176176
    177 \subsection{Motivation \& broader context}
    178 The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However the problem of incoherent labeling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     177\subsection{Motivation \& Broader Context}
     178The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However, the problem of incoherent labelling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values.
    179179
    180180This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc.
     
    183183Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives.
    184184
    185 It has to be also kept in mind, that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
     185It has to be also kept in mind that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
    186186
    187187\begin{comment}
     
    196196\label{def:CLAVAS}
    197197
    198 In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
     198In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the Dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
    199199
    200200%As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
    201201
    202202The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format.
    203 One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up, that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
    204 
    205 Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running a instance of the OpenSKOS system.
    206 
    207 As the work on this vocabulary repository started in the context of a cultural heritage program, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified, that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
     203One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
     204
     205Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running an instance of the OpenSKOS system.
     206
     207As the work on this vocabulary repository started in the context of a cultural heritage programme, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
    208208\begin{itemize}
    209 \item the list of language codes\cite{ISO639}
     209\item the list of language codes \cite{ISO639}
    210210\item organization names for the domain of language resources
    211211\item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process)
     
    215215\label{sec:export-dcr}
    216216
    217 Based on the premise, that the data in DCR also represents a kind of a controlled vocabularies, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
    218 
    219 Note, that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is, that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
     217Based on the premise that the data in DCR also represents a kind of a controlled vocabulary, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
     218
     219Note that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
    220220
    221221The fact that data categories are basically definitions of concepts may mislead to
    222222a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept}
    223 all of them belonging to the \code{ISOcat:ConceptScheme}. However the data in ISOcat as whole is too disparate in scope for such a vocabulary to be useful.
     223all of them belonging to the \code{ISOcat:ConceptScheme}. However, the data in ISOcat as a whole is too disparate in scope for such a vocabulary to be useful.
    224224
    225225A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme.
    226226
    227227\begin{quotation}
    228 The rationale is, that if we see a vocabulary as a set of possible values for a
     228The rationale is that if we see a vocabulary as a set of possible values for a
    229229field/element/attribute, complex DCs in ISOcat are the users of such
    230230vocabularies and simple DCs the DCR equivalence of values in such a
    231 vocabulary.\cite{Menzo2013mail}
     231vocabulary. \cite{Menzo2013mail}
    232232\end{quotation}
    233233
    234234\begin{comment}
    235 Still there are some closed DCs which might be good vocabulary
     235Still there are some closed DCs, which might be good vocabulary
    236236providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
    237237stay in ISOcat. I think at some point we should create a smaller set of
     
    240240then 20, 50 or 100 values are exported.
    241241
    242 However it needs to be yet assessed how useful this approach is. In the metadata profile
     242However, it needs to be yet assessed how useful this approach is. In the metadata profile
    243243there are many closed DCs with small value domains. How useful are those
    244244in CLAVAS?
     
    253253\end{figure*}
    254254
    255 Another aspect is, that a simple DC can be in value domains of multiple closed DCs.
     255Another aspect is that a simple DC can be in value domains of multiple closed DCs.
    256256Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
    257257So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
     
    260260Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
    261261i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}).
    262 This is, how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
     262This is how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
    263263/representations/dcs2/clavas.xsl}
    264264
    265265
    266 \subsection{Linking to vocabularies in data categories and schemas -- interaction between ISOcat, CLAVAS and client applications}
     266\subsection{Linking to Vocabularies in Data Categories and Schemas -- Interaction between ISOcat, CLAVAS and Client Applications}
    267267\label{interaction-dcr-skos}
    268268
    269269In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to
    270 convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013.\cite{Menzo2013mail}}
     270convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}}
    271271
    272272Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository:
    273273
    274274\begin{quotation}
    275 Originally, the vocabulary repository has been conceived to manage rather large and complex value domains, that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
     275Originally, the vocabulary repository has been conceived to manage rather large and complex value domains that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
    276276partially enumerated (organization names) ISOcat can't/shouldn't contain
    277277the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
    278 provider.\cite{Menzo2013mail}
     278provider. \cite{Menzo2013mail}
    279279\end{quotation}
    280280
     
    290290\end{lstlisting}
    291291
    292 A proposal by Windhouwer\cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
     292A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
    293293
    294294\begin{lstlisting}
     
    298298\begin{quotation}
    299299\code{@href} points to the vocabulary. Actually a PID should be used in the context
    300 of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency then the core.
     300of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency than the core.
    301301
    302302\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
     
    304304\end{quotation}
    305305
    306 This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup but are capable of XSD/RNG validation, can still use the regular expression based definition.
     306This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup, but are capable of XSD/RNG validation, can still use the regular expression based definition.
    307307 
    308308\lstset{language=XML}
    309 \begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
     309\begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=Definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
    310310  <dcif:conceptualDomain type="constrained">
    311311     <dcif:dataType>string</dcif:dataType>
     
    331331\end{figure*}
    332332
    333 It is important to emphasize, that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element then the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
    334 
    335 There are basically two options, how the vocabulary can be integrated into the schema.
     333It is important to emphasize that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element than the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
     334
     335There are basically two options how the vocabulary can be integrated into the schema.
    336336One approach is to explicitly enumerate all the values from the vocabulary.
    337 Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
     337Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however, there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
    338338and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
    339339
    340340The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in  \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute.
    341341
    342 Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though, that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
     342Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
    343343can use the reference to the data category to fetch explanations (semantic information)  (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}).
    344344
    345 The drawback of this variant is, that we gave up the validation. This
     345The drawback of this variant is that we gave up the validation. This
    346346isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but
    347 it is when the value domain is closed, e.g. \concept{languageId}. In the latter case,
     347it is when the value domain is closed, e.g. \concept{languageID}. In the latter case,
    348348the XSD generation could support both modes: a lax (smaller) version which
    349349doesn't contain the closed vocabulary as an enumeration and leaves it to
    350 the tool, and a strict version which does contain the vocabulary as an
     350the tool, and a strict version, which does contain the vocabulary as an
    351351enumeration. Probably the latter should stay the default, but the client application could
    352352request the lax version leading to smaller and quicker XSD validation
    353353inside the tool.
    354354
    355 %However for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
     355%However, for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
    356356
    357357
    358358%%%%%%%%%%%%%%%%%
    359 \section{Other aspects of the infrastructure}
    360 While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
     359\section{Other Aspects of the Infrastructure}
     360While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However, it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
    361361
    362362\subsubsection{CLARIN Centres}
     
    367367\end{quotation}
    368368
    369 CLARIN imposes a number of criteria, that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767}\cite{CE-2013-0095}.
     369CLARIN imposes a number of criteria that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767} \cite{CE-2013-0095}.
    370370CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
    371371
    372 One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties researchers (not just the home users) to store research data.
     372One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties' researchers (not just the home users) to store research data.
    373373
    374374\begin{comment}
     
    394394\subsubsection{Federated Content Search}
    395395
    396 Another aspect of the availability of resources is, that while metadata can be harvested and indexed locally in one repository, this is not possible with the content itself, both due to the size of the data, but mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
     396Another aspect of the availability of resources is that while metadata can be harvested and indexed locally in one repository this is not possible with the content itself, both due to the size of the data and mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
    397397
    398398Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}.
     
    400400\section{Summary}
    401401
    402 In this chapter we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry, that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally a few other aspects of the infrastructure, that are equally important, however not pertaining to the metadata level, were briefly tackled.
    403 
     402In this chapter, we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally, a few other aspects of the infrastructure that are equally important, however, not pertaining to the metadata level, were briefly tackled.
     403
  • SMC4LRT/chapters/Introduction.tex

    r3776 r4117  
    44%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    55
    6 \section{Motivation / problem statement}
     6\section{Motivation / Problem Statement}
    77
    88While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
    99
    10 This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
     10This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (CMDI, cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
    1111
    12 This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
     12This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} (SMC) -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
    1313
    1414\section{Main Goal}
     
    4040Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
    4141
    42 Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones.
     42Thus, the goal is not primarily to define new crosswalks but rather to develop a service serving existing ones.
    4343
    4444\subsubsection*{Concept-based query expansion}
     
    4848\paragraph{Example}
    4949Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
    50 all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like:
     50all the semantically near fields (\emph{concept cluster}) that are however, labelled (or even structured) differently in other schemas like:
    5151
    5252\begin{quote}
     
    5454\end{quote}
    5555
    56 The expansion cannot be solved by simple string matching, as there are other fields labeled with the same (sub)strings but with different semantics, that shouldn't be considered:
     56The expansion cannot be solved by simple string matching, as there are other fields labelled with the same (sub)strings but with different semantics that shouldn't be considered:
    5757
    5858\begin{quote}
     
    6262\subsubsection*{Semantic interpretation}
    6363
    64 The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
     64The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the evidence in the metadata records collected within CMDI shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
    6565
    6666\subsubsection*{Ontology-driven data exploration}
     
    7575
    7676\section{Method}
    77 We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.
     77We start with examining the existing data and with the description of the existing infrastructure, in which this work is embedded.
    7878
    7979Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
     
    9090Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
    9191
    92 A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
     92A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however, this issue can only be tackled marginally and will have to be outsourced into future work.
    9393
    9494\section{Expected Results}
     
    108108\end{description}
    109109
    110 \section{Structure of the work}
     110\section{Structure of the Work}
    111111The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
    112112
     
    116116The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
    117117
    118 The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref} and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
     118The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref}) and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
    119119
    120120
  • SMC4LRT/chapters/Literature.tex

    r3776 r4117  
    44%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    55
    6 In this chapter we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching
     6In this chapter, we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching
    77and review Semantic Web principles and technologies.
    88
     
    1717\xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey.
    1818
    19 \xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
     19\xne{CLARIN}\furl{http://clarin.eu} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI)  -- a comprehensive architecture for harmonized handling of metadata \cite{Broeder2011} --
    2020are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}.
    2121Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI.
     
    2323More of a sister-project is the initiative \xne{DARIAH} - Digital Research Infrastructure for the Arts and Humanities\furl{http://dariah.eu}. It has a broader scope, but has many personal ties as well as similar problems  and similiar solutions as CLARIN. Therefore there are efforts to intensify the cooperation between these two research infrastructures for digital humanities.
    2424
    25 \xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
     25\xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource \cite{Piperidis2012meta}, however, focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
    2626
    2727\begin{quotation}
     28\noindent
    2829META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee.
    2930\end{quotation}
    3031
    31 See \ref{def:META-SHARE} for more details about META-SHARE's catalog and metadata format.
     32See \ref{def:META-SHARE} for more details about META-SHARE's catalogue and metadata format.
    3233
    3334
     
    3637
    3738In a broader view we should also regard the activities in the domain of libraries and information sciences (LIS).
    38 Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
     39Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogues, libraries were the early adopters and driving force in the field of search federation even before the era of internet (e.g. \xne{Linked Systems Project} \cite{Fenly1988}), the LIS community certainly has a long tradition, wealth of experience and robust solutions with respect to metadata aggregation and harmonization and exploitation.
    3940%, starting collaborative efforts in mid 70s
    4041
     
    4243 The biggest one is the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012}) powered by OCLC, a cooperative of over 72.000 libraries worldwide.
    4344
    44 In Europe, multiple recent initiatives have pursuit similar goals of pooling together the immense wealth of information sheltered in the many libraries:
     45In Europe, multiple recent initiatives have pursued similar goals of pooling together the immense wealth of information sheltered in the many libraries:
    4546\xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
    4647
     
    5051Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) another initiative in the realm of \xne{Europeana} has been started, a Best Practice Network, coordinated by The European Library, designed to ``establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research''.
    5152
    52 The related catalogs and formats are described in the section \ref{sec:lib-formats}.
    53 
    54 
    55 \section{Existing crosswalks (services)}
     53The related catalogues and formats are described in the section \ref{sec:lib-formats}.
     54
     55
     56\section{Existing Crosswalks (Services)}
    5657
    5758Crosswalks as list of equivalent fields from two schemas have been around already for a long time, in the world of enterprise systems, e.g. to bridge to legacy systems as well as in the LIS domain. \cite{Day2002crosswalks} lists a number of mappings between metadata formats, mostly betweeen Dublin Core  and MARC families of formats.\footnote{\url{http://loc.gov/marc/marc2dc.html}, \url{http://www.loc.gov/marc/dccross.html}}
    5859
    59 However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort, that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118}
     60However, besides being restricted in terms of covered formats, these crosswalks are just static correspondence lists, often just available as documents and only limited coverage of formats. One effort that comes nearer to our idea of a service delivering crosswalks dynamically is the \xne{Metadata Crosswalk Service}\footnote{\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}, \url{http://www.oclc.org/research/activities/xwalk.html}, (SOAP based)} offered by OCLC as part of \xne{Metadata Schema Transformation Services}\furl{http://www.oclc.org/research/activities/schematrans.html?urlm=160118},
    6061
    6162\begin{quotation}
     
    6364\end{quotation}
    6465
    65 Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration as for the specification of the planned service.
     66Although the website states ``Crosswalk Web Service is now a production system that has been incorporated into OCLC products and services'', the demo service\furl{http://errol.oclc.org/schemaTrans.oclc.org.search} is not accessible. Also, this service only offers crosswalks between formats relevant for the LIS community: \xne{Dublin Core, MARCXML, MARC-2709, MODS}. So, altogether the service does not seem suitable to be used as is for the purposes of this work. But it certainly can serve as inspiration for the specification of the planned service.
    6667
    6768\begin{comment}
     
    7980\label{lit:schema-matching}
    8081
    81 As Shvaiko\cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
    82 As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities} we elaborate on the possible ways to apply these methods to the described problem.)
     82As Shvaiko \cite{shvaiko2012ontology} states ``\emph{Ontology matching} is a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
     83As such, it provides a very suitable methodical foundation for the problem at hand -- the \emph{semantic mapping}. (In sections \ref{sec:schema-matching-app} and \ref{sec:values2entities}, we elaborate on the possible ways to apply these methods to the described problem.)
    8384
    8485There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work \cite{Kalfoglou2003, Shvaiko2008, Noy2005_ontologyalignment, Noy2004_semanticintegration, Shvaiko2005_classification} and most recently \cite{shvaiko2012ontology, amrouch2012survey}.
    8586
    86 %Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{}.
    87 
    8887Shvaiko and Euzenat also run the web page \url{http://www.ontologymatching.org/} dedicated to this topic and the related OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}}, an ongoing effort to evaulate alignment tools based on various alignment tasks from different domains.
    8988
    90 Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?''.
     89Interestingly, \cite{shvaiko2012ontology} somewhat self-critically asks if after years of research ``the field of ontology matching [is] still making progress?''.
    9190
    9291\subsubsection{Method}
     
    113112
    114113\cite{EhrigSure2004} and \cite{amrouch2012survey} instead introduce \var{ontology mapping} when applying the task on individual entities, in the meaning as a function that ``for each concept (node) in ontology A [tries to] find a corresponding concept
    115 (node), which has the same or similar semantics, in ontology B and vice verse''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.
    116 
    117 \cite{EhrigSure2004} further specify the mapping function as based on a similarity function, that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.
     114(node), which has the same or similar semantics, in ontology B and vice versa''. In the meaning as result it is ``formal expression describing a semantic relationship between two (or more) concepts belonging to two (or more) different ontologies''.
     115
     116\cite{EhrigSure2004} further specify the mapping function as based on a similarity function that for a pair of entities from two (or more) ontologies computes a ratio indicating the semantic proximity of the two entities.
    118117
    119118\begin{defcap}[!ht]
     
    135134\cite{Algergawy2010} classifies, reviews, and experimentally compares major methods of element similarity measures and their combinations. \cite{shvaiko2012ontology} comparing a number of recent systems finds that ``semantic and extensional methods are still rarely employed. In fact, most of the approaches are quite often based only on terminological and structural methods.
    136135
    137 \cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function\cite{ehrig2004qom}, that combines all similiarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)
     136\cite{Ehrig2006} employs this \var{similarity} function over single entities to derive the notion of \var{ontology similarity} as ``based on similarity of pairs of single entities from the different ontologies''. This is operationalized as some kind of aggregating function \cite{ehrig2004qom} that combines all similarity measures (mostly modulated by custom weighting) computed for pairs of single entities again into one value (from the \var{[0,1]} range) expressing the similarity ratio of the two ontologies being compared. (The employment of weights allows to apply machine learning approaches for optimization of the results.)
    138137
    139138Thus, \var{ontology similarity} is a much weaker assertion, than \var{ontology alignment}, in fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
     
    149148\end{enumerate}
    150149
    151 In  contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision), that however correspond  to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.
     150In  contrast, \cite{jimenez2012large} in their system \xne{LogMap2} reduce the process into just two steps: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision) that however, correspond  to the steps 2 and 3 of the above procedure and in fact the other steps are implicitly present in the described system.
    152151
    153152
     
    155154A number of existing systems for schema/ontology matching/alignment is collected in the above-mentioned overview publications:
    156155
    157 \xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
    158 
    159 All of the tools use multiple methods as described in the previous section, exploiting both element as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
     156\xne{IF-Map} \cite{kalfoglou2003if}, \xne{QOM} \cite{ehrig2004qom}, \xne{FOAM} \cite{EhrigSure2005}, \xne{Similarity Flooding (SF)} \cite{melnik2002similarity}, \xne{S-Match} \cite{Giunchiglia2007_semanticmatching}, the \xne{Prompt} tools \cite{Noy2003_theprompt} integrating with \xne{Protégé} or \xne{COMA++} \cite{Aumueller2005}, \xne{Chimaera}. Additionally, \cite{shvaiko2012ontology} lists and evaluates some more recent contributions: \xne{SAMBO, Falcon, RiMOM, ASMOV, Anchor-Flood, AgreementMaker}.
     157
     158All of the tools use multiple methods as described in the previous section, exploiting both element features as well as structural features and applying some kind of composition or aggregation of the computed atomic measures, to arrive to a alignment assertion.
    160159
    161160Next to OWL as input format supported by all the systems some also accept XML Schemas (\xne{COMA++, SF, Cupid, SMatch}),
     
    169168\section{Semantic Web -- Linked Open Data}
    170169
    171 Linked Data paradigm\cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
     170Linked Data paradigm \cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
    172171
    173172\subsubsection{Semantic Web - Technical solutions / Server applications}
    174173\label{semweb-tech}
    175174
    176 The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL\cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
     175The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently via SPARQL \cite{SPARQL2008} and \textit{idealiter} expose them via a web interface to the users.
    177176
    178177Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
    179178
    180 A qualitative and quantitative study\cite{Haslhofer2011europeana}   in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
    181 
    182 \xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana}
    183 Virtuoso is used to host many important Linked Data sets, e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
    184 Virtuoso is offered both as commercial and open-source version license models exist.
     179A qualitative and quantitative study \cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
     180
     181\xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents. \cite{Erling2009Virtuoso, Haslhofer2011europeana}
     182Virtuoso is used to host many important Linked Data sets, e.g. DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}.
     183Virtuoso is offered both as commercial and open-source version license models.
    185184
    186185Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
     
    206205There exists also a sizable number of stand-alone solutions (\xne{Ontorama, FOAFnaut, IsaViz, GKB-Editor} and more) though often bound to a specific dataset or data type (\xne{Wordnet, FOAF, Cyc}).
    207206
    208 There is also plenty of general graph visualization tools, that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
     207There is also plenty of general graph visualization tools that can be adopted for viewing the RDF data as graph, like the traditional graph layouting tool \xne{GraphViz dot}, or more recently \xne{Gephi} \cite{Bastian2009gephi}, a stand-alone interactive tool for graph visualization with a number of layouting algorithms and display options. A rather recent generic visualization javascript library \xne{d3}\footnote{\url{http://d3js.org}} % \cite{bostock2011d3} seems especially appealing thanks to its data-driven paradigm, dedicated support for graphs with integrated customizable graph layouting algorithm and -- being pure javascript -- allowing web-based solutions.
    209208
    210209%Most recently a web-based version of this versatile tool has been released\furl{http://protegewiki.stanford.edu/wiki/WebProtege} that supports collaborative ontology development
    211210
    212 The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}  -- a tool which visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
     211The solutions are rather sparse when it comes to more advanced visualizations, beyond the simple one to one display of the data model graph as a visual graph, especially the visualization of ontology mapping and alignment. Besides \xne{OLA} \cite{euzenat2004ola}, \xne{PromptViz} \cite{Noy2003_theprompt} and \xne{CogZ} \cite{falconer2009cogz} we would like to point out one solution developed at the IFS of the Technical University in Vienna \cite{lanzenberger2006alviz}, \xne{AlViz}, a tool that visually supports semi-automatic alignment of ontologies. It is implemented as a ``multiple-view plug-in for Protege using J-Trees and Graphs. Based on similarity measures of an ontology matching algorithm AlViz helps to assess and optimize the alignment results.'' It applies visual clues like colouring to indicate the computed similarity of concepts between two ontologies and clustering for reducing the complexity of the displayed datasets (cf. figure \ref{fig:alviz}). Unfortunately, the development of this very promising research prototype seems to have stalled, the last available version being from 2009.
    213212
    214213\begin{figure*}
     
    228227\subsubsection{Linguistic ontologies}
    229228
    230 One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD\cite{Farrar2003}\furl{http://linguistics-ontology.org},
    231 that ``gives a formalized account of the most basic categories and relations (the "atoms") used in the scientific description of human language, attempting to codify the general knowledge of the field. The motivation is to`` facilite automated reasoning over linguistic data and help establish the basic concepts through which intelligent search can be carried out''.
     229One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD \cite{Farrar2003}\furl{http://linguistics-ontology.org},
     230that ``gives a formalized account of the most basic categories and relations (the `atoms') used in the scientific description of human language, attempting to codify the general knowledge of the field''. The motivation is to ``facilitate automated reasoning over linguistic data and help establish the basic concepts, through which intelligent search can be carried out''.
    232231
    233232In line with the aspiration ``to be compatible with the general goals of the Semantic Web'', the dataset is provided via a web application as well as a dump in OWL format\furl{http://linguistics-ontology.org/gold-2010.owl} \cite{GOLD2010}.
    234233
    235234
    236 Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalog of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives a large collection of all kinds resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.
     235Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalogue of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives, a large collection of all kinds of resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.
    237236
    238237 World Atlas of Language Structures (WALS) \furl{http://WALS.info} \cite{wals2011}
    239 is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) ''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution.
    240 
    241 Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema
     238is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars)''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution.
     239
     240Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema.
    242241Extending on this work, Simons et al. \cite{Simons2004semantics} propose a method for mapping linguistic descriptions in plain XML into semantically rich RDF/OWL, employing the GOLD ontology as the target semantic schema.
    243242
    244 These ontologies can be used by (``ontologized'') Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
    245 
    246 
    247 Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry a ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'', that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).
    248 Although not exactly an ontology in the common sense of
    249 Although (by design) this registry does not contain any relations between concepts,
    250 the central entities are concepts and not lexical items, thus it can be seen as a proto-ontology.
     243These ontologies can be used by (``ontologized'') lexicons to refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
     244
     245
     246Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry, an ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'' that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).
     247Although not exactly an ontology in the common sense --
     248given that this registry (by design) does not contain any relations between concepts --
     249the central entities are concepts and not lexical items, thus it can be seen as a semantic resource.
    251250Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010.
    252251
    253252Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the discipline specific linguistic terminology is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
    254253
    255 \subsubsection{Lexicalised ontologies,``ontologized'' lexicons}
     254\subsubsection{Lexicalised ontologies, ``ontologized'' lexicons}
    256255
    257256The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies.
     
    259258In a number of works Buitelaar, McCrae et. al \cite{Buitelaar2009, buitelaar2010ontology, McCrae2010c, buitelaar2011ontology, Mccrae2012interchanging} argues for ``associating linguistic information with ontologies'' or ``ontology lexicalisation'' and draws attention to lexical and linguistic issues in knowledge representation in general. This basic idea lies behind the series of proposed models \xne{LingInfo}, \xne{LexOnto}, \xne{LexInfo} and, most recently, \xne{lemon} aimed at allowing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology.
    260259The most recent in this line, \xne{lemon} or \xne{lexicon model for ontologies} defines ``a formal model for the proper representation of the continuum between: i) ontology semantics; ii) terminology that is used to convey this in natural
    261 language; and iii) linguistic information on these terms and their constituent lexical units'', in essence enabling the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference", no complex semantic in-
    262 formation needs to be stated in the lexicon.
    263 a clear separation of the lexical layer and the ontological layer.
    264 
    265 Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models.
    266 and in particular on global standards: W3C standard: SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and
    267 and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009})
    268 
    269 Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications, provides a RDF serialization (?!?!).
     260language; and iii) linguistic information on these terms and their constituent lexical units''.
     261In essence, \xne{lemon} enables the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference". No complex semantic information needs to be stated in the lexicon, ensuring (or at least fostering) a clear separation of the lexical layer and the ontological layer.
     262
     263Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models, and in particular on global standards: W3C standard, SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009}).
     264
     265Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications. LMF specifies also a RDF serialization.
    270266
    271267An overview of current developments in application of the linked data paradigm for linguistic data collections was given at the  workshop Linked Data in Linguistics\furl{http://ldl2012.lod2.eu/} 2012 \cite{ldl2012}.
    272268
    273269
    274 The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given, that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.
     270The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.
    275271
    276272
    277273\section{Summary}
    278 This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
     274This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and, on the other hand, it gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.
  • SMC4LRT/chapters/Results.tex

    r3776 r4117  
    66In the subsequent two sections, we explore a few specific aspects of the CMD data domain -- regarding the usage of the data categories (\ref{sec:explore-datcats}) and the integration of existing formats (\ref{sec:explore-formats}). While these topics are not directly results of this work, the presented analyses are. They were made possible by the technical solution of this work, yield a valuable test case for the usefulness of the work and are an indispensable prerequisite for the necessary coordination and curation work being carried out by the CMDI community.
    77
    8 \section{Current status of the infrastructure}
     8\section{Current Status of the Infrastructure}
    99Before we get to the results of this work,  we briefly summarize the current state of affairs within the CLARIN infrastructure at large to help contextualize the actual results.
    1010
    11 \subsection{CMDI - services}
     11\subsection{CMDI -- Services}
    1212The main services of the infrastructure have been in stable production for the last two years.
    1313Relation Registry is operational as early prototype.
    1414Three instances of \xne{OpenSKOS} are running, one of them being hosted by \xne{ACDH}.
    1515
    16 \subsection{CMDI - data}
     16\subsection{CMDI -- Data}
    1717More than 130 profiles are defined. (See table \ref{table:dev_profiles} for more details about profiles.)
    1818The official CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/} collects data from 69 providers on daily basis.
    1919The collection amounts to over 550.000 records in more than 60 distinct profiles.
    2020
    21 \subsection{ACDH - the home of SMC}
    22        
     21\subsection{ACDH -- The Home of SMC}
     22\label{acdh}   
    2323Within CLARIN-AT a new centre has been brought to life, the Austrian Centre for Digital Humanities with the mission to foster digital research paradigm in humanities. It is designed to provide depositing and publishing services to the DH community, as well as infrastructural services that are part of the CLARIN Metadata Infrastructure. SMC is one of these services provided by this centre.
    2424Figure \ref{fig:acdh_context} sketches the broader context of \xne{ACDH} and its different roles.
    2525
    2626%%%%%%%%%%%%%%%%
    27 \section {Technical solution}
     27\section {Technical Solution}
    2828With this work we delivered a module embedded in a larger metadata infrastructure, aimed at supporting the semantic interoperability across the heterogeneous data in this infrastructure. The module consists of multiple interrelated components. The technical specification of the module can be found in chapter \ref{ch:design}. A prototypical implementation has been developed for the three main parts of the system. The code of this implementation is maintained in the central CMDI code repository\footnote {\url{http://svn.clarin.eu/SMC}}.
    2929
     
    3131\\
    3232
    33 \url{http://clarin.arz.oeaw.ac.at/smc} (soon: \url{http://acdh.ac.at/smc})
    34 
    35 
    36 \subsection{SMC - crosswalks service}
    37 the crosswalk service as a REST web service
    38 
    39 exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}
    40 
    41 This interface is available as part of the smc application:
    42 
    43 \url{http://clarin.arz.oeaw.ac.at/smc/cx}
    44 
    45 \subsection{SMC - as a module within Metadata Repository}
    46 The SMC is also integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
    47 
    48 \url{http://clarin.arz.oeaw.ac.at/mdrepo/} (module not integrated yet )
    49 
    50 \subsection{SMC Browser -- advanced interactive user interface}
     33\url{http://clarin.oeaw.ac.at/smc/}
     34
     35
     36\subsection{SMC -- Crosswalks Service}
     37The crosswalk service as a REST web service exposes an interface that provides mappings between search indexes as defined in \ref{sec:cx}. This interface is available via the wrapping smc application:
     38
     39\url{http://clarin.oeaw.ac.at/smc/cx}
     40
     41\subsection{SMC -- as a Module within Metadata Repository}
     42The SMC will also be integrated as module with the Metadata Repository enabling \emph{semantic search} over the joint metadata domain.
     43
     44\url{http://clarin.oeaw.ac.at/mdrepo/}
     45
     46\subsection{SMC Browser -- Advanced Interactive User Interface}
    5147
    5248SMC Browser is an advanced web-based visualization application to explore the complex dataset of the \xne{Component Metadata Infrastructure}, by visualizing its structure as an interactive graph. In particular, it enables the metadata modeller to examine the reuse of components or DCs in different profiles. The graph is accompanied by numerical statistics about the dataset as whole and about individual items (profiles, components, data categories), a set of example results and user documentation. Details about design and implementation can be found in \ref{smc-browser}. The publicly available instance is maintained under:
    5349
    54 \url{http://clarin.arz.oeaw.ac.at/smc-browser}
     50\url{http://clarin.oeaw.ac.at/smc-browser}
    5551
    5652\begin{figure*}
     
    6359
    6460
    65 %%%%%%%%%%%%%%%555
    66 \section{Exploring the CMD data -- SMC reports}
     61%%%%%%%%%%%%%%%
     62\section{Exploring the CMD Data -- SMC Reports}
    6763SMC reports is a (growing) set of documents analyzing specific phenomena in the CMD data domain that were created making extensive use of the visual and numerical output from the \xne{SMC Browser}. In this section, we deliver a few examples of these analyses. A complete up to date listing is maintained on the SMC website:
    6864
    69 \url{http://clarin.aac.ac.at/smc/reports}
    70 
    71 \subsection{Usage of data categories}
     65\url{http://clarin.oeaw.ac.at/smc-browser/docs/reports.html}
     66
     67\subsection{Usage of Data Categories}
    7268\label{sec:explore-datcats}
    7369At the core of the whole SMC (and CMDI) are the data categories as basic semantic building blocks or anchors.
     
    9086\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
    9187\end{center}
    92 \caption{The four main \concept{Language} data categories and in which profiles they are being used}
     88\caption{The four main \concept{Language} data categories and profiles they are being used in}
    9389\label{fig:language_datcats}
    9490\end{figure*}
     
    10399Again the main DC \concept{resourceName\#DC-2544}) being used in 74 profiles together with the semantically near \concept{resourceTitle\#DC-2545}) used in 69 profiles offer a good coverage over available data.
    104100
    105 Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies, that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
     101Some of the DCs referenced by \code{Name} elements are \concept{author\#DC-4115}), \concept{contact full name\#DC-2454}), \concept{dcterms:Contributor}, \concept{project name\#DC-2536}), \concept{web service name\#DC-4160}) and \concept{language name\#DC-2484}). This implies that a na\"{i}ve search in a \texttt{Name} element would match semantically very heterogeneous fields and only applying the semantic information provided by the DCs and/or the context of the element (the enclosing components) allows to disambiguate the meaning of the values.
    106102
    107103%\subsection{Resource type}
     
    109105% \subsection{Subject, Genre, Topic}
    110106
    111 \subsection{Integration of existing formats}
     107\subsection{Integration of Existing Formats}
    112108\label{sec:explore-formats}
    113109
     
    118114\subsubsection{dublincore / OLAC}
    119115\label{reports:OLAC}
    120 Very widely used (because) simple format
     116Very widely used (because) simple metadata format
    121117\ref{def:OLAC}
    122118%\ref{info:olac-records}
    123119
    124120Here the problem of proliferation seems especially virulent. Table \ref{table:dcterms-profiles} lists all the profiles modelling dcterms.
    125 As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
     121As all these profiles are link to the corresponding dublincore data categories, this does not pose a major problem on the exploitation side, however, the cluttering of the component registry with structurally identical or almost identical profiles needs to be questioned within the community.
    126122
    127123\begin{figure*}[!ht]
     
    135131
    136132\begin{table}
    137 \caption{Profiles modelling dublincore terms}
     133\caption{Profiles Modelling Dublincore Terms}
    138134\label{table:dcterms-profiles}
    139135%  \begin{tabular}{ |l | l | l | r | r | }
     
    154150\end{table}
    155151
    156 Additionally, there is a number of profiles with concept links to dublincore terms,
     152Additionally, there is a number of profiles with concept links to dublincore terms.
    157153Some use all of the dublincore elements or terms as one component within a larger profile,
    158154one example being the \xne{data} profile created by the Czech initiative LINDAT models  the minimal obligatory set of META-SHARE \xne{resourceInfo} schema, cf. subsection about META-SHARE below) combined with a simple dublincore record.
     
    180176\label{results:tei}
    181177TEI is a de-facto standard for encoding any kind of textual resources. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description / metadata the complex element \code{teiHeader} is foreseen.
    182 TEI does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.
     178TEI does not provide just one fixed schema, but allows for a certain flexibility regarding the elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. \ref{def:tei}.
    183179Thus there is also not just one fixed \xne{teiHeader}.
    184180
    185181The widespread use of TEI for encoding textual resources  brings about a strong interest of multiple research teams of the CLARIN community to integrate TEI with CMDI. There was a first attempt already in 2010, modelling the recommended \xne{teiHeader}\furl{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html\#HD7}, encoding \xne{fileDesc} and \xne{profileDesc} components, leaving out \xne{encodingDesc} and \xne{revisionDesc}. The leaf elements were bound to the most prominent data categories, making it a mixture of both dublincore and isocat.
    186182
    187 The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/}\cite{Geyken2011deutsches}, digitizing a hoist of historical german texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
    188 Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold\cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
    189 
    190 \xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however reusing existing components.
     183The large research project \xne{Deutsches Textarchiv}\furl{http://deutschestextarchiv.de/} \cite{Geyken2011deutsches}, digitizing a hoist of historical German texts from the period 1650 - 1900 also uses TEI to encode the material and consequently the teiHeader to hold the metadata information. Part of the project is also to integrate the data and metadata with the CLARIN infrastructure, meaning CMD records need to be generated for the resources. For this the team generated a completely new profile (as yet private) closely modelling the version of the teiHeader\furl{http://www.deutschestextarchiv.de/doku/basisformat_header} used in the project.
     184Regarding the question, why another teiHeader-based profile was generated not reusing the existing one, according to a personal note by a member of the project team and author of the profile, Axel Herold \cite{Herold2013} the profile was custom made for this particular project and it seemed undesirable to create a generalised TEI header profile.
     185
     186\xne{Nederlab} is another large-scale project aiming processing historic Dutch newspaper articles into a platform for search and analysis, starting 2013 in The Netherlands\furl{http://www.nederlab.nl}. Within this project, the metadata is also encoded in a \concept{teiHeader} and the data shall be integrated within CLARIN. Here, another set of CMD profiles was created, however, reusing existing components.
    191187As seen in figure \ref{fig:teiHeader_DBNL}, components \xne{fileDesc} and \xne{profileDesc} were reused, while the components \xne{encodingDesc} and \xne{revisionDesc}, left out in the original profile, were added.
    192188
    193 Another approach was applied within the context of other CLARIN-NL projects\cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated, that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
     189Another approach was applied within the context of other CLARIN-NL projects \cite{Menzo2013-05tei}. Based on an ODD-file, a data category for every element of the teiHeader (135 datcats) was generated. In a subsequent step, an enriched schema was generated that remodells the original teiHeader-schema, but with the individual elements being annotated with the new data categories (\code{dcr:datcat}-attribute). This schema is now maintained in the SCHEMAcat (cf. \ref{ch:infra}). The next step would be to create again a new profile, but with all the components and elements in it bound to the corresponding data categories, moving the semantic linking into the relation registry, where appropriate relations could be defined between the data categories derived from TEI and the \xne{isocat} and/or \xne{dublincore} DCs.
    194190This yields a more complex, but also a more systematic and flexible setup, with a clean separation/boundary/interface of the semantic space of TEI and the possibility to map the TEI elements (via their data categories) to multiple and/or different data categories according to the specific needs of a project or research question.
    195191
     
    230226%In cooperation between metadata teams from CLARIN and META-SHARE
    231227
    232 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
    233 
    234 In a parallel effort, LINDAT, the czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however combined with a simple dublincore record.
    235 This way, the information gets partly duplicated, but with the advantage, that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
    236 
    237 The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific informations.
     228The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type, however, all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
     229
     230In a parallel effort, LINDAT, the Czech national infrastructure initiative engaged in both CLARIN and META-SHARE, created a CMD profile (\concept{data}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:p_1349361150622}) modelling the minimal obligatory set of META-SHARE \concept{resourceInfo}), however, combined with a simple dublincore record.
     231This way, the information gets partly duplicated, but with the advantage that a minimal information is conveyed in the widely understood format, retaining the expressivity of the feature-rich schema.
     232
     233The expression of the META-SHARE schema in CMD allows a direct comparison of the two different approaches taken in the two projects: a metamodel allowing to generate custom profiles with shared semantics vs. the more traditional way of trying to generate one schema to fit in all the information. It shows nicely the trade-off: many custom schemas with the risk of proliferation and problems with semantic interoperability or one very large with the risk of overwhelming the user and still not being able to capture all specific information.
    238234
    239235\begin{figure*}
     
    249245\includegraphics[width=0.75\textwidth]{images/LINDAT-profile-data.png}
    250246\end{center}
    251 \caption{profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
     247\caption{Profile by LINDAT combining META-SHARE \xne{resourceInfo} component with dublincore elements }
    252248\label{fig:META-SHARE-LINDAT}
    253249\end{figure*}
     
    274270\includegraphics[height=0.95\textheight]{images/resourceInfoBIG.png}
    275271\end{center}
    276 \caption{the META-SHARE based profile for describing corpora}
     272\caption{The META-SHARE based profile for describing corpora}
    277273\label{fig:META-SHARE-BIG}
    278274\end{figure*}
     
    280276
    281277%%%%%%%%%%%%%%%%%%%%%%%
    282 \subsection{SMC cloud}
     278\subsection{SMC Cloud}
    283279\label{sec:smc-cloud}
    284 As a latest, still experimental, addition, SMC browser provides a special type of graph, that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
     280As the latest, still experimental, addition, SMC browser provides a special type of graph that displays only profiles. The links between them reflect the reuse of components and data categories (i.e. how many components or data categories do the linked pairs of profiles share), indicating the degree of similarity or semantic proximity. Figure \ref{fig:SMC_cloud} depicts one possible output of the graph
    285281covering a large part of the defined profiles. It shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.
    286282
     
    293289\end{figure*}
    294290
    295 \begin{comment}
    296 \section{Evaluation}
    297 \label{evaluation}
    298 
    299 Sample Queries:
    300 
    301 candidate Categories:
    302 ResourceType, Format
    303 Genre, Topic
    304 Project, Institution, Person, Publisher
    305 
    306 
    307 
    308 \subsection{Use Cases}
    309 
    310 \begin{itemize}
    311 
    312 \item MD Search employing Semantic Mapping
    313 \item MD Search employing Fuzzy Search
    314 \item Visualize impact of given mapping in terms of covered dataset (number of matched records).
    315 \end{itemize}
    316 
    317 
    318 \section{Discussion}
    319 
    320 \subsection{Semantic Mapping in Metadata vs. Content/Annotation}
    321 AF + DCR + RR
    322 
    323 \end{comment}
    324 
    325291\section{Summary}
    326292In this final chapter, we presented the results, on the one hand the technical solution of the module \xne{Semantic Mapping Component}, on the other hand we spent a good part of the chapter on commented analyses of the processed dataset, that were made possible by \xne{SMC Browser}, a interactive visualization tool developed as part of this work for exploration of the schema level data of the discussed collection. As such, the analyses can be seen as an evaluation, a proof of concept and usefulness of the presented work.
  • SMC4LRT/chapters/abstract_de.tex

    r3665 r4117  
    11\chapter*{Kurzfassung}
    22
    3 Diese Arbeit ist eingebettet in eine große internationale Forschungsinfrastruktur-Iinitiave, die zur Aufgabe hat,
    4 einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermöglichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes koherentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermöglicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÀt zu ÃŒberwinden, das sich aus der HeterogenitÀt der Metatadaten-Formate ergibt.
    5 
    6 Das eigentliche Ziel, der Nutzen dieser Arbeit -- im Einklang mit der generellen Idee des ganzen Unterfangens -- war die \emph{Verbesserung der Suchmöglichkeiten} in der großen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÀnzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhöhen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend, Modellierung der DomÀndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermöglichen.
     3Das eigentliche Ziel, der Nutzen dieser Arbeit war die \emph{Verbesserung der Suchmöglichkeiten} in einer großen heterogenen Sammlung von Metadaten. Diese Aufgabe  wurde in zwei separaten sich ergÀnzenden Herangehensweisen angegangen: a) Entwurf und Entwicklung eines Dienstes (service) zur Bereitstellung von \emph{crosswalks} (Entsprechungen zwischen Feldern in unterschiedlichen Metadaten-Formaten) auf der Basis von wohldefinierten Konzepten und die Anwendung dieser \emph{crosswalks} bei Suchszenarien um die Trefferquote zu erhöhen. b) die integrative Kraft des \emph{Linked Open Data} Paradigma anerkennend Modellierung der DomÀndaten als eine \emph{Semantic Web} Ressource, um die Nutzung von semantischen Technologien auf dem Datensatz zu ermöglichen.
    74
    85Entsprechend den zwei Herangehensweisen lieferte die Arbeit auch zwei Hauptergebnisse: a) die Spezifikation eines Moduls fÃŒr \emph{konzept-basierte Suche} zusammen mit dem zugrundeliegenden Dienst \emph{crosswalk service}, begleitet von einer Testimplementierung; b) Spezifikation der Modellierung der Ausgangsdaten im RDF Format, womit die Grundlage geschaffen ist, die Daten als \emph{Linked Open Data} bereitzustellen.
    96
    107Teilweise als Nebenprodukt wurde auch die Anwendung \emph{SMC Browser} entwickelt -- ein interaktives Visualisierungswerkzeug zur Erschließung der Schema-Ebene der Datensammlung. Mit Hilfe dieses Werkzeugs konnte eine Reihe von tiefergehenden Analysen der Daten erstellt werden, die direkt von der Forschergemeinschaft zur Erschließung und Redaktion der komplexen Daten genutzt werden. Somit können die Anwendung und die Analyseberichte als ein wertvoller Beitrag fÃŒr die Forschergemeinschaft angesehen werden.
     8
     9Diese Arbeit ist eingebettet in eine große internationale Forschungsinfrastrukturinitiave, die zur Aufgabe hat,
     10einfachen, stabilen, harmonisierten Zugang zu Sprachressourcen und Technologien in Europa zu ermöglichen, der \emph{Common Language Resource and Technology Infrastructure} oder CLARIN. Das technische HerzstÃŒck dieser Unternehmung is die \emph{Component Metadata Infrastructure}, ein verteiltes System, das harmonisiertes kohÀrentes Erstellen und Verbreiten von Metadaten fÃŒr Sprachressourcen ermöglicht. Das Ergebnis dieser Arbeit, das Modul \emph{Semantic Mapping Component}, wurde als Bestandteil des Systems erdacht, um unter Ausnutzung der in die Infrastruktur eingebauten Mechanismen das Problem der semantischen InteroperabilitÀt zu ÃŒberwinden, das sich aus der HeterogenitÀt der Metatadaten-Formate ergibt.
  • SMC4LRT/chapters/abstract_en.tex

    r3665 r4117  
    11\chapter*{Abstract}
    22
    3 
    4 This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.
    5 
    6 The ultimate objective of this work -- in line with the overall mission of the whole initiative -- was to \emph{enhance search functionality} over the large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
     3The ultimate objective of this work was to \emph{enhance search functionality} over a large heterogeneous collection of resource descriptions. This objective was pursued in two separate, complementary approaches: a) design a service delivering crosswalks (i.e. equivalences between fields in disparate metadata formats) based on well-defined concepts and apply this concept-based crosswalks in search scenarios to enhance recall. b) acknowledging the integrative power of the \emph{Linked Open Data} paradigm, express the domain data as a \emph{Semantic Web} resource, to enable the application of semantic technologies on the dataset.
    74
    85In parallel with the two approaches, the work delivered two main results: a) the \emph{specification} of the module for \emph{concept-based search} together with the underlying \emph{crosswalks service} accompanied by a proof-of-concept implementation. And b) the blueprint for expressing the original dataset in RDF format, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}.
     
    107Partly as by-product, the application \emph{SMC browser} was developed -- an interactive visualization tool to explore the dataset on the schema level. This tool provided means to generate a number of advanced analyses of the data, directly used by the community for exploration and curation of the complex dataset.  As such, the tool and the reports can be considered a valuable contribution to the community.
    118
     9This work is embedded in the context of a large research infrastructure initiative aimed at providing easy, stable, harmonized access to language resources and technology (LRT) in Europe, the \emph{Common Language Resource and Technology Infrastructure} or CLARIN. A core technical pillar of this initiative is the \emph{Component Metadata Infrastructure}, a distributed system for creating and providing metadata for LRT in a coherent harmonized way. The outcome of this work, the \emph{Semantic Mapping Component}, was conceived as one module within the infrastructure dedicated to overcome the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, by harnessing the mechanisms of the semantic layer built-in into the core of the infrastructure.
     10
  • SMC4LRT/chapters/acknowledgements.tex

    r3776 r4117  
    11\chapter*{Acknowledgements}
    22
    3 I would like to thank all the colleagues from my institute and from the CLARIN community, for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
    4 And to all my dear one, for the extra portion of patience I demanded from them
     3I would like to thank all the colleagues from the institute and from the CLARIN community for the support, the fruitful discussions and helpful feedback, especially Menzo Windhouwer, Daan Broeder, Dieter Van Uytvanck, Marc Kemps-Snijders and Hennie Brugman. \\
     4And all my dear ones, for the extra portion of patience I demanded from them.
    55\\
    6  \\
    7 With love to em.
     6
     7\hfill with love to em
  • SMC4LRT/chapters/appendix.tex

    r3776 r4117  
    66\chapter{Data model reference}
    77\label{ch:data-model-ref}
    8 In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture, that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
     8In the following complete data models, schemas are listed for reference: The diagram of the data model for data category specification in figure~\ref{fig:DCR_data_model},  \xne{Terms.xsd} -- the XML schema used by the SMC module internally in listing~\ref{lst:terms-schema} (cf. \ref{datamodel-terms}) and the \xne{general-component-schema.xsd}\furl{https://infra.clarin.eu/cmd/general-component-schema.xsd} -- the schema representing the CMD meta model for defining CMD profiles and components in listing~\ref{lst:cmd-schema}. Figure \ref{fig:ref_arch} depicts an abstract reference architecture that provides a conceptual frame for this work and in figure \ref{fig:acdh_context} an overview of the roles and services of the \xne{ACDH -- Austrian Centre for Digital Humanities} -- the home of SMC -- explicates the concrete current situation regarding the architectural context of SMC.
    99
    10 \begin{figure*}[!ht]
     10\input{images/Terms.xsd}
     11
     12\input{images/general-component-schema.xsd}
     13
     14\begin{figure*}
     15\begin{center}
     16\includegraphics[width=1\textwidth]{images/EDC_components_v4.png}
     17\end{center}
     18\caption{Reference Architecture}
     19\label{fig:ref_arch}
     20\end{figure*}
     21
     22\begin{figure*}[p]
    1123\begin{center}
    1224\includegraphics[width=1\textwidth]{images/DCR_data_model.jpg}
     
    1628\end{figure*}
    1729
    18 \input{images/Terms.xsd}
    19 
    20 \input{images/general-component-schema.xsd}
    21 
    22 
    2330\begin{figure*}[!ht]
    2431\begin{center}
    25 \includegraphics[width=1\textwidth]{images/EDC_components_v4.png}
    26 \end{center}
    27 \caption{Reference Architecture}
    28 \label{fig:ref_arch}
    29 \end{figure*}
    30 
    31 \begin{figure*}[!ht]
    32 \begin{center}
    33 \includegraphics[width=1\textheight, angle=90]{images/acdh-diagram_300dpi.png}
     32\includegraphics[width=0.95\textheight, angle=90]{images/acdh-diagram_300dpi.png}
    3433\end{center}
    3534\caption{Austrian Centre for Digital Humanities - the home of SMC - in context}
     
    3736\end{figure*}
    3837
    39 \chapter{CMD -- sample data}
     38\chapter{CMD -- Sample Data}
    4039\label{ch:cmd-sample}
    4140
     
    4544\input{chapters/collection_spec.xml.tex}
    4645
    47 \section{CMD record}
     46\section{CMD Record}
    4847Following listing represents a sample CMD record  - an instance of the \concept{collection} profile listed above.
    4948
     
    5150
    5251
    53 \chapter{SMC -- documentation}
     52\chapter{SMC -- Documentation}
    5453\label{ch:smc-docs}
    5554
    5655\begin{figure*}
    5756\begin{center}
    58 \includegraphics[height=1\textwidth, angle=90]{images/build_init.png}
     57\includegraphics[width=1.1\textheight, angle=90]{images/build_init.png}
    5958\end{center}
    6059\caption{A graphical representation of the dependencies and calls in the main \xne{ant} build file.}
     
    6261\end{figure*}
    6362
    64 \section{Documentation of smc-xsl}
     63\section{Developer Documentation}
    6564\label{sec:smc-xsl-docs}
    66 \todoin{generate and reference XSLT-documentation}
    6765
    68 \section{SMC Browser user documentation}
     66A developer documentation of the code and the system is included in the source repository
     67
     68\noindent
     69\url{https://svn.clarin.eu/SMC/trunk/SMC/docs}
     70
     71\noindent
     72A short introduction can be found online as part of the application:
     73
     74\noindent
     75\url{http://clarin.oeaw.ac.at/smc/docs/devdocs.html}
     76
     77\section{SMC Browser User Documentation}
    6978\label{sec:smc-browser-userdocs}
    7079
    7180\input{chapters/userdocs_cleaned}
    7281
    73 \section {Sample SMC graphs}
     82\clearpage
     83\section {Sample SMC Graphs}
    7484\label{sec:smc-graphs}
    7585
     
    8191\label{fig:cmd-dep-dotgraph}
    8292\end{figure*}
     93
     94\begin{figure*}[h]
     95\begin{center}
     96\includegraphics[width=1\textwidth]{images/SMC-export_sample.png}
     97\end{center}
     98\caption{A sample output from SMC browser showing a number of frequently used data categories and the clusters of profiles using them.}
     99\label{fig:smc-sample}
     100\end{figure*}
     101
    83102
    84103
  • SMC4LRT/chapters/userdocs_cleaned.tex

    r3776 r4117  
    22Explore the \DUroletitlereference{Component Metadata Framework}
    33
    4 In \emph{CMD}, metadata schemas are defined by profiles, that are constructed out of reusable components  - collections
     4In \emph{CMD}, metadata schemas are defined by profiles that are constructed out of reusable components  - collections
    55of metadata fields. The components can contain other components, and they can be reused in multiple profiles.
    66Furthermore, every CMD element (metadata field) refers via a PID to a data category to indicate unambiguously how the content of the field in a metadata description should
     
    1212SMC Browser visualizes this graph structure in an interactive fashion. You can have a look at the \href{examples.html}{examples} for inspiration.
    1313
    14 It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). More technical documentation follows soon.
     14It is implemented on top of wonderful js-library \href{https://github.com/mbostock/d3}{d3}, the code checked in \href{https://svn.clarin.eu/SMC/trunk/SMC}{clarin-svn} (and needs refactoring). There is also some preliminary \href{devdocs.html}{technical documentation}
    1515
    1616
     
    5353}
    5454
    55 The User interface is divided into 4 main parts:
     55The user interface is divided into 4 main parts:
    5656%
    5757\begin{description}
    5858\item[{Index}] \leavevmode
    5959Lists all available Profiles, Components, Elements and used Data Categories
    60 The lists can be filtered (enter search pattern in the input box at the top of the index-pane)
    61 By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane
     60The lists can be filtered (enter search pattern in the input box at the top of the index-pane).
     61By clicking on individual items, they are added to the \DUroletitlereference{selected nodes} and get rendered in the graph pane.
    6262
    6363\item[{Main (Graph)}] \leavevmode
     
    7878}
    7979
    80 Following data sets are distinguished wrt user interaction:
     80Following data sets are distinguished with respect to the user interaction:
    8181%
    8282\begin{description}
    8383\item[{all data}] \leavevmode
    8484the full graph with all profiles, components, elements and data categories and links between them.
    85 
    86 Currently this amounts to roughly 2.000 nodes and 3.000 links
     85Currently this amounts to roughly 4.600 nodes and 7.500 links.
    8786
    8887\item[{selected nodes}] \leavevmode
    89 nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes})
     88nodes explicitely selected by the user (see below how to \hyperref[select-nodes]{select nodes}).
    9089
    9190\item[{data to show}] \leavevmode
     
    145144}
    146145
    147 The navigation pane provides following option to control the rendering of the graph:
     146The navigation pane provides the following options to control the rendering of the graph:
    148147%
    149148\begin{description}
Note: See TracChangeset for help on using the changeset viewer.