source: SMC4LRT/chapters/Infrastructure.tex

Last change on this file was 4117, checked in by vronk, 11 years ago

minor orthographic corrections

File size: 38.5 KB
Line 
1\chapter{Underlying Infrastructure}
2\label{ch:infra}
3
4In this chapter, we present the infrastructure, in which this work is embedded. We start with a short general introduction about the large research infrastructure initiative CLARIN, followed by a close examination of its technical infrastructure for creating and publishing metadata. In section \ref{sec:cv}, we discuss the services for managing controlled vocabularies and their role in the context of metadata creation.
5
6\section{CLARIN}
7\label{def:CLARIN}
8
9CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide
10
11\begin{quote}
12\dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web}
13\end{quote}
14
15\begin{comment}
16To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work.
17\end{comment}
18
19The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries.
20
21In the preparation phase of the project 2008 - 2011, over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level.
22
23Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects.
24
25
26\section{Component Metadata Infrastructure -- CMDI}
27\label{def:CMDI}
28
29One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}).
30
31The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}:
32
33\begin{itemize}
34\item Data Category Registry
35\item Component Registry
36\item Relation Registry
37\end{itemize}
38
39\noindent
40All these modules are running services that this work shall directly build upon.
41
42In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}.
43
44\begin{figure*}[ht]
45\begin{center}
46\includegraphics[width=0.8\textwidth]{images/CMDI_components_old_clean.png}
47\caption{The diagram [from early CLARIN/CMDI presentations] shows individual modules of the CMDI and their interrelations as envisaged in the initial phase of the CLARIN project}
48\label{fig:cmdi-old}
49\end{center}
50\end{figure*}
51
52Next to the above-mentioned services SMC is in direct interaction with, some other services and applications are part of the CMDI ecosystem that are briefly introduced in \ref{cmdi-other} for completeness:
53
54\begin{itemize}
55\item metadata editors
56\item Schema Registry
57\item SchemaParser
58\end{itemize}
59
60Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}.
61
62\subsection{CMDI Registries}
63\label{cmdi-registries}
64The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction.
65
66\begin{figure*}[t]
67\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
68\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping.}
69\label{fig:SMC-linkage}
70\end{figure*}
71       
72\subsubsection*{Data Category Registry -- ISOcat}
73\label{def:DCR}
74
75The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories (DC). The resulting shared controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework (among others -- DCR is not specific to CMDI, it is meant to be used as common concept registry in many applications).
76
77The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}.
78\xne{ISOcat}\furl{http://www.isocat.org/} is an implementation of this standard framework developed by MPI for Psycholinguistics, Nijmegen in collaboration with the ISO technical committee \xne{ISO TC 37 Terminology and Other Language and Content Resources}.
79Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well.
80
81The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is that the data categories are assigned a persistent identifier, making them globally and permanently referable.
82
83\begin{figure*}[!ht]
84\begin{center}
85\includegraphics[width=0.7\textwidth]{images/dc_types}
86\end{center}
87\caption{Data Category types \cite{Windhouwer2011}}
88\label{fig:dc_type}
89\end{figure*}
90
91\subsubsection*{Component Registry}
92\label{def:CR}
93
94\emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompanied by a REST webservice to allow client applications to retrieve the profile definitions. At the same time, the web interface serves as an editor for creating and editing new CMD components and profiles.
95
96The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components  added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013MTSR}
97
98Let us reiterate that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}, or \emph{to make its semantics explicit}.
99
100As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile.
101Once a profile is created, the Component Registry provides automatically the corresponding XML schema that can be used as base for creating and validating metadata records in the \code{cmd} namespace \code{http://www.clarin.eu/cmd}.
102
103\subsubsection*{Ontological Relations -- Relation Registry}
104
105The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
106However, there needs to be an additional mean to capture information about relations between data categories.
107This information was deliberately not included in the DCR, because relations often depend on the context, in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations need to be under control of the metadata user whereas the data categories are under control of the metadata modeller.
108
109The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
110
111There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011} that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
112This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications.
113
114\begin{definition}{The relation triples as stored by the Relation Registry}
115\textless \ subjectDatcat \ relationPredicate \  objectDatcat \textgreater
116\end{definition}
117
118\subsection{Further Parts of the Infrastructure}
119\label{cmdi-other}
120
121\subsubsection*{Schema Registry}
122
123SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html} is a registry for schemas of all kinds (not just the CMD-based, in fact not even just XML-based) semantically annotated with data categories.
124\begin{quotation}
125RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow
126(search) algorithms to traverse the semantic graph thus made explicit \cite{SchuurmanWindhouwer2011}.
127\end{quotation}
128
129\subsubsection*{Schema Parser}
130Schema Parser is a service developed at the Meertens Institute, Amsterdam that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection.
131
132\subsubsection*{Metadata editors}
133\label{md-editors}
134
135Metadata creation, i.e. the authoring of actual metadata records is undisputably the fundamental task in the whole system.
136Though not directly interacting with SMC, metadata editors need to be mentioned, i. e. tools that the human metadata editors is using for authoring metadata.
137
138Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records.
139Two examples being the stand-alone application \xne{Arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} \cite{withers2012arbil} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} \cite{dima2012mdeditor} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen.
140
141
142\subsection{CMDI Exploitation Side}
143\label{cmdi_exploitation}
144Metadata complying with the CMD data model is being created by a growing number of institutions  by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}).
145
146\begin{figure*}[!ht]
147\begin{center}
148\includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS}
149\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications.}
150\label{fig:cmd-ingestion}
151\end{center}
152\end{figure*}
153
154The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}.
155The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}).
156As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic  interoperability provided by the infrastructure.
157
158\begin{figure*}[ht]
159\begin{center}
160\includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png}
161\caption{Screenshot of the faceted browser of the VLO}
162\label{fig:vlo}
163\end{center}
164\end{figure*}
165
166More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It is also based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{Zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index.
167The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data.
168% \todoin { describe indexing and search}
169
170And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) \cite{Durco2011}.
171However, the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}).
172
173%%%%%%%%%%%%%%%%%%%%
174\section{Vocabulary Service / Reference Data Registries}
175\label{sec:cv}
176
177\subsection{Motivation \& Broader Context}
178The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However, the problem of incoherent labelling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values.
179
180This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc.
181Moreover, by using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards transformation of this data into \emph{Linked Open Data}.
182
183Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives.
184
185It has to be also kept in mind that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files).
186
187\begin{comment}
188Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}:
189\begin{verbatim}
190GND: 118540238 | LCCN: n79003362 |
191NDL: 00441109 | VIAF: 24602065
192\end{verbatim}
193\end{comment}
194
195\subsection{Implementation -- OpenSKOS/CLAVAS}
196\label{def:CLAVAS}
197
198In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the Dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}.
199
200%As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with.
201
202The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format.
203One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise.
204
205Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running an instance of the OpenSKOS system.
206
207As the work on this vocabulary repository started in the context of a cultural heritage programme, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}.  Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS:
208\begin{itemize}
209\item the list of language codes \cite{ISO639}
210\item organization names for the domain of language resources
211\item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process)
212\end{itemize}
213
214\subsection{Export DCR to SKOS}
215\label{sec:export-dcr}
216
217Based on the premise that the data in DCR also represents a kind of a controlled vocabulary, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service.
218
219Note that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service.
220
221The fact that data categories are basically definitions of concepts may mislead to
222a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept}
223all of them belonging to the \code{ISOcat:ConceptScheme}. However, the data in ISOcat as a whole is too disparate in scope for such a vocabulary to be useful.
224
225A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme.
226
227\begin{quotation}
228The rationale is that if we see a vocabulary as a set of possible values for a
229field/element/attribute, complex DCs in ISOcat are the users of such
230vocabularies and simple DCs the DCR equivalence of values in such a
231vocabulary. \cite{Menzo2013mail}
232\end{quotation}
233
234\begin{comment}
235Still there are some closed DCs, which might be good vocabulary
236providers, e.g., /linguistic subject/ (DC-2527/), and still also need to
237stay in ISOcat. I think at some point we should create a smaller set of
238metadata DCs to be harvested by CLAVAS.
239Therefore a threshold seems sensible, where only value domains with more
240then 20, 50 or 100 values are exported.
241
242However, it needs to be yet assessed how useful this approach is. In the metadata profile
243there are many closed DCs with small value domains. How useful are those
244in CLAVAS?
245\end{comment}
246
247\begin{figure*}
248\begin{center}
249\includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png}
250\end{center}
251\caption{The wrong and correct variant of exporting ISOcat data categories in SKOS format to the Vocabulary Service}
252\label{fig:export_dcr2skos}
253\end{figure*}
254
255Another aspect is that a simple DC can be in value domains of multiple closed DCs.
256Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}.
257So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts].
258That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes.
259
260Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created,
261i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}).
262This is how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest
263/representations/dcs2/clavas.xsl}
264
265
266\subsection{Linking to Vocabularies in Data Categories and Schemas -- Interaction between ISOcat, CLAVAS and Client Applications}
267\label{interaction-dcr-skos}
268
269In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to
270convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}} 
271
272Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository:
273
274\begin{quotation}
275Originally, the vocabulary repository has been conceived to manage rather large and complex value domains that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be
276partially enumerated (organization names) ISOcat can't/shouldn't contain
277the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a
278provider. \cite{Menzo2013mail}
279\end{quotation}
280
281Currently, the only possibility to constrain the value domain of a data category
282is by the means a XML Schema provides, like enumeration or regular expression. So for the data category \concept{languageID\#DC-2482} the rule looks like:
283\lstset{language=XML}
284\begin{lstlisting}
285  <dcif:conceptualDomain type="constrained">
286    <dcif:dataType>string</dcif:dataType>
287    <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
288    <dcif:rule>[a-z]{3}</dcif:rule>
289  </dcif:conceptualDomain>
290\end{lstlisting}
291
292A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension:
293
294\begin{lstlisting}
295  <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/>
296\end{lstlisting}
297
298\begin{quotation}
299\code{@href} points to the vocabulary. Actually a PID should be used in the context
300of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency than the core.
301
302\code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are
303valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open.
304\end{quotation}
305
306This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup, but are capable of XSD/RNG validation, can still use the regular expression based definition.
307 
308\lstset{language=XML}
309\begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=Definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary]
310  <dcif:conceptualDomain type="constrained">
311     <dcif:dataType>string</dcif:dataType>
312     <dcif:ruleType>XML Schema regular expression</dcif:ruleType>
313     <dcif:rule>[a-z]{3}</dcif:rule>
314  </dcif:conceptualDomain>
315  <dcif:conceptualDomain type="constrained">
316     <dcif:dataType>string</dcif:dataType>
317     <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType>
318      <dcif:rule>
319         <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639"
320                                     type="closed"/>
321      </dcif:rule>
322  </dcif:conceptualDomain>
323\end{lstlisting}
324
325\begin{figure*}[ht]
326\begin{center}
327\includegraphics[width=0.7\textwidth]{images/concept_linking.png}
328\end{center}
329\caption{The linking between schemas, data categories and vocabularies}
330\label{fig:concept_linking}
331\end{figure*}
332
333It is important to emphasize that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or  recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element than the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema.
334
335There are basically two options how the vocabulary can be integrated into the schema.
336One approach is to explicitly enumerate all the values from the vocabulary.
337Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however, there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity,
338and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp}
339
340The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in  \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute.
341
342Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool
343can use the reference to the data category to fetch explanations (semantic information)  (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}).
344
345The drawback of this variant is that we gave up the validation. This
346isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but
347it is when the value domain is closed, e.g. \concept{languageID}. In the latter case,
348the XSD generation could support both modes: a lax (smaller) version which
349doesn't contain the closed vocabulary as an enumeration and leaves it to
350the tool, and a strict version, which does contain the vocabulary as an
351enumeration. Probably the latter should stay the default, but the client application could
352request the lax version leading to smaller and quicker XSD validation
353inside the tool.
354
355%However, for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification.
356
357
358%%%%%%%%%%%%%%%%%
359\section{Other Aspects of the Infrastructure} 
360While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However, it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources.
361
362\subsubsection{CLARIN Centres}
363One view on the CLARIN infrastructure is that of a network of centres\furl{http://www.clarin.eu/node/3812}:
364
365\begin{quotation}
366CLARIN's distributed network is made out of centres. These units, often a university or an academic institute, offer the scientific community access to services on a sustainable basis.
367\end{quotation}
368
369CLARIN imposes a number of criteria that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767} \cite{CE-2013-0095}.
370CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres.
371
372One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties' researchers (not just the home users) to store research data.
373
374\begin{comment}
375In the following a few further well established repositories are mentioned.
376
377\begin{description}
378\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \footnote{\url{https://phaidra.univie.ac.at/}}
379\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \footnote{\url{https://www.escidoc.org/}}
380\item[TextGrid] \furl{http:/textgrid.de}
381\item[DRIVER] pan-European infrastructure of Digital Repositories \footnote{\url{http://www.driver-repository.eu/}}
382\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}}
383\end{description}
384\end{comment}
385
386\begin{figure*}
387\begin{center}
388\includegraphics[width=0.7\textwidth]{images/FCS_components.png}
389\end{center}
390\caption{Components of the Federated Content Search and their interdependencies}
391\label{fig:fcs}
392\end{figure*}
393
394\subsubsection{Federated Content Search}
395
396Another aspect of the availability of resources is that while metadata can be harvested and indexed locally in one repository this is not possible with the content itself, both due to the size of the data and mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}.
397
398Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}.
399
400\section{Summary}
401
402In this chapter, we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally, a few other aspects of the infrastructure that are equally important, however, not pertaining to the metadata level, were briefly tackled.
403
Note: See TracBrowser for help on using the repository browser.