1 | \chapter{Underlying Infrastructure} |
---|
2 | \label{ch:infra} |
---|
3 | |
---|
4 | In this chapter, we present the infrastructure, in which this work is embedded. We start with a short general introduction about the large research infrastructure initiative CLARIN, followed by a close examination of its technical infrastructure for creating and publishing metadata. In section \ref{sec:cv}, we discuss the services for managing controlled vocabularies and their role in the context of metadata creation. |
---|
5 | |
---|
6 | \section{CLARIN} |
---|
7 | \label{def:CLARIN} |
---|
8 | |
---|
9 | CLARIN - Common Language Resource and Technology Infrastructure \cite{Varadi2008} - is one of the large research infrastructure initiatives as envisaged by the European Stategy Forum on Research Infrastructures (ESFRI) and fostered by the framework programmes of the European Commission. The mission of this project is to provide |
---|
10 | |
---|
11 | \begin{quote} |
---|
12 | \dots easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. \cite{CLARIN2013web} |
---|
13 | \end{quote} |
---|
14 | |
---|
15 | \begin{comment} |
---|
16 | To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work. |
---|
17 | \end{comment} |
---|
18 | |
---|
19 | The initiative foresees a federated network of centres providing resources and services in a harmonized, interoperable manner to the academic community in all participating countries. |
---|
20 | |
---|
21 | In the preparation phase of the project 2008 - 2011, over 180 institutions from 38 countries participated. In the construction phase, the action impetus moved, as projected, more to the individual national initiatives of this federated endeavour, while kept together by the common principles set up during the preparation phase and established processes and administrative decision bodies ensuring the flow of information and coherent action on European level. |
---|
22 | |
---|
23 | Since 2013, CLARIN also became an \emph{European Research Infrastructure Consortium} (ERIC), which is a new type of legal entity established within EU, especially designed to give the research infrastructure initiatives a more stable status and better means to act independently. This is an important step to ensure a continuity of the endeavour, the chronic problem of (international) projects. |
---|
24 | |
---|
25 | |
---|
26 | \section{Component Metadata Infrastructure -- CMDI} |
---|
27 | \label{def:CMDI} |
---|
28 | |
---|
29 | One core pillar of CLARIN is the \emph{Component Metadata Infrastructure} (CMDI)\furl{http://www.clarin.eu/cmdi} -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way. The conceptual foundation of CMDI is the \emph{Component Metadata Framework} \cite{Broeder+2010}, a flexible meta model that supports creation of metadata schemas also allowing to accommodate existing schemas (cf. \ref{def:CMD}). |
---|
30 | |
---|
31 | The SMC is part of CMDI and depends on multiple modules on the production side of the infrastructure. Before we describe the SMC and its interaction with these modules in detail in chapter \ref{ch:design}, we introduce the latter and the type of data they provide in \ref{cmdi-registries}: |
---|
32 | |
---|
33 | \begin{itemize} |
---|
34 | \item Data Category Registry |
---|
35 | \item Component Registry |
---|
36 | \item Relation Registry |
---|
37 | \end{itemize} |
---|
38 | |
---|
39 | \noindent |
---|
40 | All these modules are running services that this work shall directly build upon. |
---|
41 | |
---|
42 | In contrast, SMC is meant as provider for the modules on the exploitation side of the infrastructure, i.e. search and exploration services used by the end users. These are briefly introduced in \ref{cmdi_exploitation}. |
---|
43 | |
---|
44 | \begin{figure*}[ht] |
---|
45 | \begin{center} |
---|
46 | \includegraphics[width=0.8\textwidth]{images/CMDI_components_old_clean.png} |
---|
47 | \caption{The diagram [from early CLARIN/CMDI presentations] shows individual modules of the CMDI and their interrelations as envisaged in the initial phase of the CLARIN project} |
---|
48 | \label{fig:cmdi-old} |
---|
49 | \end{center} |
---|
50 | \end{figure*} |
---|
51 | |
---|
52 | Next to the above-mentioned services SMC is in direct interaction with, some other services and applications are part of the CMDI ecosystem that are briefly introduced in \ref{cmdi-other} for completeness: |
---|
53 | |
---|
54 | \begin{itemize} |
---|
55 | \item metadata editors |
---|
56 | \item Schema Registry |
---|
57 | \item SchemaParser |
---|
58 | \end{itemize} |
---|
59 | |
---|
60 | Finally, the Vocabulary Alignment Service, a module playing crucial role in metadata curation, is treated separately in section \ref{sec:cv}. |
---|
61 | |
---|
62 | \subsection{CMDI Registries} |
---|
63 | \label{cmdi-registries} |
---|
64 | The CMD framework as data model (cf. \ref{def:CMD}) together with the two registries the \emph{Data Category Registry} \xne{ISOcat} and the \emph{Component Registry} build the backbone of the CMD Infrastructure. See figure \ref{fig:cmdi-old} with the rather na\"{i}ve initial vision of the system contrasted with the figure \ref{fig:SMC-linkage} detailing the actual linkage between the data in the individual registries. In the following, we explain briefly their role and interaction. |
---|
65 | |
---|
66 | \begin{figure*}[t] |
---|
67 | \includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2} |
---|
68 | \caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping.} |
---|
69 | \label{fig:SMC-linkage} |
---|
70 | \end{figure*} |
---|
71 | |
---|
72 | \subsubsection*{Data Category Registry -- ISOcat} |
---|
73 | \label{def:DCR} |
---|
74 | |
---|
75 | The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories (DC). The resulting shared controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework (among others -- DCR is not specific to CMDI, it is meant to be used as common concept registry in many applications). |
---|
76 | |
---|
77 | The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}. |
---|
78 | \xne{ISOcat}\furl{http://www.isocat.org/} is an implementation of this standard framework developed by MPI for Psycholinguistics, Nijmegen in collaboration with the ISO technical committee \xne{ISO TC 37 Terminology and Other Language and Content Resources}. |
---|
79 | Next to a web interface for users to browse and manage the data categories, ISOcat provides a REST-style webservice allowing applications to retrieve the data category specifications. By default, it is provided in the \xne{Data Category Interchange Format - DCIF}, the standardized XML-serialization of the data model, but a RDF and HTML representation is available as well. |
---|
80 | |
---|
81 | The core data model defining the data category specification is rather complex, consisting of administrative, linguistic and description part, containing language-specific versions of definitions, value domains, examples and other attributes (cf. \ref{fig:DCR_data_model} for the diagram of the full data model). Following types of data categories are recognized (cf. figure \ref{fig:dc_type}): \var{simple, complex}: (\var{closed, open} or \var{constrained}), \var{container}. One fundamental aspect to emphasize is that the data categories are assigned a persistent identifier, making them globally and permanently referable. |
---|
82 | |
---|
83 | \begin{figure*}[!ht] |
---|
84 | \begin{center} |
---|
85 | \includegraphics[width=0.7\textwidth]{images/dc_types} |
---|
86 | \end{center} |
---|
87 | \caption{Data Category types \cite{Windhouwer2011}} |
---|
88 | \label{fig:dc_type} |
---|
89 | \end{figure*} |
---|
90 | |
---|
91 | \subsubsection*{Component Registry} |
---|
92 | \label{def:CR} |
---|
93 | |
---|
94 | \emph{Component Registry}\furl{http://catalog.clarin.eu/ds/ComponentRegistry/} (CR) implements the CMD data model (cf. \ref{def:CMD}) and fulfills two functions. For one, it is the actual registry that persistently stores and exposes published CMD profiles via a web interface allowing to browse and search in them and view their structure accompanied by a REST webservice to allow client applications to retrieve the profile definitions. At the same time, the web interface serves as an editor for creating and editing new CMD components and profiles. |
---|
95 | |
---|
96 | The primary user of the CR is the metadata modeller with the task to create a dedicated metadata profile for a given resource type. She can browse and search the CR for components and profiles that are suitable or come close. The registry already contains many general components, e.g., for contact persons, language and geographical information. In general many of these can be reused as they are or have to be only slightly adapted, i.e., have some metadata elements and/or components added or removed. Also new components can be created if needed to model the unique aspects of the resources under consideration. \cite{Durco2013MTSR} |
---|
97 | |
---|
98 | Let us reiterate that the actual core provision for semantic interoperability is the requirement that the elements (and as far as possible also components and values) should be linked ``via a PID to exactly one data category (cf. \ref{def:DCR}), thus indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}, or \emph{to make its semantics explicit}. |
---|
99 | |
---|
100 | As dictated by the CMD model, all components needed for the modelled resource description are compiled into one profile. |
---|
101 | Once a profile is created, the Component Registry provides automatically the corresponding XML schema that can be used as base for creating and validating metadata records in the \code{cmd} namespace \code{http://www.clarin.eu/cmd}. |
---|
102 | |
---|
103 | \subsubsection*{Ontological Relations -- Relation Registry} |
---|
104 | |
---|
105 | The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions. |
---|
106 | However, there needs to be an additional mean to capture information about relations between data categories. |
---|
107 | This information was deliberately not included in the DCR, because relations often depend on the context, in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry}\label{def:rr} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. This design decision is based upon the assumption that the relations need to be under control of the metadata user whereas the data categories are under control of the metadata modeller. |
---|
108 | |
---|
109 | The relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed. |
---|
110 | |
---|
111 | There is a prototypical implementation of such a relation registry called \xne{RELcat} being developed at MPI, Nijmegen \cite{Windhouwer2011,SchuurmanWindhouwer2011} that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}. |
---|
112 | This implementation stores the individual relations as RDF triples allowing typed relations, like equivalency (\code{rel:sameAs}) and subsumption (\code{rel:subClassOf}). The relations are grouped into relation sets that can be used independently. The relations are deliberately defined in a separate namespace, instead of reusing existing ones (\code{skos:exactMatch, owl:sameAs}) with the aim to avoid introducing too specific semantics. These relations can be mapped to appropriate other predicates when integrating the relation sets in concrete applications. |
---|
113 | |
---|
114 | \begin{definition}{The relation triples as stored by the Relation Registry} |
---|
115 | \textless \ subjectDatcat \ relationPredicate \ objectDatcat \textgreater |
---|
116 | \end{definition} |
---|
117 | |
---|
118 | \subsection{Further Parts of the Infrastructure} |
---|
119 | \label{cmdi-other} |
---|
120 | |
---|
121 | \subsubsection*{Schema Registry} |
---|
122 | |
---|
123 | SCHEMAcat\furl{http://lux13.mpi.nl/schemacat/site/index.html} is a registry for schemas of all kinds (not just the CMD-based, in fact not even just XML-based) semantically annotated with data categories. |
---|
124 | \begin{quotation} |
---|
125 | RELcat and SCHEMAcat will provide the means to harvest and specify this information in the form of relationships and allow |
---|
126 | (search) algorithms to traverse the semantic graph thus made explicit \cite{SchuurmanWindhouwer2011}. |
---|
127 | \end{quotation} |
---|
128 | |
---|
129 | \subsubsection*{Schema Parser} |
---|
130 | Schema Parser is a service developed at the Meertens Institute, Amsterdam that processes XML Schemas to generate all possible paths in the instance data. It is used primarily as auxiliary service to the search engine developed at the same institute, presented in the following subsection. |
---|
131 | |
---|
132 | \subsubsection*{Metadata editors} |
---|
133 | \label{md-editors} |
---|
134 | |
---|
135 | Metadata creation, i.e. the authoring of actual metadata records is undisputably the fundamental task in the whole system. |
---|
136 | Though not directly interacting with SMC, metadata editors need to be mentioned, i. e. tools that the human metadata editors is using for authoring metadata. |
---|
137 | |
---|
138 | Given that the Component Registry generates a XML schema for every profile, basically any generic XML editor with schema validation can be used (e.g. the wide-spread \xne{oXygen}). However, there have been efforts within the CLARIN community to develop dedicated tools, tailor-made for creation of CMD records. |
---|
139 | Two examples being the stand-alone application \xne{Arbil}\furl{http://tla.mpi.nl/tools/tla-tools/arbil/} \cite{withers2012arbil} being developed at Max Planck Institute for Psycholinguistics, Nijmegen and the web-based application developed within the project \xne{NaLiDa}\furl{http://www.sfs.uni-tuebingen.de/nalida/en/} \cite{dima2012mdeditor} at the Seminar fÃŒr Sprachwissenschaft University TÃŒbingen. |
---|
140 | |
---|
141 | |
---|
142 | \subsection{CMDI Exploitation Side} |
---|
143 | \label{cmdi_exploitation} |
---|
144 | Metadata complying with the CMD data model is being created by a growing number of institutions by various means -- automatic transformation from legacy data or authoring of new metadata records with the help of one of the metadata editors (cf. \ref{md-editors}). The CMD infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints. These are being collected daily by a dedicated CLARIN harvester\furl{http://catalog.clarin.eu/oai-harvester/}. The harvested data is validated against the corresponding schemas (every profile implies a separate schema). In the future a subsequent normalization step will play a bigger role, currently only minimal ad-hoc label normalization is performed for a few organization names. Finally, the data is made (publicly) available as compressed archive files. These are being fetched by the exploitation side applications that ingest the metadata records, index them and make them available for searching and browsing (cf. figure \ref{fig:cmd-ingestion}). |
---|
145 | |
---|
146 | \begin{figure*}[!ht] |
---|
147 | \begin{center} |
---|
148 | \includegraphics[width=0.8\textwidth]{images/CMDingestion_woVAS} |
---|
149 | \caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by search applications.} |
---|
150 | \label{fig:cmd-ingestion} |
---|
151 | \end{center} |
---|
152 | \end{figure*} |
---|
153 | |
---|
154 | The first stable and publicly available application providing access to the collected metadata of CMDI has been the \xne{VLO - Virtual Language Observatory}\furl{http://www.clarin.eu/vlo/} \cite{VanUytvanck2010}, developed by the Technical Group at the MPI for Psycholinguistics, Nijmegen, based on the wide-spread full-text search engine \xne{Apache Solr}\furl{http://lucene.apache.org/solr/}. |
---|
155 | The application employs a faceted search with 10 fixed facets (figure \ref{fig:vlo}). |
---|
156 | As the processed metadata records are instances of different CMD profiles and thus have very differing structures, to map the fields in the records onto the facets the application relies on the data category references in the underlying schemas, effectively making use of this basic layer of semantic interoperability provided by the infrastructure. |
---|
157 | |
---|
158 | \begin{figure*}[ht] |
---|
159 | \begin{center} |
---|
160 | \includegraphics[width=0.8\textwidth]{images/screen_VLO_overview.png} |
---|
161 | \caption{Screenshot of the faceted browser of the VLO} |
---|
162 | \label{fig:vlo} |
---|
163 | \end{center} |
---|
164 | \end{figure*} |
---|
165 | |
---|
166 | More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\furl{http://www.meertens.knaw.nl/cmdi/search/}. It is also based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated indexing process and search interface \cite{Zhang2012cmdi}. Instead of reducing the data into a fixed number of indexes or facets, the application employs the aforementioned \xne{Schema Parser} to dynamically generate an index configuration that covers all data, again relying on the data categories to merge information from semantically equivalent metadata fields in the different schemas into a common index. |
---|
167 | The application also offers some innovative solutions on the user interface, like search by similarity, content-first search or specialized contextual widgets visualizing the time dimension, the geographic information and other derived data. |
---|
168 | % \todoin { describe indexing and search} |
---|
169 | |
---|
170 | And finally, there is the \xne{Metadata Repository}, being developed by the author as a XQuery application in the XML database \xne{eXist}, originally (in the initial blueprints of the infrastructure) foreseen as main storage of the collected metadata with the \xne{Metadata Service} on top providing search access to the data optionally applying \xne{Semantic Mapping} to expand user queries (cf. figure \ref{fig:cmdi-old}) \cite{Durco2011}. |
---|
171 | However, the application still did not reach production quality, and is used rather as experimenting field for the author. Meanwhile the functionality of the Metadata Service had been integrated directly into the Metadata Repository together with the auxiliary use of Semantic Mapping, making it the implementation of the semantic search module as proposed in this work (cf. \ref{sec:qx}). |
---|
172 | |
---|
173 | %%%%%%%%%%%%%%%%%%%% |
---|
174 | \section{Vocabulary Service / Reference Data Registries} |
---|
175 | \label{sec:cv} |
---|
176 | |
---|
177 | \subsection{Motivation \& Broader Context} |
---|
178 | The provisions for data harmonization and semantic interoperability as presented until now pertain mostly to the schema level. However, the problem of incoherent labelling and nomenclature is even more virulent in the actual metadata fields on the instance level. While for a number of fields the value domain can be enforced through schema validation, many fields (e.g. \concept{organization} or \concept{resource type}) have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants) prompting an urgent need for better means for harmonizing the constrained-field values. |
---|
179 | |
---|
180 | This issue is to be seen in a broader context of a general need for reliable community-shared registry services for concepts, controlled vocabularies and reference data in both the LRT and Digital Humanities community, applicable in a range of applications and tasks like data enrichment and annotation, metadata generation and curation, data analysis, etc. |
---|
181 | Moreover, by using global semantic identifiers instead of strings, such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards transformation of this data into \emph{Linked Open Data}. |
---|
182 | |
---|
183 | Consequently, activities with regard to controlled vocabularies are ongoing not only in CLARIN, but also within the sister ESFRI project DARIAH. As there is a substantial overlap in the vocabularies relevant for the various communities and even more so a high potential for reusability on the technical level, there is a strong case for tight synergic cooperation between individual initiatives. |
---|
184 | |
---|
185 | It has to be also kept in mind that a hoist of work on controlled vocabularies has already been done and a large body of data is present in individual specialized communities (taxonomies) as well as -- with more general scope -- in the libraries world (authority files). |
---|
186 | |
---|
187 | \begin{comment} |
---|
188 | Besides providing vocabularies, the service should also hold and expose equivalences (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalences from Wikipedia\footnote{\href{http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe}{page for J. W. Goethe}}: |
---|
189 | \begin{verbatim} |
---|
190 | GND: 118540238 | LCCN: n79003362 | |
---|
191 | NDL: 00441109 | VIAF: 24602065 |
---|
192 | \end{verbatim} |
---|
193 | \end{comment} |
---|
194 | |
---|
195 | \subsection{Implementation -- OpenSKOS/CLAVAS} |
---|
196 | \label{def:CLAVAS} |
---|
197 | |
---|
198 | In the context of CLARIN (primarily CLARIN-NL), a concrete initiative has been conducted -- \xne{Vocabulary Alignment Service for CLARIN} or CLAVAS -- with the objective to reuse and enhance for CLARIN needs a SKOS-based vocabulary repository and editor \xne{OpenSKOS}\furl{http://openskos.org}, developed and run within the Dutch program \xne{CATCHplus}\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. |
---|
199 | |
---|
200 | %As of spring 2013, the Standing Committee on CLARIN Technical Centres (SCCTC) adopted the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-centre) services to be dealt with. |
---|
201 | |
---|
202 | The basic idea of this repository is to serve as a project independent manager and provider of controlled vocabularies, as an exchange platform for data in SKOS format. |
---|
203 | One important feature of the \xne{OpenSKOS} system is its distributed architecture. Multiple instances can be set up that can synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, in which multiple instances provide identical synchronized data, with organizations behind individual instances assuming the primary responsibility for individual vocabularies based on their specialization or field of expertise. |
---|
204 | |
---|
205 | Currently, the Meertens Institute\furl{http://meertens.knaw.nl/} of the Dutch Royal Academy of Sciences (KNAW), Netherlands Institute for Sound and Vision\furl{http://www.beeldengeluid.nl/}, as well as Austrian Centre for Digital Humanities at the Austrian Academy of Sciences are running an instance of the OpenSKOS system. |
---|
206 | |
---|
207 | As the work on this vocabulary repository started in the context of a cultural heritage programme, originally it served vocabularies not directly relevant for the LRT-community \concept{GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven} or \concept{AAT - Art \& Architecture Thesaurus}\furl{http://openskos.org/api/collections}. Within the CLAVAS, a number of vocabularies relevant for the CLARIN and LRT-community were identified that will be gradually integrated into the vocabulary repository. (See \ref{refdata} for a more complete list of required reference data together with candidate existing vocabularies.) Following vocabularies were already integrated into the \xne{CLAVAS} instance of OpenSKOS: |
---|
208 | \begin{itemize} |
---|
209 | \item the list of language codes \cite{ISO639} |
---|
210 | \item organization names for the domain of language resources |
---|
211 | \item a number of data categories from ISOcat (see \ref{sec:export-dcr} for details of the process) |
---|
212 | \end{itemize} |
---|
213 | |
---|
214 | \subsection{Export DCR to SKOS} |
---|
215 | \label{sec:export-dcr} |
---|
216 | |
---|
217 | Based on the premise that the data in DCR also represents a kind of a controlled vocabulary, there is an effort to export data categories in SKOS format and import them into the Vocabulary Service. |
---|
218 | |
---|
219 | Note that there are two interaction paths between the ISOcat and the Vocabulary Service. The first, importing certain data categories from ISOcat into the Vocabulary Service, is described in this section. The second aspect (described in next section \ref{interaction-dcr-skos}) is that the value domains of certain data categories are defined by reference to a vocabulary maintained in the Vocabulary Service. |
---|
220 | |
---|
221 | The fact that data categories are basically definitions of concepts may mislead to |
---|
222 | a na\"{i}ve approach to mapping DCR data to SKOS, namely mapping every data category to a \code{skos:Concept} |
---|
223 | all of them belonging to the \code{ISOcat:ConceptScheme}. However, the data in ISOcat as a whole is too disparate in scope for such a vocabulary to be useful. |
---|
224 | |
---|
225 | A more sensible approach is to export only closed DCs (with explicitely defined value domain, cf. \ref{def:DCR}) as separate \code{skos:ConceptSchemes} and their respective simple DCs as \code{skos:Concepts} within that scheme. |
---|
226 | |
---|
227 | \begin{quotation} |
---|
228 | The rationale is that if we see a vocabulary as a set of possible values for a |
---|
229 | field/element/attribute, complex DCs in ISOcat are the users of such |
---|
230 | vocabularies and simple DCs the DCR equivalence of values in such a |
---|
231 | vocabulary. \cite{Menzo2013mail} |
---|
232 | \end{quotation} |
---|
233 | |
---|
234 | \begin{comment} |
---|
235 | Still there are some closed DCs, which might be good vocabulary |
---|
236 | providers, e.g., /linguistic subject/ (DC-2527/), and still also need to |
---|
237 | stay in ISOcat. I think at some point we should create a smaller set of |
---|
238 | metadata DCs to be harvested by CLAVAS. |
---|
239 | Therefore a threshold seems sensible, where only value domains with more |
---|
240 | then 20, 50 or 100 values are exported. |
---|
241 | |
---|
242 | However, it needs to be yet assessed how useful this approach is. In the metadata profile |
---|
243 | there are many closed DCs with small value domains. How useful are those |
---|
244 | in CLAVAS? |
---|
245 | \end{comment} |
---|
246 | |
---|
247 | \begin{figure*} |
---|
248 | \begin{center} |
---|
249 | \includegraphics[width=0.6\textwidth]{images/export_DCR2SKOS.png} |
---|
250 | \end{center} |
---|
251 | \caption{The wrong and correct variant of exporting ISOcat data categories in SKOS format to the Vocabulary Service} |
---|
252 | \label{fig:export_dcr2skos} |
---|
253 | \end{figure*} |
---|
254 | |
---|
255 | Another aspect is that a simple DC can be in value domains of multiple closed DCs. |
---|
256 | Also a \code{skos:Concept} can belong to multiple \code{skos:ConceptSchemes}\furl{http://www.w3.org/TR/skos-primer/\#secscheme}. |
---|
257 | So there could a 1:1 mapping [complex closed DCs] to [skos:ConceptSchemes] and [simple DCS] to [skos:Concepts]. |
---|
258 | That would automatically convey also the possibly multiplicate membership of simple DCs / skos:Concepts in closed DCs / skos:ConceptSchemes. |
---|
259 | |
---|
260 | Alternatively, for each value domain a SKOS concept scheme with SKOS concepts can be created, |
---|
261 | i.e., a SKOS concept always belongs to one concept schema, but multiple SKOS concepts refer to the same simple DC using \code{<dcr:datcat/>} (and \code{<dcterms:source/>}). |
---|
262 | This is how the export for CLAVAS currently works.\furl{http://www.isocat.org/rest/profile/5.clavas}\furl{https://trac.clarin.eu/browser/cats/ISOcat/trunk/mod-ISOcat-interface-rest |
---|
263 | /representations/dcs2/clavas.xsl} |
---|
264 | |
---|
265 | |
---|
266 | \subsection{Linking to Vocabularies in Data Categories and Schemas -- Interaction between ISOcat, CLAVAS and Client Applications} |
---|
267 | \label{interaction-dcr-skos} |
---|
268 | |
---|
269 | In the following, we elaborate on the possible ways to model references to vocabularies in data category specification and to |
---|
270 | convey that information to the client application. As of the writing, this is work in progress with some design decision yet to be made.\footnote{Large parts of this subsection come from email correspondence with M. Windhouwer in spring 2013. \cite{Menzo2013mail}} |
---|
271 | |
---|
272 | Providing vocabularies for constrained but large and complex conceptual domains is the main motivation for the vocabulary repository: |
---|
273 | |
---|
274 | \begin{quotation} |
---|
275 | Originally, the vocabulary repository has been conceived to manage rather large and complex value domains that do not fit easily in the DCR data model. Where the value domains are big (ISO 639-3) or can only be |
---|
276 | partially enumerated (organization names) ISOcat can't/shouldn't contain |
---|
277 | the value domains but just refer to CLAVAS, i.e., ISOcat wouldn't be a |
---|
278 | provider. \cite{Menzo2013mail} |
---|
279 | \end{quotation} |
---|
280 | |
---|
281 | Currently, the only possibility to constrain the value domain of a data category |
---|
282 | is by the means a XML Schema provides, like enumeration or regular expression. So for the data category \concept{languageID\#DC-2482} the rule looks like: |
---|
283 | \lstset{language=XML} |
---|
284 | \begin{lstlisting} |
---|
285 | <dcif:conceptualDomain type="constrained"> |
---|
286 | <dcif:dataType>string</dcif:dataType> |
---|
287 | <dcif:ruleType>XML Schema regular expression</dcif:ruleType> |
---|
288 | <dcif:rule>[a-z]{3}</dcif:rule> |
---|
289 | </dcif:conceptualDomain> |
---|
290 | \end{lstlisting} |
---|
291 | |
---|
292 | A proposal by Windhouwer \cite{Menzo2013mail} for integration with CLAVAS foresees following extension: |
---|
293 | |
---|
294 | \begin{lstlisting} |
---|
295 | <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" type="closed"/> |
---|
296 | \end{lstlisting} |
---|
297 | |
---|
298 | \begin{quotation} |
---|
299 | \code{@href} points to the vocabulary. Actually a PID should be used in the context |
---|
300 | of ISOcat, but it is not clear how persistent are the vocabularies. This may pose a problem as part of DC specification may now have a different persistency than the core. |
---|
301 | |
---|
302 | \code{@type} could be \code{closed} or \code{open}. \code{closed}: only values in the vocabulary are |
---|
303 | valid. \code{open}: the values in the vocabulary are hints/preferred values. Basically the DC itself is then open. |
---|
304 | \end{quotation} |
---|
305 | |
---|
306 | This yields a definition of the value domain for the data category, where the new rule pointing to the vocabulary is \emph{added} (cf. listing \ref{lst:dcif-conceptualDomain}), so that -- once the information from the DC specification gets into the schema -- tools that don't support vocabulary lookup, but are capable of XSD/RNG validation, can still use the regular expression based definition. |
---|
307 | |
---|
308 | \lstset{language=XML} |
---|
309 | \begin{lstlisting}[label=lst:dcif-conceptualDomain, caption=Definition of conceptualDomain for the data category \concept{languageID} employing the proposed extension for pointing to a vocabulary] |
---|
310 | <dcif:conceptualDomain type="constrained"> |
---|
311 | <dcif:dataType>string</dcif:dataType> |
---|
312 | <dcif:ruleType>XML Schema regular expression</dcif:ruleType> |
---|
313 | <dcif:rule>[a-z]{3}</dcif:rule> |
---|
314 | </dcif:conceptualDomain> |
---|
315 | <dcif:conceptualDomain type="constrained"> |
---|
316 | <dcif:dataType>string</dcif:dataType> |
---|
317 | <dcif:ruleType>CLAVAS vocabulary</dcif:ruleType> |
---|
318 | <dcif:rule> |
---|
319 | <clavas:vocabulary href="http://my.openskos.org/vocab/ISO-639" |
---|
320 | type="closed"/> |
---|
321 | </dcif:rule> |
---|
322 | </dcif:conceptualDomain> |
---|
323 | \end{lstlisting} |
---|
324 | |
---|
325 | \begin{figure*}[ht] |
---|
326 | \begin{center} |
---|
327 | \includegraphics[width=0.7\textwidth]{images/concept_linking.png} |
---|
328 | \end{center} |
---|
329 | \caption{The linking between schemas, data categories and vocabularies} |
---|
330 | \label{fig:concept_linking} |
---|
331 | \end{figure*} |
---|
332 | |
---|
333 | It is important to emphasize that anything stated in the DC specification is not binding (even if the DC is of type \var{closed}), but rather a non-normative hint or recommendation. The authoritative source is the schema. A schema modeller binding an element in a schema to a data category can still decide to have other restriction for the values domain of that element than the ones suggested in the DC specification. This applies equally to the proposed vocabulary reference mechanism: The author of the data category suggests a vocabulary to be used for values of given data category, but the metadata modeller decides, if and how this vocabulary will be integrated into the modelled schema. |
---|
334 | |
---|
335 | There are basically two options how the vocabulary can be integrated into the schema. |
---|
336 | One approach is to explicitly enumerate all the values from the vocabulary. |
---|
337 | Within CMD this has been done in the component for language-codes\furl{http://catalog.clarin.eu/ds/ComponentRegistry/?item=clarin.eu:cr1:c_1271859438110}. This method allows to strictly validate given metadata field, however, there is clearly a limit to this approach in terms of a) size of the vocabulary\footnote{e.g. \xne{ISO-639} contains 7.679 items (language codes) adding some 2MB to each schema referencing it}, b) completeness -- most of the vocabularies cannot be seen as closed, i.e. they represent only a partial enumeration just providing a recommended label for an entity, |
---|
338 | and c) stability or change rate -- even the supposedly fixed list of language-codes \xne{ISO-639-*} undergoes regular changes -- it is being updated semi-annually, with entries being added, deleted, merged and split.\furl{http://www-01.sil.org/iso639-3/changes.asp} |
---|
339 | |
---|
340 | The other ``soft'' alternative is to convey the information about data category and vocabulary in the schema as annotation, either in \code{<xs:app-info>} element or by some attribute in dedicated namespace. This method is already being employed in the Component Registry indicating data category of a generated element with the \code{@dcr:datcat} attribute. |
---|
341 | |
---|
342 | Once the data category and vocabulary reference end up in the specification of the CMD profile and the derived XSD, the information can finally be used by client applications (like metadata editor)\footnote{Note though that this is not a standard mechanism but rather a convention. The client application must implement it in order to be able to make use of it.}. The tool |
---|
343 | can use the reference to the data category to fetch explanations (semantic information) (and translations) from ISOcat and it can access the autocomplete/search interface of the Vocabulary Service to offer the user suggestions from the recommended vocabulary (cf. figure \ref{fig:concept_linking}). |
---|
344 | |
---|
345 | The drawback of this variant is that we gave up the validation. This |
---|
346 | isn't a problem if the vocabulary is of \code{@type=open}, e.g. \concept{organisation names}, but |
---|
347 | it is when the value domain is closed, e.g. \concept{languageID}. In the latter case, |
---|
348 | the XSD generation could support both modes: a lax (smaller) version which |
---|
349 | doesn't contain the closed vocabulary as an enumeration and leaves it to |
---|
350 | the tool, and a strict version, which does contain the vocabulary as an |
---|
351 | enumeration. Probably the latter should stay the default, but the client application could |
---|
352 | request the lax version leading to smaller and quicker XSD validation |
---|
353 | inside the tool. |
---|
354 | |
---|
355 | %However, for the presumably default (and recommended) scenario, where the modeller wants to use the information from the data category, the \xne{Component Editor} could offer to take over the data type and the vocabulary reference from the linked DC specification. |
---|
356 | |
---|
357 | |
---|
358 | %%%%%%%%%%%%%%%%% |
---|
359 | \section{Other Aspects of the Infrastructure} |
---|
360 | While this work concentrates solely on the metadata, it is important to acknowledge that it is only one aspect of the infrastructure and its actual purpose -- the availability of resources. To announce and describe the resources by metadata is a necessary first step. However, it is of little value, if the resources themselves are not accessible. We want to briefly mention at least two other important aspects: content repositories for storing the resources and federated content search for searching in the resources. |
---|
361 | |
---|
362 | \subsubsection{CLARIN Centres} |
---|
363 | One view on the CLARIN infrastructure is that of a network of centres\furl{http://www.clarin.eu/node/3812}: |
---|
364 | |
---|
365 | \begin{quotation} |
---|
366 | CLARIN's distributed network is made out of centres. These units, often a university or an academic institute, offer the scientific community access to services on a sustainable basis. |
---|
367 | \end{quotation} |
---|
368 | |
---|
369 | CLARIN imposes a number of criteria that each centre needs to fulfill to become a CLARIN Centre\furl{http://www.clarin.eu/node/3767} \cite{CE-2013-0095}. |
---|
370 | CLARIN also maintains a central registry, the \xne{Centre Registry}\furl{https://centerregistry-clarin.esc.rzg.mpg.de/}, maintaining structured information about every centre, meant as primary entry point into the CLARIN network of centres. |
---|
371 | |
---|
372 | One core service of such centres are the content repositories, systems meant for long-term preservation and online provision of research data and resources. A number of centres have been identified that provide Depositing Services\furl{http://clarin.eu/3773}, i.e. allow third parties' researchers (not just the home users) to store research data. |
---|
373 | |
---|
374 | \begin{comment} |
---|
375 | In the following a few further well established repositories are mentioned. |
---|
376 | |
---|
377 | \begin{description} |
---|
378 | \item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \footnote{\url{https://phaidra.univie.ac.at/}} |
---|
379 | \item[eSciDoc] provided by MPG + FIZ Karlsruhe \footnote{\url{https://www.escidoc.org/}} |
---|
380 | \item[TextGrid] \furl{http:/textgrid.de} |
---|
381 | \item[DRIVER] pan-European infrastructure of Digital Repositories \footnote{\url{http://www.driver-repository.eu/}} |
---|
382 | \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \footnote{\url{http://www.openaire.eu/}} |
---|
383 | \end{description} |
---|
384 | \end{comment} |
---|
385 | |
---|
386 | \begin{figure*} |
---|
387 | \begin{center} |
---|
388 | \includegraphics[width=0.7\textwidth]{images/FCS_components.png} |
---|
389 | \end{center} |
---|
390 | \caption{Components of the Federated Content Search and their interdependencies} |
---|
391 | \label{fig:fcs} |
---|
392 | \end{figure*} |
---|
393 | |
---|
394 | \subsubsection{Federated Content Search} |
---|
395 | |
---|
396 | Another aspect of the availability of resources is that while metadata can be harvested and indexed locally in one repository this is not possible with the content itself, both due to the size of the data and mainly due to legal obligations (licenses, copyright), restricting the access to and availability of the resources. CLARIN's answer to this problem is the task force \emph{Federated Content Search}\furl{http://www.clarin.eu/fcs} \cite{stehouwer2012fcs} aiming at establishing an architecture allowing to search simultaneously (via an aggregator) across a number of resources hosted by different content providers via a harmonized interface adhering to a common protocol. The agreed upon protocol is a compatible extension of the SRU/CQL protocol developed and endorsed by the Library of Congress as the XML- (and web)based successor of the Z39.50 \cite{Lynch1991}. |
---|
397 | |
---|
398 | Note that in practice the line between metadata and content data is not so clear -- usually there is a need to filter by metadata even when searching in content. Therefore also most content search engines feature some kind of metadata filters. Thus it seems reasonable to harmonize the search protocol and query language for metadata and content. This proposition is further elaborated on in \ref{cql}. |
---|
399 | |
---|
400 | \section{Summary} |
---|
401 | |
---|
402 | In this chapter, we presented individual parts of the infrastructure, next to the core registries: ISOcat Data Category Registry, Component Registry and Relation Registry that this work directly builds upon, a number of other services and application forming the CLARIN ecosystem were briefly introduced. A separate consideration was dedicated to the issue of controlled vocabularies together with a related module the Vocabulary Alignment Service (and its implementation OpenSKOS) that allows to manage vocabularies and use them in client application. Finally, a few other aspects of the infrastructure that are equally important, however, not pertaining to the metadata level, were briefly tackled. |
---|
403 | |
---|