source: SMC4LRT/chapters/Infrastructure.tex @ 2672

Last change on this file since 2672 was 2672, checked in by vronk, 11 years ago

reorganized according to the MasterThesisTemplate? of the Departement

File size: 8.2 KB
Line 
1\chapter{Underlying infrastructure}\label{ch:components}
2
3As stated before, the proposed module is part of CMDI and depends on multiple modules of the infrastructure. Before we describe the interaction itself in chapter \ref{method}, we introduce in short these modules and the data they provide:
4
5\begin{itemize}
6\item Data Category Registry,
7\item Relation Registry
8\item Component Registry
9\item Vocabulary Alignement Service (OpenSKOS)
10\item SchemaParser
11\end{itemize}
12
13?MDBrowser
14?MDService
15
16
17\begin{figure*}[!ht]
18\includegraphics[width=1\textwidth]{images/SMC_CR-DCR-RR_Linkage_v2}
19\caption{The diagram depicts the links between pieces of data in the individual registries that serve as basis for semantic mapping}
20\end{figure*}
21
22\subsection{CMDI - Production side}
23
24The \emph{Data Category Registry} (DCR) is a central registry that enables the community to collectively define and maintain a set of relevant linguistic data categories. The resulting commonly agreed controlled vocabulary is the cornerstone for grounding the semantic interpretation within the CMD framework.
25The data model and the procedures of the DCR are defined by the ISO standard \cite{ISO12620:2009}, and is implemented in \emph{ISOcat}\footnote{\url{http://www.isocat.org/}}.
26%Next to a web interface for users to browse and manage the data categories, DCR provides a REST-style webservice allowing applications to access the information (provided in Data Category Interchange Format - DCIF). The data categories are assigned a persistent identifier, making them globally and permanently referenceable.
27
28The \emph{Component Metadata Framework} (CMD) is built on top of the DCR and complements it. While the DCR defines the atomic concepts, within CMD the metadata schemas can be constructed out of reusable components - collections of metadata fields. The components can contain other components, and they can be reused in multiple profiles as long as each field "refers via a PID to exactly one data category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be interpreted" \cite{Broeder+2010}. This allows to trivially infer equivalencies between metadata fields in different CMD-based schemas. While the primary registry used in CMD is the ISOcat DCR, other authoritative sources for data categories ("trusted registries") are accepted, especially Dublin Core Metadata Initiative \cite{DCMI:2005}.
29% \emph{Component Registry} implements the Component Data Model and allows to define, maintain and publish CMD-components and -profiles.
30
31The framework as described so far provides a sound mechanism for binding the semantic interpretation of the metadata descriptions.
32However there needs to be an additional means to capture information about relations between data categories.
33This information was deliberately not included in the DCR, because relations often depend on the context in which they are used, making global agreement unfeasible. CMDI proposes a separate module -- the \emph{Relation Registry} (RR) \cite{Kemps-Snijders+2008} --, where arbitrary relations between data categories can be stored and maintained. We expect that the RR should be under control of the metadata user whereas the DCR is under control of the metadata modeler.
34% These relations don't need to pass a standardization process, but rather separate research teams may define their own sets of relations according to the specific needs of the project. That is not to say that every researcher has to create her own set of relations -- some basic recommended sets will be defined right from the start. But new -- even contradictory -- ones can be created when needed.
35
36There is a prototypical implementation of such a relation registry called \emph{RELcat} being developed at MPI, Nijmegen. \cite{Windhouwer2011,SchuurmanWindhouwer2011}, that already hosts a few relation sets. There is no user interface to it yet, but it is accessible as a REST-webservice\footnote{sample relation set: \url{http://lux13.mpi.nl/relcat/rest/set/cmdi}}.
37This implementation stores the individual relations as RDF-triples
38\begin{eqnarray*}
39<subjectDatacat, relationPredicate, objectDatcat>
40\end{eqnarray*}
41allowing typed relations, like equivalency (\texttt{rel:sameAs}) and subsumption (\texttt{rel:subClassOf}). The relations are grouped into relation sets that can be used independently.
42
43!check DCR-RR/Odijk2010 -follow up
44!Cf. Erhard Hinrichs 2009
45
46And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
47
48\noindent
49All these components are running services, that this work shall directly build upon.
50
51This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation differs
52from the traditional methods of schema matching that try to establish pairwise alignments between schemas only after they were created and published. % -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
53
54Consequently, the infrastructure also foresees a dedicated module, \emph{Semantic Mapping}, that exploits this novel mechanism to deliver correspondences between different metadata schemas. The details of its functioning and its interaction with the aforementioned modules is described in the following chapter \ref{method}.
55
56
57\subsection{CMDI - Exploitation side}
58Metadata complying to the CMD-framework is being created by a growing number of institutions  by various means, automatic transformation from legacy data, authoring of new metadata records with the help of one of the Metadata-Editors (TODO: cite: Arbil, NALIDA, ). The CMD-Infrastructure requires the content providers to publish their metadata via the OAI-PMH protocol and announce the OAI-PMH endpoints.  These are being harvested daily by a dedicated CLARIN harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}. The harvested data is validated against the schemas \todo{What about Normalization?}.  and made available in packaged datasets. These are being fetched by the exploitations side components, that index the metadata records and make them available for searching and browsing.
59
60\begin{figure*}[!ht]
61\includegraphics[width=1\textwidth]{images/CMDingestion_woVAS}
62\caption{Within CMDI, metadata is harvested from content providers via OAI-PMH and made available to consumers/users by exploitation side components}
63\end{figure*}
64
65
66The first stable and publicly available application providing access to the collected metadata of CMDI has been the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 10? fixed facets. Underlying search engine is the widely used full-text search engine Apache Solr\footnote{\url{http://lucene.apache.org/solr/}}. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
67
68More recently, the team at Meertens Institute developed a similar application the \texttt{MI Search Engine}\footnote{\url{http://www.meertens.knaw.nl/cmdi/search/}}. It too is based on the Apache Solr and provides a faceted search, but with a substantially more sophisticated both indexing process and search interface. \todo { describe indexing and search}
69\todo { add citation}
70
71And finally, there is the \emph{Metadata Repository} aimed to collect all the harvested metadata descriptions from CLARIN centers,
72and \emph{Metadata Service} that provides search access to this body of data. As such, Metadata Service is the primary application to use Semantic Mapping, to optionally expand user queries before issuing a search in the Metadata Repository. \cite{Durco2011}
73
Note: See TracBrowser for help on using the repository browser.